pandas


pyspark dataframe count distinct value row by row considering history


I want to get the same effect on "cookie_per_device" using pyspark dataframe, i have tried many methods but seems all of them failed"
The following is the original dataset which has csv format:
device;cookie;id
a;a;1
b;f;2
c;f;3
a;a;4
b;d;5
c;f;6
b;c;7
c;f;8
a;a;9
d;r;10
e;f;11
b;r;12
c;e;13
a;b;14
b;w;15
I need keep the order and attribute in the dataframe the same, please check above figure
If this effect could be achieved by Pandas dataframe, it is also ok.
This what i have done so far, it works but it doesn't combine with id, i need keep the id the same order which shows in the dataset
from pyspark.sql.functions import *
Test = spark.read.csv("prueba.csv", header=True, sep = ';')
lookup = (Test.select("device","cookie")
.distinct()
.orderBy("device","cookie")
.rdd
.zipWithIndex()
.map(lambda x: x[0] + (x[1], ))
.toDF(["device","cookie", "rank"]))
Test = Test.join(lookup, ["device","cookie"]).withColumn("rank", col("rank") + 1)
aux = Test.groupby("device").min("rank").withColumnRenamed("min(rank)","min")
Test = Test.join(aux,["device"]).select(col('device'),col('cookie'),\
((col("rank") - col("min")+1).alias('dift_dev')))
Test.show()
Below is the current output:
+------+------+--------+
|device|cookie|dift_dev|
+------+------+--------+
| e| f| 1|
| d| r| 1|
| c| e| 1|
| c| f| 2|
| c| f| 2|
| c| f| 2|
| b| c| 1|
| b| d| 2|
| b| f| 3|
| b| r| 4|
| b| w| 5|
| a| a| 1|
| a| a| 1|
| a| a| 1|
| a| b| 2|
+------+------+--------+
that is keep the same order with in the original dataset and add a new id column which also keep the same order.
The last, if dataframe can not achieve this, does pyspark-streaming could handle this problem?
Thanks for your help.

Related Links

pandas map and timedelta with missing values
How to summarise data by percentages in pandas
python pandas groupby plot with sorted date as xtick
python Pandas groupby method
Pandas time series plot - setting custom ticks
Python / Pandas Dataframe change specific value
localize timestamp in pandas
Pandas Merge duplicate index into single index
Pandas : How to get timestamp.day and timestamp.month with padded zero
Resample TimedeltaIndex and normalize to frequency
Memory error when running medium sized merge function ipython notebook jupyter
Aggregate/Remove duplicate rows in DataFrame based on swapped index levels
Change color of legend to match line plot matplotlib pandas
TypeError: pivot_table() got an unexpected keyword argument 'rows'
Having a key error when using group and sum in a dataframe
Pandas dataframe use column names in train data to select same column names in test data

Categories

HOME
testing
coq
debugging
image-processing
server
plone
relay
grep
analysis
graphql
sql-server-2016
blueprintjs
contact
esper
metatrader4
spring-kafka
volttron
wheelnav.js
quartz-scheduler
progressive-web-apps
code-review
karma-jasmine
emulator
tokenize
csrf-protection
excel-2007
mustache.php
language-agnostic
php-openssl
emgucv
vlsi
pepper
selectedindexchanged
brunch
virtualdub
bootstrap-material-design
entitlements
maquette
publish
neo4j-spatial
dbclient
windows-server-2000
c++-amp
madlib
bcd
galsim
ioio
http-get
lxd
atl
sequential
nssegmentedcontrol
restlet
swift3.0.2
az-application-insights
mapzen
avro4s
elmah
powershell-dsc
janrain
http-live-streaming
nxlog
fakeiteasy
gpx
menuitem
hendrix
boost-multi-index
ionicons
pebble-watch
return-value
gabor-filter
diagnostics
pillow
pdfclown
epson
sqldf
spring-mongodb
bind9
angular-strap
bstr
disque
bluegiga
sdhc
wordml
pundit
wyam
embedded-code
microbenchmark
p2
system.reflection
typekit
network-interface
starcluster
elliptic-curve
mysql-error-1062
poller
centos5
monomac
ms-project-server-2010
soundtouch
quartz-graphics
flash-builder4.5
buster.js
html4
smtp-auth
dmoz
mt
bluepill
gdata-api
runas
datawindow
transactionscope
fireworks
coercion
visitor-statistic
firefox4
multiple-languages
.net-1.0
signal-handling

Resources

Database Users
RDBMS discuss
Database Dev&Adm
javascript
java
csharp
php
android
javascript
java
csharp
php
python
android
jquery
ruby
ios
html
Mobile App
Mobile App
Mobile App