pandas


pyspark dataframe count distinct value row by row considering history


I want to get the same effect on "cookie_per_device" using pyspark dataframe, i have tried many methods but seems all of them failed"
The following is the original dataset which has csv format:
device;cookie;id
a;a;1
b;f;2
c;f;3
a;a;4
b;d;5
c;f;6
b;c;7
c;f;8
a;a;9
d;r;10
e;f;11
b;r;12
c;e;13
a;b;14
b;w;15
I need keep the order and attribute in the dataframe the same, please check above figure
If this effect could be achieved by Pandas dataframe, it is also ok.
This what i have done so far, it works but it doesn't combine with id, i need keep the id the same order which shows in the dataset
from pyspark.sql.functions import *
Test = spark.read.csv("prueba.csv", header=True, sep = ';')
lookup = (Test.select("device","cookie")
.distinct()
.orderBy("device","cookie")
.rdd
.zipWithIndex()
.map(lambda x: x[0] + (x[1], ))
.toDF(["device","cookie", "rank"]))
Test = Test.join(lookup, ["device","cookie"]).withColumn("rank", col("rank") + 1)
aux = Test.groupby("device").min("rank").withColumnRenamed("min(rank)","min")
Test = Test.join(aux,["device"]).select(col('device'),col('cookie'),\
((col("rank") - col("min")+1).alias('dift_dev')))
Test.show()
Below is the current output:
+------+------+--------+
|device|cookie|dift_dev|
+------+------+--------+
| e| f| 1|
| d| r| 1|
| c| e| 1|
| c| f| 2|
| c| f| 2|
| c| f| 2|
| b| c| 1|
| b| d| 2|
| b| f| 3|
| b| r| 4|
| b| w| 5|
| a| a| 1|
| a| a| 1|
| a| a| 1|
| a| b| 2|
+------+------+--------+
that is keep the same order with in the original dataset and add a new id column which also keep the same order.
The last, if dataframe can not achieve this, does pyspark-streaming could handle this problem?
Thanks for your help.

Related Links

Log values by SFrame column
Encode error with df.to_clipboard()
ipython - pandasql.sqldf doesn't return an error
Efficient way to add to a series without duplicates
How to avoid temporary variables when creating new column via groupby.apply
Get value of a Pandas GroupBy Object
Trouble importing Pandas
pandas.io.ga not working for me
adding two series with missing data
Merging/combining two dataframes with different frequency time series indexes in Pandas?
Show DataFrame as table in iPython Notebook
Pandas. Groupby multiple columns, then attach a calculated column to an existing dataframe
pandas dataframe transformation partial sums
Pycharm - Package installation on Windows
rolling polynomial regression in pandas
python list to dataframe object

Categories

HOME
ionic-framework
listview
openmp
image-processing
isabelle
dictionary
kde
google-api-php-client
iot
elm
leon
graphql
twitter-bootstrap-4
webpack-2
in-app-purchase
webrequest
awesome-wm
cloudkit
dacpac
adobe-analytics
node-pdfkit
jsprit
clearcase-ucm
clojurescript
highlight.js
sms-gateway
scichart
csrf-protection
devops
firebase-crash-reporting
selectedindexchanged
assistant
log4js-node
compatibility
libuv
twilio-api
semantic-versioning
data-manipulation
copying
usbserial
textmate
uiswipegesturerecognizer
cloud-code
bower-install
http-digest
stacked
main
git-merge
automake
event-driven
hue
gulp-sourcemaps
apple-news
scrollable
pango
idisposable
carrot
snmptrapd
pdfclown
jspdf-autotable
mediaelement
angular-strap
angular-cache
merge-conflict-resolution
qpid
two-factor-authentication
system.management
uid
jfugue
srand
cartesian-product
sdhc
thredds
libressl
content-length
nsight
clicktag
typekit
ng-animate
qpainter
knuth
coldbox
android-imagebutton
preferences
bitcoinj
responsive-slides
has-many-through
aapt
contenttype
qt-faststart
pushbackinputstream
bitsharp
ohm
heartbeat
yui-datatable
krl
mysql-error-1005
radcombobox
coda-slider
substrings
privilege
msdev
paul-graham

Resources

Mobile Apps Dev
Database Users
javascript
java
csharp
php
android
MS Developer
developer works
python
ios
c
html
jquery
RDBMS discuss
Cloud Virtualization
Database Dev&Adm
javascript
java
csharp
php
python
android
jquery
ruby
ios
html
Mobile App
Mobile App
Mobile App