openrefine


Keep newest duplicate row depending on multiple Columns


I seem to have a workflow problem with Open Refine (Google Refine 2.5 [r2407]) to do sophisticated duplicate row cleaning. All I have found so far is how to delete duplicate rows based on a single column.
My aim is to delete duplicate rows based on multiple columns, at best, in a specific hierarchy.
Example
Given the following dummy data in Refine
+----+---------+---------+--------+------------+------+-----------------------------------+
| id | timeAgo | title | author | date | val1 | [After Refine, keep Record] |
+----+---------+---------+--------+------------+------+-----------------------------------+
| 1 | 10 | Faust | Mr. A | 2014-01-15 | 10 | ->B, older entry |
| 2 | 11 | Faust | Mr. A | 2014-01-21 | 10 | A (because of Date) |
| 3 | 8 | Faust | Mr. A | 2014-01-15 | 10 | B |
| 4 | 8 | RedHead | Mr. B | 2014-01-21 | 34 | ->D, older entry |
| 5 | 7 | RedHead | Mr. B | 2014-01-21 | 34 | ->D, same time Ago, but lower ID |
| 6 | 7 | RedHead | Mr. A | 2014-01-01 | 13 | C (because of author, date, val1) |
| 7 | 7 | RedHead | Mr. B | 2014-01-21 | 34 | D |
+----+---------+---------+--------+------------+------+-----------------------------------+
I want to kill the duplicate rows based on following logic. If
title && auther && date && val1 are the same, than
keep the newest (least timeAgo) row, if there are multiple, than
keep the one with the highest id
The Result would be:
+---------+----+---------+---------+--------+------------+------+
| Refined | id | timeAgo | title | author | date | val1 |
+---------+----+---------+---------+--------+------------+------+
| A | 2 | 10 | Faust | Mr. A | 2014-01-21 | 10 |
| B | 3 | 8 | Faust | Mr. A | 2014-01-15 | 10 |
| C | 6 | 7 | RedHead | Mr. A | 2014-01-01 | 13 |
| D | 7 | 7 | RedHead | Mr. B | 2014-01-21 | 34 |
+---------+----+---------+---------+--------+------------+------+
Easy Approach?
If there is no other solution, I thankfully take a scripting/GREL one.
But could it be done by Refines famous workflow "recording" to achieve above logic, so it could be extracted and applied to other same format datasets?
My motivation behind this is to enable employees to work more thoughtfully with data (beyond excel) but without confronting them right away with a full blown scripting language.
That sounds like a straightforward sorting problem.
Sort the records by title, author, time ago, and ID
Re-order rows permanently (IMPORTANT - it won't work if you forget this step)
Blank down on Title & Author
Move those two columns to the two left most positions
Join multivalued cells on remaining columns
Transform all columns from step 5 using value.split(',')[0] to extract the first value (which should be the value for the record you want if you sorted them in the right order

Related Links

add numbers down a column in OpenRefine
OpenRefine split on character in multivalue cell
Openrefine: text facet by counting
Select multiple repeated records OpenRefine
Simple OpenRefine IF to create a new column
OpenRefine split in multiple cells
How to export the cell that contains new line character properly?
Is it possible to run an OpenRefine script in the background?
Browser cluster link does not work properly in Open Refine
How to save only specific JSON elements in a new OpenRefine column
Openrefine: cross.cell for similar but not identical values
OpenRefine changing the port and host when executable is run directly
How can I join two datasets using a key in OpenRefine, with the secondary table having more than one value?
Open Refine: Open Project Issue
Progressive number in Openrefine column
Lost all my files on Openrefine [closed]

Categories

HOME
twitter
coq
server
mockito
path-finding
session
platform-builder
tesseract
read-eval-print-loop
sqlite-net-extensions
yahoo-oauth
rubygems
ojdbc
enterprise-library-5
wamp
fancybox
onelogin
maude-system
kentor-authservices
rascal
google-apps-marketplace
moonmail
static-libraries
quickfix
iolanguage
lombok
systemc
google-static-maps
aurigma
crystal-reports-2008
google-pagespeed
jprofiler
emulator
extjs5
cultureinfo
intel-pin
fluentvalidation
log4js-node
tasklet
procdump
.net-4.0
functor
data-manipulation
jspresso
rotational-matrices
overlap
service-discovery
twitch
pdf-reactor
gulp-sourcemaps
elmah
spring-security-kerberos
unixodbc
angular2-meteor
wso2carbon
parentheses
hibernate-tools
mu
colorama
directory-structure
magma
feeds
android-cursor
windows-mobile-6.5
bind9
pintos
abcpdf9
instant
azure-virtual-network
py2app
plottable.js
fputcsv
holder.js
markojs
sdf
cartesian-product
google-places
computer-algebra-systems
rgeo
responsive-images
appfabric-cache
remobjects
pretty-print
javax.mail
file-copying
titanium-modules
websocket4net
neolane
mesa
contenttype
wp-query
hippomocks
prettify
page-layout
ms-project-server-2010
dir
inbox
soundtouch
opcache
flashvars
referrer
multipage
qtkit
uiviewanimation-curve
removeclass
bigcouch
nsdatecomponents
external-accessory
mod-auth
lpeg
visitor-statistic
google-friend-connect
gamequery
bespin
web-architecture
nintendo-ds
iweb
uimenucontroller
text-coloring
nerddinner
caching-application-block

Resources

Mobile Apps Dev
Database Users
javascript
java
csharp
php
android
MS Developer
developer works
python
ios
c
html
jquery
RDBMS discuss
Cloud Virtualization
Database Dev&Adm
javascript
java
csharp
php
python
android
jquery
ruby
ios
html
Mobile App
Mobile App
Mobile App