openrefine


Keep newest duplicate row depending on multiple Columns


I seem to have a workflow problem with Open Refine (Google Refine 2.5 [r2407]) to do sophisticated duplicate row cleaning. All I have found so far is how to delete duplicate rows based on a single column.
My aim is to delete duplicate rows based on multiple columns, at best, in a specific hierarchy.
Example
Given the following dummy data in Refine
+----+---------+---------+--------+------------+------+-----------------------------------+
| id | timeAgo | title | author | date | val1 | [After Refine, keep Record] |
+----+---------+---------+--------+------------+------+-----------------------------------+
| 1 | 10 | Faust | Mr. A | 2014-01-15 | 10 | ->B, older entry |
| 2 | 11 | Faust | Mr. A | 2014-01-21 | 10 | A (because of Date) |
| 3 | 8 | Faust | Mr. A | 2014-01-15 | 10 | B |
| 4 | 8 | RedHead | Mr. B | 2014-01-21 | 34 | ->D, older entry |
| 5 | 7 | RedHead | Mr. B | 2014-01-21 | 34 | ->D, same time Ago, but lower ID |
| 6 | 7 | RedHead | Mr. A | 2014-01-01 | 13 | C (because of author, date, val1) |
| 7 | 7 | RedHead | Mr. B | 2014-01-21 | 34 | D |
+----+---------+---------+--------+------------+------+-----------------------------------+
I want to kill the duplicate rows based on following logic. If
title && auther && date && val1 are the same, than
keep the newest (least timeAgo) row, if there are multiple, than
keep the one with the highest id
The Result would be:
+---------+----+---------+---------+--------+------------+------+
| Refined | id | timeAgo | title | author | date | val1 |
+---------+----+---------+---------+--------+------------+------+
| A | 2 | 10 | Faust | Mr. A | 2014-01-21 | 10 |
| B | 3 | 8 | Faust | Mr. A | 2014-01-15 | 10 |
| C | 6 | 7 | RedHead | Mr. A | 2014-01-01 | 13 |
| D | 7 | 7 | RedHead | Mr. B | 2014-01-21 | 34 |
+---------+----+---------+---------+--------+------------+------+
Easy Approach?
If there is no other solution, I thankfully take a scripting/GREL one.
But could it be done by Refines famous workflow "recording" to achieve above logic, so it could be extracted and applied to other same format datasets?
My motivation behind this is to enable employees to work more thoughtfully with data (beyond excel) but without confronting them right away with a full blown scripting language.
That sounds like a straightforward sorting problem.
Sort the records by title, author, time ago, and ID
Re-order rows permanently (IMPORTANT - it won't work if you forget this step)
Blank down on Title & Author
Move those two columns to the two left most positions
Join multivalued cells on remaining columns
Transform all columns from step 5 using value.split(',')[0] to extract the first value (which should be the value for the record you want if you sorted them in the right order

Related Links

How to save only specific JSON elements in a new OpenRefine column
Openrefine: cross.cell for similar but not identical values
OpenRefine changing the port and host when executable is run directly
How can I join two datasets using a key in OpenRefine, with the secondary table having more than one value?
Open Refine: Open Project Issue
Progressive number in Openrefine column
Lost all my files on Openrefine [closed]
freebaseapps reconciliation stuck in Open Refine 2.6
OpenRefine - add sequence number, reset for each record
How to transpose cell data by section in Open Refine?
OpenRefine columnwise scripting
Remove content inside parentheses
Extra blank space between words
forNonBlank function in OpenRefine
Import columns to existing OpenRefine project
Bulk replace text in all columns

Categories

HOME
sendgrid
proxy
urbancode
ngrx
dictionary
kde
platform-builder
tesseract
bpmn
setup-deployment
enterprise-library-5
django-imagekit
qore
add
user-input
kentor-authservices
circuit
ghc
graphlab
pc
windows-7-x64
aurigma
apache-metamodel
propel
reactcsstransitiongroup
ab-testing
oxyplot
nhibernate-envers
visual-composer
ml
functional-testing
csrf-protection
jndi
blazemeter
windowbuilder
frame
junit5
preg-match-all
google-qpx-express-api
copying
stringtemplate
qwerty
gesture
vapor
jna
webdriver-manager
user-controls
galen
botbuilder
acoustics
reactive-cocoa-5
qcombobox
osx-mavericks
chain-builder
pnotify
jlink
setter
lift-json
snmptrapd
media-player
boost-preprocessor
mako
gcsfuse
spring-cache
query-performance
setuptools
firmata
log4c
sts-springsourcetoolsuite
elements
url-masking
database-optimization
gridview-sorting
php-internals
tcpserver
rtbkit
lemon
cyclomatic-complexity
qdialog
wordml
kcachegrind
varargs
muse
streambase
oxwall
codeigniter-url
gadt
asp.net-dynamic-data
android-nested-fragment
phalanger
fpml
fluentautomation
dexterity
apc
google-cloud-save
eclipse-memory-analyzer
shellexecute
gwt-rpc
google-email-migration
first-responder
dbproviderfactories
regsvr32
enter
hgsubversion
file-comparison
disclosure
site.master
nhibernate.search
perfect-hash
webkit.net
zend-tool
exchange-server-2003
sustainable-pace
procedural-music
w3wp
ugc
paul-graham
uiq3

Resources

Database Users
RDBMS discuss
Database Dev&Adm
javascript
java
csharp
php
android
javascript
java
csharp
php
python
android
jquery
ruby
ios
html
Mobile App
Mobile App
Mobile App