openrefine


openrefine, cluster and edit two datasets


i have two datasets. Column A has ids from dataset one, column B, has the data i need to cluster and edit, using the various available algorithms. Dataset 2, has again in the first column, the ids, and in the next column, the data. I need to reconcile, data only from dataset one, against data from the second dataset. What i have done so far, is use one dataset, merge the two, but then openrefine, gives me mixed results, ie messy data that exist only in dataset two, which is not what i want, in the current phase.
I have also investigated Reconcile-csv, but without success, in achieving desired result. Any ideas?
Reconcile-CSV is a very good tool, but not very user friendly. You can use as an alternative the free Excel plugin Fuzzy Lookup Add-In for Excel. It's very easy to use, as evidenced by this screencast. One constraint: the two tables to be reconciled must be in Excel table format (select and CTRL + L).
And here is the same procedure with reconcile-csv (the GREL formula used is cell.recon.best.name and comes from here)
An alternative approach to using the reconciliation approach described by Ettore is to use algorithms similar to the 'key collision' clustering algorithms to create shared keys between the two data sets and then use this to do lookups between the data sets using the 'cross' function.
As an example for Column B in each data set you could 'Add column based on this column' using the GREL:
value.fingerprint()
This creates the same key as is used by the "Fingerprint" clustering method. Lets call the new column 'Column C'
You can then look up between the two projects using the following GREL in Dataset 2:
cells["Column C"].cross("Dataset 1","Column C")
If the values in both Dataset 1 and Dataset 2 would have clustered based on the fingerprint cluster then the lookup between the projects will work
You can also use the phonetic keying algorithms to create match keys in Column C if that works better. What you can't do using this method (as far as I know) is the equivalent of the Nearest Neighbour matching - you'd have to have a reconciliation service with fuzzy matching of some kind, or merge the two data sets, to achieve this.
Owen

Related Links

Openrefine: cross.cell for similar but not identical values
OpenRefine changing the port and host when executable is run directly
How can I join two datasets using a key in OpenRefine, with the secondary table having more than one value?
Open Refine: Open Project Issue
Progressive number in Openrefine column
Lost all my files on Openrefine [closed]
freebaseapps reconciliation stuck in Open Refine 2.6
OpenRefine - add sequence number, reset for each record
How to transpose cell data by section in Open Refine?
OpenRefine columnwise scripting
Remove content inside parentheses
Extra blank space between words
forNonBlank function in OpenRefine
Import columns to existing OpenRefine project
Bulk replace text in all columns
Split multi valued cells in more than one column into rows (Open Refine)

Categories

HOME
java
wso2-am
vbscript
keycloak
comparison
tesseract
cplex
framework7
gps
pivotal-cloud-foundry
flyway4
networkx
cross-validation
facebook-php-sdk
vifm
node-pdfkit
modx-revolution
spring-tool-suite
visual-studio-2005
hapi
jsprit
ghc
openrefine
javacv
react-css-modules
social-media
wtx
c++-amp
nat
yadcf
google-cloud-endpoints-v2
ioio
siesta-swift
asset-pipeline
textmate
azure-sql-database
android-kernel
gammu
google-api-nodejs-client
disassembling
adobe-premiere
mixture-model
ansible-playbook
splice
g-code
angular2-meteor
jsch
avconv
filepicker
atomicity
magma
flutterwave
query-performance
messenger
python-stackless
php-parse-error
ford-fulkerson
freedesktop.org
natvis
jfugue
execute
teamcity-8.0
computer-algebra-systems
php-ci
unity3d-gui
sortedlist
nsbutton
titanium-modules
markers
jquery-layout
expected-exception
doskey
valuechangelistener
comexception
braille
sitemesh
quartz-graphics
dbconnection
jelly
clipper
itmstransporter
simba
chuck
cascalog
ticoredatasync
ohm
jquery-ui-layout
spring-portlet-mvc
boost-filesystem
recent-documents
yui-datatable
.nettiers
dmx512
sproutcore-2
routedevent
nintendo-ds
sef
virtual-functions
spec#

Resources

Database Users
RDBMS discuss
Database Dev&Adm
javascript
java
csharp
php
android
javascript
java
csharp
php
python
android
jquery
ruby
ios
html
Mobile App
Mobile App
Mobile App