openrefine


openrefine, cluster and edit two datasets


i have two datasets. Column A has ids from dataset one, column B, has the data i need to cluster and edit, using the various available algorithms. Dataset 2, has again in the first column, the ids, and in the next column, the data. I need to reconcile, data only from dataset one, against data from the second dataset. What i have done so far, is use one dataset, merge the two, but then openrefine, gives me mixed results, ie messy data that exist only in dataset two, which is not what i want, in the current phase.
I have also investigated Reconcile-csv, but without success, in achieving desired result. Any ideas?
Reconcile-CSV is a very good tool, but not very user friendly. You can use as an alternative the free Excel plugin Fuzzy Lookup Add-In for Excel. It's very easy to use, as evidenced by this screencast. One constraint: the two tables to be reconciled must be in Excel table format (select and CTRL + L).
And here is the same procedure with reconcile-csv (the GREL formula used is cell.recon.best.name and comes from here)
An alternative approach to using the reconciliation approach described by Ettore is to use algorithms similar to the 'key collision' clustering algorithms to create shared keys between the two data sets and then use this to do lookups between the data sets using the 'cross' function.
As an example for Column B in each data set you could 'Add column based on this column' using the GREL:
value.fingerprint()
This creates the same key as is used by the "Fingerprint" clustering method. Lets call the new column 'Column C'
You can then look up between the two projects using the following GREL in Dataset 2:
cells["Column C"].cross("Dataset 1","Column C")
If the values in both Dataset 1 and Dataset 2 would have clustered based on the fingerprint cluster then the lookup between the projects will work
You can also use the phonetic keying algorithms to create match keys in Column C if that works better. What you can't do using this method (as far as I know) is the equivalent of the Nearest Neighbour matching - you'd have to have a reconciliation service with fuzzy matching of some kind, or merge the two data sets, to achieve this.
Owen

Related Links

Open refine by google on private data
Openrefine not working as expected
Open Refine Error Uploading Data?
Open Refine / Google Refine - edit cells in multiple columns
Open Refine : Reconciliation with Freebase data based on ORganization Name
Keep newest duplicate row depending on multiple Columns
multiple filters in google openrefine
Where does openrefine store projects?
Domain Names to Webpage Titles in OpenRefine
How does one run Google refine on a different port than 3333?
OpenRefine - Cross-column clustering
Grel to apply to ALL columns or current column
Open Refine / Google Refine - Remove blank cells in a column
Google Refine split string into multiple columns using multiple separators
Easiest way to merge rows in Google Refine (OpenRefine) if all columns are identical
enumerate the values of a range in google refine / openrefine

Categories

HOME
hook
iot
fme
cplex
read-eval-print-loop
jira
google-oauth
malloc
fancybox
podio
paradox
autotools
gitpitch
tomcat6
u-sql
quickbooks
visual-studio-2005
fallback
visual-studio-cordova
after-effects
saxon
normalizr
sms-gateway
kryo
arabic
telerik-reporting
captiveportal
rundeck
one-to-many
su
status
user-interaction
social-media
strncpy
ioio
pdb
normal-distribution
io-redirection
jquery-bootgrid
sqlite2
streamreader
bootstrapper
fedex
password-encryption
thin
segment
cubic-spline
outlook-2013
parentheses
appcompat
hibernate-tools
np-complete
lync-client-sdk
pintos
xcb
httplistener
url-masking
phishing
google-places
iis-arr
sframe
map-projections
ami
sortedlist
inmobi
tween
winddk
flask-cors
article
dml
jubula
wp-query
jsctypes
apc
mylyn
sqlperformance
ember-app-kit
clipper
jquery-knob
bluepill
gdata-api
onsubmit
code-cleanup
appender
calling-convention
actionview
sudzc
filtered-index
user-friendly
brewmp
sector

Resources

Mobile Apps Dev
Database Users
javascript
java
csharp
php
android
MS Developer
developer works
python
ios
c
html
jquery
RDBMS discuss
Cloud Virtualization
Database Dev&Adm
javascript
java
csharp
php
python
android
jquery
ruby
ios
html
Mobile App
Mobile App
Mobile App