openrefine, cluster and edit two datasets
i have two datasets. Column A has ids from dataset one, column B, has the data i need to cluster and edit, using the various available algorithms. Dataset 2, has again in the first column, the ids, and in the next column, the data. I need to reconcile, data only from dataset one, against data from the second dataset. What i have done so far, is use one dataset, merge the two, but then openrefine, gives me mixed results, ie messy data that exist only in dataset two, which is not what i want, in the current phase. I have also investigated Reconcile-csv, but without success, in achieving desired result. Any ideas?
Reconcile-CSV is a very good tool, but not very user friendly. You can use as an alternative the free Excel plugin Fuzzy Lookup Add-In for Excel. It's very easy to use, as evidenced by this screencast. One constraint: the two tables to be reconciled must be in Excel table format (select and CTRL + L). And here is the same procedure with reconcile-csv (the GREL formula used is cell.recon.best.name and comes from here)
An alternative approach to using the reconciliation approach described by Ettore is to use algorithms similar to the 'key collision' clustering algorithms to create shared keys between the two data sets and then use this to do lookups between the data sets using the 'cross' function. As an example for Column B in each data set you could 'Add column based on this column' using the GREL: value.fingerprint() This creates the same key as is used by the "Fingerprint" clustering method. Lets call the new column 'Column C' You can then look up between the two projects using the following GREL in Dataset 2: cells["Column C"].cross("Dataset 1","Column C") If the values in both Dataset 1 and Dataset 2 would have clustered based on the fingerprint cluster then the lookup between the projects will work You can also use the phonetic keying algorithms to create match keys in Column C if that works better. What you can't do using this method (as far as I know) is the equivalent of the Nearest Neighbour matching - you'd have to have a reconciliation service with fuzzy matching of some kind, or merge the two data sets, to achieve this. Owen
Open refine by google on private data
Openrefine not working as expected
Open Refine Error Uploading Data?
Open Refine / Google Refine - edit cells in multiple columns
Open Refine : Reconciliation with Freebase data based on ORganization Name
Keep newest duplicate row depending on multiple Columns
multiple filters in google openrefine
Where does openrefine store projects?
Domain Names to Webpage Titles in OpenRefine
How does one run Google refine on a different port than 3333?
OpenRefine - Cross-column clustering
Grel to apply to ALL columns or current column
Open Refine / Google Refine - Remove blank cells in a column
Google Refine split string into multiple columns using multiple separators
Easiest way to merge rows in Google Refine (OpenRefine) if all columns are identical
enumerate the values of a range in google refine / openrefine