openrefine, cluster and edit two datasets
i have two datasets. Column A has ids from dataset one, column B, has the data i need to cluster and edit, using the various available algorithms. Dataset 2, has again in the first column, the ids, and in the next column, the data. I need to reconcile, data only from dataset one, against data from the second dataset. What i have done so far, is use one dataset, merge the two, but then openrefine, gives me mixed results, ie messy data that exist only in dataset two, which is not what i want, in the current phase. I have also investigated Reconcile-csv, but without success, in achieving desired result. Any ideas?
Reconcile-CSV is a very good tool, but not very user friendly. You can use as an alternative the free Excel plugin Fuzzy Lookup Add-In for Excel. It's very easy to use, as evidenced by this screencast. One constraint: the two tables to be reconciled must be in Excel table format (select and CTRL + L). And here is the same procedure with reconcile-csv (the GREL formula used is cell.recon.best.name and comes from here)
An alternative approach to using the reconciliation approach described by Ettore is to use algorithms similar to the 'key collision' clustering algorithms to create shared keys between the two data sets and then use this to do lookups between the data sets using the 'cross' function. As an example for Column B in each data set you could 'Add column based on this column' using the GREL: value.fingerprint() This creates the same key as is used by the "Fingerprint" clustering method. Lets call the new column 'Column C' You can then look up between the two projects using the following GREL in Dataset 2: cells["Column C"].cross("Dataset 1","Column C") If the values in both Dataset 1 and Dataset 2 would have clustered based on the fingerprint cluster then the lookup between the projects will work You can also use the phonetic keying algorithms to create match keys in Column C if that works better. What you can't do using this method (as far as I know) is the equivalent of the Nearest Neighbour matching - you'd have to have a reconciliation service with fuzzy matching of some kind, or merge the two data sets, to achieve this. Owen
Openrefine: cross.cell for similar but not identical values
OpenRefine changing the port and host when executable is run directly
How can I join two datasets using a key in OpenRefine, with the secondary table having more than one value?
Open Refine: Open Project Issue
Progressive number in Openrefine column
Lost all my files on Openrefine [closed]
freebaseapps reconciliation stuck in Open Refine 2.6
OpenRefine - add sequence number, reset for each record
How to transpose cell data by section in Open Refine?
OpenRefine columnwise scripting
Remove content inside parentheses
Extra blank space between words
forNonBlank function in OpenRefine
Import columns to existing OpenRefine project
Bulk replace text in all columns
Split multi valued cells in more than one column into rows (Open Refine)