OpenRefine - Cross-column clustering
As it seems, cross-column clustering isn't supported yet with OpenRefine. Does anyone have any suggestions of how to cluster 'models' based on 'manufacturers', much like a 'city' would be based on a 'state' (many 'Springfield' could exist in the US, but only cluster "city": 'Springfield', if the relative 'state' column is the same)? The relative column is already normalized.
One easy way to do it would be to create a column which was the concatenation of the model+manufacturer, cluster on the joined fields, then (if needed) split the two pieces back apart again.
I had a similar requirement for de-duplicating address strings. So I created a new column (say COMPLETE_ADDRESS) and concatenated the STREET, CITY, PROVINCE, COUNTRY and ZIPCODE fields using the below GREL expression cells["STREET"].value + " " + cells["CITY"].value + " " + cells["PROVINCE"].value + " " + cells["COUNTRY"].value + " " + cells["ZIPCODE"].value Then I did the following : Clustered the new COMPLETE_ADDRESS column with the default algorithm Merged the values in each cluster (now the values are perfect duplicates) Sort the column permanently. Do a "blank down" operation. Finally pick only non-null values in the COMPLETE_ADDRESS Having said that, as of this writing, there is no feature to merge the independent columns. The only way to do that it is to split the COMPLETE_ADDRESS into separate columns suitably. In this case, you will have to use a better separator such as pipe "|" symbol which will not conflict with existing values.
Progressive number in Openrefine column
Lost all my files on Openrefine [closed]
freebaseapps reconciliation stuck in Open Refine 2.6
OpenRefine - add sequence number, reset for each record
How to transpose cell data by section in Open Refine?
OpenRefine columnwise scripting
Remove content inside parentheses
Extra blank space between words
forNonBlank function in OpenRefine
Import columns to existing OpenRefine project
Bulk replace text in all columns
Split multi valued cells in more than one column into rows (Open Refine)
OpenRefine - Fill between cells but not at the end of the list
Reconciliation services for OpenRefine not working?
Appending a specific string in GREL
How to extract ONLY lat, lon values for node “osm_type”:“node” in a Nominatim response using Google Refine