openrefine


OpenRefine - Cross-column clustering


As it seems, cross-column clustering isn't supported yet with OpenRefine.
Does anyone have any suggestions of how to cluster 'models' based on 'manufacturers', much like a 'city' would be based on a 'state' (many 'Springfield' could exist in the US, but only cluster "city": 'Springfield', if the relative 'state' column is the same)? The relative column is already normalized.
One easy way to do it would be to create a column which was the concatenation of the model+manufacturer, cluster on the joined fields, then (if needed) split the two pieces back apart again.
I had a similar requirement for de-duplicating address strings. So I created a new column (say COMPLETE_ADDRESS) and concatenated the STREET, CITY, PROVINCE, COUNTRY and ZIPCODE fields using the below GREL expression
cells["STREET"].value + " " + cells["CITY"].value + " " + cells["PROVINCE"].value + " " + cells["COUNTRY"].value + " " + cells["ZIPCODE"].value
Then I did the following :
Clustered the new COMPLETE_ADDRESS column with the default algorithm
Merged the values in each cluster (now the values are perfect duplicates)
Sort the column permanently.
Do a "blank down" operation.
Finally pick only non-null values in the COMPLETE_ADDRESS
Having said that, as of this writing, there is no feature to merge the independent columns. The only way to do that it is to split the COMPLETE_ADDRESS into separate columns suitably. In this case, you will have to use a better separator such as pipe "|" symbol which will not conflict with existing values.

Related Links

Progressive number in Openrefine column
Lost all my files on Openrefine [closed]
freebaseapps reconciliation stuck in Open Refine 2.6
OpenRefine - add sequence number, reset for each record
How to transpose cell data by section in Open Refine?
OpenRefine columnwise scripting
Remove content inside parentheses
Extra blank space between words
forNonBlank function in OpenRefine
Import columns to existing OpenRefine project
Bulk replace text in all columns
Split multi valued cells in more than one column into rows (Open Refine)
OpenRefine - Fill between cells but not at the end of the list
Reconciliation services for OpenRefine not working?
Appending a specific string in GREL
How to extract ONLY lat, lon values for node “osm_type”:“node” in a Nominatim response using Google Refine

Categories

HOME
ibm-bluemix
hook
gerrit
c#-4.0
mfc
google-api-php-client
lodash
platform-builder
bookshelf.js
fancybox
azure-media-services
gorm
metatrader4
ssl-client-authentication
quickbooks
google-cloud-ml
designer
dosgi
ef-migrations
ab-testing
oxyplot
openedx
scichart
lldb
immutable.js
reverse-proxy
zapier
telerik-reporting
hammerspoon
blazemeter
emgucv
h2db
smb
facet
intel-pin
applozic
windows-server-2000
libuv
c++-amp
column-family
textmate
amazon-kinesis-kpl
c11
hilbert-curve
glew
arena-simulation
vao
dropbox-php
carrot
slickedit
kbuild
sybase-asa
namecoin
iso8601
jspdf-autotable
xcb
grails-tomcat-plugin
infix-notation
probability-density
freedesktop.org
jfugue
pervasive-sql
prerequisites
google-places
eclipse-clp
rgeo
android-listview
nsight
mmc
iiviewdeckcontroller
web-controls
python-green
fscommand
nstableviewcell
jscript.net
knuth
gulp-less
imdbpy
oam
ccss
inbox
jmapviewer
google-cloud-save
odata4j
ftps
seed
free-variable
path-separator
chuck
coderush
wse3.0
code-cleanup
whoosh
fireworks
doh
external-accessory
subgurim-maps
coredump
django-tagging
rfc1123
floating
brewmp
gacutil
putchar
ajax-forms
uiq3

Resources

Mobile Apps Dev
Database Users
javascript
java
csharp
php
android
MS Developer
developer works
python
ios
c
html
jquery
RDBMS discuss
Cloud Virtualization
Database Dev&Adm
javascript
java
csharp
php
python
android
jquery
ruby
ios
html
Mobile App
Mobile App
Mobile App