openrefine


OpenRefine - Cross-column clustering


As it seems, cross-column clustering isn't supported yet with OpenRefine.
Does anyone have any suggestions of how to cluster 'models' based on 'manufacturers', much like a 'city' would be based on a 'state' (many 'Springfield' could exist in the US, but only cluster "city": 'Springfield', if the relative 'state' column is the same)? The relative column is already normalized.
One easy way to do it would be to create a column which was the concatenation of the model+manufacturer, cluster on the joined fields, then (if needed) split the two pieces back apart again.
I had a similar requirement for de-duplicating address strings. So I created a new column (say COMPLETE_ADDRESS) and concatenated the STREET, CITY, PROVINCE, COUNTRY and ZIPCODE fields using the below GREL expression
cells["STREET"].value + " " + cells["CITY"].value + " " + cells["PROVINCE"].value + " " + cells["COUNTRY"].value + " " + cells["ZIPCODE"].value
Then I did the following :
Clustered the new COMPLETE_ADDRESS column with the default algorithm
Merged the values in each cluster (now the values are perfect duplicates)
Sort the column permanently.
Do a "blank down" operation.
Finally pick only non-null values in the COMPLETE_ADDRESS
Having said that, as of this writing, there is no feature to merge the independent columns. The only way to do that it is to split the COMPLETE_ADDRESS into separate columns suitably. In this case, you will have to use a better separator such as pipe "|" symbol which will not conflict with existing values.

Related Links

How to export the cell that contains new line character properly?
Is it possible to run an OpenRefine script in the background?
Browser cluster link does not work properly in Open Refine
How to save only specific JSON elements in a new OpenRefine column
Openrefine: cross.cell for similar but not identical values
OpenRefine changing the port and host when executable is run directly
How can I join two datasets using a key in OpenRefine, with the secondary table having more than one value?
Open Refine: Open Project Issue
Progressive number in Openrefine column
Lost all my files on Openrefine [closed]
freebaseapps reconciliation stuck in Open Refine 2.6
OpenRefine - add sequence number, reset for each record
How to transpose cell data by section in Open Refine?
OpenRefine columnwise scripting
Remove content inside parentheses
Extra blank space between words

Categories

HOME
google-chrome-extension
angular-material
pheatmap
cross-validation
gnupg
spring-kafka
wheelnav.js
modx-revolution
apache-cayenne
timeout
remote-access
event-handling
visual-studio-cordova
designer
saxon
pythonanywhere
django-cms
karma-jasmine
textfield
nas
autocad-plugin
visjs
paging
jndi
hammerspoon
numerical-methods
saas
windows-server-2000
socialengine
iframe-resizer
google-cloud-endpoints-v2
stringtemplate
sequential
html5-fullscreen
io-redirection
jna
bootstrapper
grid.mvc
spring-security-kerberos
segment
wptoolkit
apache-fop
strptime
lowpass-filter
jxcore
knpmenubundle
flutterwave
titanium-android
elements
underscore.js-templating
measures
websitepanel
disque
moveit
dataview
xpath-1.0
execute
marching-cubes
appfabric-cache
graph-drawing
bigbluebutton
network-interface
gui-test-framework
winddk
file-copying
rdl
mesa
has-many-through
marmalade
kgdb
android-2.2-froyo
oam
jsctypes
easy-install
gwt-rpc
back-stack
cascalog
cufon
work-stealing
disclosure
celltable
android-sdk-2.1
coda-slider
.nettiers
privilege
project-hosting
nerddinner
ntvdm.exe

Resources

Database Users
RDBMS discuss
Database Dev&Adm
javascript
java
csharp
php
android
javascript
java
csharp
php
python
android
jquery
ruby
ios
html
Mobile App
Mobile App
Mobile App