openrefine


Fetch URL from word list on openRefine


I have a list of organisations in Column 1 (string with spaces, e.g. United Nations) and want to populate a second column with the associated URLs (e.g. www.un.org/), using the column 1 values as a search string. The geocoding procedure is rather straightforward (http://enipedia.tudelft.nl/wiki/OpenRefine_Tutorial#Geocoding_names_and_addresses), so I wonder if there is a way to perform this task using google search or other web service. It would be a hit and miss approach, but it beats manual editing. Thanks!
It's hard to answer a so broad question without specific examples. But of course, we can use Open Refine to enrich data using a ton of APIs or by doing web scraping. And the procedure is almost always the same: rebuild URLs, "add a column by fetching urls", and then parsing the resulting column of HTML, XML or JSON files.
Here is an example on how to call the Wikipedia search API from a list of names.
Rebuild URLs is quite simple :
"https://en.wikipedia.org/w/api.php?action=opensearch&search="
+ value.escape('url')
+ "&limit=10&namespace=0&format=xml"
What, for value='United Nations', would give this :
https://en.wikipedia.org/w/api.php?action=opensearch&search=united+nations&limit=10&namespace=0&format=xml
The XML content can then be parsed to extract the items you need. For example, to get the description of the Wikipedia page :
value.parseHtml().select('Description').htmlText()

Related Links

OpenRefine - Lost records
Incrementing a date in openrefine
add numbers down a column in OpenRefine
OpenRefine split on character in multivalue cell
Openrefine: text facet by counting
Select multiple repeated records OpenRefine
Simple OpenRefine IF to create a new column
OpenRefine split in multiple cells
How to export the cell that contains new line character properly?
Is it possible to run an OpenRefine script in the background?
Browser cluster link does not work properly in Open Refine
How to save only specific JSON elements in a new OpenRefine column
Openrefine: cross.cell for similar but not identical values
OpenRefine changing the port and host when executable is run directly
How can I join two datasets using a key in OpenRefine, with the secondary table having more than one value?
Open Refine: Open Project Issue

Categories

HOME
log4j
zeromq
magnific-popup
deezer
single-sign-on
path-finding
raspberry-pi
react-router
sqlite-net-extensions
survey
echarts
gorm
communication
offline
timeout
n-gram
gz
messages
jsprit
uitypeeditor
database-replication
introduction
reverse-proxy
mps
bootstrap-tour
mmap
pepper
web-api-testing
quote
newline
microsoft-chart-controls
react-chartjs
bcd
javascriptcore
winrt-xaml-toolkit
wixsharp
simplexml
xcode-extension
neuroscience
unspecified
az-application-insights
unobtrusive-validation
sas-visual-analytics
ncalc
firebase-admin
password-encryption
.net-4.6.2
termination
theming
paxos
nativeapplication
worker-thread
sybase-asa
epson
spring-mongodb
mediaelement
pintos
setuptools
android-textview
phishing
impresspages
goose
tableau-online
ideamart
myo
ami
clipperlib
bigbluebutton
nsmutabledictionary
formatjs
tween
c3
nsbutton
java-metro-framework
ceil
expected-exception
android-2.2-froyo
.aspxauth
gridfs
dbconnection
kyotocabinet
anonymous-methods
first-responder
genshi
invite
servicehost
krl
disclosure
dmx512
suppress
nhibernate.search
yagni
pascal-fc
wsdl.exe

Resources

Database Users
RDBMS discuss
Database Dev&Adm
javascript
java
csharp
php
android
javascript
java
csharp
php
python
android
jquery
ruby
ios
html
Mobile App
Mobile App
Mobile App