openrefine


Fetch URL from word list on openRefine


I have a list of organisations in Column 1 (string with spaces, e.g. United Nations) and want to populate a second column with the associated URLs (e.g. www.un.org/), using the column 1 values as a search string. The geocoding procedure is rather straightforward (http://enipedia.tudelft.nl/wiki/OpenRefine_Tutorial#Geocoding_names_and_addresses), so I wonder if there is a way to perform this task using google search or other web service. It would be a hit and miss approach, but it beats manual editing. Thanks!
It's hard to answer a so broad question without specific examples. But of course, we can use Open Refine to enrich data using a ton of APIs or by doing web scraping. And the procedure is almost always the same: rebuild URLs, "add a column by fetching urls", and then parsing the resulting column of HTML, XML or JSON files.
Here is an example on how to call the Wikipedia search API from a list of names.
Rebuild URLs is quite simple :
"https://en.wikipedia.org/w/api.php?action=opensearch&search="
+ value.escape('url')
+ "&limit=10&namespace=0&format=xml"
What, for value='United Nations', would give this :
https://en.wikipedia.org/w/api.php?action=opensearch&search=united+nations&limit=10&namespace=0&format=xml
The XML content can then be parsed to extract the items you need. For example, to get the description of the Wikipedia page :
value.parseHtml().select('Description').htmlText()

Related Links

openrefine cluster and unique values, exported
openrefine, cluster and edit two datasets
OpenRefine - Lost records
Incrementing a date in openrefine
add numbers down a column in OpenRefine
OpenRefine split on character in multivalue cell
Openrefine: text facet by counting
Select multiple repeated records OpenRefine
Simple OpenRefine IF to create a new column
OpenRefine split in multiple cells
How to export the cell that contains new line character properly?
Is it possible to run an OpenRefine script in the background?
Browser cluster link does not work properly in Open Refine
How to save only specific JSON elements in a new OpenRefine column
Openrefine: cross.cell for similar but not identical values
OpenRefine changing the port and host when executable is run directly

Categories

HOME
arduino-uno
pdf
coq
blogger
pypi
debugging
netbeans
zeromq
c#-4.0
angular-material
hashmap
raspberry-pi
read-eval-print-loop
sqlite-net-extensions
razor
umd
disassembler
installshield
echarts
pheatmap
synchronization
handsontable
nstableview
vb.net-2010
messages
msp430
carthage
iolanguage
dosgi
pugjs
pythonanywhere
google-cloud-speech
viewport
xlsxwriter
tapestry
icloud-api
crosstab
vlsi
buildbot
cas
spark-jobserver
greendao
linkerd
elasticsearch-ruby
dbclient
gitignore
quote
srcset
gsoap
sql-server-2012-express
frame
bosh
framemaker
commit
simplexml
occlusion
streamsets
optix
pdf-reactor
rails-routing
hls.js
jxcore
carrot
tactic
diagnostics
typed-lambda-calculus
font-size
nsarray
hittest
vtigercrm
log4c
savon
database-optimization
autorest
nsviewcontroller
testng-dataprovider
wapiti
freedesktop.org
rtbkit
r-tree
feedback
purge
libressl
kcachegrind
method-parameters
apache-commons-fileupload
sniffer
system32
composite
device-orientation
csquery
file-copying
javafx-webengine
flask-cors
article
p4java
htmlcleaner
hippomocks
flash-builder4.5
qt-faststart
rabl
flashvars
isnullorempty
nsnetservice
dbproviderfactories
qtkit
bubble-chart
netbeans-6.9
nsdatecomponents
data-loss
microsoft-virtualization
anti-piracy
nerddinner
mediarss
document-conversion

Resources

Mobile Apps Dev
Database Users
javascript
java
csharp
php
android
MS Developer
developer works
python
ios
c
html
jquery
RDBMS discuss
Cloud Virtualization
Database Dev&Adm
javascript
java
csharp
php
python
android
jquery
ruby
ios
html
Mobile App
Mobile App
Mobile App