Fetch URL from word list on openRefine
I have a list of organisations in Column 1 (string with spaces, e.g. United Nations) and want to populate a second column with the associated URLs (e.g. www.un.org/), using the column 1 values as a search string. The geocoding procedure is rather straightforward (http://enipedia.tudelft.nl/wiki/OpenRefine_Tutorial#Geocoding_names_and_addresses), so I wonder if there is a way to perform this task using google search or other web service. It would be a hit and miss approach, but it beats manual editing. Thanks!
It's hard to answer a so broad question without specific examples. But of course, we can use Open Refine to enrich data using a ton of APIs or by doing web scraping. And the procedure is almost always the same: rebuild URLs, "add a column by fetching urls", and then parsing the resulting column of HTML, XML or JSON files. Here is an example on how to call the Wikipedia search API from a list of names. Rebuild URLs is quite simple : "https://en.wikipedia.org/w/api.php?action=opensearch&search=" + value.escape('url') + "&limit=10&namespace=0&format=xml" What, for value='United Nations', would give this : https://en.wikipedia.org/w/api.php?action=opensearch&search=united+nations&limit=10&namespace=0&format=xml The XML content can then be parsed to extract the items you need. For example, to get the description of the Wikipedia page : value.parseHtml().select('Description').htmlText()
openrefine cluster and unique values, exported
openrefine, cluster and edit two datasets
OpenRefine - Lost records
Incrementing a date in openrefine
add numbers down a column in OpenRefine
OpenRefine split on character in multivalue cell
Openrefine: text facet by counting
Select multiple repeated records OpenRefine
Simple OpenRefine IF to create a new column
OpenRefine split in multiple cells
How to export the cell that contains new line character properly?
Is it possible to run an OpenRefine script in the background?
Browser cluster link does not work properly in Open Refine
How to save only specific JSON elements in a new OpenRefine column
Openrefine: cross.cell for similar but not identical values
OpenRefine changing the port and host when executable is run directly