Tutorial: scraping and turning a web site into a widget with YQL
During the mentoring sessions at last weekend’s Young Rewired State one of the most asked questions was how you can easily re-use content on the web. The answer I gave was by using YQL and I promised a short introduction to the topic so here it is. What we are going to do here and now is to turn a web sites into a widget with YQL and a few lines of JavaScript:
Say you have a web site with a list of content and you want to turn it into widget to include in other web sites. For example this list of funny TV facts (which is really a Usenet classic). The first thing you need to do with this is to find out its structure, either by looking at the source code of the page or by using Firebug:
If you right-click on the item in Firebug you can get the XPATH to the element you want to reach – we’ll need this later. In this case the xpath is /html/body/ul/li[92] which gets us that single element. If we want all TV facts, then we need to shorten this to //ul/li.
The next step is to go to the YQL console and enter the following statement.
select * from html where url='http://www.dcs.gla.ac.uk/~joy/fun/jokes/TV.html' and xpath='//ul/li'
This follows the syntax select * from html where url='{url}' and xpath='{xpath}'. This will result in YQL pulling the page content and giving it back to us as XML:
Notice that YQL has inserted P elements in the results. This is because YQL runs the XML through HTML Tidy to remove invalid HTML. This means that we need to alter our XPATH to be //ul/li/p to get to the texts.
The next step is to define the output format as JSON, define a callback function with the name funfacts, hit the test button, wait for the results and copy and paste the REST query.
That’s all you need to do. You will now have the HTML as a JavaScript-readable object and all you need to do is to define a function called funfacts that gets the data from YQL and add another SCRIPT node with the REST URL you copied from YQL as the src attribute:
<script>
function funfacts(o){
}
</script>
<script src="http://query.yahooapis.com/v1/public/yql?q=select%20*%20from%20html%20where%20url%3D'http%3A%2F%2Fwww.dcs.gla.ac.uk%2F~joy%2Ffun%2Fjokes%2FTV.html'%20and%20xpath%3D'%2F%2Fli%2Fp'&format=json&callback=funfacts"></script>
The function will get the data from YQL as you were able to see in the console. Therefore getting to the TV facts is as easy as accessing o.query.results.p.
The rest of the functionality is plain and simple DOM Scripting. Check the comments for explanations:
<div id="funfacts"><h2>Funny TV facts</h2><p><a href="http://www.dcs.gla.ac.uk/~joy/fun/jokes/TV.html">Some funny TV facts</a></p></div>
<script>
function funfacts(o){
// get the DIV with the ID funfacts
var facts = document.getElementById('funfacts');
// add a class for styling
facts.className = 'js';
// if it exists
if(facts){
// get the TV facts data returned from YQL
var data = o.query.results.p;
// get the original link and change its text content
var link = facts.getElementsByTagName('a')[0];
link.innerHTML = '(see all facts)';
// create a container to host the TV fact and add it
// to the main container DIV
var out = document.createElement('p');
out.className = 'fact';
facts.insertBefore(out,link.parentNode);
// this function gets a random fact from the dataset
// and adds it as content to the element stored in out
function seed(){
var ran = parseInt(Math.random()*data.length);
out.innerHTML = data[ran];
}
// create a button to get another random fact and
// add it to the container
var b = document.createElement('button');
b.innerHTML = 'get another fact';
b.onclick = seed;
link.parentNode.insertBefore(b,link);
// call the first fact
seed();
}
}
</script>
<script src="http://query.yahooapis.com/v1/public/yql?q=select%20*%20from%20html%20where%20url%3D'http%3A%2F%2Fwww.dcs.gla.ac.uk%2F~joy%2Ffun%2Fjokes%2FTV.html'%20and%20xpath%3D'%2F%2Fli%2Fp'&format=json&callback=funfacts"></script>
Add a bit of styling and you’ll end up with quite a cool little widget powered by the data on the jokes site. Check the source of the demo to see all the CSS needed.
That is all there is to it – get scraping!
Tags: development, javascript, scraping, widget, yql







August 25th, 2009 at 10:45 am
RT @codepo8: Tutorial: converting a web site into a widget using YQL and a few lines of JavaScript: [link to post]
– Posted using Chat Catcher
August 25th, 2009 at 11:34 am
(and well explained:-) RT @codepo8 Tutorial: converting a web site into a widget using YQL and a few lines of JavaScript: [link to post]
– Posted using Chat Catcher
August 25th, 2009 at 12:07 pm
That is so cool! I used to do something like this with curl, but this is much more elegant. Thanks for the tip.
Mihai
August 25th, 2009 at 3:35 pm
RT: @acarlos1000 @codepo8: Tutorial: converting a web site into a widget using YQL and a few lines of JavaScript: [link to post]
– Posted using Chat Catcher
August 25th, 2009 at 4:45 pm
How to create your own widget with YQL [link to post] from @codepo8 Amazingly simple!
– Posted using Chat Catcher
August 25th, 2009 at 6:02 pm
@iDevGeek [link to post] – sweet, really smart to leverage mental-models from related technologies – eg jQuery & css selectors #YQL
– Posted using Chat Catcher
August 25th, 2009 at 7:01 pm
good tutorial on scraping html with YQL [link to post]
– Posted using Chat Catcher
August 26th, 2009 at 12:40 am
Tutorial: scraping and turning a web site into a widget with YQL [link to post]
– Posted using Chat Catcher
August 26th, 2009 at 1:42 pm
I have been implementing a similar toolkit, MetaSeeker whose codes are open and can be downloaded from http://www.gooseeker.com. I take a different approach than YQL to extract and integrated Web data. I like to integrate the Web data on the client side where specific applications are developed and overlayed onto the web browser, e.g. Firefox. I think things about copyright could be avoided by this way. I am afraid the way like YQL was not good because the contents are redistributed. At the same time, the performance might be concerned because all requests for the contents are proxyed by YQL’s servers. With MetaSeeker, the data structures are stored separatedly on a semantic layer and can be accessed by all semantic-aware applications on the client sides. They work as the accelerators of IE do.
August 26th, 2009 at 5:19 pm
Using the following example, any idea how to use YQL to get only the ‘Good’ xxxx values?
[tr]
…[td][span]Good[/span][/td]
…[td][span]xxxx[/span][/td]
[/tr]
[tr]
…[td][span]Bad[/span][/td]
…[td][span]xxxx[/span][/td]
[/tr]
Thanks…
August 26th, 2009 at 6:31 pm
I’ve been playing around with Yahoo Pipes a bit, since reading about it and YQL here. I’m liking it so far, but I think I’m running into some limitations. :-/ As an idea for a further post
slasha question for you, do you know if it’s possible to combine a number of items into 1 item — say, union several feeds and then filter them and output the filtered elements as a single RSS item rather than 10 or however many matched the filter? I’ve been messing with it, but so far no luck. It’s still a cool service with unplumbed depths I’m enjoying learning. :-)August 27th, 2009 at 3:42 am
good tutorial
thanks bro
August 27th, 2009 at 8:33 pm
How to create a nice little screen-scraping widget with YQL #javascript [link to post]
– Posted using Chat Catcher
August 29th, 2009 at 11:41 pm
@doriantaylor btw, you can transclude with js + yql [link to post]
– Posted using Chat Catcher
August 29th, 2009 at 11:42 pm
I love it, but what about security, what if at xpath ‘//ul/li’ somebody adds later some inline javascript ⦠would YQL have an secure tidy/xpath option, that would clean-up the HTML fragment? hmmm, I’ll probably for the moment use an XSL to whitelist what nodes/attributes I permit.
August 30th, 2009 at 12:37 am
@gridinoc it occurs to me that transcluding markup would in fact be safer because it’s deterministic â you can just not execute scripts.
– Posted using Chat Catcher
August 30th, 2009 at 12:38 am
@gridinoc as you just pointed out you can transclude with js+yql which is turing-complete and therefore subject to the halting problem.
– Posted using Chat Catcher
August 30th, 2009 at 1:38 am
@doriantaylor I would prefer to sandbox rather than filter elements/attributes
– Posted using Chat Catcher
August 30th, 2009 at 2:41 am
@gridinoc ah, see i’d rather just not execute ad-hoc code whose behaviour i can’t predict.
– Posted using Chat Catcher
August 30th, 2009 at 3:39 am
@doriantaylor I agree, but filtering code is a tedious continuos task, the hacking community is very creative in encoding+escaping things.
– Posted using Chat Catcher
August 30th, 2009 at 4:36 am
@gridinoc it’s simple: you don’t execute script tags or on* attributes in transcluded markup. done deal. honest.
– Posted using Chat Catcher
August 30th, 2009 at 5:37 am
@doriantaylor I would prefer a DOM implementation where I can tag a sub-tree as non-executable, akin to mounting a file system with noexec
– Posted using Chat Catcher
August 30th, 2009 at 5:38 am
@doriantaylor it is simple, you can white or blacklist and it would work until somebody will trick your parser â¦
– Posted using Chat Catcher
August 30th, 2009 at 6:37 am
@gridinoc what i’m saying is that x?html doesn’t possess the semantics to describe anything tricky whereas JS is provably unpredictable.
– Posted using Chat Catcher
August 30th, 2009 at 7:41 am
@doriantaylor I assume that is too hard to ward off the js from arbitrary html, therefore I would rather sandbox than tidy.
– Posted using Chat Catcher
August 30th, 2009 at 8:36 am
@gridinoc no, really, you just don’t execute the contents of script tags, on* attributes or javascript: URIs. it’s as simple as that.
– Posted using Chat Catcher
August 30th, 2009 at 9:35 am
@doriantaylor and filter all javascript: URIs, and do it for all the imaginable encodings/escapes: http://ha.ckers.org/xss.html
– Posted using Chat Catcher
August 30th, 2009 at 10:35 am
@gridinoc like imagine an in-markup transclusion mechanism that indiscriminately marks all transcluded script content inert.
– Posted using Chat Catcher
August 30th, 2009 at 10:36 am
@gridinoc no, not filter, just not run. i’m not talking about filtering anything, just zero activity.
– Posted using Chat Catcher
August 30th, 2009 at 11:36 am
@doriantaylor well, I doubt the current browsers would allow some granularity at DOM level of where to stop scripting and just keep render.
– Posted using Chat Catcher
August 30th, 2009 at 11:37 am
@doriantaylor I wonder how simple would be browser makers to do it, hmmm, let’s ping azaaza
– Posted using Chat Catcher
August 30th, 2009 at 12:36 pm
@gridinoc @noscript already basically does this.
– Posted using Chat Catcher
August 30th, 2009 at 1:36 pm
@doriantaylor ask @noscript if they can do it on DOM sub-trees and not on provenance only
– Posted using Chat Catcher
August 30th, 2009 at 2:36 pm
@gridinoc based on my current understanding it should be possible to guarantee that behaviour.
– Posted using Chat Catcher
August 30th, 2009 at 3:36 pm
@doriantaylor you made me remember some old private notes I made on transclusion and RDFa ⦠now digging for those scraps
– Posted using Chat Catcher
August 31st, 2009 at 1:13 am
@crustyadventure codepo8Tutorial: converting a web site into a widget using YQL and a few lines of JavaScript: [link to post]
– Posted using Chat Catcher
September 8th, 2009 at 3:59 am
i get the error [object Object]
when selecting divs with a
http://query.yahooapis.com/v1/public/yql?q=select%20*%20from%20html%20where%20url%3D%22http%3A%2F%2Fwww.faberludens.com.br%22%20and%20xpath%3D’%2F%2Fdiv%5B%40id%3D%22block-views-featured_projects%22%5D%2F%2Fdiv%2F%2Ful%2F%2Fli%2F%2Fdiv%5B2%5D’&format=json&callback=funfacts
the sql seems correct on yahoo Yql query
help please!
November 16th, 2009 at 11:16 am
@dmje BTW, here is @codepo8’s tutorial on scraping HTML with YQL: [link to post]
– Posted using Chat Catcher