Tutorial: scraping and turning a web site into a widget with YQL

During the mentoring sessions at last weekend’s Young Rewired State one of the most asked questions was how you can easily re-use content on the web. The answer I gave was by using YQL and I promised a short introduction to the topic so here it is. What we are going to do here and now is to turn a web sites into a widget with YQL and a few lines of JavaScript:

Turning a web page into a widget with yql by  you.

Say you have a web site with a list of content and you want to turn it into widget to include in other web sites. For example this list of funny TV facts (which is really a Usenet classic). The first thing you need to do with this is to find out its structure, either by looking at the source code of the page or by using Firebug:

Finding out the HTML structure by using firebug by  you.

If you right-click on the item in Firebug you can get the XPATH to the element you want to reach – we’ll need this later. In this case the xpath is /html/body/ul/li[92] which gets us that single element. If we want all TV facts, then we need to shorten this to //ul/li.

Copying the XPATH in firebug by  you.

The next step is to go to the YQL console and enter the following statement.

select * from html where url='http://www.dcs.gla.ac.uk/~joy/fun/jokes/TV.html' and xpath='//ul/li'

This follows the syntax select * from html where url='{url}' and xpath='{xpath}'. This will result in YQL pulling the page content and giving it back to us as XML:

Yahoo! Query Language - YDN by  you.

Notice that YQL has inserted P elements in the results. This is because YQL runs the XML through HTML Tidy to remove invalid HTML. This means that we need to alter our XPATH to be //ul/li/p to get to the texts.

The next step is to define the output format as JSON, define a callback function with the name funfacts, hit the test button, wait for the results and copy and paste the REST query.

Steps to get the HTML in the right format by  you.

That’s all you need to do. You will now have the HTML as a JavaScript-readable object and all you need to do is to define a function called funfacts that gets the data from YQL and add another SCRIPT node with the REST URL you copied from YQL as the src attribute:

<script>
function funfacts(o){
}
</script>  
<script src="http://query.yahooapis.com/v1/public/yql?q=select%20*%20from%20html%20where%20url%3D'http%3A%2F%2Fwww.dcs.gla.ac.uk%2F~joy%2Ffun%2Fjokes%2FTV.html'%20and%20xpath%3D'%2F%2Fli%2Fp'&format=json&callback=funfacts"></script>

The function will get the data from YQL as you were able to see in the console. Therefore getting to the TV facts is as easy as accessing o.query.results.p.

The rest of the functionality is plain and simple DOM Scripting. Check the comments for explanations:

<div id="funfacts"><h2>Funny TV facts</h2><p><a href="http://www.dcs.gla.ac.uk/~joy/fun/jokes/TV.html">Some funny TV facts</a></p></div>
<script>
function funfacts(o){
  // get the DIV with the ID funfacts
  var facts = document.getElementById('funfacts');
  // add a class for styling
  facts.className = 'js';
  // if it exists
  if(facts){
    // get the TV facts data returned from YQL
    var data = o.query.results.p;
    // get the original link and change its text content
    var link = facts.getElementsByTagName('a')[0];
    link.innerHTML = '(see all facts)';
    // create a container to host the TV fact and add it 
    // to the main container DIV
    var out = document.createElement('p');
    out.className = 'fact';
    facts.insertBefore(out,link.parentNode);
    // this function gets a random fact from the dataset
    // and adds it as content to the element stored in out
    function seed(){
      var ran = parseInt(Math.random()*data.length);
      out.innerHTML = data[ran];
    }
    // create a button to get another random fact and 
    // add it to the container
    var b = document.createElement('button');
    b.innerHTML = 'get another fact';
    b.onclick = seed;
    link.parentNode.insertBefore(b,link);
    // call the first fact
    seed();
  }
}
</script>  
<script src="http://query.yahooapis.com/v1/public/yql?q=select%20*%20from%20html%20where%20url%3D'http%3A%2F%2Fwww.dcs.gla.ac.uk%2F~joy%2Ffun%2Fjokes%2FTV.html'%20and%20xpath%3D'%2F%2Fli%2Fp'&format=json&callback=funfacts"></script>

Add a bit of styling and you’ll end up with quite a cool little widget powered by the data on the jokes site. Check the source of the demo to see all the CSS needed.

That is all there is to it – get scraping!

Tags: , , , ,

38 Responses to “Tutorial: scraping and turning a web site into a widget with YQL”

  1. ossreleasefeed (Schalk Neethling) Says:


    RT @codepo8: Tutorial: converting a web site into a widget using YQL and a few lines of JavaScript: [link to post]

    Posted using Chat Catcher

  2. guillelamb (Guillermo Álvaro Rey) Says:


    (and well explained:-) RT @codepo8 Tutorial: converting a web site into a widget using YQL and a few lines of JavaScript: [link to post]

    Posted using Chat Catcher

  3. mihaitodor Says:

    That is so cool! I used to do something like this with curl, but this is much more elegant. Thanks for the tip.

    Mihai

  4. daniloefbento (Danilo Bento) Says:


    RT: @acarlos1000 @codepo8: Tutorial: converting a web site into a widget using YQL and a few lines of JavaScript: [link to post]

    Posted using Chat Catcher

  5. iDevGeek (Stanislav Cmakal) Says:


    How to create your own widget with YQL [link to post] from @codepo8 Amazingly simple!

    Posted using Chat Catcher

  6. johnhunter (John Hunter) Says:


    @iDevGeek [link to post] – sweet, really smart to leverage mental-models from related technologies – eg jQuery & css selectors #YQL

    Posted using Chat Catcher

  7. joedag32 (joedag32) Says:


    good tutorial on scraping html with YQL [link to post]

    Posted using Chat Catcher

  8. rvramesh (Ramesh Vijayaraghava) Says:


    Tutorial: scraping and turning a web site into a widget with YQL [link to post]

    Posted using Chat Catcher

  9. Fuller Says:

    I have been implementing a similar toolkit, MetaSeeker whose codes are open and can be downloaded from http://www.gooseeker.com. I take a different approach than YQL to extract and integrated Web data. I like to integrate the Web data on the client side where specific applications are developed and overlayed onto the web browser, e.g. Firefox. I think things about copyright could be avoided by this way. I am afraid the way like YQL was not good because the contents are redistributed. At the same time, the performance might be concerned because all requests for the contents are proxyed by YQL’s servers. With MetaSeeker, the data structures are stored separatedly on a semantic layer and can be accessed by all semantic-aware applications on the client sides. They work as the accelerators of IE do.

  10. G. Says:

    Using the following example, any idea how to use YQL to get only the ‘Good’ xxxx values?

    [tr]
    …[td][span]Good[/span][/td]
    …[td][span]xxxx[/span][/td]
    [/tr]
    [tr]
    …[td][span]Bad[/span][/td]
    …[td][span]xxxx[/span][/td]
    [/tr]

    Thanks…

  11. Rhyaniwyn Says:

    I’ve been playing around with Yahoo Pipes a bit, since reading about it and YQL here. I’m liking it so far, but I think I’m running into some limitations. :-/ As an idea for a further post slash a question for you, do you know if it’s possible to combine a number of items into 1 item — say, union several feeds and then filter them and output the filtered elements as a single RSS item rather than 10 or however many matched the filter? I’ve been messing with it, but so far no luck. It’s still a cool service with unplumbed depths I’m enjoying learning. :-)

  12. rozy Says:

    good tutorial
    thanks bro

  13. 5tuarth (Stuart Homfray) Says:


    How to create a nice little screen-scraping widget with YQL #javascript [link to post]

    Posted using Chat Catcher

  14. gridinoc (Laurian Gridinoc) Says:


    @doriantaylor btw, you can transclude with js + yql [link to post]

    Posted using Chat Catcher

  15. Laurian Gridinoc Says:

    I love it, but what about security, what if at xpath ‘//ul/li’ somebody adds later some inline javascript ⦠would YQL have an secure tidy/xpath option, that would clean-up the HTML fragment? hmmm, I’ll probably for the moment use an XSL to whitelist what nodes/attributes I permit.

  16. doriantaylor (Dorian Taylor) Says:


    @gridinoc it occurs to me that transcluding markup would in fact be safer because it’s deterministic â you can just not execute scripts.

    Posted using Chat Catcher

  17. doriantaylor (Dorian Taylor) Says:


    @gridinoc as you just pointed out you can transclude with js+yql which is turing-complete and therefore subject to the halting problem.

    Posted using Chat Catcher

  18. gridinoc (Laurian Gridinoc) Says:


    @doriantaylor I would prefer to sandbox rather than filter elements/attributes

    Posted using Chat Catcher

  19. doriantaylor (Dorian Taylor) Says:


    @gridinoc ah, see i’d rather just not execute ad-hoc code whose behaviour i can’t predict.

    Posted using Chat Catcher

  20. gridinoc (Laurian Gridinoc) Says:


    @doriantaylor I agree, but filtering code is a tedious continuos task, the hacking community is very creative in encoding+escaping things.

    Posted using Chat Catcher

  21. doriantaylor (Dorian Taylor) Says:


    @gridinoc it’s simple: you don’t execute script tags or on* attributes in transcluded markup. done deal. honest.

    Posted using Chat Catcher

  22. gridinoc (Laurian Gridinoc) Says:


    @doriantaylor I would prefer a DOM implementation where I can tag a sub-tree as non-executable, akin to mounting a file system with noexec

    Posted using Chat Catcher

  23. gridinoc (Laurian Gridinoc) Says:


    @doriantaylor it is simple, you can white or blacklist and it would work until somebody will trick your parser â¦

    Posted using Chat Catcher

  24. doriantaylor (Dorian Taylor) Says:


    @gridinoc what i’m saying is that x?html doesn’t possess the semantics to describe anything tricky whereas JS is provably unpredictable.

    Posted using Chat Catcher

  25. gridinoc (Laurian Gridinoc) Says:


    @doriantaylor I assume that is too hard to ward off the js from arbitrary html, therefore I would rather sandbox than tidy.

    Posted using Chat Catcher

  26. doriantaylor (Dorian Taylor) Says:


    @gridinoc no, really, you just don’t execute the contents of script tags, on* attributes or javascript: URIs. it’s as simple as that.

    Posted using Chat Catcher

  27. gridinoc (Laurian Gridinoc) Says:


    @doriantaylor and filter all javascript: URIs, and do it for all the imaginable encodings/escapes: http://ha.ckers.org/xss.html

    Posted using Chat Catcher

  28. doriantaylor (Dorian Taylor) Says:


    @gridinoc like imagine an in-markup transclusion mechanism that indiscriminately marks all transcluded script content inert.

    Posted using Chat Catcher

  29. doriantaylor (Dorian Taylor) Says:


    @gridinoc no, not filter, just not run. i’m not talking about filtering anything, just zero activity.

    Posted using Chat Catcher

  30. gridinoc (Laurian Gridinoc) Says:


    @doriantaylor well, I doubt the current browsers would allow some granularity at DOM level of where to stop scripting and just keep render.

    Posted using Chat Catcher

  31. gridinoc (Laurian Gridinoc) Says:


    @doriantaylor I wonder how simple would be browser makers to do it, hmmm, let’s ping azaaza

    Posted using Chat Catcher

  32. doriantaylor (Dorian Taylor) Says:


    @gridinoc @noscript already basically does this.

    Posted using Chat Catcher

  33. gridinoc (Laurian Gridinoc) Says:


    @doriantaylor ask @noscript if they can do it on DOM sub-trees and not on provenance only

    Posted using Chat Catcher

  34. doriantaylor (Dorian Taylor) Says:


    @gridinoc based on my current understanding it should be possible to guarantee that behaviour.

    Posted using Chat Catcher

  35. gridinoc (Laurian Gridinoc) Says:


    @doriantaylor you made me remember some old private notes I made on transclusion and RDFa ⦠now digging for those scraps

    Posted using Chat Catcher

  36. aussie_ian (aussie_ian) Says:


    @crustyadventure codepo8Tutorial: converting a web site into a widget using YQL and a few lines of JavaScript: [link to post]

    Posted using Chat Catcher

  37. edyd Says:

    i get the error [object Object]
    when selecting divs with a

    http://query.yahooapis.com/v1/public/yql?q=select%20*%20from%20html%20where%20url%3D%22http%3A%2F%2Fwww.faberludens.com.br%22%20and%20xpath%3D’%2F%2Fdiv%5B%40id%3D%22block-views-featured_projects%22%5D%2F%2Fdiv%2F%2Ful%2F%2Fli%2F%2Fdiv%5B2%5D’&format=json&callback=funfacts

    the sql seems correct on yahoo Yql query
    help please!

  38. pekingspring (Evil Jim O'Donnell) Says:


    @dmje BTW, here is @codepo8’s tutorial on scraping HTML with YQL: [link to post]

    Posted using Chat Catcher

Leave a Reply

Wait till I come! is the blog of Christian Heilmann , a developer evangelist living and working in London, England. Download vcard.

Feed me, Seymour: Entries (RSS) and Comments (RSS).