Freelancer's Playground! Learn Programming, The Freelancer's Way

21Jan/100

Web Scrapping Made Easy, With YQL

Usually, when someone asked me question: "how do you scrap data from certain website?". My answer would be using some REGEXs. However, Yahoo now has something interesting on it's developer site. Instead of doing REGEX, we can just QUERY our target page with YQL and extract the data as XML or JSON.

What? YQL? Hmm... Care to share some examples? Why not. Follow me.

For our use-case, lets say, I have interest in MANGA and I want to show currently released MANGA on Manga Stream. All I need to do is, examine the XPath on new release section on that site.

To examine XPath, I use Xpather for Firefox.

Free Image Hosting at www.ImageShack.us

Right click on section you want and choose Show in XPather. Another instance of firefox window will appear.

Free Image Hosting at www.ImageShack.us
  1. This is where your XPath statement is.
  2. Use it to evaluate your XPath statement.
  3. This is where the results.

Here's a good and comprehensive tutorial on XPath.

For my purpose, the Xpath is /html/body/div[@id='contentwrap']/div[@id='contentwrap-inner']/div[2]/ul/li[*]

Now, that we have it, let's head on YQL concole to try it out.

  • Go to http://developer.yahoo.com/yql/console/
  • From Bottom-right panel, choose "Data Tables" panel
  • Then, choose "Data" and select "html"
  • On "Your YQL Statement" edit it with: select * from html where url="http://mangastream.com/" and xpath='//div[@id=\'contentwrap\']/div[@id=\'contentwrap-inner\']/div[2]/ul/li[*]'
    Please note that I removed /html/body/ from my Xpath. Because YQL can't evaluate the statement if I include it. I don't know why, If you know, please tell me on comment form below.
  • Your next step would be: hit the "test" button. and exanime the result.
  • If it's OK, just click on "COPY URL" from REST query panel and use it on your script. FYI, you can choose XML of JSON for the result. Choose which suits you most.
Filed under: Tutorial Leave a comment
blog comments powered by Disqus