Web Scrapping Made Easy, With YQL
Usually, when someone asked me question: "how do you scrap data from certain website?". My answer would be using some REGEXs. However, Yahoo now has something interesting on it's developer site. Instead of doing REGEX, we can just QUERY our target page with YQL and extract the data as XML or JSON.
What? YQL? Hmm... Care to share some examples? Why not. Follow me.
For our use-case, lets say, I have interest in MANGA and I want to show currently released MANGA on Manga Stream. All I need to do is, examine the XPath on new release section on that site.
To examine XPath, I use Xpather for Firefox.
Right click on section you want and choose Show in XPather. Another instance of firefox window will appear.
- This is where your XPath statement is.
- Use it to evaluate your XPath statement.
- This is where the results.
Here's a good and comprehensive tutorial on XPath.
For my purpose, the Xpath is /html/body/div[@id='contentwrap']/div[@id='contentwrap-inner']/div[2]/ul/li[*]
Now, that we have it, let's head on YQL concole to try it out.
- Go to http://developer.yahoo.com/yql/console/
- From Bottom-right panel, choose "Data Tables" panel
- Then, choose "Data" and select "html"
- On "Your YQL Statement" edit it with:
select * from html where url="http://mangastream.com/" and xpath='//div[@id=\'contentwrap\']/div[@id=\'contentwrap-inner\']/div[2]/ul/li[*]'
Please note that I removed /html/body/ from my Xpath. Because YQL can't evaluate the statement if I include it. I don't know why, If you know, please tell me on comment form below. - Your next step would be: hit the "test" button. and exanime the result.
- If it's OK, just click on "COPY URL" from REST query panel and use it on your script. FYI, you can choose XML of JSON for the result. Choose which suits you most.







