Getting the Data: Outwit Hub

So, after weeks of wrestling with Yahoo Pipes and ScraperWiki, I finally succumbed  to the pay-model of OutWit Hub (more on this below).

I know, I know – I sold out. However, the £25 (or so) for the full version of this web scraper tool has saved me literally hours of work and means I can get stuck into the data itself.

I was also reluctant to ask for assistance from the several Python and Yahoo Pipes experts who offered assistance as I would have to rely on them TOO MUCH at this stage.

NOTE: I fully intend to return to this problem at a later date and finally crack this, as it as frustrated me no end … Yahoo Pipes and ScraperWiki, it’s not over!

Outwit Hub

The options available on OutWit Hub

Outwit Hub: the left hand column showing search options available

 

Outwit Hub is a web data tool (limited version available free as a Firefox extension) that offers a feast of insights into a page including images, lists, tables and links (see image left)

There is also the option to set up your own scraper by setting search terms at the beginning and end of the data that you want.

The information drops into a window, which can be automatically filtered and moved with a “catch” function into another window.

I wanted to scrape a large number of pages, and whilst Outwit Hub has a function which allows a scraper to run over an entire site by using the “next” page link, the site I was scraping did not have this “next” structure.

This was my solution:

  • I asked OutWit Hub to search the entire site for ALL the URL’s of the pages I needed –  specifying them by a common term. (This required me to purchase the full version of Outwit Hub – as the free version limits you to 100 items within a search – far fewer than I required)
  • I then set up a scraper and asked it to search these URL’s for Artist, Album title and position. (However, I knew at a later stage that I would also need the DATE of the chart, something I would have to manipulate from the URL)
  • Once the scraper had found the correct information I saved it, and exported it into Excel.
Enhanced by Zemanta

Leave a comment