December 02, 2007

Automating web scraping and archiving

I recently needed to scrape the contents of a large number of non-standard HTML pages and output the results in a different format. That requires pulling each page, locating specific DOM elements, and saving the results to a new file. I posed the problem to the A2B3 mailing list and got this series of detailed responses:

Ed V, Mark R, and Brian K. suggested BeautifulSoup, a Python library that did the job quite nicely. The language just makes sense, and the library allowed me to write the script with minimal effort. From the Soup website:

Beautiful Soup won't choke if you give it bad markup. It yields a parse tree that makes approximately as much sense as your original document. This is usually good enough to collect the data you need and run away.
Beautiful Soup provides a few simple methods and Pythonic idioms for navigating, searching, and modifying a parse tree: a toolkit for dissecting a document and extracting what you need. You don't have to create a custom parser for each application.

Several other tools in different languages were forwarded to me, none of which I have yet tried:

DCB recommended Snoopy, a PHP class for automating content retrieval.

JHI recommended Hpricot, a Ruby parser. Mechanize, another library, can fill out and submit forms. A third library, scRUBYt combines the features of the Hpricot and Mechanize. The list of use-cases for scRUBYt is impressive:

  • scraping on-line bookstores for price comparison
  • monitoring e-bay or any other web shop for a specific item and price
  • automatically gathering news, events, weather etc. information
  • metasearch
  • checking if a website had changed and scraping the new information
  • creating mashups
  • saving on-line data to a database
  • December 2, 2007 07:50 PM

Login to leave a comment. Create a new account.