December 02, 2007
Automating web scraping and archiving
I recently needed to scrape the contents of a large number of non-standard HTML pages and output the results in a different format. That requires pulling each page, locating specific DOM elements, and saving the results to a new file. I posed the problem to the A2B3 mailing list and got this series of detailed responses:
Ed V, Mark R, and Brian K. suggested BeautifulSoup, a Python library that did the job quite nicely. The language just makes sense, and the library allowed me to write the script with minimal effort. From the Soup website:
Beautiful Soup won't choke if you give it bad markup. It yields a parse tree that makes approximately as much sense as your original document. This is usually good enough to collect the data you need and run away.
Beautiful Soup provides a few simple methods and Pythonic idioms for navigating, searching, and modifying a parse tree: a toolkit for dissecting a document and extracting what you need. You don't have to create a custom parser for each application.
Several other tools in different languages were forwarded to me, none of which I have yet tried:
DCB recommended Snoopy, a PHP class for automating content retrieval.
JHI recommended Hpricot, a Ruby parser. Mechanize, another library, can fill out and submit forms. A third library, scRUBYt combines the features of the Hpricot and Mechanize. The list of use-cases for scRUBYt is impressive:
- scraping on-line bookstores for price comparison
- monitoring e-bay or any other web shop for a specific item and price
- automatically gathering news, events, weather etc. information
- checking if a website had changed and scraping the new information
- creating mashups
- saving on-line data to a database
December 2, 2007 07:50 PM