December 12, 2007
University of Michigan IT jobs mailing list
Ronald D. Loveless and Ruth A write:
I speak on behalf of the IT Commons Stewards. We have created a joinable e-mail group titled, firstname.lastname@example.org. The purpose of this e-mail group is to advertise IT job opportunities across the UM campus. Posting to this e-mail group is optional and in addition to what units post to the eMploy site.
The IT Commons stewards believe that we should take advantage of the communication network here at the University of Michigan to better broadcast IT job opportunities. The eMploy system is useful for those actively seeking employment. But it is not so useful for â€śgetting the wordâ€? out about IT job opportunities that occur across the campus. Additionally, many of us are connected to various IT professional associations and contacts outside of the UM. It is anticipated that some of you may forward these job postings thus promoting the recruitment potential for existing job opportunities here at Michigan.
I hope you will find this a useful way to become aware of IT job opportunities at Michigan, or to further advertise IT job opportunities that arise in your unit.
To subscribe, go to the directory entry and click "Join".
December 02, 2007
Automating web scraping and archiving
I recently needed to scrape the contents of a large number of non-standard HTML pages and output the results in a different format. That requires pulling each page, locating specific DOM elements, and saving the results to a new file. I posed the problem to the A2B3 mailing list and got this series of detailed responses:
Ed V, Mark R, and Brian K. suggested BeautifulSoup, a Python library that did the job quite nicely. The language just makes sense, and the library allowed me to write the script with minimal effort. From the Soup website:
Beautiful Soup won't choke if you give it bad markup. It yields a parse tree that makes approximately as much sense as your original document. This is usually good enough to collect the data you need and run away.
Beautiful Soup provides a few simple methods and Pythonic idioms for navigating, searching, and modifying a parse tree: a toolkit for dissecting a document and extracting what you need. You don't have to create a custom parser for each application.
Several other tools in different languages were forwarded to me, none of which I have yet tried:
DCB recommended Snoopy, a PHP class for automating content retrieval.
JHI recommended Hpricot, a Ruby parser. Mechanize, another library, can fill out and submit forms. A third library, scRUBYt combines the features of the Hpricot and Mechanize. The list of use-cases for scRUBYt is impressive:
- scraping on-line bookstores for price comparison
- monitoring e-bay or any other web shop for a specific item and price
- automatically gathering news, events, weather etc. information
- checking if a website had changed and scraping the new information
- creating mashups
- saving on-line data to a database