Tag Archives | BeautifulSoup

PySIG notes, 25-Oct-2007, Kent Johnson and Beautiful Soup sprint

Thirteen people elected to skip watching the second game of the World Series (Go Sox!) to attend the October meeting of the Python Special Interest Group (PySIG), held as usual at the Amoskeag Business Incubator in Manchester, New Hampshire, on the fourth Thursday of the month, 7 PM until… well, 10 PM last night!

The usual slew of announcements was made: the PySIG won’t meet on the usual night due to the Thanksgiving holiday. A meeting might happen the week after, since there are five Thursdays. Stay tuned for the official announcement. Other affiliated GNHLUG meetings are posted to http://www.gnhlug.org and all are welcome.

I had proposed a programming challenge to PySIG: following recent discussions on the GNHLUG mailing lists about attendance at meetings, Jim Kuzdrall had suggested we more closely analyze the attendance data that’s been posted to the GNHLUG wiki [1] for the past two years or so. The data is accessible from there, but the HTML format is not too easy to manipulate into an analyze-able format.

Enter BeautifulSoup. BS is a utility written in Python that parses HTML, with a lot of toleration for somewhat malformed HTML, and produces a parsed tree that can be traversed or queried or parsed into its various elements. Kent S Johnson continued his great Kent’s Korner series with a presentation on the basics of using BeautifulSoup. Kent noted that the documentation on BS is remarkably good, with illustrative examples and exhaustive discussions. BS is in its third major version and continues to be supported by its original author.

After Kent’s Korner, Bill Sconce took the driver’s seat, set up BS on his machine and we began with the kernel of source Kent had supplied to parse the source. The group participated, suggested, yelled at typos, experimented, threw out code, started over, changed the angle of attack, and successfully produced code that not only parsed the existing page, but generated a comma-separated-value file with proper escaping, thanks to the csv module. Along the way, we discussed issued of character conversion (since BS uses the aptly-named UnicodeDammit module and csv wants ASCII), escaping issues, coding styles, and more.

At the end of the presentation, Kent got the projector again to show a somewhat different tack he had used to parse the HTML, with an emphasis on writing small functions to clean each column of the idiosyncracies found in the data (a “Saturday” in the date field, a date field a two-day event, approximated attendance ~24 and so forth) and generate some results: which groups had the highest attendance for the year? No one was surprised that Nashua/MerriLUG was #1, but who knew that PySIG was #2? Woo-hoo! We noted that RubySIG was last, but there’s a good sampling problem: they had just started up early in the year, and a couple attendance figures were missing.

To follow up from the meeting, we intended to merge Kent’s improvements into the group’s code and generate some CSV files that we can make available for download from the GNHLUG wiki for all to analyze, graph, visualize and study.

Thanks to Kent for preparing his Beautiful Soup presentation, to Bill Sconce and Alex Hewitt for arranging the meeting, to Bill again for having the patience to type while twelve people tsk’d at him, to the Amoskeag Business Incubator for providing the fine facilities, and to all for attending and vigorously participating in the meeting!

[1] which actually breaks down to:
http://wiki.gnhlug.org/twiki2/bin/view/Www/PastEvents2007,
http://wiki.gnhlug.org/twiki2/bin/view/Www/PastEvents2006, and
http://wiki.gnhlug.org/twiki2/bin/view/Www/PastEvents2005

Adding a “skin=print.pattern” eliminates some of the “chrome”
surrounding the content.

Powered by WordPress. Designed by Woo Themes

This work by Ted Roche is licensed under a Creative Commons Attribution-NonCommercial-ShareAlike 3.0 United States.