ElementTree on the come-up

I had a very small number of complaints related to basing Kid on ElementTree. This came in two forms:

  1. SAX and DOM are “standard” and while ElementTree is a drastically improved system for processing XML in Python, it doesn’t matter because everyone already knows SAX/DOM.

  2. “libxml2 is teh rawk!”

First, if Python’s W3C DOM standard based xml.dom package were a movie, it would be called Elf, staring xml.dom. It’s the episode of Little House on the Prairie where Alien asks Michael Landon for permission to marry his daughter. It does not belong here!

Next, in terms of pythonicness, libxml2 is almost worse than xml.dom but you at least get something for it: they don’t even have a word to describe this level of “fast” and it comes along with XPath, RelaxNG, XSD, XML-Base, XInclude, and XSLT. My issue with libxml2 is just that it’s a bad dependency for a project like Kid that wants to be able to run on cheap web space with bare-bones Python support. There are a lot of hosting providers that aren’t going to have libxml2 or the option of compiling from source.

I went with ElementTree because it’s simple, pythonic, and fast enough. I also had a feeling we’d be seeing more development around ElementTree, which brings us nicely to why I’m posting.

Fredrik Lundh announced cElementTree, an implementation of his ElementTree XML parsing library for Python implemented in C. The initial numbers coming out of effland look excellent:

library time space
xml.dom.minidom (Python 2.1) 6.3 s 80000k
xml.dom.minidom (Python 2.4) 1.4 s 53000k
ElementTree 1.2 1.6 s 14500k
ElementTree 1.2.1/1.3 1.1 s 14500k
PyRXPU (C extension) 0.22 s 11500k
cElementTree 0.8 (C extension) 0.058 s 5700k
readlines (read as text) 0.032 s 5050k

This comes on the heels of a well hidden announcement by Martijn Faassen on the lxml mailing list Saturday:

The lxml.etree implementation of ElementTree, on top of libxml2, is getting there now. It features automatic memory management and quite a bit of ElementTree compatibility. Not all of the ElementTree API has been implemented yet, but enough for many use cases.

As everyone is already quite aware, libxml2 is fast. But as I mentioned, the python bindings that ship with libxml2 are painful; many a hacker has been seduced by its performance only to be bitten later by monsters growing out of the large impedance mismatch it creates with the rest of your python code.

This is all really great news, of course, but now there’s questions to be asked and work to be done: