ElementTree on the come-up
I had a very small number of complaints related to basing Kid on ElementTree. This came in two forms:
-
SAX and DOM are “standard” and while ElementTree is a drastically improved system for processing XML in Python, it doesn’t matter because everyone already knows SAX/DOM.
-
“libxml2 is teh rawk!”
First, if Python’s W3C DOM standard based xml.dom package were a movie, it
would be called Elf, staring xml.dom. It’s the episode of Little House
on the Prairie where Alien asks Michael Landon for permission to
marry his daughter. It does not belong here!
Next, in terms of pythonicness, libxml2 is almost worse than xml.dom but you
at least get something for it: they don’t even have a word to describe this
level of “fast” and it comes along with XPath, RelaxNG, XSD, XML-Base,
XInclude, and XSLT. My issue with libxml2 is just that it’s a bad dependency
for a project like Kid that wants to be able to run on cheap web space with
bare-bones Python support. There are a lot of hosting providers that aren’t
going to have libxml2 or the option of compiling from source.
I went with ElementTree because it’s simple, pythonic, and fast enough. I also had a feeling we’d be seeing more development around ElementTree, which brings us nicely to why I’m posting.
Fredrik Lundh announced cElementTree, an implementation of his ElementTree XML parsing library for Python implemented in C. The initial numbers coming out of effland look excellent:
| library | time | space |
|---|---|---|
| xml.dom.minidom (Python 2.1) | 6.3 s | 80000k |
| xml.dom.minidom (Python 2.4) | 1.4 s | 53000k |
| ElementTree 1.2 | 1.6 s | 14500k |
| ElementTree 1.2.1/1.3 | 1.1 s | 14500k |
| PyRXPU (C extension) | 0.22 s | 11500k |
| cElementTree 0.8 (C extension) | 0.058 s | 5700k |
| readlines (read as text) | 0.032 s | 5050k |
This comes on the heels of a well hidden announcement by Martijn Faassen on the lxml mailing list Saturday:
The lxml.etree implementation of ElementTree, on top of libxml2, is getting there now. It features automatic memory management and quite a bit of ElementTree compatibility. Not all of the ElementTree API has been implemented yet, but enough for many use cases.
As everyone is already quite aware, libxml2 is fast. But as I mentioned, the python bindings that ship with libxml2 are painful; many a hacker has been seduced by its performance only to be bitten later by monsters growing out of the large impedance mismatch it creates with the rest of your python code.
This is all really great news, of course, but now there’s questions to be asked and work to be done:
-
Will Fredrik and others collaborate to create a compatibility definition for these different ElementTree implementations? I’d like to see a definition of a mandatory ElementTree API. Ideally, whether to use cElementTree,
lxml.etree, or ElementTree proper would be a decision based on what was available in a given environment, not a decision made when coding. -
I’d like to see libxml2 added to Fredrik’s comparison table. (Fredrik: ping)
-
At some point in the future (Python 3000?), I’d like to see ElementTree or its equivalent rolled into the core library. This seems unlikely though, as I don’t think XML-SIG or the greater python community wants the Python/XML waters any murkier. I partially agree but the number of people looking outside of core python’s XML support for functionality it provides says that it isn’t getting the job done.