I’m in the process of learning how to use lxml- a fast, powerful XML parser for python that relies on libxml2. I don’t know much in the way of details regarding xml so I got stuck as soon as I got started.
I was passing straight markdown generated XHTML into the various xml parser methods and objects for lxml. All of them were dying with the same error. The only parser that worked was the HTML parser.
The xml parsers kept coughing up the following error:
lxml.etree.XMLSyntaxError: Extra content at the end of the document, line 1, column 222
I really had know clue as to what this might mean. Since the output was from markdown, I reasoned there was little likelihood of malformed XHTML or some such. Besides, I knew that it rendered fine on web pages. Liekly, there was some significant detail I was missing. Unfortunately, my lack of xml knowledge meant I didn’t have anything to fallback on for solving the problem.
Finally, I turned up a comment thread where I learned that xml documents require a “root element.” I had seen the term “root” but only had a vague notion of what it meant. Now, I know exactly what is meant.
Prior to passing the markdown string to the parser, I performed the following operation:
rootedhtml = "<post>%s</post>" % html
Where html
was the markdown output. I then passed rootedhtml
to the parser and it no longer chokes.
Now I can get back to solving my original problem.
2 replies on “Learned Something New”
Thanks for this, it’s saved me a lot of time figuring out the same problem.
Glad my frustrations were of use to someone else.