Dismiss
Announcing Stack Overflow Documentation

We started with Q&A. Technical documentation is next, and we need your help.

Whether you're a beginner or an experienced developer, you can contribute.

Sign up and start helping → Learn more about Documentation →

Consider the following:

from lxml import etree
from StringIO import StringIO

x = """<?xml version="1.0" encoding="utf-8"?>\n<aa>&nbsp;&acirc;</aa>"""
p = etree.XMLParser(remove_blank_text=True, resolve_entities=False)
r = etree.parse(StringIO(x), p)

This would fail with:
lxml.etree.XMLSyntaxError: Entity 'nbsp' not defined, line 2, column 11

This is because resolve_entities=False doesn't ignore them, it just doesn't resolve them.

If I use etree.HTMLParser instead, it creates html and body tags, plus a lot of other special handling it tries to do for HTML.

What's the best way to get a &nbsp;&acirc; text child under the aa tag with lxml?

share|improve this question
up vote 11 down vote accepted

You can't ignore entities as they are part of the XML definition. Your document is not well-formed if it doesn't have a DTD or standalone="yes" or if it includes entities without an entity definition in the DTD. Lie and claim your document is HTML.

https://mailman-mail5.webfaction.com/pipermail/lxml/2008-February/003398.html

You can try lying and putting an XHTML DTD on your document. e.g.

from lxml import etree
from StringIO import StringIO
x = """<?xml version="1.0" encoding="utf-8"?>\n<!DOCTYPE html PUBLIC "-//W3C//DTD HTML 4.01 Transitional//EN" "http://www.w3.org/TR/xhtml1/DTD/xhtml1-transitional.dtd" >\n<aa>&nbsp;&acirc;</aa>"""
p = etree.XMLParser(remove_blank_text=True, resolve_entities=False)
r = etree.parse(StringIO(x), p)
etree.tostring(r) # '<aa>&nbsp;&acirc;</aa>'
share|improve this answer

@Alex is right: your document is not well-formed XML, and so XML parsers will not parse it. One option is to pre-process the text of the document to replace bogus entities with their utf-8 characters:

entities = [
    ('&nbsp;', u'\u00a0'),
    ('&acirc;', u'\u00e2'),
    ...
    ]

for before, after in entities:
    x = x.replace(before, after.encode('utf8'))

Of course, this can be broken by sufficiently weird "xml" also.

Your best bet is to fix your input XML documents to be well-formed XML.

share|improve this answer

When I was trying to do something similar, I just used x.replace('&', '&amp;') before parsing the string.

share|improve this answer
    
For this to work the encoding must be such that the & character denotes only itself (UTF-8 is one such encoding as (…) ASCII bytes do not occur when encoding non-ASCII code points into UTF-8). However it still won't work as written because you have to preserve & character in already existing Xml escape sequences like &amp;. In addition this will mangle plain & characters in comments. – Piotr Dobrogost Jun 6 at 5:55

Your Answer

 
discard

By posting your answer, you agree to the privacy policy and terms of service.

Not the answer you're looking for? Browse other questions tagged or ask your own question.