What's the best way to handle  -like entities in XML documents with lxml?

Question

Consider the following:

from lxml import etree
from StringIO import StringIO

x = """<?xml version="1.0" encoding="utf-8"?>\n<aa>&nbsp;&acirc;</aa>"""
p = etree.XMLParser(remove_blank_text=True, resolve_entities=False)
r = etree.parse(StringIO(x), p)

This would fail with:
lxml.etree.XMLSyntaxError: Entity 'nbsp' not defined, line 2, column 11

This is because resolve_entities=False doesn't ignore them, it just doesn't resolve them.

If I use etree.HTMLParser instead, it creates html and body tags, plus a lot of other special handling it tries to do for HTML.

What's the best way to get a  â text child under the aa tag with lxml?

twasbrillig · Accepted Answer · 2014-11-13 07:44:52Z

You can't ignore entities as they are part of the XML definition. Your document is not well-formed if it doesn't have a DTD or standalone="yes" or if it includes entities without an entity definition in the DTD. Lie and claim your document is HTML.

https://mailman-mail5.webfaction.com/pipermail/lxml/2008-February/003398.html

You can try lying and putting an XHTML DTD on your document. e.g.

from lxml import etree
from StringIO import StringIO
x = """<?xml version="1.0" encoding="utf-8"?>\n<!DOCTYPE html PUBLIC "-//W3C//DTD HTML 4.01 Transitional//EN" "http://www.w3.org/TR/xhtml1/DTD/xhtml1-transitional.dtd" >\n<aa>&nbsp;&acirc;</aa>"""
p = etree.XMLParser(remove_blank_text=True, resolve_entities=False)
r = etree.parse(StringIO(x), p)
etree.tostring(r) # '<aa>&nbsp;&acirc;</aa>'

jedwards · Answer 2 · 2013-06-07 02:26:39Z

@Alex is right: your document is not well-formed XML, and so XML parsers will not parse it. One option is to pre-process the text of the document to replace bogus entities with their utf-8 characters:

entities = [
    ('&nbsp;', u'\u00a0'),
    ('&acirc;', u'\u00e2'),
    ...
    ]

for before, after in entities:
    x = x.replace(before, after.encode('utf8'))

Of course, this can be broken by sufficiently weird "xml" also.

Your best bet is to fix your input XML documents to be well-formed XML.

Michael Buckley · Answer 3 · 2015-03-29 16:51:59Z

up vote -3 down vote

When I was trying to do something similar, I just used x.replace('&', '&') before parsing the string.

answered Mar 29 '15 at 16:51

Michael Buckley

16718

For this to work the encoding must be such that the & character denotes only itself (UTF-8 is one such encoding as (…) ASCII bytes do not occur when encoding non-ASCII code points into UTF-8). However it still won't work as written because you have to preserve & character in already existing Xml escape sequences like &. In addition this will mangle plain & characters in comments. – Piotr Dobrogost Jun 6 at 5:55

add a comment |

asked	5 years ago
viewed	10340 times
active	1 year ago

current community

your communities

more stack exchange communities

What's the best way to handle  -like entities in XML documents with lxml?

3 Answers 3

Your Answer

Not the answer you're looking for? Browse other questions tagged python lxml or ask your own question.

Visit Chat

Hot Network Questions

current community

your communities

more stack exchange communities

What's the best way to handle &nbsp;-like entities in XML documents with lxml?

3 Answers 3

Your Answer

Sign up or log in

Post as a guest

Not the answer you're looking for? Browse other questions tagged python lxml or ask your own question.

Visit Chat

Related

Hot Network Questions

What's the best way to handle -like entities in XML documents with lxml?