html5charref
Python library for escaping/unescaping HTML5 Named Character References.
The standard library includes the HTMLParser library for unescaping HTML named entities and HTML unicode escapes. Unfortunately, it doesn't include any of the named character entity references defined in HTML5. This library intends to provide a solution for escaping/unescaping HTML character references defined in HTML5.
Installation
This project is still under development, so you should install it via GitHub instead of PyPI:
pip install git+https://github.com/bpabel/html5charref.git
Usage
The main purpose of html5charref is to unescape HTML named entities. It will also handle HTML unicode character escapes.
html = u'This has © and < and © symbols'
print html5charref.unescape(html)
# u'This has \uxa9 and < and \uxa9 symbols'
You can also use html5charref to find the HTML5 named entity for a given unicode character.
import html5charref
# The copyright character
print html5charref.escape_char(u'\u00a9')
# u'©'
Updating Named Entity References
It is possible that additional named entity references will be
added to the HTLM5 spec. You can update the list maintained by
html5charref using the update_charrefs()
function. This queries
the latest named entity definitions from the w3 HTML5 site.
import html5charref
html5charref.update_charrefs()
Licensing
This project is licensed under the MIT license.
Documentation
View the full documentation.