html5charref

Python library for escaping/unescaping HTML5 Named Character References.


License
MIT
Install
pip install html5charref==0.1.0

Documentation

html5charref

Build Status Documentation Status License

Python library for escaping/unescaping HTML5 Named Character References.

The standard library includes the HTMLParser library for unescaping HTML named entities and HTML unicode escapes. Unfortunately, it doesn't include any of the named character entity references defined in HTML5. This library intends to provide a solution for escaping/unescaping HTML character references defined in HTML5.

Installation

This project is still under development, so you should install it via GitHub instead of PyPI:

pip install git+https://github.com/bpabel/html5charref.git

Usage

The main purpose of html5charref is to unescape HTML named entities. It will also handle HTML unicode character escapes.

html = u'This has © and < and © symbols'
print html5charref.unescape(html)
# u'This has \uxa9 and < and \uxa9 symbols' 

You can also use html5charref to find the HTML5 named entity for a given unicode character.

import html5charref
# The copyright character
print html5charref.escape_char(u'\u00a9')
# u'&copy;'

Updating Named Entity References

It is possible that additional named entity references will be added to the HTLM5 spec. You can update the list maintained by html5charref using the update_charrefs() function. This queries the latest named entity definitions from the w3 HTML5 site.

import html5charref
html5charref.update_charrefs()

Licensing

This project is licensed under the MIT license.

Documentation

View the full documentation.