Excerpt HTML

This distribution provides a single function, excerpt_html whose purpose is to extract leading portions of HTML text. This is useful, for example, in order to be able to generate a summary of a blog post from the post body.

excerpt_html(html_text, min_words=50, cut_mark=r'(?i)\s*more\b')

The excerpt_html function expects, as input, HTML text, and returns a shortened version of that HTML text. The truncation point is found in one of two ways:

If an explicit cut-mark — an HTML comment whose text matches cut_mark — is found, the text will be truncated there.
If no explicit cut-mark is found, an attempt will be made to find a suitable implicit truncation point. Only points which are not within in-line markup are considered. The text will be truncated at the first such location found which preserves at least min_words (by default, 50) words of text.

In either case, the returned excerpt will always be a syntactically valid HTML fragment.

Arguments:

html_text: The input text, a string containing an HTML fragment.
min_words: When finding a block-level truncation point, retain at least this many words of the original text. Pass None to disable block-level truncation.
cut_mark: A regular expression which is to be matched against the text of HTML comments in html_text to find a truncation point. This is matched using re.match() against the contents of HTML comments. This should be either a compiled regular expression or a string; or None to disable cut-mark recognition.

Returns:

If a truncation point was found, a string containing the excerpt, a semantically valid HTML fragment, is returned.

If no suitable truncation point was found, None is returned.

Installation

The package is installable via pip.

pip install excerpt-html

Example

Here are two paragraphs worth of HTML, with an explicit cut-mark in the middle of the first paragraph.

>>> from excerpt_html import excerpt_html

>>> post_body = '''
... <p>
... In a sense, the subject is interpolated into a neotextual
... narrative that includes culture as a paradox.
... <!-- more -->
... A number of deconceptualisms concerning substructural
... construction exist.
... </p>
... <p>
... However, the subject is contextualised into a postmaterial
... discourse that includes sexuality as a totality. Sontag uses
... the term ‘cultural narrative’ to denote not, in fact,
... deconstruction, but predeconstruction.
... </p>'''

By default, the text will be truncated at the cut mark:

>>> summary = excerpt_html(post_body)
>>> print(summary)
<p>
In a sense, the subject is interpolated into a neotextual
narrative that includes culture as a paradox.
</p>

If we disable cut_mark recognition, there is no suitable implicit truncation point which will preserve at least 50 words (the default value of min_words):

>>> summary = excerpt_html(post_body, cut_mark=None)

>>> summary is None
True

If we a lower value for min_words, the break between paragraphs will be selected as a truncation point:

>>> summary = excerpt_html(post_body, min_words=10, cut_mark=None)

>>> print(summary)          # doctest: +NORMALIZE_WHITESPACE
<p>
In a sense, the subject is interpolated into a neotextual
narrative that includes culture as a paradox.
<!-- more -->
A number of deconceptualisms concerning substructural
construction exist.
</p>

Links

Development takes place at GitHub. Releases may be downloaded from PyPI.

Author

Jeff Dairiki dairiki@dairiki.org

excerpt-html
Release 0.2.0

Release 0.2.0

0.2.0

0.1

0.1b2

0.1b1

Documentation

Excerpt HTML

Installation

Example

Links

Author

Stats

Development practices

Releases

Contributors

excerpt-html Release 0.2.0

Release 0.2.0 Toggle Dropdown 0.2.0 0.1 0.1b2 0.1b1

Documentation

Excerpt HTML

Installation

Example

Links

Author

Stats

Development practices

Releases

Contributors

excerpt-html
Release 0.2.0

Release 0.2.0

0.2.0

0.1

0.1b2

0.1b1