unfluff

HTML content extraction - remove the fluff


Keywords
html content extraction
License
BSD-3-Clause
Install
pip install unfluff==0.2

Documentation

Unfluff

A statistical content extraction tool written in python - remove the useless fluff from arbitrary HTML pages.

Based on methods discussed (and implemented) in various places, but most directly:

An experiment / work in progress.

Usage:

The command line tool can either take a file or a URL to extract. It prints the content tree to stdout:

unfluff /path/to/something.html

or

unfluff -u 'http://some-website.com/interesting-article.html'

The unfluff library has a few functions, which pretty much all do the same thing via different formats:

import unfluff
unfluff.from_url('http://whatever/')
unfluff.from_file('/tmp/input.html')
unfluff.from_string("<html>inline content</html>")

Requirements:

  • lxml (fancy HTML parsing)
  • scipy (fancy maths)

Both of these are native (C) extensions, which means you're best off looking for them in your friendly neighborhood package manager.

Licence:

BSD