A statistical content extraction tool written in python - remove the useless fluff from arbitrary HTML pages.
Based on methods discussed (and implemented) in various places, but most directly:
- http://www2003.org/cdrom /papers/refereed/p583/p583-gupta.html
An experiment / work in progress.
The command line tool can either take a file or a URL to extract. It prints the content tree to stdout:
unfluff -u 'http://some-website.com/interesting-article.html'
unfluff library has a few functions, which pretty much all do the same thing via different formats:
import unfluff unfluff.from_url('http://whatever/') unfluff.from_file('/tmp/input.html') unfluff.from_string("<html>inline content</html>")
- lxml (fancy HTML parsing)
- scipy (fancy maths)
Both of these are native (C) extensions, which means you're best off looking for them in your friendly neighborhood package manager.