How to reduce your reliance on "bad" open source packages ✨ RSVP

unfluff
Release 0.2

HTML content extraction - remove the fluff

Homepage Repository PyPI Python

Keywords: html, content, extraction
License: BSD-3-Clause
Install: pip install unfluff==0.2

Documentation

Unfluff

A statistical content extraction tool written in python - remove the useless fluff from arbitrary HTML pages.

Based on methods discussed (and implemented) in various places, but most directly:

An experiment / work in progress.

Usage:

The command line tool can either take a file or a URL to extract. It prints the content tree to stdout:

unfluff /path/to/something.html

or

unfluff -u 'http://some-website.com/interesting-article.html'

The unfluff library has a few functions, which pretty much all do the same thing via different formats:

import unfluff
unfluff.from_url('http://whatever/')
unfluff.from_file('/tmp/input.html')
unfluff.from_string("<html>inline content</html>")

Requirements:

lxml (fancy HTML parsing)
scipy (fancy maths)

Both of these are native (C) extensions, which means you're best off looking for them in your friendly neighborhood package manager.

Licence:

BSD

Dependencies: 0
Dependent packages: 0
Dependent repositories: 0
Total releases: 2
Latest release: Jan 12, 2011
First release: Dec 24, 2009
Stars: 15
Forks: 0
Watchers: 1
Contributors: 1
Repository size: 105 KB
SourceRank: 7

Source repo 2FA enabled: TEXT!
Package manager 2FA enabled: TEXT!
Is security responsive: TEXT!
Dependencies are managed: TEXT!
Issue-free release available: TEXT!
Succession plan available: TEXT!
Package manager 2FA enabled: TEXT!

Releases

0.2: Jan 12, 2011
0.1: Dec 24, 2009

Contributors

See all contributors

Something wrong with this page? Make a suggestion

Export .ABOUT file for this package

Last synced: 2021-02-23 03:35:27 UTC

Login to resync this project