Life as a maintainer after the xz utils backdoor hack 👉 Watch now!

boilerpipe3-fix
Release 1.1

Python interface to Boilerpipe, Boilerplate Removal and Fulltext Extraction from HTML pages with Python 3 support

Homepage PyPI Python

Keywords: boilerpipe
License: Apache-2.0
Install: pip install boilerpipe3-fix==1.1

Documentation

boilerpipe3

Python interface to Boilerpipe, Boilerplate Removal and Fulltext Extraction from HTML pages

Installation

You can install this lib directly from github repository by execute these command

pip install git+ssh://git@github.com/slaveofcode/boilerpipe3@master

Or from official pypi

pip install boilerpipe3

Configuration

Dependencies: jpype, charade

The boilerpipe jar files will get fetched and included automatically when building the package.

Usage

Be sure to have set JAVA_HOME properly since jpype depends on this setting.

The constructor takes a keyword argment extractor, being one of the available boilerpipe extractor types:

DefaultExtractor
ArticleExtractor
ArticleSentencesExtractor
KeepEverythingExtractor
KeepEverythingWithMinKWordsExtractor
LargestContentExtractor
NumWordsRulesExtractor
CanolaExtractor

If no extractor is passed the DefaultExtractor will be used by default. Additional keyword arguments are either html for HTML text or url.

from boilerpipe.extract import Extractor
extractor = Extractor(extractor='ArticleExtractor', url=your_url)

Then, to extract relevant content:

extracted_text = extractor.getText()

extracted_html = extractor.getHTML()

Dependencies: 0
Dependent packages: 0
Dependent repositories: 0
Total releases: 1
Latest release: Oct 4, 2018
First release: Oct 4, 2018
Stars: 38
Forks: 13
Watchers: 1
Contributors: 4
Repository size: 7.23 MB
SourceRank: 8

Source repo 2FA enabled: TEXT!
Package manager 2FA enabled: TEXT!
Is security responsive: TEXT!
Dependencies are managed: TEXT!
Issue-free release available: TEXT!
Succession plan available: TEXT!
Package manager 2FA enabled: TEXT!

Releases

1.1: Oct 4, 2018

Contributors

See all contributors

Something wrong with this page? Make a suggestion

Export .ABOUT file for this package

Last synced: 2023-11-30 16:18:41 UTC

Login to resync this project