vellichor

A succinct article text extractor.


License
MIT
Install
pip install vellichor==0.0.2

Documentation

Vellichor: a succinct article text extractor

Vellichor (n): the strange wistfulness of used bookstores

Vellichor's aims aren't ambitious. It does its duty relatively well, living a simple package's life, sustaining itself solely on URL or HTML strings. Provide it with these basic comforts and you shall receive a lean, healthy block of article text.

Quickstart

Dependencies

Despite its simple purpose, Vellichor has a few dependencies, as it uses a random forest model to classify a candidate HTML node as relevant or not. These will be installed automatically, if you don't already have them: urlvalidator, requests, commonregex, lxml, beautifulsoup4, scipy, scikit-learn, numpy. The library was tested with Python 3.6 only.

Installation

Of course, virtualenv would be a nice idea, considering you may want a few of those important dependencies untouched:

virtualenv test_env --python==python3.6

You can use pip to install Vellichor:

pip install vellichor

Usage

Vellichor extracts relevant text from an article URL or HTML string. To begin, import the Extract class:

from vellichor.extract import Extract

You can then create an instance of Extract and feed a URL or HTML string to several methods:

url = "http://www.example.com/you-wont-believe-these-examples"
html = "<html><p>Example</p></html>"

extract = Extract()

# Main method
article_text = extract.article_text_from(url)
# OR extract.article_text_from(html=html)

# Extract raw text directly from the retrieved HTML
raw_text = extract.raw_text_from(url)

# Extract the HTML only - URL parameter only
html_only = extract.html_from(url)

# Outputs a Beautiful Soup object from the retrieved HTML
soup = extract.soup_from(url)

To extract text from a sea of article URLs, be sure to instantiate Extract for every new URL.

Not satisfied with just a clean block of text? Vellichor comes with a few methods for extracting some basic details:

extract.article_details()

# outputs a list of author candidates: ["Dr. Exampleton"]
extract.author

# outputs the site name: "Example"
extract.site_name

# outputs the article title: "You Won't Believe these Examples!"
extract.article_title

A few things to note. Running the article_text_from() method on an instance of Extract automatically gives access to the following class attributes: html, article_text, soup, and soup_blocks (a collection of candidate nodes, or <p> tags, that were used for deciding the final output text).

Second, there is a bit of hierarchy built in. Running the get_soup_blocks() method also gives access to the soup and html class methods. Running get_soup() on your instance also gets you the html class method.

raw_text is only available when the raw_text_from() method is called on an instance of Extract (the URL or HTML parameter is required if this will be the first class method you call).

That's all folks.

...

I have always imagined that Paradise will be a kind of library.