Vellichor: a succinct article text extractor
Vellichor (n): the strange wistfulness of used bookstores
Vellichor's aims aren't ambitious. It does its duty relatively well, living a simple package's life, sustaining itself solely on URL or HTML strings. Provide it with these basic comforts and you shall receive a lean, healthy block of article text.
Quickstart
Dependencies
Despite its simple purpose, Vellichor has a few dependencies, as it uses a random forest model to classify a candidate HTML node as relevant or not. These will be installed automatically, if you don't already have them: urlvalidator, requests, commonregex, lxml, beautifulsoup4, scipy, scikit-learn, numpy. The library was tested with Python 3.6 only.
Installation
Of course, virtualenv would be a nice idea, considering you may want a few of those important dependencies untouched:
virtualenv test_env --python==python3.6
You can use pip
to install Vellichor:
pip install vellichor
Usage
Vellichor extracts relevant text from an article URL or HTML string. To begin, import the Extract class:
from vellichor.extract import Extract
You can then create an instance of Extract and feed a URL or HTML string to several methods:
url = "http://www.example.com/you-wont-believe-these-examples" html = "<html><p>Example</p></html>" extract = Extract() # Main method article_text = extract.article_text_from(url) # OR extract.article_text_from(html=html) # Extract raw text directly from the retrieved HTML raw_text = extract.raw_text_from(url) # Extract the HTML only - URL parameter only html_only = extract.html_from(url) # Outputs a Beautiful Soup object from the retrieved HTML soup = extract.soup_from(url)
To extract text from a sea of article URLs, be sure to instantiate Extract
for every new URL.
Not satisfied with just a clean block of text? Vellichor comes with a few methods for extracting some basic details:
extract.article_details() # outputs a list of author candidates: ["Dr. Exampleton"] extract.author # outputs the site name: "Example" extract.site_name # outputs the article title: "You Won't Believe these Examples!" extract.article_title
A few things to note. Running the article_text_from()
method on an instance of Extract
automatically gives access to the following class attributes: html
, article_text
, soup
, and soup_blocks
(a collection of candidate nodes, or <p> tags, that were used for deciding the final output text).
Second, there is a bit of hierarchy built in. Running the get_soup_blocks()
method also gives access to the soup
and html
class methods. Running get_soup()
on your instance also gets you the html
class method.
raw_text
is only available when the raw_text_from()
method is called on an instance of Extract (the URL or HTML parameter is required if this will be the first class method you call).
That's all folks.
...
I have always imagined that Paradise will be a kind of library.