tech-parser
Parses articles from 34 sites and outputs it into HTML. Also, it's some sort of RSS reader.
You can see it in action here. And here's template repository for deploying at heroku.
Table of contents
Current list of sites
- habrahabr.ru (russian)
- venturebeat.com
- engadget.com
- techrepublic.com
- techcrunch.com
- smashingmagazine.com
- theverge.com
- slashdot.org
- gizmodo.com
- androidcentral.com
- topdesignmag.com
- flowa.fi
- it.toolbox.com
- dzone.com
- codeproject.com
- news.ycombinator.com
- mashable.com
- maketecheasier.com
- digg.com
- wired.com
- medium.com
- planet.clojure.in
- reddit.com
- mobile-review.com (russian)
- ixbt.ru (russian)
- readwrite.com
- trashbox.ru (russian)
- droider.ru (russian)
- redroid.ru (russian)
- 3dnews.ru (russian)
- helpix.ru (russian)
- recode.net
- zdnet.com
- geektimes.ru (russian)
One awesome feature
New in 1.4.0
Before You scroll away, I want You to know about one awesome feature that TechParser has.
I'm talking about ranking.
Every time when You click on like button below article TechParser adds it to the database.
And next time when it will parse articles it will sort them according to those articles in that database.
Installation
Requirements
Mako
Bottle
Grab
Daemo
All these modules can be installed with pip or easy_install.
How to install
TechParser works on both Python 2.X and 3.X, although I still recommend to use Python 3.X.
You can install TechParser by running
pip install TechParser
or
python setup.py install
How to use
Run python -m TechParser start
to start server
And then open localhost:8080 in your browser.
python -m TechParser stop
to stop server
python -m TechParser update
to manually update list of articles.
python -m TechParser run HOST:PORT
run server without starting daemon.
python -m TechParser lock
to not allow updating articles.
python -m TechParser unlock
to allow updating articles (run this if you can't update articles).
python -m TechParser locked?
to check if updating articles is allowed.
python -m TechParser train
to train classifier (useful after chaning ngrams
).
python -m TechParser rerank
to rank articles again.
python -m TechParser -h
show help.
python -m TechParser <action> --config <path to configuration file>
set path to configuration file.
Run python -m TechParser --help
for more info.
To make usage easier I recommend to make an alias like this:
alias tech-parser="python -m TechParser"
on *nix based OS or
doskey tech-parser=python -m TechParser $*
on Windows
After that You will be able to run tech-parser
instead of python -m TechParser
.
Configuring
Don't forget to check out TechParser/parser_config.py
after updating.
Changing configuration in browser
New in 1.8.3
By default you have json_config=True
in ~/.tech-parser/user_parser_config.py
.
That allows you to edit configuration right in your browser (click at Edit config
link).
Note that when you save your configuration in browser, you update ~/.tech-parser/user_parser_config.json
, not ~/.tech-parser/user_parser_config.py
. In order to disable that just set json_config=False
in ~/.tech-parser/user_parser_config.py
and restart parser.
Enabling/disabling parsers
To enable/disable site parsers edit ~/.tech-parser/user_parser_config.py
.
If you can't find the file, run python -m TechParser
then search again.
Find there line with sites_to_parse
and comment those sites, which you don't want to see articles from.
For example if you don't want to see articles from Habrahabr (it's in russian only), find this fragment of code:
"Habrahabr": { # habrahabr.ru
"module": habrahabr,
"kwargs": {},
"enabled": True
},
and make it look like this:
"Habrahabr": { # habrahabr.ru
"module": habrahabr,
"kwargs": {},
"enabled": False
},
All you need to do is to set enabled
to False
.
Setting password
New in 1.8.2
You can set password inside your configuration like this:
password = 'your password'
or
password = os.environ.get('TechParser_PASSWORD', '')
In last case you need to set environment variable TechParser_PASSWORD
equal to your password.
After that when you'll open TechParser in your browser it will ask you to enter password.
Session expires after a year.
Adding RSS feeds
New in 1.7.0
Find the following line in your configuration:
rss_feeds = {}
RSS feed should contain it's name, url, short name (without spaces and stuff like that), url to icon and title color. Example feeds:
rss_feeds = {'CSS-tricks': {
'short-name': 'css-tricks',
'url': 'http://feeds.feedburner.com/CssTricks?format=xml',
'icon': 'http://css-tricks.com/favicon.ico',
'color': '#DA8817'
},
'The Next Web': {
'url': 'http://feeds2.feedburner.com/thenextweb',
'short-name': 'nextweb',
'icon': 'http://thenextweb.com/favicon.ico',
'color': '#F15A2F'
}
}
Asynchronous parsing
New in 1.7.0
You can set number of threads available for parsing.
To do that you need to set num_threads
in your configuration.
Example:
num_threads = 4
Word lists
New in 1.7.5
Articles can also be sorted by words you find interesting and boring.
To do that you can set variables interesting_words
and boring_words
.
Example:
interestring_words = {'word1', 'word2', 'word3'}
boring_words = {'word4', 'word5', 'word6'}
You can also set priority for each word:
interesting_words = [['python', 5.0], ['fortran', 3.0], 'css', 'html', ['google', 1.5]]
boring_words = [['pascal', 10.0], 'delphi']
Default priority for each word is 1
Update interval
Find the line of code in user_parser_config.py
like this:
update_interval = 1800
and set update_interval
equal to any amount of seconds you want.
For example if update_interval
will be set to 3600
, it will update data every hour.
Note that this hour is not hour after server start.
It means, that every time, when epoch time is divisible by 3600
TechParser will update articles.
With this interval TechParser will update articles at:
00:00
01:00
02:00
...
13:00
14:00
...and so on.
Custom host and port
In ~/.tech-parser/user_parser_config.py
find two variables: host
and port
and set them equal to whatever host and port you want.
Example:
host="0.0.0.0"
port="8081"