Lets You parse articles from various related to IT sites.


Keywords
parser, article, web, ranking, rss
License
Other
Install
pip install TechParser==1.9.0

Documentation

tech-parser

Parses articles from 34 sites and outputs it into HTML. Also, it's some sort of RSS reader.

You can see it in action here. And here's template repository for deploying at heroku.

Table of contents

  1. Current list of sites
  2. One awesome feature
  3. Installation
  4. How to use
  5. Configuring

Current list of sites

  1. habrahabr.ru (russian)
  2. venturebeat.com
  3. engadget.com
  4. techrepublic.com
  5. techcrunch.com
  6. smashingmagazine.com
  7. theverge.com
  8. slashdot.org
  9. gizmodo.com
  10. androidcentral.com
  11. topdesignmag.com
  12. flowa.fi
  13. it.toolbox.com
  14. dzone.com
  15. codeproject.com
  16. news.ycombinator.com
  17. mashable.com
  18. maketecheasier.com
  19. digg.com
  20. wired.com
  21. medium.com
  22. planet.clojure.in
  23. reddit.com
  24. mobile-review.com (russian)
  25. ixbt.ru (russian)
  26. readwrite.com
  27. trashbox.ru (russian)
  28. droider.ru (russian)
  29. redroid.ru (russian)
  30. 3dnews.ru (russian)
  31. helpix.ru (russian)
  32. recode.net
  33. zdnet.com
  34. geektimes.ru (russian)

One awesome feature

New in 1.4.0
Before You scroll away, I want You to know about one awesome feature that TechParser has.
I'm talking about ranking.

Every time when You click on like button below article TechParser adds it to the database.
And next time when it will parse articles it will sort them according to those articles in that database.

Installation

Requirements

Mako
Bottle
Grab
Daemo

All these modules can be installed with pip or easy_install.

How to install

TechParser works on both Python 2.X and 3.X, although I still recommend to use Python 3.X.

You can install TechParser by running
pip install TechParser
or
python setup.py install

How to use

Run python -m TechParser start to start server
And then open localhost:8080 in your browser.
python -m TechParser stop to stop server
python -m TechParser update to manually update list of articles.
python -m TechParser run HOST:PORT run server without starting daemon.
python -m TechParser lock to not allow updating articles.
python -m TechParser unlock to allow updating articles (run this if you can't update articles).
python -m TechParser locked? to check if updating articles is allowed.
python -m TechParser train to train classifier (useful after chaning ngrams).
python -m TechParser rerank to rank articles again.
python -m TechParser -h show help.
python -m TechParser <action> --config <path to configuration file> set path to configuration file.

Run python -m TechParser --help for more info.

To make usage easier I recommend to make an alias like this:
alias tech-parser="python -m TechParser" on *nix based OS or
doskey tech-parser=python -m TechParser $* on Windows
After that You will be able to run tech-parser instead of python -m TechParser.

Configuring

Don't forget to check out TechParser/parser_config.py after updating.

Changing configuration in browser

New in 1.8.3
By default you have json_config=True in ~/.tech-parser/user_parser_config.py. That allows you to edit configuration right in your browser (click at Edit config link). Note that when you save your configuration in browser, you update ~/.tech-parser/user_parser_config.json, not ~/.tech-parser/user_parser_config.py. In order to disable that just set json_config=False in ~/.tech-parser/user_parser_config.py and restart parser.

Enabling/disabling parsers

To enable/disable site parsers edit ~/.tech-parser/user_parser_config.py.
If you can't find the file, run python -m TechParser then search again.
Find there line with sites_to_parse and comment those sites, which you don't want to see articles from.

For example if you don't want to see articles from Habrahabr (it's in russian only), find this fragment of code:

		"Habrahabr": { # habrahabr.ru
			"module": habrahabr,
			"kwargs": {},
			"enabled": True
		},

and make it look like this:

		"Habrahabr": { # habrahabr.ru
			"module": habrahabr,
			"kwargs": {},
			"enabled": False
		},

All you need to do is to set enabled to False.

Setting password

New in 1.8.2

You can set password inside your configuration like this:

password = 'your password'

or

password = os.environ.get('TechParser_PASSWORD', '')

In last case you need to set environment variable TechParser_PASSWORD equal to your password.
After that when you'll open TechParser in your browser it will ask you to enter password.
Session expires after a year.

Adding RSS feeds

New in 1.7.0
Find the following line in your configuration:

rss_feeds = {}

RSS feed should contain it's name, url, short name (without spaces and stuff like that), url to icon and title color. Example feeds:

rss_feeds = {'CSS-tricks': {
		'short-name': 'css-tricks',
		'url': 'http://feeds.feedburner.com/CssTricks?format=xml',
		'icon': 'http://css-tricks.com/favicon.ico',
		'color': '#DA8817'
	},
	
	'The Next Web':	{
		'url': 'http://feeds2.feedburner.com/thenextweb',
		'short-name': 'nextweb',
		'icon': 'http://thenextweb.com/favicon.ico',
		'color': '#F15A2F'
	}
}

Asynchronous parsing

New in 1.7.0
You can set number of threads available for parsing.
To do that you need to set num_threads in your configuration.
Example:

num_threads = 4

Word lists

New in 1.7.5
Articles can also be sorted by words you find interesting and boring. To do that you can set variables interesting_words and boring_words. Example:

interestring_words = {'word1', 'word2', 'word3'}
boring_words = {'word4', 'word5', 'word6'}

You can also set priority for each word:

interesting_words = [['python', 5.0], ['fortran', 3.0], 'css', 'html', ['google', 1.5]]
boring_words = [['pascal', 10.0], 'delphi']

Default priority for each word is 1

Update interval

Find the line of code in user_parser_config.py like this:

update_interval = 1800

and set update_interval equal to any amount of seconds you want.

For example if update_interval will be set to 3600, it will update data every hour.
Note that this hour is not hour after server start.
It means, that every time, when epoch time is divisible by 3600 TechParser will update articles. With this interval TechParser will update articles at:
00:00
01:00
02:00
...
13:00
14:00
...and so on.

Custom host and port

In ~/.tech-parser/user_parser_config.py find two variables: host and port and set them equal to whatever host and port you want.
Example:

host="0.0.0.0"
port="8081"