GotchaTwitter
A python twitter crawler.
- Support on crawling timeline for a target user [in a certain date range].
-
Developing
Support on crawling threads. -
Warning
Using TwitterAPI to get user information with uid/screen_name is much faster and safer than web-scraping method.
Dependencies
-
bs4
Beautifulsoup -
lxml
Html parser for beautifulsoup (has special installation method on Amazon EC2) -
tqdm
Progress bar in terminal -
requestsplus
Self-modified requests package with max retries and sleeping time between requests -
pushbullet.py
(optional) Notifier when crawling is finished.
Example
input = [<screen_name_1>, <screen_name_2>]
access_token = <pushbullet_token>
output_fp = <filepath you want to save>
with GotchaTwitter('timeline', input, output_fp) as gt:
gt = gt.set_output(save_mode='w', has_header=True) \
.set_notifier('pushbullet', access_token=access_token)
gt.crawl()
Notifier Setting
PushBullet
- Register a PushBullet account and create an access token in your account setting.
- Download and install Pushbullet app on your device (iOS tested).
Install lxml on Amazon Linux AMI (2016.03.3)
sudo yum install libxml2-devel libxslt-devel python-devel gcc
sudo pip install --upgrade setuptools
sudo /usr/local/bin/easy_install lxml