What is blockedonweibo?
This python package allows you to automate tests to check if a keyword is censored or not on Sina Weibo, a Chinese social media site. It is an updated version of the script which was used to detect keywords on the site, http://blockedonweibo.com. It handles interrupted tests, multiple tests per day, storing results to a database, and a number of other features which simplify testing Weibo censorship at scale. The researcher merely has to feed the script a list of words for one-off tests. For recurring tests, simply wrap the script with a scheduler.
Table of Contents
Install the blockedonweibo package
The github repo for this Weibo keyword testing script is located at https://github.com/jasonqng/blocked-on-weibo.
To begin using this python package, inside your terminal, do a
git clone https://github.com/jasonqng/blocked-on-weibo.git
or
git clone git@github.com:jasonqng/blocked-on-weibo.git
(if you prefer ssh).
Then cd
into the downloaded directory and install the requirements then the package:
pip install -r requirements.txt
python setup.py install
To confirm the installation works, in a python shell (you can start by running python
from terminal), try importing the package:
import blockedonweibo
If you don't get any errors, things have installed successfully. If not, you may need to fiddle with your python paths and settings to ensure it's being installed to the correct location.
Adjust your settings
Your python script only requires the following. All other imports are handled by the package.
from blockedonweibo import weibo
import pandas as pd
You have the option of saving your test results to a file. You'll need to pass a path to to a file which will store the results in sqlite format. It can be helpful to set this at the top of your script and pass the variable each time you run the test.
sqlite_file = 'results.sqlite' # name of sqlite file to read from/write to
If you want to erase any existing data you have in the sqlite file defined above, just pass overwrite=True to the create_database
function. Otherwise any new results will be appended to the end of the database.
weibo.create_database(sqlite_file, overwrite=True)
This testing script is enhanced if you allow it to log into Weibo, which increases your rate limit threshold as well as returns the number of results a search says it has. This script will work without your supplying credentials, but it is highly recommended. To do so, edit the weibo_credentials.py
with your email address and password. The file is ignored and will not be uploaded by default when you push commits to github. You can inspect the code to verify that the credentials don't go anywhere except to weibo.
Using those credentials, the script logs you in and fetches a cookie for the user session you create. This cookie can be saved to a file by passing the write_cookie
parameter in the user_login
function.
session = weibo.user_login(write_cookie=True)
There is a helper function to verify that the cookie actually works
cookie = session.cookies.get_dict()
print(weibo.verify_cookies_work(cookie))
True
If you have the cookie already written to disk, you don't need to perform another user_login
and instead, you can just use the load_cookies
function to fetch the cookie from the file. Again, you can verify that it works. Just store the cookie's contents (a dictionary) to a variable and pass that to the run
function below if you want to test as if you were logged in. Otherwise, it will emulate a search by a logged out user.
cookie = weibo.load_cookies()
print(weibo.verify_cookies_work(cookie))
True
Let's start testing!
Pass a dictionary of keywords to start testing
sample_keywords_df = pd.DataFrame(
[{'keyword':'hello','source':'my dataframe'},
{'keyword':'lxb','source':'my dataframe'},
{'keyword':u'习胞子','source':'my dataframe'}
])
sample_keywords_df
keyword | source | |
---|---|---|
0 | hello | my dataframe |
1 | lxb | my dataframe |
2 | 习胞子 | my dataframe |
weibo.run(sample_keywords_df,insert=False,return_df=True)
(0, u'hello', 'has_results')
(1, u'lxb', 'censored')
(2, u'\u4e60\u80de\u5b50', 'no_results')
date | datetime | keyword | num_results | result | source | test_number | |
---|---|---|---|---|---|---|---|
0 | 2017-09-25 | 2017-09-25 10:12:45.280812 | hello | [] | has_results | my dataframe | 1 |
0 | 2017-09-25 | 2017-09-25 10:13:00.191900 | lxb | None | censored | my dataframe | 1 |
0 | 2017-09-25 | 2017-09-25 10:13:16.356805 | 习胞子 | None | no_results | my dataframe | 1 |
Pass in cookies so you can also get the number of results. Pass in sqlite_file to save the results to disk so you can load it later
weibo.run(sample_keywords_df,sqlite_file=sqlite_file,cookies=cookie)
(0, u'hello', 'has_results')
(1, u'lxb', 'censored')
(2, u'\u4e60\u80de\u5b50', 'no_results')
weibo.sqlite_to_df(sqlite_file)
id | date | datetime_logged | test_number | keyword | censored | no_results | reset | result | source | num_results | notes | |
---|---|---|---|---|---|---|---|---|---|---|---|---|
0 | 0 | 2017-09-25 | 2017-09-25 10:13:37.816720 | 1 | hello | 0 | 0 | 0 | has_results | my dataframe | 80454701.0 | None |
1 | 1 | 2017-09-25 | 2017-09-25 10:13:54.356722 | 1 | lxb | 0 | 0 | 0 | censored | my dataframe | NaN | None |
2 | 2 | 2017-09-25 | 2017-09-25 10:14:11.489530 | 1 | 习胞子 | 0 | 0 | 0 | no_results | my dataframe | NaN | None |
If your test gets interrupted or you add more keywords, you can pick up where you left off
Let's pretend I wanted to test four total keywords, but I was only able to complete the first three above. I'll go ahead and add one more keyword to the test list to replicate an unfinished keyword.
sample_keywords_df.loc[len(sample_keywords_df.index)] = ['刘晓波','my dataframe']
sample_keywords_df
keyword | source | |
---|---|---|
0 | hello | my dataframe |
1 | lxb | my dataframe |
2 | 习胞子 | my dataframe |
3 | 刘晓波 | my dataframe |
weibo.run(sample_keywords_df,sqlite_file=sqlite_file,cookies=cookie)
(3, u'\u5218\u6653\u6ce2', 'censored')
Neat-o, it was smart enough to start right at that new keyword and not start all over again!
weibo.sqlite_to_df(sqlite_file)
id | date | datetime_logged | test_number | keyword | censored | no_results | reset | result | source | num_results | notes | |
---|---|---|---|---|---|---|---|---|---|---|---|---|
0 | 0 | 2017-09-25 | 2017-09-25 10:13:37.816720 | 1 | hello | 0 | 0 | 0 | has_results | my dataframe | 80454701.0 | None |
1 | 1 | 2017-09-25 | 2017-09-25 10:13:54.356722 | 1 | lxb | 0 | 0 | 0 | censored | my dataframe | NaN | None |
2 | 2 | 2017-09-25 | 2017-09-25 10:14:11.489530 | 1 | 习胞子 | 0 | 0 | 0 | no_results | my dataframe | NaN | None |
3 | 3 | 2017-09-25 | 2017-09-25 10:14:29.667395 | 1 | 刘晓波 | 0 | 0 | 0 | censored | my dataframe | NaN | None |
You can attach notes or categorizations to your keywords for easy querying and analysis later
new_keywords_df = pd.DataFrame(
[{'keyword':'pokemon','source':'my dataframe',"notes":"pop culture"},
{'keyword':'jay chou','source':'my dataframe',"notes":"pop culture"},
{'keyword':u'weibo','source':'my dataframe',"notes":"social media"}
])
merged_keywords_df = pd.concat([sample_keywords_df,new_keywords_df]).reset_index(drop=True)
merged_keywords_df
keyword | notes | source | |
---|---|---|---|
0 | hello | NaN | my dataframe |
1 | lxb | NaN | my dataframe |
2 | 习胞子 | NaN | my dataframe |
3 | 刘晓波 | NaN | my dataframe |
4 | pokemon | pop culture | my dataframe |
5 | jay chou | pop culture | my dataframe |
6 | social media | my dataframe |
weibo.run(merged_keywords_df,sqlite_file=sqlite_file,cookies=cookie)
(4, u'pokemon', 'has_results')
(5, u'jay chou', 'has_results')
(6, u'weibo', 'has_results')
weibo.sqlite_to_df(sqlite_file)
id | date | datetime_logged | test_number | keyword | censored | no_results | reset | result | source | num_results | notes | |
---|---|---|---|---|---|---|---|---|---|---|---|---|
0 | 0 | 2017-09-25 | 2017-09-25 10:13:37.816720 | 1 | hello | 0 | 0 | 0 | has_results | my dataframe | 80454701.0 | None |
1 | 1 | 2017-09-25 | 2017-09-25 10:13:54.356722 | 1 | lxb | 0 | 0 | 0 | censored | my dataframe | NaN | None |
2 | 2 | 2017-09-25 | 2017-09-25 10:14:11.489530 | 1 | 习胞子 | 0 | 0 | 0 | no_results | my dataframe | NaN | None |
3 | 3 | 2017-09-25 | 2017-09-25 10:14:29.667395 | 1 | 刘晓波 | 0 | 0 | 0 | censored | my dataframe | NaN | None |
4 | 4 | 2017-09-25 | 2017-09-25 10:14:49.107078 | 1 | pokemon | 0 | 0 | 0 | has_results | my dataframe | 5705260.0 | pop culture |
5 | 5 | 2017-09-25 | 2017-09-25 10:15:09.762484 | 1 | jay chou | 0 | 0 | 0 | has_results | my dataframe | 881.0 | pop culture |
6 | 6 | 2017-09-25 | 2017-09-25 10:15:28.100418 | 1 | 0 | 0 | 0 | has_results | my dataframe | 63401495.0 | social media |
results = weibo.sqlite_to_df(sqlite_file)
results.query("notes=='pop culture'")
id | date | datetime_logged | test_number | keyword | censored | no_results | reset | result | source | num_results | notes | |
---|---|---|---|---|---|---|---|---|---|---|---|---|
4 | 4 | 2017-09-25 | 2017-09-25 10:14:49.107078 | 1 | pokemon | 0 | 0 | 0 | has_results | my dataframe | 5705260.0 | pop culture |
5 | 5 | 2017-09-25 | 2017-09-25 10:15:09.762484 | 1 | jay chou | 0 | 0 | 0 | has_results | my dataframe | 881.0 | pop culture |
results.query("notes=='pop culture'").num_results.mean()
2853070.5
test_number
param
If you want to test multiple times a day, just pass in the You can off verbose
output in case you don't need to troubleshoot anything...
weibo.run(sample_keywords_df,sqlite_file=sqlite_file,cookies=cookie,verbose='none',test_number=2)
weibo.sqlite_to_df(sqlite_file)
id | date | datetime_logged | test_number | keyword | censored | no_results | reset | result | source | num_results | notes | |
---|---|---|---|---|---|---|---|---|---|---|---|---|
0 | 0 | 2017-09-25 | 2017-09-25 10:13:37.816720 | 1 | hello | 0 | 0 | 0 | has_results | my dataframe | 80454701.0 | None |
1 | 1 | 2017-09-25 | 2017-09-25 10:13:54.356722 | 1 | lxb | 0 | 0 | 0 | censored | my dataframe | NaN | None |
2 | 2 | 2017-09-25 | 2017-09-25 10:14:11.489530 | 1 | 习胞子 | 0 | 0 | 0 | no_results | my dataframe | NaN | None |
3 | 3 | 2017-09-25 | 2017-09-25 10:14:29.667395 | 1 | 刘晓波 | 0 | 0 | 0 | censored | my dataframe | NaN | None |
4 | 4 | 2017-09-25 | 2017-09-25 10:14:49.107078 | 1 | pokemon | 0 | 0 | 0 | has_results | my dataframe | 5705260.0 | pop culture |
5 | 5 | 2017-09-25 | 2017-09-25 10:15:09.762484 | 1 | jay chou | 0 | 0 | 0 | has_results | my dataframe | 881.0 | pop culture |
6 | 6 | 2017-09-25 | 2017-09-25 10:15:28.100418 | 1 | 0 | 0 | 0 | has_results | my dataframe | 63401495.0 | social media | |
7 | 7 | 2017-09-25 | 2017-09-25 10:15:46.214464 | 2 | hello | 0 | 0 | 0 | has_results | my dataframe | 80454634.0 | None |
8 | 8 | 2017-09-25 | 2017-09-25 10:16:03.274804 | 2 | lxb | 0 | 0 | 0 | censored | my dataframe | NaN | None |
9 | 9 | 2017-09-25 | 2017-09-25 10:16:19.035805 | 2 | 习胞子 | 0 | 0 | 0 | no_results | my dataframe | NaN | None |
10 | 10 | 2017-09-25 | 2017-09-25 10:16:36.021837 | 2 | 刘晓波 | 0 | 0 | 0 | censored | my dataframe | NaN | None |
It can skip redundant keywords
more_keywords_df = pd.DataFrame(
[{'keyword':'zhongnanhai','source':'my dataframe2',"notes":"location"},
{'keyword':'cats','source':'my dataframe2',"notes":"pop culture"},
{'keyword':'zhongnanhai','source':'my dataframe2',"notes":"location"}
])
more_keywords_df
keyword | notes | source | |
---|---|---|---|
0 | zhongnanhai | location | my dataframe2 |
1 | cats | pop culture | my dataframe2 |
2 | zhongnanhai | location | my dataframe2 |
weibo.run(more_keywords_df,sqlite_file=sqlite_file,cookies=cookie)
(0, u'zhongnanhai', 'has_results')
(1, u'cats', 'has_results')
weibo.sqlite_to_df(sqlite_file)
id | date | datetime_logged | test_number | keyword | censored | no_results | reset | result | source | num_results | notes | |
---|---|---|---|---|---|---|---|---|---|---|---|---|
0 | 0 | 2017-09-25 | 2017-09-25 10:13:37.816720 | 1 | hello | 0 | 0 | 0 | has_results | my dataframe | 80454701.0 | None |
1 | 1 | 2017-09-25 | 2017-09-25 10:13:54.356722 | 1 | lxb | 0 | 0 | 0 | censored | my dataframe | NaN | None |
2 | 2 | 2017-09-25 | 2017-09-25 10:14:11.489530 | 1 | 习胞子 | 0 | 0 | 0 | no_results | my dataframe | NaN | None |
3 | 3 | 2017-09-25 | 2017-09-25 10:14:29.667395 | 1 | 刘晓波 | 0 | 0 | 0 | censored | my dataframe | NaN | None |
4 | 4 | 2017-09-25 | 2017-09-25 10:14:49.107078 | 1 | pokemon | 0 | 0 | 0 | has_results | my dataframe | 5705260.0 | pop culture |
5 | 5 | 2017-09-25 | 2017-09-25 10:15:09.762484 | 1 | jay chou | 0 | 0 | 0 | has_results | my dataframe | 881.0 | pop culture |
6 | 6 | 2017-09-25 | 2017-09-25 10:15:28.100418 | 1 | 0 | 0 | 0 | has_results | my dataframe | 63401495.0 | social media | |
7 | 7 | 2017-09-25 | 2017-09-25 10:15:46.214464 | 2 | hello | 0 | 0 | 0 | has_results | my dataframe | 80454634.0 | None |
8 | 8 | 2017-09-25 | 2017-09-25 10:16:03.274804 | 2 | lxb | 0 | 0 | 0 | censored | my dataframe | NaN | None |
9 | 9 | 2017-09-25 | 2017-09-25 10:16:19.035805 | 2 | 习胞子 | 0 | 0 | 0 | no_results | my dataframe | NaN | None |
10 | 10 | 2017-09-25 | 2017-09-25 10:16:36.021837 | 2 | 刘晓波 | 0 | 0 | 0 | censored | my dataframe | NaN | None |
11 | 11 | 2017-09-25 | 2017-09-25 10:16:53.766351 | 1 | zhongnanhai | 0 | 0 | 0 | has_results | my dataframe2 | 109.0 | location |
12 | 12 | 2017-09-25 | 2017-09-25 10:17:14.124440 | 1 | cats | 0 | 0 | 0 | has_results | my dataframe2 | 648313.0 | pop culture |
You can also pass in lists if you prefer (though you can't include the source or notes)
sample_keywords_list = ["cats",'yes','自由亚洲电台','刘晓波','dhfjkdashfjkasdhf']
See below how it handles connection reset errors (it waits a little extra to make sure your connection clears before continuing testing)
weibo.run(sample_keywords_list,sqlite_file=sqlite_file,cookies=cookie)
(0, u'cats', 'has_results')
(1, u'yes', 'has_results')
自由亚洲电台 caused connection reset, waiting 95
(2, u'\u81ea\u7531\u4e9a\u6d32\u7535\u53f0', 'reset')
(3, u'\u5218\u6653\u6ce2', 'censored')
(4, u'dhfjkdashfjkasdsf87', 'no_results')
weibo.sqlite_to_df(sqlite_file)
id | date | datetime_logged | test_number | keyword | censored | no_results | reset | result | source | num_results | notes | |
---|---|---|---|---|---|---|---|---|---|---|---|---|
0 | 0 | 2017-09-25 | 2017-09-25 10:13:37.816720 | 1 | hello | 0 | 0 | 0 | has_results | my dataframe | 80454701.0 | None |
1 | 1 | 2017-09-25 | 2017-09-25 10:13:54.356722 | 1 | lxb | 0 | 0 | 0 | censored | my dataframe | NaN | None |
2 | 2 | 2017-09-25 | 2017-09-25 10:14:11.489530 | 1 | 习胞子 | 0 | 0 | 0 | no_results | my dataframe | NaN | None |
3 | 3 | 2017-09-25 | 2017-09-25 10:14:29.667395 | 1 | 刘晓波 | 0 | 0 | 0 | censored | my dataframe | NaN | None |
4 | 4 | 2017-09-25 | 2017-09-25 10:14:49.107078 | 1 | pokemon | 0 | 0 | 0 | has_results | my dataframe | 5705260.0 | pop culture |
5 | 5 | 2017-09-25 | 2017-09-25 10:15:09.762484 | 1 | jay chou | 0 | 0 | 0 | has_results | my dataframe | 881.0 | pop culture |
6 | 6 | 2017-09-25 | 2017-09-25 10:15:28.100418 | 1 | 0 | 0 | 0 | has_results | my dataframe | 63401495.0 | social media | |
7 | 7 | 2017-09-25 | 2017-09-25 10:15:46.214464 | 2 | hello | 0 | 0 | 0 | has_results | my dataframe | 80454634.0 | None |
8 | 8 | 2017-09-25 | 2017-09-25 10:16:03.274804 | 2 | lxb | 0 | 0 | 0 | censored | my dataframe | NaN | None |
9 | 9 | 2017-09-25 | 2017-09-25 10:16:19.035805 | 2 | 习胞子 | 0 | 0 | 0 | no_results | my dataframe | NaN | None |
10 | 10 | 2017-09-25 | 2017-09-25 10:16:36.021837 | 2 | 刘晓波 | 0 | 0 | 0 | censored | my dataframe | NaN | None |
11 | 11 | 2017-09-25 | 2017-09-25 10:16:53.766351 | 1 | zhongnanhai | 0 | 0 | 0 | has_results | my dataframe2 | 109.0 | location |
12 | 12 | 2017-09-25 | 2017-09-25 10:17:14.124440 | 1 | cats | 0 | 0 | 0 | has_results | my dataframe2 | 648313.0 | pop culture |
13 | 13 | 2017-09-25 | 2017-09-25 10:17:36.205255 | 1 | cats | 0 | 0 | 0 | has_results | list | 648313.0 | None |
14 | 14 | 2017-09-25 | 2017-09-25 10:17:54.330039 | 1 | yes | 0 | 0 | 0 | has_results | list | 28413048.0 | None |
15 | 15 | 2017-09-25 | 2017-09-25 10:19:47.007930 | 1 | 自由亚洲电台 | 0 | 0 | 0 | reset | list | NaN | None |
16 | 16 | 2017-09-25 | 2017-09-25 10:20:03.491231 | 1 | 刘晓波 | 0 | 0 | 0 | censored | list | NaN | None |
17 | 17 | 2017-09-25 | 2017-09-25 10:20:18.747414 | 1 | dhfjkdashfjkasdsf87 | 0 | 0 | 0 | no_results | list | NaN | None |