birdspotter: A tool to measure social attributes of Twitter users
birdspotter is a python package providing a toolkit to measures the social influence and botness of twitter users. It takes a twitter dump input in
jsonl format and produces measures for:
- Social Influence: The relative amount that one user can cause another user to adopt a behaviour, such as retweeting.
- Botness: The amount that a user appears automated.
Rizoiu, M.A., Graham, T., Zhang, R., Zhang, Y., Ackland, R. and Xie, L. # DebateNight: The Role and Influence of Socialbots on Twitter During the 1st 2016 US Presidential Debate. In Twelfth International AAAI Conference on Web and Social Media (ICWSM'18), 2018. https://arxiv.org/abs/1802.09808
Ram, R., & Rizoiu, M.-A. A social science-grounded approach for quantifying online social influence. In Australian Social Network Analysis Conference (ASNAC'19) (p. 2). Adelaide, Australia, 2019.
pip3 install birdspotter
birdspotter requires a python version
birdspotter on your own twitter dump, replace './example.jsonl' with the path to your twitter dump './path/to/tweet/dump.json'. In this example we use a bespoke dataset found in this repository. It can be downloaded here.
from birdspotter import BirdSpotter bs = BirdSpotter('./example.jsonl') # This may take a few minutes, go grab a coffee... labeledUsers = bs.getLabeledUsers(out='./output.csv')
After extracting the tweets,
getLabeledDataFrame() returns a
pandas dataframe with the influence and botness labels of users and writes a
csv file if a path is specified i.e.
birdspotter relies on the Fasttext word embeddings wiki-news-300d-1M.vec, which will automatically be downloaded if not available in the current directory (
./) or a relative data folder (
Get Cascades Data
After extracting the tweets, the retweet cascades are accessible by using:
cascades = bs.getCascadesDataFrame()
This dataframe includes the expected structure of the retweet cascade as given by Rizoiu et al. (2018) via the column
expected_parent in this dataframe.
Adding more influence metrics
birdspotter provides DebateNight influence as a standard, when
getLabeledUsers is run. To generate spatial-decay influence run:
bs.getInfluenceScores(time_decay = -0.000068, alpha = 0.15, beta = 1.0)
This returns the updated
featureDataframe with influence scores appended, under the column
Training with your own botness data
birdspotter provides functionality for training the botness detector with your own training data. To generate an
csv to be annotated run:
Once annotated the botness detector can be trained with:
Defining your own word embeddings
birdspotter provides functionality for defining your own word embeddings. For example:
customEmbedding # A mapping such as a dict() representing word embeddings bs = BirdSpotter('./example.jsonl', embeddings=customEmbedding)
Embeddings can be set through several methods, refer to setWord2VecEmbeddings.
wiki-news-300d-1M.vec and as such we would need to retrain the bot detector for alternative word embeddings.Note the default bot training data uses the
Alternatives to python
birdspotter can be accessed through the command-line to return a
csv, with the recipe below:
birdspotter ./path/to/twitter/dump.json ./path/to/output/directory/
birdspotter functionality can be accessed in
R via the
reticulate still requires a
python installation on your system and
birdspotter to be installed. The following produces the same results as the standard usage.
install.packages("reticulate") library(reticulate) use_python(Sys.which("python3")) birdspotter <- import("birdspotter") bs <- birdspotter$BirdSpotter("./example.jsonl") bs$getLabeledDataFrame(out = './output.csv')
The development of this package was partially supported through a UTS Data Science Institute seed grant.