# We the People analysis toolkit

During the Obama administration the “We the People” petitioning system was established, to let US residents engage with issues they care about; the project gathered over 20 million signatures on almost 4000 signatures over five years (2011 – 2016). This provides a very exciting way to group states not based on party affiliation, but on their levels of engagement on actual issues (petitions).

This toolkit provides tools to visualize states by their level of similarity/difference, to create novel clusterings of the electoral map, and generate insights about those clusters. It is built with Python 2.7 and uses Spark, Hive, SciPy, scikit-learn and Matplotlib.

## Installation

The toolkit is available as a Python package:

```
pip install wethepeopletoolkit
```

#### Prerequisites

The following **won't** be installed with the toolkit, but need to be installed for everything to work smoothly:

- Apache Spark 2+
- Hive
- PyQT4

## Examples

2-D projection of states, colored by party affiliation:

```
$ wethepeopletoolkit projection --show-party-affiliation --z-score-exclude 1
```

K-means clustering of states with 6 clusters, z-score > 1 exclusion during initial centroid clustering, and PCA applied as a pre-processing step:

```
$ wethepeopletoolkit cluster -n 6 -model-type kmeans --pca --z-score-exclude 1
```

Evaluating the effectiveness of models with 2-8 clusters, using the Silhouette score and Euclidian distance:

```
$ wethepeopletoolkit cluster-evaluation -m spectral --range 2 8 --evaluation-metric silhouette
```

The top 10 most signed petitions for Utah:

```
$ wethepeopletoolkit top-petitions 27 -n 10
```

Topic extraction for the 500 most signed petitions for Washington, Oregon and Colorado:

```
$ wethepeopletoolkit topic-extraction 8y7nF3D1
```

## Usage

```
$ wethepeopletoolkit
Usage: wethepeopletoolkit [OPTIONS] COMMAND [ARGS]...
Options:
-d, --data-directory PATH Path to data (./data/ by default).
-S, --spark-home PATH Path to Spark installation (automatically
discovered by default).
--help Show this message and exit.
Commands:
fetch-data Download and preprocess the neccessary data.
projection Create a 2-D projection of states w/ PCA.
cluster Performs clustering on states based on their...
cluster-evaluation Shows comparisons of performance as number of...
top-petitions Displays the top N most signed petitions for...
topic-extraction Performs topic extraction on the top N most...
```

#### Fetching data

The toolkit can automatically fetch and process the data necessary for clustering, topic extraction etc.

```
$ wethepeopletoolkit fetch-data --help
Usage: wethepeopletoolkit fetch-data [OPTIONS]
Download and preprocess the neccessary data. By default, files will be
downloaded to the directory ./data/
Options:
--keep-files Don't delete files after they've been extracted, converted and
processes.
--force Recreate Hive tables, even if they already exist
--help Show this message and exit.
```

#### 2-D projection visualization

Generates a 2-D projection of the states, based on Principal Component Analysis of a 50 x 3892 matrix, describing the number of signatures per 1,000 residents for every combination of petition and state. States which have similar patterns of engagement towards petitions will be closer together, those with dissimilar patterns will be further apart.

```
$ wethepeopletoolkit projection --help
Usage: wethepeopletoolkit projection [OPTIONS]
Create a 2-D projection of states w/ PCA. States which react more
similarly to petitions will be closer together.
Options:
-p, --show-party-affiliation Color states based on their affiliation to
Republicans/Democrats. Based on the 2014 Cook
Partisan Voting Index.
--show-points Show points next to state labels.
-z, --z-score-exclude FLOAT Don't show points with a z-score higher than
this value. For example, -z 3.0 would exclude
points more than 3 standard deviations from
the mean. If the value is 0, no points are
excluded.
--help Show this message and exit.
```

#### Clustering

Clusters the states, based on the similarity of signature engagement, uses scikit-learn under the hood.

```
$ wethepeopletoolkit cluster --help
Usage: wethepeopletoolkit cluster [OPTIONS]
Performs clustering on states based on their similar reactions to
petitions.
Options:
-n, --number-of-clusters INTEGER RANGE
The number of clusters to generate. Must be
between 2 and 50.
-m, --model-type [kmeans|spectral]
The type of clustering model to use. Valid
values:
kmeans: K-means clustering,
spectral: spectral clustering
--pca Performs PCA (dimensionality reduction) to
reduce the data to two dimensions before
clustering.
-z, --z-score-exclude FLOAT Don't show points with a z-score higher than
this value. For example, -z 3.0 would
exclude points more than 3 standard
deviations from the mean. If the value is 0,
no points are excluded. This can only be
used in conjunction with K-means clustering.
--seed INTEGER Sets the random seed for clustering.
--help Show this message and exit.
```

#### Cluster evaluation

Evaluates the effectiveness different cluster numbers, and plots the results. Can use Silhouette score, Calinski and Harabaz score or inertia (K-means clustering only). Silhouette score can be used in conjunction with any distance measure supported by `sklearn.metrics.pairwise.pairwise_distances`

(specified with the `--distance`

option).

```
$ wethepeopletoolkit cluster-evaluation --help
Usage: wethepeopletoolkit cluster-evaluation [OPTIONS]
Shows comparisons of performance as number of clusters is varied.
Options:
-r, --range INTEGER RANGE... The beginning and end of the range of
cluster numbers to test. For example, -r 2 5
would evaluate four models with 2, 3, 4 and
5 clusters. Both numbers must be between 2
and 50.
-e, --evaluation-metric [silhouette|calinski_harabaz|inertia]
--distance [cityblock|cosine|euclidean|l1|l2|manhattan|braycurtis|canberra|chebyshev|correlation|dice|hamming|jaccard|kulsinski|mahalanobis|matching|minkowski|rogerstanimoto|russellrao|seuclidean|sokalmichener|sokalsneath|sqeuclidean|yule]
The type of distance measure to used to
calculate the silhouette score.
-m, --model-type [kmeans|spectral]
The type of clustering model to use. Valid
values:
kmeans: K-means clustering,
spectral: spectral clustering
--pca Performs PCA (dimensionality reduction) to
reduce the data to two dimensions before
clustering.
-z, --z-score-exclude FLOAT Don't show points with a z-score higher than
this value. For example, -z 3.0 would
exclude points more than 3 standard
deviations from the mean. If the value is 0,
no points are excluded. This can only be
used in conjunction with K-means clustering.
--seed INTEGER Sets the random seed for clustering.
--help Show this message and exit.
```

#### Top petitions

Displays the top petitions for a given cluster ID (provided by the `cluster`

command). Useful for understanding the most important issues for a given cluster.

```
$ wethepeopletoolkit top-petitions --help
Usage: wethepeopletoolkit top-petitions [OPTIONS] CLUSTER_ID
Displays the top N most signed petitions for a given cluster. Defaults to
the top 10. CLUSTER_ID is the Base58 encoded cluster ID (as provided by
the 'cluster' command).
Options:
-n, --top-n INTEGER RANGE Dictates what number of the top petitions (by
number of signatures) are displayed.
--no-truncation Always show the entire petition titles.
-b, --show-body Additionally show the body of the petitions.
--help Show this message and exit.
```

#### Topic extraction

The topic extractor takes one or more cluster IDs, then takes the top N most signed petitions for each cluster (500 by default) and extracts the most important topics present in that corpus, using either latent Dirichlet Allocation (LDA) or non-negative Matrix Factorization (NMF). Useful for comparing and contrasting the different key themes present in the most key petitions for each cluster.

```
$ wethepeopletoolkit topic-extraction --help
Usage: wethepeopletoolkit topic-extraction [OPTIONS] [CLUSTER_IDS]...
Performs topic extraction on the top N most signed petitions for given
cluster(s). Uses the top 500 petitions by default, and constructs 10
topics of 10 words. Extraction can be performed with latent Dirichlet
allocation (LDA) or non-negative matrix factorization (NMF). CLUSTER_IDS
are the Base58 encoded cluster IDs (as provided by the 'cluster' command)
that you want to display/compare.
Options:
-m, --extraction-method [lda|nmf]
The type of topic extraction model to use.
Valid values:
lda: latent Dirichlet
allocation, nmf: non-negative matrix
factorization
-P, --petition-sample-size INTEGER RANGE
Dictates what number of the top petitions
(by number of signatures) are used as the
data for topic extraction (default 500).
-n, --number-of-topics INTEGER RANGE
How many topics to extract (1 - 100, default
10).
-w, --words-per-topic INTEGER RANGE
How many words should be in each topic (1 -
100, default 10).
--help Show this message and exit.
```

## Development

To develop the toolkit, first clone this repository:

```
git clone git@github.com:alexpeattie/wethepeopletoolkit.git
cd wethepeopletoolkit
```

If you don't have virtualenv installed, install it:

```
pip install virtualenv
```

Next create a new virtual environment:

```
virtualenv venv --system-site-packages
. venv/bin/activate
```

Then install the package in editable mode:

```
pip install --editable .
```

## Contributing

Pull requests are very welcome! Please try to follow these simple rules if applicable:

- Fork it (https://github.com/alexpeattie/wethepeopletoolkit/fork)
- Create your feature branch (
`git checkout -b my-new-feature`

) - Commit your changes (
`git commit -am 'Add some feature'`

) - Push to the branch (
`git push origin my-new-feature`

) - Create a new Pull Request

## License

All code is released under the MIT license. (See License.md)

## Author

Alex Peattie / alexpeattie.com / @alexpeattie