This is a set of utilities for analyzing the American Community Survey's Public Use Microdata Sample files (ACS PUMS), mostly following
Now requires Python 3.6 or higher.
pip install pummeler; it will then be available via
import pummler and install a
If you prefer, you can also check out the source directory, which should work as long as you put the
pummel directory on your
sys.path (or start Python from the root of the checkout). In that case you should use the
pummel script at the top level of the checkout.
Getting the census data
First, download the data from the Census site. You probably want the "csv_pus.zip" file from whatever distribution you're using. The currently supported options are:
- 2006-10 (2.1 GB); uses 2000 PUMAs.
- 2007-11 (2.1 GB); uses 2000 PUMAs.
- The 2012-14 subset of the 2010-14 file (2.3 GB); this is the subset using 2010 PUMAs. (Pass
- 2015 (595 MB); uses 2010 PUMAs.
- The 2012-15 subset of the 2011-15 file (2.4GB); this is the subset using 2010 PUMAs. (Pass
- 2012-16 (2.3 GB); uses 2010 PUMAs.
- 2013-17 (2.3 GB); uses 2010 PUMAs.
- 2014-18 (2.1 GB); uses 2010 PUMAs.
It's relatively easy to add support for new versions; see the
VERSIONS dictionary in
Picking regions of analysis
Election results are generally reported by counties; PUMS data are in their own special Public Use Microdata Areas, which are related but not the same. This module ships with regions that merge all overlapping blockgroups / counties, found with the MABLE/Geocorr tool, in Pandas dataframes stored in
Regions are named like
AL_00_01, which means Alabama's region number 01 in the 2000 geography, or
WY_10_02, which is Wyoming's second region in the 2010 geography. There are also "superregions" which merge 2000 and 2010 geographies, named like
Note: Alaskan electoral districts are weird. For now, I just lumped all of Alaska into one region.
TODO: Could switch to precinct-level results, which should end up with more regions in the end. 2012 results are available here, including shapefiles if you go into the state-by-state section, so it shouldn't be too much work there. I haven't found national precinct-level results for the 2016 election yet, but maybe somebody's done it.
First, we need to sort the features by region, and collect statistics about them so we can do the featurization later.
pummel sort --version 2006-10 --voters-only -z csv_pus.zip SORT_DIR. (A few extra options are shown if you pass
--help.) This will:
Make a bunch of files in
feats_AL_00_01.h5, which contain basically the original features (except with the
ADJINCadjustment applied to fields that need it to account for inflation) grouped by region. These are stored in HDF5 format with pandas, because it's much faster and takes less disk space than CSVs. (If you only want state-level analysis,
--region-type statewill make one file per state;
--region-type pumawill split per PUMA instead of the default
Makes a file
SORT_DIR/stats.h5containing means and standard deviations of the real-valued features, counts of the different values for the categorical features, and a random sample of all the features.
This will take a while (15 minutes to 2 hours, depending on machine and what processing you're doing) and produce about 4GB of temp data (for the 2006-10 files). Luckily you should only need to do it once per ACS file.
--voters-only is simpler if you're directly replicating the Flaxman et al. paper,
--all-people is the default: you can replicate the same effect in
featurize by adding
--subsets 'AGEP >= 18 & CIT != 5'. (If you want to do multiple subsets, at that to each one as appropriate.)
pummel featurize SORT_DIR. (Again, you have a couple of options shown by
--help.) This will get both linear embeddings (i.e. means) and random Fourier feature embeddings for each region, saving the output in
You can also get features for demographic subsets with e.g.
--subsets 'SEX == 2 & AGEP > 45, SEX == 1 & PINCP < 20000'.
NOTE: As it turns out, with this featurization, linear embeddings seem to be comparable to random Fourier feature embeddings. You can save yourself a bunch of time and the world a smidgen of global warming if you skip them with
On my laptop (with a quad-core Haswell i7), doing it with random Fourier features takes about an hour; the only-linear version takes about ten minutes. Make sure you're using a numpy linked to a fast multithreaded BLAS (like MKL or OpenBLAS; the easiest way to do this is to use the Anaconda Python distribution, which includes MKL by default); otherwise, this step will be much slower.
If it's using too much memory, decrease
The original paper used Fastfood transforms instead of the default random Fourier features used here, which with a good implementation will be faster. I'm not currently aware of a high-quality, easily-available Python-friendly implementation. A GPU implementation of regular random Fourier features could also help.
SORT_DIR/embeddings.npz, which you can load with
np.load, will then have:
n_regions x n_featsarray of feature means.
n_regions x (2 * n_freq)array of random Fourier feature embeddings.
region_names: the names corresponding to the first axis of the embeddings.
feature_names: the names for each used feature.
n_feats x n_freqarray of random frequencies for the random Fourier features.
bandwidth: the bandwidth used for selecting the
(If you did
bandwidth won't be present.)
Getting the election data
There doesn't seem to be a good publicly-available county-level election results resource for years prior to 2012. If you get some, follow that notebook to get results in a similar format. (Your might have an institutional subscription to CQ Press's election data, for example. That source, though, doesn't use FIPS codes, so it'll be a little more annoying to line up; I might do that at some point.)
TODO: add 2016 election data.
For a basic replication of the model from the paper, see