fedregs

Text Analysis of the US Code of Federal Regulations


License
GPL-3.0

Documentation

'fedregs': Text Analysis of the US Code of Federal Regulations

Project Status: Active - The project has reached a stable, usable state and is being actively developed. codecov Travis-CI Build Status

The goal of fedregs is to allow for easy exploration and analysis of the Code of Federal Regulation.

Installation

You can install fedregs using:

install.packages("fedregs")
# Or: devtools::install_github("slarge/fedregs")

Example

The Code of Federal Regulation is organized according to a consistent hierarchy: title, chapter, part, subpart, section, and subsection. Each title within the CFR is (somewhat haphazardly) divided into volumes and over time each chapter isn't consistently in the same volume. The cfr_text() function is the main function in the package and it will return the text for a specified part, including the associated subparts and sections. Behind the scenes, cfr_text() and associated helper functions gather the volumes for a given title/year combination and parses XML to determine the chapters, parts, and subparts associated with each volume. Next, the text is extracted for each subpart. The return_tidytext = TRUE argument will return a tibble with the text in a tidytext format. If ngrams are your game, set token = "ngrams" and specify n.

library(fedregs)
library(dplyr)
library(tidyr)
library(ggplot2)
library(quanteda)

regs <- cfr_text(year = 2017,
                 title_number = 50,
                 chapter = 6,
                 part = 648,
                 #token = "ngrams", # uncomment for ngrams of length 2
                 #n = 2, # uncomment for ngrams of length 2
                 return_tidytext = TRUE,
                 verbose = FALSE)
head(regs)
## # A tibble: 6 x 6
## # Groups:   subpart, year, title_number, chapter, part [6]
##   subpart                      year title_number chapter  part         data
##   <chr>                       <dbl>        <dbl> <chr>   <dbl> <list<df[,4>
## 1 Subpart A—General Provisio~  2017           50 VI        648 [84,596 x 4]
## 2 Subpart B—Management Measu~  2017           50 VI        648 [10,384 x 4]
## 3 Subpart C—Management Measu~  2017           50 VI        648    [696 x 4]
## 4 Subpart D—Management Measu~  2017           50 VI        648 [25,567 x 4]
## 5 Subpart E—Management Measu~  2017           50 VI        648  [6,994 x 4]
## 6 Subpart F—Management Measu~  2017           50 VI        648 [97,477 x 4]

Now, we can unnest the tibble and take a peek at the data to see what data we have to play with.

regs %>%
  unnest(cols = c(data)) %>% head(20) %>% pull(word)
##  [1] "c"          "a"          "this"       "part"       "implements"
##  [6] "the"        "fishery"    "management" "plans"      "fmps"      
## [11] "for"        "the"        "atlantic"   "mackerel"   "squid"     
## [16] "and"        "butterfish" "fisheries"  "atlantic"   "mackerel"

Not entirely unexpected, but there are quite a few common words that don't mean anything. These "stop words" typically don't have important significance and and are filtered out from search queries.

head(stopwords("english"))
## [1] "i"      "me"     "my"     "myself" "we"     "our"

There are some other messes like punctuation, numbers, iths, Roman Numerals, web sites, and random letters (probably from indexed lists) that can be removed with some simple regex-ing. We can also convert the raw words to word stems to further aggregate our data.

stop_words <- data_frame(word = stopwords("english"))
## Warning: `data_frame()` is deprecated, use `tibble()`.
## This warning is displayed once per session.

clean_words <- regs %>%
  unnest(cols = c(data)) %>% 
  mutate(word = gsub("[[:punct:]]", "", word), # remove any remaining punctuation
                word = gsub("^[[:digit:]]*", "", word)) %>%  # remove digits (e.g., 1st, 1881a, 15th, etc)
  anti_join(stop_words, by = "word") %>%  # remove "stop words"
  filter(is.na(as.numeric(word)),
                !grepl("^m{0,4}(cm|cd|d?c{0,3})(xc|xl|l?x{0,3})(ix|iv|v?i{0,3})$",
                      word), # adios Roman Numerals
                !grepl("\\b[a-z]{1}\\b", word), # get rid of one letter words
                !grepl("\\bwww*.", word)) %>% # get rid of web addresses
  mutate(word = tokens(word),
                word = as.character(tokens_wordstem(word)))
## Warning in ~is.na(as.numeric(word)): NAs introduced by coercion
## Warning in ~is.na(as.numeric(word)): NAs introduced by coercion

## Warning in ~is.na(as.numeric(word)): NAs introduced by coercion

## Warning in ~is.na(as.numeric(word)): NAs introduced by coercion

## Warning in ~is.na(as.numeric(word)): NAs introduced by coercion

## Warning in ~is.na(as.numeric(word)): NAs introduced by coercion

## Warning in ~is.na(as.numeric(word)): NAs introduced by coercion

## Warning in ~is.na(as.numeric(word)): NAs introduced by coercion

## Warning in ~is.na(as.numeric(word)): NAs introduced by coercion

## Warning in ~is.na(as.numeric(word)): NAs introduced by coercion

## Warning in ~is.na(as.numeric(word)): NAs introduced by coercion

## Warning in ~is.na(as.numeric(word)): NAs introduced by coercion

## Warning in ~is.na(as.numeric(word)): NAs introduced by coercion

## Warning in ~is.na(as.numeric(word)): NAs introduced by coercion

## Warning in ~is.na(as.numeric(word)): NAs introduced by coercion

## Warning in ~is.na(as.numeric(word)): NAs introduced by coercion
head(clean_words)
## # A tibble: 6 x 9
## # Groups:   subpart, year, title_number, chapter, part [1]
##   subpart  year title_number chapter  part SECTION_NAME SECTION_NUMBER
##   <chr>   <dbl>        <dbl> <chr>   <dbl> <chr>        <chr>         
## 1 Subpar~  2017           50 VI        648 Purpose and~ §<U+2009>648.1       
## 2 Subpar~  2017           50 VI        648 Purpose and~ §<U+2009>648.1       
## 3 Subpar~  2017           50 VI        648 Purpose and~ §<U+2009>648.1       
## 4 Subpar~  2017           50 VI        648 Purpose and~ §<U+2009>648.1       
## 5 Subpar~  2017           50 VI        648 Purpose and~ §<U+2009>648.1       
## 6 Subpar~  2017           50 VI        648 Purpose and~ §<U+2009>648.1       
## # ... with 2 more variables: values <chr>, word <chr>

Now we can look at binning and plotting the words

count_words <- clean_words %>%
  group_by(word) %>%
  summarise(n = n()) %>%
  ungroup() %>%
  arrange(-n) %>% 
  top_n(n = 50, wt = n) %>% 
  mutate(word = reorder(word, n))
ggplot(count_words, aes(word, n)) +
  geom_col() +
  labs(xlab = NULL, 
       title = "Code of Federal Regulations", 
       subtitle = "Title 50, Chapter VI, Part 648",
       caption = sprintf("Data accessed on %s from:\n https://www.gpo.gov/fdsys/browse/collectionCfr.action?collectionCode=CFR", 
                         format(Sys.Date(), "%d %B %Y"))) +
  theme(axis.text.x = element_text(angle = 45, hjust = 1),
        legend.direction = "horizontal",
        legend.position = "bottom",
        text = element_text(size = 8)) +
  coord_flip() +
  theme_minimal()