fara_principals

A web scraper designed to collect Foreign Principal information from fara.gov


Keywords
foreign, principals, fara, gov, scraper, scrapy, python
License
Other
Install
pip install fara_principals==0.0.7

Documentation

About Fara Principles

This project is a web scraper which collects Foreign Principles from the website fara.gov.Most of the trick the site uses to evade scrapers has been discovered and is abstracted in the core package. Work being done on the core are usually implemented on the core branch. There is also the scraper branch where work on the scraper has be done.

Note:

For better overview of the code please read the ARCHITECTURE_REDAME file which explains some of the components of this project.

Installation

It is considered very good practice to setup a virtual environment before installing apps like this so, please do that. This project is targeted for python2 so please make sure at least python 2.7 is installed on your system. You can download or clone this repo to install the scraper. At your terminal, enter the following commands. You can litrarily copy and paste this commands at your terminal to get this working.

git clone https://github.com/tandalf/fara_principals.git
cd fara_principals
python setup.py install
pip install -r requirements.txt

This install the scraper and it's dependencies.

Scraping Active Principals

After the installation has completed successfull, you can start collecting principal data by running the command below while in the project's directory.

scrapy crawl active_principals -o outputfile.json

where outputfile.json is the path to the file which the principal json lines will be stored.

Running tests

Good test coverage is encouraged for this code base. To run the tests and coverage for the core components, while at the base directory, enter the following commands:

coverage run --source=fara_principals.core -m unittest discover tests/unit
coverage report -m

Requested JSON file

The requested json file for the project is located at the project root and is named principals.json

Warning

Although the Active Principals Page claims to have up to 510 Active Principals, it happens to contain 495 UNIQUE principals and about ~520 in total if duplicate requests are permitted. I do not think this problem is from my code (probably some logic error in their listing), but I'm still very open to scrutiny of my project to see what I might be missing.

Extra

Due to a tight schedule I might not find time to elaborate MORE on the project in areas including, but not limited to, testing, documentation, and deployment. However, the basics for all these have been provided.