wiki-table-scrape

Scrape HTML tables from a Wikipedia page into CSV format.

wikitablescrape can be used as a shell command or imported as a Python package.

Why?

This tool makes it easy to download any Wikipedia table via CLI in a format ready for text processing.

This is especially useful when combined with a tool like xsv.

Year Distribution of Costliest Atlantic Hurricanes

wikitablescrape --url='https://en.wikipedia.org/wiki/List_of_costliest_Atlantic_hurricanes' --header='costliest' | xsv select "Season" | xsv stats --median | xsv select field,min,max,median,mean,stddev | xsv table

field   min   max   median  mean                stddev
Season  1965  2018  2002    1999.1228070175441  12.900523823770502

Country / Market Distribution of Best-selling Music Artists

wikitablescrape --url='https://en.wikipedia.org/wiki/List_of_best-selling_music_artists' --header='100 million' | xsv select 'Country / Market' | xsv frequency | xsv table

field             value                         count
Country / Market  United States                 26
Country / Market  United Kingdom                10
Country / Market  United Kingdom United States  1
Country / Market  Australia                     1
Country / Market  Spain                         1
Country / Market  Japan                         1

Installation

You can download the package from PyPI or build from source using Python 3.

As a system-level Python package

python3 -m pip install wikitablescrape
wikitablescrape --help

In a virtual environment

python3 -m venv venv
. venv/bin/activate
pip install wikitablescrape
wikitablescrape --help

Build from source

git clone https://github.com/rocheio/wiki-table-scrape
cd ./wiki-table-scrape
python3 -m venv venv
. venv/bin/activate
python setup.py install
wikitablescrape --help

Sample Commands

Write a single table to stdout

wikitablescrape --url="https://en.wikipedia.org/wiki/List_of_highest-grossing_films" --header="films by year" | tee >(head -1) >(tail -5) >/dev/null

"Year","Title","Worldwide gross","Budget","Reference(s)"
"2015","Star Wars: The Force Awakens","$2,068,223,624","$245,000,000",""
"2016","Captain America: Civil War","$1,153,304,495","$250,000,000",""
"2017","Star Wars: The Last Jedi","$1,332,539,889","$200,000,000",""
"2018","Avengers: Infinity War","$2,048,359,754","$316,000,000–400,000,000",""
"2019","Avengers: Endgame","$2,796,255,086","$356,000,000",""

Download all tables on a page into a folder of CSV files

wikitablescrape --url="https://en.wikipedia.org/wiki/Wikipedia:Multiyear_ranking_of_most_viewed_pages#Top-100_list" --output-folder="/tmp/scrape"

Parsing all tables from 'https://en.wikipedia.org/wiki/Wikipedia:Multiyear_ranking_of_most_viewed_pages#Top-100_list' into '/tmp/scrape'
Writing table 1 to /tmp/scrape/table_1_top_100_list.csv
Writing table 2 to /tmp/scrape/table_2_countries.csv
Writing table 3 to /tmp/scrape/table_3_cities.csv
Writing table 4 to /tmp/scrape/table_4_buildings_&_structures_&_statues.csv
Writing table 5 to /tmp/scrape/table_5_people.csv
Writing table 6 to /tmp/scrape/table_6_people_singers.csv
Writing table 7 to /tmp/scrape/table_7_people_actors.csv
Writing table 8 to /tmp/scrape/table_8_people_romantic_actors.csv
Writing table 9 to /tmp/scrape/table_9_people_athletes.csv
Writing table 10 to /tmp/scrape/table_10_people_modern_political_leaders.csv
Writing table 11 to /tmp/scrape/table_11_people_pre_modern_people.csv
Writing table 12 to /tmp/scrape/table_12_people_3rd_millennium_people.csv
Writing table 13 to /tmp/scrape/table_13_progression_of_the_most_viewed_millennial_persons_on_wikipedia.csv
Writing table 14 to /tmp/scrape/table_14_music_bands_historical_most_viewed_3rd_millennium_persons.csv
Writing table 15 to /tmp/scrape/table_15_sport_teams_historical_most_viewed_3rd_millennium_persons.csv
Writing table 16 to /tmp/scrape/table_16_films_and_tv_series_historical_most_viewed_3rd_millennium_persons.csv
Writing table 17 to /tmp/scrape/table_17_albums_historical_most_viewed_3rd_millennium_persons.csv
Writing table 18 to /tmp/scrape/table_18_books_and_book_series_historical_most_viewed_3rd_millennium_persons.csv
Writing table 19 to /tmp/scrape/table_19_books_and_book_series_pre_modern_books_and_texts.csv

head -5 /tmp/scrape/table_3_cities.csv

"Rank","Page","Continent","Views in millions"
"1","New York City","North America","75"
"2","Singapore","Asia","63"
"3","London","Europe","61"
"4","Hong Kong","Asia","50"

Testing

./scripts/test.sh

# Show coverage data in a browser
coverage html && open htmlcov/index.html

Sample Articles for Scraping

Contributing

If you would like to contribute to this module, please open an issue or pull request.

More Information

If you'd like to read more about this module, please check out my blog post from the initial release.

wikitablescrape
Release 1.0.0

Release 1.0.0

1.0.4

1.0.3

1.0.2

1.0.1

1.0.0

Documentation

wiki-table-scrape

Why?

Year Distribution of Costliest Atlantic Hurricanes

Country / Market Distribution of Best-selling Music Artists

Installation

As a system-level Python package

In a virtual environment

Build from source

Sample Commands

Write a single table to stdout

Download all tables on a page into a folder of CSV files

Testing

Sample Articles for Scraping

Contributing

More Information

Stats

Development practices

Releases

Contributors

wikitablescrape Release 1.0.0

Release 1.0.0 Toggle Dropdown 1.0.4 1.0.3 1.0.2 1.0.1 1.0.0

Documentation

wiki-table-scrape

Why?

Year Distribution of Costliest Atlantic Hurricanes

Country / Market Distribution of Best-selling Music Artists

Installation

As a system-level Python package

In a virtual environment

Build from source

Sample Commands

Write a single table to stdout

Download all tables on a page into a folder of CSV files

Testing

Sample Articles for Scraping

Contributing

More Information

Stats

Development practices

Releases

Contributors

wikitablescrape
Release 1.0.0

Release 1.0.0

1.0.4

1.0.3

1.0.2

1.0.1

1.0.0