wikiscraper

Easy scraper that extracts data from Wikipedia articles thanks to its URL slug


Keywords
python, web, scraping, wikipedia, slug
License
CC-BY-4.0
Install
pip install wikiscraper==1.1.9

Documentation

CC BY 4.0 Downloads

wikiscraper

Easy scraper that extracts data from Wikipedia articles thanks to its URL slug : title, images, summary, sections paragraphs, sidebar info

Developed by Alexandre MEYER

This work is licensed under a Creative Commons Attribution 4.0 International License.

CC BY 4.0

Installation

$ pip install wikiscraper

Initialization

Import

import wikiscraper as ws

Main request

# Set the language page in Wikipedia for the query
# (ISO 639-1 & by default "en" for English)
ws.lang("fr")
# Search and get content by the URL slug of the article
# (Example : https://fr.wikipedia.org/wiki/Paris)
result = ws.searchBySlug("Paris")

Examples

Title H1 & URL

# Get article's title
result.getTitle()
# Get article's URL
result.getURL()

Sidebar

# Get value of the sidebar information label
result.getSideInfo("Gentilé")

Abstract

# Get all paragraphs of abstract
print(result.getAbstract())
# Get the second paragraph of abstract
print(result.getAbstract()[1])
# Optional : Get the x paragraphs, starting from the beginning
print(result.getAbstract(2))

Images

# Get all illustration images
img = result.getImage()
# Get a specific image thanks to its position in the page
print(img[0]) # Main image

Sections

# Get table of contents
# Only first headlines
print(result.getContentsTable())
# All headelines (first and second levels)
print(result.getContentsTable(subcontents=True))
# Get paragraphs from a specific section thanks to the parents' header title
# All optional args : .getSection(h2Title, h3Title, h4Title)
# Exemple : https://fr.wikipedia.org/wiki/Paris#Politique_et_administration
print(result.getSection('Politique et administration', 'Statut et organisation administrative', 'Historique')[0])

Errors

"Unable to find the requested query: please check the spelling of the slug"

  • Check if the spelling of the slug is correct
  • Check if the article exists
  • Check if the language set for the query matches with the slug (by default the search is for English articles)

Versions

  • 1.1.0 = Error Handling
  • 1.0.0 = init