Wiki Faces: Figure 1: Joko Widodo's Wikipedia page, which includes am image of his face. The cropped image on the right is download into a directory named "Joko_Widodo."
TLDR
This project downloads images from a Wiki that include human faces. Specifically, images that are associated with certain wikipedia categories.
Installation
Pip Installation Procedure:
From PIP:
pip install wikifaces
From Repo:
git clone git@github.com:tford9/Wiki-Faces-Downloader.git
cd Wiki-Faces-Downloader
python setup.py
pip install wikifaces
Usage
Command-Line Example
python downloader -i "indonesian engineers" -o ../data/ -d
Package Example
from wikifaces.downloader import WikiFace
wikiface_obj = WikiFace()
wikiface_obj.download(categories=['facebook'], depth=2, output_location='../data/')
The following structure is output:
-
facebook
cached_1_people_pages_d2.pkl
cached_pages_d2.pkl
-
alan_rushbridger
Alan_Rusbridger_01.jpg-p0.jpg
...
-
mark_zuckerberg
MarkZuckerbergcrop.jpg-p1.jpg
...
The process is carried out as follows:
- Given a category from a Wiki, collect n pages that contain the same category as well as at least one category containing "people" in the title.
- With those pages, crawl across their included categories and collect y pages that contain those categories as well as at least one "people" category.
- Given the collected Wiki pages, download the primary image from the page and determine if it is a human face using light facial detection.
- We capture all images from the wiki that contain the name of the page (if it's a person then the filename contains their name),
- Using the captured name and images, we create a dataset for that face.
TODOs:
- Currently, a part of this process uses a recursive call structure to get all related pages; there may be a way to linearize, or parallelize this.
- Currently, we are only pulling images contain the person's name in the title and only have one visible face in the image. All other images are not considered. A voting system should be added to get the most represented faces across multiple images.