archiveorg

An Internet Archive search tool.


License
MPL-2.0
Install
pip install archiveorg==0.2.0

Documentation

Archiveorg

An Internet Archive search tool.

This project is not affiliated with the Internet Archive.

The Internet Archive is a non-profit project that provides a public service. If you are using package (or the Internet Archive generally) on an ongoing basis, please consider donating to them.

Installation

Archiveorg requires Python 3.6+.

To install:

pip install archiveorg

Usage

Archiveorg contains a single object, Search.

from archiveorg import Search

Pass in all search parameters when creating the search object. See the Internet Archive's search API for details on which parameters exist:

search = Search(mediatype='image', collection='maps_usgs')

When the search is created, it will do an initial check of how many results exists:

>>>search.num_items
10000
>>>search.num_pages
10

NOTE: By default, the number of items per page is 1,000 (and can be specified using the rows parameter). As per their API, a maximum of 10,000 items will be returned. You can actually specify up to 100,000,000 rows (at which point results will not be paginated and won't be sorted in any way).

Once your search has been created, you can iterate over results:

for result in search:
    print(result['identifier'])

The result will be a dictionary representing the JSON search output.

You can also user the explicit iterate method if you want to start from an offset, or if you want to limit results to items which have files of a certain format:

for result in search.iterate(offset=100, format_regex=r'^TIFF$'):
    ...  # only results with .tiff files, starting from the 101st object.

You can download files using the get_files method:

for result in search:
    directory = Path(result['identifier'])

    file_list = search.get_files(result, directory)

Each result may have multiple files. You can use the format_regex parameter to filter files based on file format.

Random access

You can use the random_item method to return a single random result. You can use the format_regex parameter to ensure the result contains the file type you want.

The method will return None if no result can be found. You can use the max_attempts parameter to adjust the number of attempts are made to find a matching result.

License

Archiveorg is available under an AGPL License.