gallipy

A python tool to query the French National Library portal gallica.bnf.fr


Licenses
AGPL-3.0/GPL-3.0+
Install
pip install gallipy==0.1b1

Documentation

Gallipy: yet another python wrapper for the Gallica APIs

Gallipy provides a simple access to gallica.bnf.fr Document and IIIF. The Search API is not yet implemented. APIs are wrapped in a single class Resource, which is basically the 'R' in 'Archival Resouce Key'.

Why a new package instead of forking Pyllica or PyGallica? I know, "thou should not reinvent the wheel"...but you can't tell me what to do! Also I wanted to play with pythonic monades from this awesome article by Alexey Karasev!

Example

Retrieve the first issue of the periodical journal Le Journal de Toto for the year 1937, then save this document as a PDF file.

def retrieve_first_issue(issues):
  arkname = issues['issues']['issue'][0]['@ark']
  issue = Resource('ark:/12148/{}'.format(arkname))

  f = issue.content(mode='pdf') # Fetch the content of issue. f : Future[Either[Exception Binary]] 
  # If fetch succeeded, write the binary content to a file.
  bs_to_file = lambda bs : open('lejournaldetoto.pdf','wb').write(bs)
  f.map(lambda x : x.map(bs_to_file))

# Fetch the resource metadata with the service Issues and get the ARK id of the first issue in 1937
my_resource = Resource('https://gallica.bnf.fr/ark:/12148/cb32798952c/date')
issues = my_resource.issues(date=1937)  # issues: Future[Resource]
issues.map(retrieve_first_issue)  # retrieve_first_issue is the callback function

Getting started

Installation

pip install gallipy

Gallipy requires Python 3.5 or higher. If you want to use it with Python 2.7.x, do not hesitate to contribute to this project.

Overview

Document and IIIF are available from instance methods. The constructor of Resource accepts Ark objects (see section "Parsing ARKs") or any valid ARK string of the form [scheme://naming_authority/]ark:/name_assigning_authority_number/name[/qualifier].

Which means you can do things like:

from gallipy import Resource, Ark

# Full ARK string
my_resource = Resource('https://gallica.bnf.fr/ark:/12148/cb32798952c/date')

# ARK ID
my_other_resource = Resource('ark:/12148/btv1b6930733g/f1n200')

# Parse an ARK, build a Resource if the ARK is valid, otherwise return an Exception
ark = Ark.parse('https://gallica.bnf.fr/ark://12148/bpt6k5738219s')
my_third_resource = ark.map(Resource)
if my_third_resource.is_left:
  # Something went wrong
else:
  # Ready to query gallica.bnf.fr!

Synchronous, asynchronous calls and monades

Sync/async calls

Gallipy allows for synchronous or asynchronous queries:

# Asynchronous call
my_resource.issues(date=1937)  # issues: Resource -> Future[Either[Exception Dict]]

# Synchronous call
my_resource.issues_sync(date=1937) # issues_sync: Resource -> Either[Exception Dict]

Pythonic monades

All methods available from Resource return Either objects for synchronous methods and Future objects for asynchronous ones. The classes Either and Future are implementation of the Either and Future Monades used in functional programming. Here, a Monade is just some kind of wrapper that takes some object of type a and put it inside an object of type M[a] where M is a subclass of Monade. The class Monade defines three functions :

  • pure : x -> M[y]: which accepts an object of type x and returns an object of monadic type M[x].
  • map: (x -> y) -> (M[x] -> M[y]): takes some function f: x -> y and 'converts' it to a new function that applies to some M[x] and returns a M[y]. Basically, it applies f to the object wrapped inside a monade object and returns another monade object that wraps the result of f.
  • flat_map: (x -> M[y]) -> (M[x] -> M[y]): Let say there is f: String -> Monade[String]. If you map f on some x: Monade[String] you'll get a y: Monade[Monade[String]]). Not very cool, right? But with flat_map you'll end with a simple y: Monade[String]. For an in-depth explanation, read this and this

Either monade

The Either monade is a very elegant way to deal with Exceptions. It allows to wrap them inside a stable object that won't ever break your code. Then, you decide where to unwrap and deal with the exception. Either objects can be of two types: Right[x] if x is 'valid' (whatever it means) and Left[x] otherwise. In Gallipy Left is only used to wrap exceptions.

Here is an example with the Gallica service Pagination:

r = Resource('ark:/12148/bpt6k5738219s')

# Resource -> Either[Exception Dict]
either = r.pagination_sync()

# ( Dict -> None ) -> (Either[Exception Dict] -> Either[Exception None] )
if either.map(print).is_left:
  raise either.value

Future monade

Futures are kinda similar to Javascript Promises. They let you execute a function asynchronously in a light, elegant way. In Gallipy, all synchronous functions return some x: Future[Either[...]]:

r = Resource('ark:/12148/bpt6kzzzzz5738219s')

def callback(either):
  # Unwrap either and print x: X its content if either: Right[X]
  # Otherwise either holds an exception, so we raise it.
  if e.map(print).is_left:
    raise e.value
  
r.pagination().map(callback)

Document API

See the official documentation for more details.

Issues

Retrieve metadata about a periodical journal. The optional parameter date will return metadata about all the issues that are available for a specific year.

r = Resource('ark:/12148/cb32798952c/date')
# Resource -> Future[Either[Exception Dict]]
r.issues(date=1937)
# Resource -> Either[Exception Dict]
r.issues_sync()

OAIRecord

Retrieve the OAI record of a given document.

r = Resource('ark:/12148/bpt6k5738219s')
# Resource -> Future[Either[Exception Dict]]
r.oairecord()
# Resource -> Either[Exception Dict]
r.oairecord_sync()

Pagination

Get paging informations about a document.

r = Resource('ark:/12148/bpt6k5738219s')
# Resource -> Future[Either[Exception Dict]]
r.oairecord()
# Resource -> Either[Exception Dict]
r.oairecord_sync()

Table of content

Get the ToC of a document, in HTML.

r = Resource('ark:/12148/bpt6k83037p/f143')
# Resource -> Future[Either[Exception HTMLString]]
r.toc()
# Resource -> Either[Exception HTMLString]
r.toc_sync()

Full-text search

Execute search queries on the text of a document.

r = Resource('ark:/12148/btv1b6930733g')
# Resource -> Future[Either[Exception Dict]]
r.fulltextsearch('hugo') # Search for 'hugo' in the whole document
# Resource -> Either[Exception Dict]
r.fulltextsearch_sync(query='hugo',page=10, startResult=1)  # Search for 'hugo' at page 10 and return all results in one 'page'.<

Content retrieve

Retrieve the content of a document. This is how you get the full PDFs.

Optional parameter mode can be 'pdf' or 'texteBrut' ('texteImage' is not supported). Default is 'pdf.'

r = Resource('ark:/12148/btv1b693073')
# Resource -> Future[Either[Exception Binary]]
r.content() # Get all pages from r as a PDF

# Get pages 10 to 20 and save them as an html file.
e = r.content_sync(startPage=10, nPages=10, mode='textBrut')  
e.map(lambda x : open('myresource.html','wb').write(x))

The parameters startPage and nPages have precedence over the resource's ARK qualifier. Wich means that Resource('ark:/12148/btv1b693073/f1n10.textBrut').content(startPage=10, nPages=10, mode=pdf) will return pages 10 to 20 of resource ark:/12148/btv1b693073 in PDF. mode will always be appended to the end of the qualifier.

IIIF API

Document and image metadata

Retrieve metadata from an image or a whole document in JSON.

r = Resource('ark:/12148/btv1b90017179')

r.iiif_info(image='f15')  # Get metadata of page 15.
r.iiif_info_sync()  # Get metadata of the document  'ark:/12148/btv1b90017179'

Image retrieval

Retrieve an image using the IIIF API.

Parameters are detailed in http://api.bnf.fr/api-iiif-de-recuperation-des-images-de-gallica. region is a 4-elements object of any iterable type.

The ARK qualifier has precedence over the parameter image, which means that image will be ignored if the resource's ARK is qualified.

r = Resource('ark:/12148/btv1b90017179')

r.iiif_data(image='f15', imgtype='png')
e = r.iiif_data_sync(image='f15', region=(0, 0, 2400, 3898), imgtype='png')

Usage

Let's retrieve the first image of a document in native resolution

ark = Ark.parse('https://gallica.bnf.fr/ark:/12148/btv1b90017179').value
r = Resource(ark)

metadata = r.iiif_info_sync().value
width = metadata['sequences'][0]['canvases'][0]['width']
height = metadata['sequences'][0]['canvases'][0]['height']

with open(ark.name+'_f1.png','bw') as o:
  o.write(r.iiif_data_sync(image='f1',fileformat='png', region=(0,0,width, height)).value)

Parsing ARKs

Gallipy provides a parser for ARK urls and ARK ids. The parser uses rfc for the optional non-id part of an ARK and Lark for the actual ARK id.

Buitl-in methods __repr__ and __str__ come in handy to handle ARK in a smooth way :

ark = Ark.parse('https://gallica.bnf.fr/ark:/12148/cb32798952c/date') # Parse the ark

ark.map(print)
# > https://gallica.bnf.fr/ark:/12148/cb32798952c/date

ark.map(lambda x : print(repr(x)))
# > {'scheme': 'https', 'authority': 'gallica.bnf.fr', 'naan': '12148', 'name': 'cb32798952c', 'qualifier': 'date'}

ark.map(lambda x : print(x.arkid))
# > ark:/12148/cb32798952c/date

ark.map(lambda x : print(repr(x.arkid)))
# > {'scheme': 'ark', 'authority': None, 'naan': '12148', 'name': 'cb32798952c', 'qualifier': 'date'}

Todo

  • Implement the Search API.
  • Implement the OCR retrieval.
  • Provide an object representation of API response rather than OrderedDict