citextract

CiteXtract - Bringing structure to the papers on ArXiv.


License
MIT
Install
pip install citextract==0.0.4

Documentation

CiteXtract

Read the Docs CircleCI Docker Cloud Build Status PyPI - Python Version

CiteXtract - Bringing structure to the papers on ArXiv.

Getting started

In order to install CiteXtract, run the following command:

pip install citextract

Extracting references

Then, one can extract references from a text using the RefXtract model:

from citextract.models.refxtract import RefXtractor

refxtractor = RefXtractor().load()
text = """This is a test sentence.\n[1] Jacobs, K. 2019. This is a test title. In Proceedings of Some Journal."""
refs = refxtractor(text)
print(refs)

It gives the following output:

['[1] Jacobs, K. 2019. This is a test title. In Proceedings of Some Journal.']

Under the hood, a trained neural network extracts reference boundaries and extracts the references by using these boundaries.

Extracting titles

Using the found references, titles can be extracted by using the TitleXtract model:

from citextract.models.titlextract import TitleXtractor

titlextractor = TitleXtractor().load()
ref = """[1] Jacobs, K. 2019. This is a test title. In Proceedings of Some Journal."""
title = titlextractor(ref)
print(title)

It gives the following output:

'This is a test title.'

Here, a trained neural network extracts the titles from the given reference.

Converting an arXiv PDF to text

There is a utility available which takes an arXiv URL and converts it to text:

from citextract.utils.pdf import convert_pdf_url_to_text

pdf_url = 'https://arxiv.org/pdf/some_file.pdf'
text = convert_pdf_url_to_text(pdf_url)
print(text)