ChemistryPaperParser

Parsing HTML chemistry papers from certain publishers into plain text


Keywords
data-mining, natural-language-processing, nlp, parser, chemistry, python
License
MIT
Install
pip install ChemistryPaperParser==0.1.1

Documentation

Chemistry Paper Parser

Convert HTML/XML Chemistry/Material Science articles into plain text.

made-with-python Maintenance PyPI version

1. Install

Requirements

The current version of Chemistry Paper Parser is built for Python >= 3.9. Please check requirements.txt for other dependencies.

Install package

Chemistry Paper Parser is hosted on pypi. You can simply install it with

pip install ChemistryPaperParser

Once installed, you can import the package as chempp in Python:

from chempp import parse_html, parse_xml

html_article, _ = parse_html(path_to_my_local_html)
xml_article, _ = parse_xml(path_to_my_local_xml)

Supported publishers:

Currently, Chemistry Paper Parser supports the following publishers and file types.

Publisher Supports HTML Supports XML
RSC ✓ ✗
Springer ✓ ✗
Nature ✓ ✗
Wiley ✓ ✗
AIP ✓ ✗
ACS ✓ ✓
Elsevier ✓ ✓
AAAS (Science) ✓ ✗

In addition, table parsing is not supported for all publishers.

For figures, only captions will be parsed and saved in the current version.

2. Example

The open-access ACS article Toland et al. (2023) is used here as an example to demonstrate the article parsing process. The offline file is provided at ./examples/Toland.et.al.2023.html. For online HTML files, you can either download the html files manually and load it as demonstrated below, or use the provided chempp.crawler.load_online_html function (requires external dependencies).

To parse the example article, you can try the following example in your shell.

PYTHONPATH="." python ./examples/process_articles.py --input_dir ./examples/ --output_dir ./output/ --output_format pt

The --input_dir argument can either be the file path or a directory. If it is a directory, the program will try to read and parse all html and xml files in the folder. --output_format defines the output format of the parse file. pt will retain all structural information within the Article class. jsonl saves the file as a Doccano-compatible jsonl file for easy annotation. html saves the file as a simplified HTML for easy demonstration of the annotated sentences and tokens. It also is a good way to present the quality of the parsed article.

Notice that ./examples/process_articles.py is only an incomplete demonstration of chempp APIs and their usage. The notebook ./examples/example.ipynb demonstrates the structure of the parsed Article object and some possible use cases. You can find more details regarding Chemistry Article Parser and its application in my blog. I'll provide more comprehensive API introduction if needed in the future.

3. Known issues

Due to the variety of HTML/XML documents, not all document can be successfully parsed. It would be helpful for our improvement if you can report the failed cases in the Issue section.

  • HTML highlighting sometimes may fail when multiple entities start at the same position due to incorrect text span alignment.
  • May fail to extract sections from Elsevier when section ids are s[\d]+ instead of sec[\d]+, as mentioned in this issue.
  • Fails to extract abstracts from RSC due to updated HTML format, as mentioned in this issue.

Citation

Please consider citing the following article if your find our package useful. Although not mentioned at all, Chemistry Paper Parser is still a part of this project.

@article{toland.2023.accelerated.scheme,
  author = {Toland, Aubrey and Tran, Huan and Chen, Lihua and Li, Yinghao and Zhang, Chao and Gutekunst, Will and Ramprasad, Rampi},
  title = {Accelerated Scheme to Predict Ring-Opening Polymerization Enthalpy: Simulation-Experimental Data Fusion and Multitask Machine Learning},
  journal = {The Journal of Physical Chemistry A},
  volume = {127},
  number = {50},
  pages = {10709-10716},
  year = {2023},
  doi = {10.1021/acs.jpca.3c05870},
  note ={PMID: 38055927},
  URL = {https://doi.org/10.1021/acs.jpca.3c05870},
  eprint = {https://doi.org/10.1021/acs.jpca.3c05870}
}