subhaashita

A package for curating subhaashita-s (quotes) in various languages.


Keywords
quotes, subhAShita, subhasita
License
MIT
Install
pip install subhaashita==0.0.1

Documentation

^Build status Documentation Status PyPI version

doc curation

A package for curating doc file collections. Prominent features:

  • Scrape texts off various sites, such as Wikisource. See example here. (PS: Consider contributing to raw_etexts repo. )
  • OCR some pdf with google drive. Automatically splits into 25 page bits and ocrs them individually. See usage example here, function here.

For users

Installation or upgrade

  • For stable version pip install doc_curation -U -e.[all]
  • For latest code pip install git+https://github.com/sanskrit-coders/doc_curation/@master -U -e.[all]
  • Web.

Usage

Google Drive API wrapper

  • Enable Google Drive API and download service account key file having Google Driver API access. (See details in split_and_ocr_on_drive function documentation (eg. github source).)
from doc_curation.pdf import drive_ocr
pdf_file = '/home/file.pdf'
key_file = '/home/key.json'
drive_ocr.split_and_ocr_on_drive(pdf_path=pdf_file, google_key=key_file, small_pdf_pages=5)

Command line invocation:

# For help and details - 
/usr/bin/python3 -m doc_curation.pdf.drive_ocr --help
/usr/bin/python3 -m doc_curation.pdf.drive_ocr --input_path=/some/Dir/Or/File --google_key=/some/path/service_account_key.json --small_pdf_pages=5

Usage for the google_vision_pdf.py to OCR pdf to txt files.

export GOOGLE_APPLICATION_CREDENTIALS="/home/user/Downloads/service-account-file.json"
  • Invoke the script passing in the input file. Eg:
python3 google_vision_pdf.py --input-file <input.pdf>
/usr/bin/python3 -m doc_curation.pdf.google_vision_pdf  --input-file <input.pdf>

For contributors

Contact

Have a problem or question? Please head to github.

Packaging

  • ~/.pypirc should have your pypi login credentials.
python setup.py bdist_wheel
twine upload dist/* --skip-existing

Build documentation

  • sphinx html docs can be generated with cd docs; make html

Testing

Run pytest in the root directory.

Auxiliary tools