parsa

A multiformat text parser


Keywords
parsa
License
MIT
Install
pip install parsa==1.1.5

Documentation

   Logo

A text parser that doesn't care about your file extensions

Build Status Code Coverage SemVer Version

Key FeaturesSupported FormatsInstallationUsageRelated projectsContributingMIT License

Demo GIF

Parsa is a textract-based CLI text parser that supports multiple file extensions. It takes any number of inputs, and outputs them to .txt files in a directory of choice, preserving the structure of the original text.

Key features

  • Extends textract's functionalities to work with multiple inputs and to automatically save the output
  • Takes an arbitrary number of inputs of different filetypes, and processess them all equally when supported
  • Outputs the parsed text from the input files individually to corresponding .txt files, with the option of selecting a custom output path
  • Includes a naming system that always avoids overwriting existing files, instead naming new files in a simple manner
  • Supports over 20 of the most common formats (see Supported formats for more)
  • Preserves the structure of document file formats (.docx, .pdf, ...)
  • Supports audio formats (.wav, .mp3, ...) via the speech recognition tools sox, SpeechRecognition and pocketsphinx
  • Supports image formats (.jpg, .png, ...), via the optical character recognition (OCR) tool tesseract-ocr
  • Prompts the user for an input file's extension if it's not explicitly present; this feature can be turned off via --noprompt

Supported formats

See this page from textract's documentation for a full list of the supported formats and their linked dependencies.

Installation

System requirements

  • Linux
  • Python 2.7/3.x (any Python 3 version)

Linux

Via pip:

$ pip install parsa

Or, if you prefer, you can install it from source:

# Clone the repository
$ git clone https://github.com/rdimaio/parsa

# Go into the parsa folder
$ cd parsa

# Install parsa
$ python setup.py install

Tests

$ python -m unittest discover tests

Usage

Single input

# Basic usage
$ parsa path/to/input_file
# The output will be saved inside the input file's parent folder.

Multi input

# Basic usage
$ parsa path/to/input_folder
# The output will be saved inside a folder named `parsaoutput` in the input folder.

Optional: custom output folder

# Basic usage
$ parsa path/to/input -o path/to/output_folder
# Works with both single and multi input.

Optional: ignore files without an explicit extension

# Basic usage
$ parsa --noprompt path/to/input
# Useful for situations where your input includes log/system files without an extension.

Full help message

$ parsa --help
usage: parsa [-h] [--noprompt] [--output [OUTPUT]] input

Textract-based text parser that supports most text file extensions. Parsa can
parse multiple formats at once, writing them to .txt files in the directory of
choice.

positional arguments:
  input                 input file or folder; if a folder is passed as input,
                        parsa will scan every file inside it recursively
                        (scanning subfolders as well)

optional arguments:
  -h, --help            show this help message and exit
  --noprompt, -n        ignore files without an extension and don't prompt the
                        user to input their extension
  --output [OUTPUT], -o [OUTPUT]
                        folder where the output files will be stored. The default folder is:
                        (a) the input file's parent folder, if the input is a file, or
                        (b) a folder named 'parsaoutput' located in the input folder, if the input is a folder.

Related projects

  • parsa-gui - Graphical version of parsa (WIP)
  • xparsa - Extended parsa, enhanced with statistics about the parsed files (WIP)
  • xparsa-gui - GUI for xparsa (WIP)

Contributing

Pull requests are welcome! If you would like to include/remove/change a major feature, please open an issue first.

License

This project is licensed under the MIT License - see the LICENSE file for details.