keycollator

Compares text in a file to reference/glossary/key-items/dictionary file.


License
MIT
Install
pip install keycollator==0.0.4

Documentation


Pylint Makefile CI Python Version License

โ”ฌโ”Œโ”€โ”Œโ”€โ”โ”ฌ โ”ฌโ”Œโ”€โ”โ”Œโ”€โ”โ”ฌ  โ”ฌ  โ”Œโ”€โ”โ”Œโ”ฌโ”โ”Œโ”€โ”โ”ฌโ”€โ”
โ”œโ”ดโ”โ”œโ”ค โ””โ”ฌโ”˜โ”‚  โ”‚ โ”‚โ”‚  โ”‚  โ”œโ”€โ”ค โ”‚ โ”‚ โ”‚โ”œโ”ฌโ”˜
โ”ด โ”ดโ””โ”€โ”˜ โ”ด โ””โ”€โ”˜โ””โ”€โ”˜โ”ดโ”€โ”˜โ”ดโ”€โ”˜โ”ด โ”ด โ”ด โ””โ”€โ”˜โ”ดโ””โ”€

Compares text in a file to reference/glossary/key-items/dictionary.

๐Ÿงฑ Built by David Rush fueled by โ˜•๏ธ โ„น๏ธ info

https://pypi.org/project/keycollator/0.0.3/


๐Ÿ—‚๏ธ Structure

.
โ”‚
โ”œโ”€โ”€ assets
โ”‚   โ””โ”€โ”€ images
โ”‚       โ””โ”€โ”€ coverage.svg
โ”‚
โ”œโ”€โ”€ docs
โ”‚   โ”œโ”€โ”€ cli.md
โ”‚   โ””โ”€โ”€ index.md
โ”‚
โ”œโ”€โ”€ src
โ”‚   โ”œโ”€โ”€ __init__.py
โ”‚   โ”œโ”€โ”€ cli.py
โ”‚   โ”œโ”€โ”€ keycollator.py
โ”‚   โ”œโ”€โ”€ test_keycollator.py
โ”‚   โ”œโ”€โ”€ extractonator.py
โ”‚   โ”œโ”€โ”€ requirements.txt
โ”‚   โ””โ”€โ”€data
โ”‚       โ”œโ”€โ”€ (placeholder)
โ”‚       โ””โ”€โ”€ (placeholder)
โ”‚
โ”œโ”€โ”€ tests
โ”‚   โ””โ”€โ”€ test_keycollator
โ”‚       โ”œโ”€โ”€ __init__.py
โ”‚       โ””โ”€โ”€ test_keycollator.py
โ”‚
โ”œโ”€โ”€ COD_OF_CONDUCT.md
โ”œโ”€โ”€ CONTRIBUTING.md
โ”œโ”€โ”€ LICENSE
โ”œโ”€โ”€ make-venv.sh
โ”œโ”€โ”€ Makefile
โ”œโ”€โ”€ pyproject.toml
โ”œโ”€โ”€ README.README
โ”œโ”€โ”€ README.rst
โ”œโ”€โ”€ setup.cfg
โ””โ”€โ”€ setup.py

๐Ÿš€ Features

  • Extract text from file to dictionary
  • Extract keys from file to dictionary
  • Find matches of keys in text file
  • Apply fuzzy matching

๐Ÿงฐ Installation

๐Ÿ–ฅ๏ธ Install from Pypi using pip3

๐Ÿ“ฆ https://pypi.org/project/keycollator/

pip3 install keycollator

๐Ÿ“„ Documentation

Official documentation can be found here:

https://github.com/davidprush/keycollator/tree/main/docs

๐Ÿ’ช Supported File Formats

  • TXT/CSV files (Mac/Linux/Win)
  • Plans to add PDF and JSON

๐Ÿ“ Usage

๐Ÿ–ฅ๏ธ Import keycollator it into Python Projects

from keycollator import ZTimer, KeyKrawler

๐Ÿ–ฅ๏ธ CLI

keycollator uses the CLI to change default parameters and functions

python3 src/keycollator.py --help                         
Usage: keycollator.py [OPTIONS] COMMAND [ARGS]...

  keycollator is an app that finds occurances of keys in a text file

Options:
  -t, --text-file PATH            Path/file name of the text to be searched
                                  for against items in the key file
  -k, --key-file PATH             Path/file name of the key file containing a
                                  dictionary, key items, glossary, or
                                  reference list used to search the text file
  -O, --output-file PATH          Path/file name of the output file that
                                  will contain the results (CSV or TXT)
  -R, --limit-results INTEGER     Limit the number of results
  -f, --fuzzy-matching INTEGER RANGE
                                  Set the level of fuzzy matching (default=99)
                                  to validate matches using
                                  approximations/edit distances, uses
                                  acceptance ratios with integer values from 0
                                  to 99, where 99 is nearly identical and 0 is
                                  not similar  [0<=x<=99]
  -U, --ubound-limit INTEGER RANGE
                                  Ignores items from the results with matches
                                  greater than the upper boundary (upper-
                                  limit); reduce eroneous matches
                                  [1<=x<=99999]
  -L, --lbound-limit INTEGER RANGE
                                  Ignores items from the results with matches
                                  less than the lower boundary (lower-limit);
                                  reduce eroneous matches  [0<=x<=99999]
  -v, --set-verbose               Turn on verbose
  -l, --set-logging               Turn on logging
  -Z, --log-file PATH             Path/file name to be used for the log file
  --help                          Show this message and exit.

๐Ÿ–ฅ๏ธ Turn on verbose output

currently provides only one level for verbose, future versions will implement multiple levels (DEBUG, INFO, WARN, etc.)

keycollator --verbose

๐Ÿ–ฅ๏ธ Apply fuzzy matching

fuzzy matching uses approximate matches (edit distances) whereby 0 is the least strict and accepts nearly anything as a match and more strictly 99 accepts only nearly identical matches; by default the app uses level 99 only if regular matching finds no matches

keycollator --fuzzy-matching=[0-99]

๐Ÿ–ฅ๏ธ Set the key file

each line of text represents a key which will be used to match with items in the text file

keycollator --key-file="/path/to/key/file/keys.txt"

๐Ÿ–ฅ๏ธ Set the text file

text file whereby each line represents an item that will be compared with the items in the keys file

keycollator --text-file="/path/to/key/file/text.txt"

๐Ÿ–ฅ๏ธ Specify the output file

currently uses CSV but will add additional file formats in future releases (PDF/JSON/DOCX)

keycollator --output-file="/path/to/results/result.csv"

๐Ÿ–ฅ๏ธ Set limit results for console and output file

Limit the number of results

keycollator --limit-results=30

๐Ÿ–ฅ๏ธ Set upper bound limit

rejects items with matches over the integer value set, helps with eroneous matches when using fuzzy matching

keycollator --ubound-limit

๐Ÿ–ฅ๏ธ Turn on logging:

turn on logging whereby if no log file is supplied by user it will create one using the default log.log

keycollator --set-logging

๐Ÿ–ฅ๏ธ Create a log file

set the name of the log file to be used by logging

keycollator --log-file="/path/to/log/file/log.log"

Example Output

python3 src/keycollator.py --set-logging --limit-results=30
โœ” Extracted text.txt items.[[0.16]seconds]
โœ” Extracted keys.txt items.[[0.25]seconds]
โœ” Matched keys.txt items to text.txt items.[[76.45]seconds]
โœ” results.csv Complete.[[76.52]seconds]
โ•ญโ”€โ”€โ”€โ”€โ”€โ”ฌโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”ฌโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ•ฎ
โ”‚ No. โ”‚ Key           โ”‚ Count โ”‚
โ”œโ”€โ”€โ”€โ”€โ”€โ”ผโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”ผโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”ค
โ”‚  1  โ”‚ manage        โ”‚  73   โ”‚
โ”œโ”€โ”€โ”€โ”€โ”€โ”ผโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”ผโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”ค
โ”‚  2  โ”‚ develop       โ”‚  62   โ”‚
โ”œโ”€โ”€โ”€โ”€โ”€โ”ผโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”ผโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”ค
โ”‚  3  โ”‚ report        โ”‚  58   โ”‚
โ”œโ”€โ”€โ”€โ”€โ”€โ”ผโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”ผโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”ค
โ”‚  4  โ”‚ support       โ”‚  46   โ”‚
โ”œโ”€โ”€โ”€โ”€โ”€โ”ผโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”ผโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”ค
โ”‚  5  โ”‚ process       โ”‚  43   โ”‚
โ”œโ”€โ”€โ”€โ”€โ”€โ”ผโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”ผโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”ค
โ”‚  6  โ”‚ analysis      โ”‚  36   โ”‚
โ”œโ”€โ”€โ”€โ”€โ”€โ”ผโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”ผโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”ค
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
โ”œโ”€โ”€โ”€โ”€โ”€โ”ผโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”ผโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”ค
โ”‚ 28  โ”‚ dashboards    โ”‚  11   โ”‚
โ”œโ”€โ”€โ”€โ”€โ”€โ”ผโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”ผโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”ค
โ”‚ 29  โ”‚ sales         โ”‚  10   โ”‚
โ”œโ”€โ”€โ”€โ”€โ”€โ”ผโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”ผโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”ค
โ”‚ 30  โ”‚ create        โ”‚  10   โ”‚
โ•ฐโ”€โ”€โ”€โ”€โ”€โ”ดโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”ดโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ•ฏ
โ•ญโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”ฌโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ•ฎ
โ”‚ Statistic   โ”‚ Total  โ”‚
โ”œโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”ผโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”ค
โ”‚ Keys        โ”‚  701   โ”‚
โ”œโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”ผโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”ค
โ”‚ Text        โ”‚  695   โ”‚
โ”œโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”ผโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”ค
โ”‚ Matches     โ”‚  1207  โ”‚
โ”œโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”ผโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”ค
โ”‚ Comparisons โ”‚ 376855 โ”‚
โ”œโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”ผโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”ค
โ”‚ Logs        โ”‚   0    โ”‚
โ”œโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”ผโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”ค
โ”‚ Runtime     โ”‚ 76.60  โ”‚
โ•ฐโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”ดโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ•ฏ

๐ŸŽฏ Todo ๐Ÿ“Œ

    โŒ Update requirements.txt
    โŒ Add proper error handling
    โŒ Add CHANGELOG.md
    โŒ Update requirements.txt
    โŒ Add functions/methods to handle STOP_WORDS
    โŒ Verify python3 -m nltk.downloader punkt is properly immported
    โœ… Separating project into multiple files
    โœ… Add progress inicator using halo when extracting and comparing
    โœ… Create a logger class (for some reason logging is broken)
    โœ… KeyKrawler matching is broken
    โœ… Update README.md(.rst) with correct CLI
    โŒ Create method to KeyKrawler to select and _create missing files_
    โŒ Update CODE_OF_CONDUCT.md
    โŒ Update CONTRIBUTING.md
    โœ… Format KeyCrawler console results as a table
    โŒ Create ZLog class in extractonator.py (parse out __logit method)
    โŒ Cleanup verbose output (conflicts with halo)
    โŒ Update all comments
    โŒ Migrate click functionality to cli.py
    โœ… Refactor all methods and functions
    โŒ Test ALL CLI options

๐Ÿ‘” Project Resource Acknowledgements

  1. Creating a Python Package
  2. javiertejero

๐Ÿ’ผ Deployment Features

๐Ÿ“ˆ Releases

Currently stage: testing

๐Ÿ›ก License

License

This project is licensed under the terms of the MIT license. See LICENSE for more details.

@misc{keycollator,
  author = {David Rush},
  title = {Compares text in a file to reference/glossary/key-items/dictionary file.},
  year = {2022},
  publisher = {Rush Solutions, LLC},
  journal = {GitHub repository},
  howpublished = {\url{https://github.com/davidprush/keycollator}}
}

Additional Information

  1. The latest version of this document can be found here; if you are viewing it there (via HTTPS), you can download the Markdown/reStructuredText source here.
  2. You can contact the author via e-mail.