German tf-idf module.


Keywords
tf, idf, lemmatize, tokenize, german, nlp, tf-idf
License
MIT
Install
pip install tfidfDE==0.7.1

Documentation

tf-idf

Build Status Python MIT License

This script implements the TF-IDF term relevance scoring as described on Wikipedia's article.

Its purpose is to generate the TF-IDF ratings for a collection of documents in German. This script will also tokenize the input files to extract words (removes punctuation).

This script doesn't support Python 2.7 anymore. For an older, compatible but less-maintained version of this tool check out this branch.

Usage

Build your own script

  • Download and import the tfidfDE module
pip install tfidfDE
import tfidfDE
tfidfDE.analyze(documents, resultsPerDocument, preferNouns, ranking, files, verbose)
Parameter Type Description
documents List List of documents (texts).
resultsPerDocument Integer [optional] Number of highest rated words per document to be output.
preferNouns Boolean [optional] If True the algorithm will favour nouns in the generated rankings. Default value: False.
ranking Boolean [optional] If False the script will only output sorted lists of words based on their ranking instead of also displaying the ranking score for each word. Default value: True.
files Boolean [optional] If True: Documents contains a list of file path instead of the document texts themselves. Output will be written to files, too. Default value: False.
verbose Boolean [optional] If True: Enable console logging. Default value: False.

Use the example script

  • Create a file to hold the paths+names of all your documents (in the example shown: example_data/input.txt)
  • Make sure you have the full paths to the files listed in the file above each on a separate line
  • For now, the documents are only collections of text, no HTML, XML, RDF, or any other format
  • Simply run the example script file with your input file as a single parameter (or use -h for a full list of options), for example:
python3 example.py example_data/input.txt
  • This script will generate new files, one for each of the input files, with the prefix tfidf_ which contains terms with corresponding tf-idf score, each on a separate line (default behaviour of the files option)

This script is based on Yasser Elsayed's TF-IDF-Generator module, still in active development and currently in alpha status.

Contributing

If you found a bug, want to propose a feature or feel the urge to complain about your life, feel free to visit the issues page.