Jason Young's Tools.


License
Apache-2.0
Install
pip install young-tools==0.0.2a7

Documentation

Young Tools

This package contains several useful tools, some of which deal with the problems in Natrual Language Processing.

Installation

  1. Through pip
pip install young-tools
  1. Clone it to local
git clone https://github.com/Jason-Young-NLP/YoungTools.git
cd YoungTools
python setup.py build develop

Main Framework

Executable Moduels

All the executable modules can be executed by running command young-tools-{module_name}.

Until now, young-tools provides three Executable Moduels:

  • young-tools-corpus
  • young-tools-levenshtein
  • young-tools-xml

Compilers

Corpus

It is a corpus compiler which can be executed by running young-tools-corpus. The command only recieve 1 argument -p or --configuration-path that contains all the parameters you set. The configuration file is wrote in a basic configuration language which provides a structure similar to what’s found in Microsoft Windows INI files.

You must provide the main section in which you should to configure:

  • pipeline
  • corpus_directory
  • corpora_names
  • languages
  • encodings

Before each running, young-tools-corpus will read the configuration-path and parse the main section. young-tools-corpus can deal with multiple corpus with different settings in one time. The configuration of different corpus in main section are seperated by seperator |.

pipeline indicates the running order of the sub-corpus-compiler modules. Each name of different modules are seperated by the seperator &. If there is another instance of a module have a different configuration, just define a new section whitch name is appended by the suffix _{index} like moduel_name_10. module_name must be one of names of sub-corpus-compiler-modules.

corpus_directory specifies where the raw and compiled corpus are.

In each corpus_directory, there may contains several corpora(corpora_names), and each corpora may have several languages(languages) whose compiled file encodings can be detemained by encodings.

young-tools-corpus has 5 sub-corpus-compiler-modules:

  1. Cleaner

    Which can remove the dumplicate_lines(remove_dumplicate_lines) and lowercase the corpora(lowercase). granularity can be set as sentence or document. When granularity is document, the document index which indicates the start point of each document in the corpora are write the corpora_names+document_index_suffix.

  2. Normalizer

    Normalize punctuations of the corpora.

  3. Segmenter

    Segment the Chinese sentence using THULAC. If you need POS tagging, set part_of_speech_tagging to be true. traditional_to_simplified may useful in some situation.

  4. Tokenizer

    Tokenize the sentences in different languages, you may need to convert the hyphen - char to @-@ by setting split_aggressive_hyphen to be True.

  5. Subword

    This is a simple encapsulation of subword-nmt. learn_file_index and apply_file_indices indicate the index of which corpus should be learn/apply in the corpora_names, and subword_indices indicates which language of the corpora should be executed by BPE. symbols_number is the number of the merge operation and joint_learn is whether learn the BPE jointly among the subword_indices at learn_file_index of the corpora_names.

Normalizer and Tokenizer are reimplementation of the scripts of the mosesdecoder.

Leveshtein

It can generate the manipulation sequences between corpora hypothesis and references by calculating the levenshtein distance , and synthetise the hypothesis of the references by getting the rules of the aligned hypothesis and references. These functions can be executed by running young-tools-corpus with a subcommand of get-rules, apply-rules and gen-seqs.

XML

young-tools-xml can convert a XML file into a plain file or escape/deescape the file by specifing the subcommand as xml2plain or scape respectively.

Metrics

To be done.

Pedestal Moduels

Using it by simply import the pedestal package:

import young_tools.pedestal as pedestal

The usage of each module in the pedestal package is described as follows:

Timer

Timer record the system/process elapsed time.

Constant

Constant is a type of class with which stores unlimited number of constants.

InstancesChecker

InstancesChecker is a basic decorator that can check whether parameters that are passed to the method is legal.

ANSIFormatter

ANSIFormatter controls the ANSI color string. One use this class to format the terminal output string.

Logger

Logger records the logging of the process and sends it to log file or terminal.

Argument

Argument is a simple encapsulation of the argparser.

Configurator

Configurator is a simple encapsulation of the configparser, but Configurator is case sensitive.

UnicodeHandler

UnicodeHandler has several methods that deal with the unicode string and detect the encoding type.

RedirectStream

A simple class can redirect the stdout/stderr stream to a file.