Young Tools
This package contains several useful tools, some of which deal with the problems in Natrual Language Processing.
Installation
- Through pip
pip install young-tools
- Clone it to local
git clone https://github.com/Jason-Young-NLP/YoungTools.git
cd YoungTools
python setup.py build develop
Main Framework
Executable Moduels
All the executable modules can be executed by running command young-tools-{module_name}
.
Until now, young-tools provides three Executable Moduels:
- young-tools-corpus
- young-tools-levenshtein
- young-tools-xml
Compilers
Corpus
It is a corpus compiler which can be executed by running young-tools-corpus
. The command only recieve 1 argument -p
or --configuration-path
that contains all the parameters you set. The configuration file is wrote in a basic configuration language which provides a structure similar to what’s found in Microsoft Windows INI files.
You must provide the main
section in which you should to configure:
pipeline
corpus_directory
corpora_names
languages
encodings
Before each running, young-tools-corpus
will read the configuration-path
and parse the main
section. young-tools-corpus
can deal with multiple corpus with different settings in one time. The configuration of different corpus in main
section are seperated by seperator |
.
pipeline
indicates the running order of the sub-corpus-compiler modules. Each name of different modules are seperated by the seperator &
. If there is another instance of a module have a different configuration, just define a new section whitch name is appended by the suffix _{index}
like moduel_name_10
. module_name must be one of names of sub-corpus-compiler-modules.
corpus_directory
specifies where the raw and compiled corpus are.
In each corpus_directory
, there may contains several corpora(corpora_names
), and each corpora may have several languages(languages
) whose compiled file encodings can be detemained by encodings
.
young-tools-corpus
has 5 sub-corpus-compiler-modules:
-
Cleaner
Which can remove the dumplicate_lines(
remove_dumplicate_lines
) and lowercase the corpora(lowercase
).granularity
can be set as sentence or document. Whengranularity
is document, the document index which indicates the start point of each document in the corpora are write thecorpora_names
+document_index_suffix
. -
Normalizer
Normalize punctuations of the corpora.
-
Segmenter
Segment the Chinese sentence using THULAC. If you need POS tagging, set
part_of_speech_tagging
to be true.traditional_to_simplified
may useful in some situation. -
Tokenizer
Tokenize the sentences in different languages, you may need to convert the hyphen
-
char to@-@
by settingsplit_aggressive_hyphen
to be True. -
Subword
This is a simple encapsulation of subword-nmt.
learn_file_index
andapply_file_indices
indicate the index of which corpus should be learn/apply in thecorpora_names
, andsubword_indices
indicates which language of the corpora should be executed by BPE.symbols_number
is the number of the merge operation andjoint_learn
is whether learn the BPE jointly among thesubword_indices
atlearn_file_index
of thecorpora_names
.
Normalizer and Tokenizer are reimplementation of the scripts of the mosesdecoder.
Leveshtein
It can generate the manipulation sequences between corpora hypothesis and references by calculating the levenshtein distance , and synthetise the hypothesis of the references by getting the rules of the aligned hypothesis and references. These functions can be executed by running young-tools-corpus
with a subcommand of get-rules
, apply-rules
and gen-seqs
.
XML
young-tools-xml
can convert a XML file into a plain file or escape/deescape the file by specifing the subcommand as xml2plain
or scape
respectively.
Metrics
To be done.
Pedestal Moduels
Using it by simply import the pedestal
package:
import young_tools.pedestal as pedestal
The usage of each module in the pedestal
package is described as follows:
Timer
Timer record the system/process elapsed time.
Constant
Constant is a type of class with which stores unlimited number of constants.
InstancesChecker
InstancesChecker is a basic decorator that can check whether parameters that are passed to the method is legal.
ANSIFormatter
ANSIFormatter controls the ANSI color string. One use this class to format the terminal output string.
Logger
Logger records the logging of the process and sends it to log file or terminal.
Argument
Argument is a simple encapsulation of the argparser.
Configurator
Configurator is a simple encapsulation of the configparser, but Configurator is case sensitive.
UnicodeHandler
UnicodeHandler has several methods that deal with the unicode string and detect the encoding type.
RedirectStream
A simple class can redirect the stdout/stderr stream to a file.