eKorpkit provides a flexible interface for NLP and ML research pipelines such as extraction, transformation, tokenization, training, and visualization.


Keywords
corpus, nlp, pipeline, python
License
MIT
Install
pip install ekorpkit==0.1.35

Documentation

ekorpkit 【iːkɔːkɪt】 : eKonomic Research Python Toolkit

pypi-image version-image release-date-image jupyter-book-image codeql-image test-image circleci-image codecov-image license-image

eKorpkit provides a flexible interface for NLP and ML research pipelines such as extraction, transformation, tokenization, training, and visualization. Its powerful config composition is backed by Hydra.

Warning: This is a work in progress

This project is still under development. The API is subject to change. Until the first stable release, the version number will be 0.x.x. Please use it at your own risk. If you have any questions or suggestions, please feel free to contact me.

Especially, some core configuration interface parts of the package will be carbed out and moved to a separate package. The package will be renamed to hyfi (Hydra Fast Interface). Image generation and visualization will be moved to a separate package. The package will be renamed to ekaros (from Íkaros[Icarus] in Greek mythology).

Key features

Easy Configuration

  • You can compose your configuration dynamically, enabling you to easily get the perfect configuration for each research.
  • You can override everything from the command line, which makes experimentation fast, and removes the need to maintain multiple similar configuration files.
  • With a help of the eKonf class, it is also easy to compose configurations in a jupyter notebook environment.

No Boilerplate

  • eKorpkit lets you focus on the problem at hand instead of spending time on boilerplate code like command line flags, loading configuration files, logging etc.

Workflows

  • A workflow is a configurable automated process that will run one or more jobs.
  • You can divide your research into several unit jobs (tasks), then combine those jobs into one workflow.
  • You can have multiple workflows, each of which can perform a different set of tasks.

Sharable and Reproducible

  • With eKorpkit, you can easily share your datasets and models.
  • Sharing configs along with datasets and models makes every research reproducible.
  • You can share each unit jobs or an entire workflow.

Pluggable Architecture

  • eKorpkit has a pluggable architecture, enabling it to combine with your own implementation.

Tutorials

Tutorials for ekorpkit package can be found at https://ekorpkit.entelecheia.ai.

Installation

Install the latest version of ekorpkit:

pip install ekorpkit

To install all extra dependencies,

pip install ekorpkit[all]

The eKorpkit Corpus

The eKorpkit Corpus is a large, diverse, bilingual (ko/en) language modelling dataset.

ekorpkit corpus

Citation

@software{lee_2022_6497226,
  author       = {Young Joon Lee},
  title        = {eKorpkit: eKonomic Research Python Toolkit},
  month        = apr,
  year         = 2022,
  publisher    = {Zenodo},
  doi          = {10.5281/zenodo.6497226},
  url          = {https://doi.org/10.5281/zenodo.6497226}
}
@software{lee_2022_ekorpkit,
  author       = {Young Joon Lee},
  title        = {eKorpkit: eKonomic Research Python Toolkit},
  month        = apr,
  year         = 2022,
  publisher    = {GitHub},
  url          = {https://github.com/entelecheia/ekorpkit}
}

Changelog

See the CHANGELOG for more information.

Contributing

Contributions are welcome! Please see the contributing guidelines for more information.

License

  • This project is released under the MIT License.
  • Each corpus adheres to its own license policy. Please check the license of the corpus before using it!