Short Text Mining in Python
shorttext is a Python package that facilitates supervised and unsupervised
learning for short text categorization. Due to the sparseness of words and
the lack of information carried in the short texts themselves, an intermediate
representation of the texts and documents are needed before they are put into
any classification algorithm. In this package, it facilitates various types
of these representations, including topic modeling and word-embedding algorithms.
Since release 1.0.0,
shorttext runs on Python 2.7, 3.5, and 3.6.
Since release 1.0.7, it runs on Python 3.7 as well, but the backend for
keras cannot be
Since release 1.0.8, it runs on Python 3.7 with 'TensorFlow' being the backend for
- example data provided (including subject keywords and NIH RePORT);
- text preprocessing;
- pre-trained word-embedding support;
gensimtopic models (LDA, LSI, Random Projections) and autoencoder;
- topic model representation supported for supervised learning using
- cosine distance classification;
- neural network classification (including ConvNet, and C-LSTM);
- maximum entropy classification;
- metrics of phrases differences, including soft Jaccard score (using Damerau-Levenshtein distance), and Word Mover's distance (WMD);
- character-level sequence-to-sequence (seq2seq) learning; and
- spell correction.
Documentation and tutorials for
shorttext can be found here: http://shorttext.rtfd.io/.
To install it, in a console, use
>>> pip install -U shorttext
or, if you want the most recent development version on Github, type
>>> pip install -U git+https://github.com/stephenhky/PyShortTextCategorization@master
Developers are advised to make sure
Keras >=2 be installed. Users are advised to install the backend
Tensorflow (preferred) or
Theano in advance. It is desirable if
Cython has been previously installed too.
Before using, check the language model of spaCy has been installed or updated, by running:
>>> python -m spacy download en
See installation guide for more details.
To report any issues, go to the Issues tab of the Github page and start a thread. It is welcome for developers to submit pull requests on their own to fix any errors.
If you would like to contribute, feel free to submit the pull requests. You can talk to me in advance through e-mails or the Issues page.
- Documentation: http://shorttext.readthedocs.io
- Github: https://github.com/stephenhky/PyShortTextCategorization
- PyPI: https://pypi.org/project/shorttext/
- "Package shorttext 1.0.0 released," Medium
- "Python Package for Short Text Mining", WordPress
- "Document-Term Matrix: Text Mining in R and Python," WordPress
- An earlier version of this repository is a demonstration of the following blog post: Short Text Categorization using Deep Neural Networks and Word-Embedding Models
- 09/02/2017: end of GSoC project. (Report)
- 05/30/2017: GSoC project (Chinmaya Pancholi, with gensim)
Possible Future Updates
word2vec-apifor faster loading (especially on Cloud);
More scalability using
- Including BERT models;
- Use of DASK;
- Dividing components to other packages;
- More available corpus.