TopMost provides complete lifecycles of topic modeling, including datasets, preprocessing, models, training, and evaluations. It covers the most popular topic modeling scenarios, like basic, dynamic, hierarchical, and cross-lingual topic modeling.
This is our survey paper on neural topic models: A Survey on Neural Topic Models: Methods, Applications, and Challenges.
If you want to use our toolkit, please cite as
@article{wu2023topmost,
title={Towards the TopMost: A Topic Modeling System Toolkit},
author={Wu, Xiaobao and Pan, Fengjun and Luu, Anh Tuan},
journal={arXiv preprint arXiv:2309.06908},
year={2023}
}
@article{wu2023survey,
title={A Survey on Neural Topic Models: Methods, Applications, and Challenges},
author={Wu, Xiaobao and Nguyen, Thong and Luu, Anh Tuan},
journal={Artificial Intelligence Review},
url={https://doi.org/10.1007/s10462-023-10661-7},
year={2024},
publisher={Springer}
}
Table of Contents
TopMost offers the following topic modeling scenarios with models, evaluation metrics, and datasets:
Scenario | Model | Evaluation Metric | Datasets |
---|---|---|---|
Basic Topic Modeling |
TC TD Clustering Classification |
| 20NG | IMDB | NeurIPS | ACL | NYT | Wikitext-103 | |
Hierarchical Topic Modeling |
| HDP | SawETM | HyperMiner | ProGBN | TraCo | | TC over levels | TD over levels | Clustering over levels | Classification over levels | 20NG IMDB NeurIPS ACL NYT Wikitext-103 |
| Dynamic | Topic Modeling | | DTM | DETM | TC over time slices TD over time slices Clustering Classification |
| NeurIPS | ACL | NYT |
| Cross-lingual | Topic Modeling | | NMTM | InfoCTM | TC (CNPMI) TD over languages Classification (Intra and Cross-lingual) |
ECNews Amazon Review Rakuten |
Install topmost with pip
as
$ pip install topmost
We can get the top words of discovered topics, topic_top_words
and the topic distributions of documents, doc_topic_dist
. The preprocessing steps are configurable. See our documentations.
import topmost
from topmost.preprocessing import Preprocessing
# Your own documents
docs = [
"This is a document about space, including words like space, satellite, launch, orbit.",
"This is a document about Microsoft Windows, including words like windows, files, dos.",
# more documents...
]
device = 'cuda' # or 'cpu'
preprocessing = Preprocessing()
dataset = topmost.data.RawDatasetHandler(docs, preprocessing, device=device, as_tensor=True)
model = topmost.models.ProdLDA(dataset.vocab_size, num_topics=2)
model = model.to(device)
trainer = topmost.trainers.BasicTrainer(model)
topic_top_words, doc_topic_dist = trainer.fit_transform(dataset, num_top_words=15, verbose=False)
import topmost
from topmost.data import download_dataset
download_dataset('20NG', cache_path='./datasets')
device = "cuda" # or "cpu"
# load a preprocessed dataset
dataset = topmost.data.BasicDatasetHandler("./datasets/20NG", device=device, read_labels=True, as_tensor=True)
# create a model
model = topmost.models.ProdLDA(dataset.vocab_size)
model = model.to(device)
# create a trainer
trainer = topmost.trainers.BasicTrainer(model)
# train the model
trainer.train(dataset)
# get theta (doc-topic distributions)
train_theta, test_theta = trainer.export_theta(dataset)
# get top words of topics
topic_top_words = trainer.export_top_words(dataset.vocab)
# evaluate topic diversity
TD = topmost.evaluations.compute_topic_diversity(top_words)
# evaluate clustering
clustering_results = topmost.evaluations.evaluate_clustering(test_theta, dataset.test_labels)
# evaluate classification
classification_results = topmost.evaluations.evaluate_classification(train_theta, test_theta, dataset.train_labels, dataset.test_labels)
import torch
from topmost.preprocessing import Preprocessing
new_docs = [
"This is a new document about space, including words like space, satellite, launch, orbit.",
"This is a new document about Microsoft Windows, including words like windows, files, dos."
]
preprocessing = Preprocessing()
parsed_new_docs, new_bow = preprocessing.parse(new_docs, vocab=dataset.vocab)
new_doc_topic_dist = trainer.test(torch.as_tensor(new_bow, device=device).float())
To install TopMost, run this command in your terminal:
$ pip install topmost
This is the preferred method to install TopMost, as it will always install the most recent stable release.
The sources for TopMost can be downloaded from the Github repository. You can clone the public repository by
$ git clone https://github.com/BobXWu/TopMost.git
Then install the TopMost by
$ python setup.py install
We provide tutorials for different usages:
This library includes some datasets for demonstration. If you are a dataset owner who wants to exclude your dataset from this library, please contact Xiaobao Wu.
Xiaobao Wu |
Fengjun Pan |
- If you want to add any models to this package, we welcome your pull requests.
- If you encounter any problem, please either directly contact Xiaobao Wu or leave an issue in the GitHub repo.
- Icon by Flat-icons-com.