Visualization of Topic Modeling Results


Keywords
data, science, analytics, data-science, data-visualization, machine-learning, plotting, python, topic-modeling, visualization
License
Other
Install
pip install tmplot==0.1.2

Documentation

tmplot

Codacy coverage Codacy grade GitHub Workflow Status Documentation Status Downloads PyPI Issues

tmplot is a Python package for analysis and visualization of topic modeling results. It provides the interactive report interface that borrows much from LDAvis/pyLDAvis and builds upon it offering a number of metrics for calculating topic distances and a number of algorithms for calculating scatter coordinates of topics. It can be used to select closest and stable topics across multiple models.

Plots

Features

  • Supported models:

    • tomotopy: LDAModel, LLDAModel, CTModel, DMRModel, HDPModel, PTModel, SLDAModel, GDMRModel
    • gensim: LdaModel, LdaMulticore
    • bitermplus: BTM
  • Supported distance metrics:

    • Kullback-Leibler (symmetric and non-symmetric) divergence
    • Jenson-Shannon divergence
    • Jeffrey's divergence
    • Hellinger distance
    • Bhattacharyya distance
    • Total variation distance
    • Jaccard inversed index
  • Supported algorithms for calculating topics scatter coordinates:

    • t-SNE
    • SpectralEmbedding
    • MDS
    • LocallyLinearEmbedding
    • Isomap

Installation

The package can be installed from PyPi:

pip install tmplot

Or directly from this repository:

pip install git+https://github.com/maximtrp/tmplot.git

Dependencies

  • numpy
  • scipy
  • scikit-learn
  • pandas
  • altair
  • ipywidgets
  • tomotopy, gensim, and bitermplus (optional)

Quick example

# Importing packages
import tmplot as tmp
import pickle as pkl
import pandas as pd

# Reading a model from a file
with open('data/model.pkl', 'rb') as file:
    model = pkl.load(file)

# Reading documents from a file
docs = pd.read_csv('data/docs.txt.gz', header=None).values.ravel()

# Plotting topics as a scatter plot
topics_coords = tmp.prepare_coords(model)
tmp.plot_scatter_topics(topics_coords, size_col='size', label_col='label')

# Plotting terms probabilities
terms_probs = tmp.calc_terms_probs_ratio(phi, topic=0, lambda_=1)
tmp.plot_terms(terms_probs)

# Running report interface
tmp.report(model, docs=docs, width=250)

You can find more examples in the tutorial.