corpusshow

Corpus-Show makes it easier and faster to visualize corpus through sentence embedding of corpus.


Keywords
nlp, pypi-package, visualization
License
Apache-2.0
Install
pip install corpusshow==0.1.8

Documentation

Corpus-Show

Contributor Covenant Python Version Pypi Version Code convention

Corpus-Show helps to understand the corpus data distribution through various values generated from Sentence Transformer. (It's not such a great package, but It simply helps you visualize comfortably.)

  • Corpus-Show performs sentence embedding via Sentence Transformers, a Python framework for state-of-the-art sentence, text and image embeddings. [Paper] [Document] [Huggingface model]
  • You can visualize the embedded sentences of each document generated from SentenceTransformers.
  • Corpus-Show can also generate clusters with sentences embedded array through Scikit-Learn KMeans.
  • The sentence transformer model is downloaded through the hugging face interface, and the default model is set to paraphrase-xlm-r-multilingual-v1, which supports multiple languages. However, you can easily input your custom model as a sentence transformer model through the hugging face interface. It is also easy to fine-tune via SBERT. For more models, please see this page.

Installation

pip install corpusshow
  • This package may not work properly on M1/M2 MacOS. If you are using Mac OS, please use the git repository as a submodule because it consists of simple functions. (not highly encapsulated) [issue#1]

Tutorial

I will provide tutorial notebooks for all the features we offer. I plan to provide additional docstrings or documentation from the official release version (major version 1 or higher).

  1. Main-tutorials: https://github.com/DSDanielPark/corpus-show/blob/main/tutorials/corpusshow_tutorial.ipynb
  2. Sub-tutorial-folder: https://github.com/DSDanielPark/corpus-show/blob/main/tutorials





Main Feature

It helps to create a simple but useful plot as shown below with a simple dataframe and column names as input, such as the following BBC sample dataset in ./data/bbc_news_dataset.csv.

news topic
0 Oil rebounds from weather effect (...) business
1 Indonesia 'declines debt freeze' (...) business
... ... ...
601 EU software patent law faces axe (...) tech

1. CorpusClster

Contains 1 static method. You can create great pictures with:

from corpusshow import CorpusCluster

# Class arguments
csv_file_path = '../data/bbc_news_dataset.csv'
sentence_transformer_model_name = 'paraphrase-xlm-r-multilingual-v1'
target_col = 'news'
num_cluster = 4

# Get class object
cc = CorpusCluster(csv_file_path, sentence_transformer_model_name, target_col, num_cluster)

# 1. quick_corpus_show method: 
# Show figures without k-means clustering
cc.quick_corpus_show('topic', 'tsne2d', False, 'fig1.png')
cc.quick_corpus_show('topic', 'tsne3d', False, 'fig2.png')
cc.quick_corpus_show('topic', 'pca2d', False, 'fig3.png')
cc.quick_corpus_show('topic', 'pca3d', False, 'fig4.png')

# 2. quick_cluster_show method:
# Show figures with k-means clustering
df_returned = cc.quick_cluster_show('tsne2d', False, 'fig5.png')
df_returned = cc.quick_cluster_show('tsne3d', False, 'fig6.png')
df_returned = cc.quick_cluster_show('pcda2d', False, 'fig7.png')
df_returned = cc.quick_cluster_show('pcda2d', False, 'fig8.png')

  • If you want to change the design of the plot, use matplotlib's RcParams method or the returned dataframe.

References

[1] Scikit-Learn https://scikit-learn.org
[2] Matplotlib https://matplotlib.org/
[3] Huggingface Sentence Transformer https://huggingface.co/sentence-transformers
[4] SBERT https://www.sbert.net/


Use Case

[1] Korean-news-topic-classification-using-KO-BERT: all plots were created through Corpus-Show and Quick-Show.

Contacts

Maintainer: Daniel Park, South Korea e-mail parkminwoo1991@gmail.com