Corpus-Show

Corpus-Show helps to understand the corpus data distribution through various values generated from Sentence Transformer. (It's not such a great package, but It simply helps you visualize comfortably.)

Corpus-Show performs sentence embedding via Sentence Transformers, a Python framework for state-of-the-art sentence, text and image embeddings. [Paper] [Document] [Huggingface model]
You can visualize the embedded sentences of each document generated from SentenceTransformers.
Corpus-Show can also generate clusters with sentences embedded array through Scikit-Learn KMeans.
The sentence transformer model is downloaded through the hugging face interface, and the default model is set to paraphrase-xlm-r-multilingual-v1, which supports multiple languages. However, you can easily input your custom model as a sentence transformer model through the hugging face interface. It is also easy to fine-tune via SBERT. For more models, please see this page.

Installation

pip install corpusshow

This package may not work properly on M1/M2 MacOS. If you are using Mac OS, please use the git repository as a submodule because it consists of simple functions. (not highly encapsulated) [issue#1]

Tutorial

I will provide tutorial notebooks for all the features we offer. I plan to provide additional docstrings or documentation from the official release version (major version 1 or higher).

Main-tutorials: https://github.com/DSDanielPark/corpus-show/blob/main/tutorials/corpusshow_tutorial.ipynb
Sub-tutorial-folder: https://github.com/DSDanielPark/corpus-show/blob/main/tutorials

Main Feature

It helps to create a simple but useful plot as shown below with a simple dataframe and column names as input, such as the following BBC sample dataset in ./data/bbc_news_dataset.csv.

	news	topic
0	Oil rebounds from weather effect (...)	business
1	Indonesia 'declines debt freeze' (...)	business
...	...	...
601	EU software patent law faces axe (...)	tech

1. `CorpusClster`

Contains 1 static method. You can create great pictures with:

from corpusshow import CorpusCluster

# Class arguments
csv_file_path = '../data/bbc_news_dataset.csv'
sentence_transformer_model_name = 'paraphrase-xlm-r-multilingual-v1'
target_col = 'news'
num_cluster = 4

# Get class object
cc = CorpusCluster(csv_file_path, sentence_transformer_model_name, target_col, num_cluster)

# 1. quick_corpus_show method: 
# Show figures without k-means clustering
cc.quick_corpus_show('topic', 'tsne2d', False, 'fig1.png')
cc.quick_corpus_show('topic', 'tsne3d', False, 'fig2.png')
cc.quick_corpus_show('topic', 'pca2d', False, 'fig3.png')
cc.quick_corpus_show('topic', 'pca3d', False, 'fig4.png')

# 2. quick_cluster_show method:
# Show figures with k-means clustering
df_returned = cc.quick_cluster_show('tsne2d', False, 'fig5.png')
df_returned = cc.quick_cluster_show('tsne3d', False, 'fig6.png')
df_returned = cc.quick_cluster_show('pcda2d', False, 'fig7.png')
df_returned = cc.quick_cluster_show('pcda2d', False, 'fig8.png')

If you want to change the design of the plot, use matplotlib's RcParams method or the returned dataframe.

References

[1] Scikit-Learn https://scikit-learn.org
[2] Matplotlib https://matplotlib.org/
[3] Huggingface Sentence Transformer https://huggingface.co/sentence-transformers
[4] SBERT https://www.sbert.net/

Use Case

[1] Korean-news-topic-classification-using-KO-BERT: all plots were created through Corpus-Show and Quick-Show.

Contacts

Maintainer: Daniel Park, South Korea e-mail parkminwoo1991@gmail.com

corpusshow
Release 0.1.8

Release 0.1.8

0.1.0

0.1.1

0.1.2

0.1.3

0.1.4

0.1.6

0.1.7

0.1.8

Documentation

Corpus-Show

Installation

Tutorial

Main Feature

1. `CorpusClster`

References

Use Case

Contacts

Stats

Development practices

Releases

Contributors

corpusshow Release 0.1.8

Release 0.1.8 Toggle Dropdown 0.1.0 0.1.1 0.1.2 0.1.3 0.1.4 0.1.6 0.1.7 0.1.8

Documentation

Corpus-Show

Installation

Tutorial

Main Feature

1. CorpusClster

References

Use Case

Contacts

Stats

Development practices

Releases

Contributors

corpusshow
Release 0.1.8

Release 0.1.8

0.1.0

0.1.1

0.1.2

0.1.3

0.1.4

0.1.6

0.1.7

0.1.8

1. `CorpusClster`