MassiveWordVec

Package for classes to split extremely large sets of docs into multiple files and convert them word vector embeddings


Keywords
embedding-python, embedding-vectors, large-dataset, pymagnitude, python3, wordvectors
License
Other
Install
pip install MassiveWordVec==11

Documentation

#BELOW IS OUTDATED AND DEPRECATED. NEW FUNCTIONALITY ADDED AND USE CHANGED #WILL UPDATE WHEN I GET TIME.

MassiveWordVec

massvecpy is a package for converting a large corpus into word embeddings using pre-trained word vectors. Corpora that consistent of an array, list, or DataFrame of tokenized docs with shape-(num_docs,) are converted into an array of shape-(num_docs, max_doc_length, vector_dimensions).

Corpora that are too large to store the entire word embedding matrix into memory can be turned into a series of n - 'slices' with shape-(num_docs/num_slices, max_doc_length, vector_dimensions). These slices are saved to local harddrive to save memory and loaded into your program individually (or multiple together if memory allows) for use as mini-batch to train models.

Corpora can be sliced, converted, and saved with or without training labels. With training labels, both corpus and labels are shuffled together and sliced into corresponding matrices that are automatically saved and loaded together.

This package currently only provides support for pymagnitude pre-trained word vectors. If you have a suggestion for word vecs in other packages, please let me know.

Installation

Use the package manager pip to install MassiveWordVec.

pip install massvecpy

Usage

import massvecpy
import pymagnitude

#Load your pretrained word vectors from pymagnitude
vector_directory = '~/Word Embeddings/'
vector_dict = mag.Magnitude(vector_directory+'glove.6B.200d.magnitude')

#Define the corpus to split and convert.
#If entire corpus embedding matrix can fit in memory then leave
#the number of slices at 1.
corpus = massvecpy.DocVectorizer(corpus_name, tokenized_corpus, labels, vector_dict,
         vector_dimension, number_of_slices, file_directory, random_state)

#convert slice 0 to word embedding matrix with associated labels
x, y = corpus.fit(0, verbose = True)

#save slice 0 to harddrive
corpus.save()

#define and fit a model (can be anything, most useful to use model that allows mini-batch training)
clf = Model()
clf.fit(x, y)

#convert the rest of the slices and save them to their own files.  Provide updates to track process.
corpus.fit_and_save(range(1, number_of_slices), verbose - True)

#clear currently stored embeddings from memory
corpus.clear_memory()

#************

corpus = massvecpy.LoadVectorizedDoc(corpus_name, file_directory)

#load first slice (0) into local arrays
x, y = corpus.load(0)

#figure out the dimensions of our slices
num_docs_in_slice = x.shape[0]
max_length = x.shape[1]
vec_dims = x.shape[2]

#alternatively, process all slices in current file directory with name corpus_name
clf = SGDClassifier(loss='log')

for i in corpus.all_indices_available:
     x, y = corpus.load(index)
     x_f = np.reshape(x, (num_docs_in_slice, max_length * vec_dims))
     clf.partial_fit(x_f, y, [0, 1])

Future Plans

In the future I will build support for just generating a word vector lookup dictionary for all (or some specified number of) words in the corpus. Additionally, plan to use this dictionary to generate matrix embeddings in the format Keras uses to load weights into an embedding layer.

License

None brah