vecdb

VecDB is a tool to help you to store, manage and compare embeddings vectors for machine learning applications.


Keywords
deep-learning, embedding-vectors, hacktoberfest, latent-space, machine-learning, pypi, siamese-network
Install
pip install vecdb==0.0.0

Documentation

VecDB is a tool to help you to store, manage and compare embeddings vectors for machine learning applications.

PyPI pyversions PyPI license

Embeddings

It is very common in machine learning applications the need to store a large quantity of embedding vectors and compare them using functions such as cosine similary or euclidean distance. VecDB helps you to do that in a fast and efficient way by using H5DF files.

Dependencies

Usage

Inside the H5DF file, the data can be segregated into dataset. This way, you can add categories to your embeddings, such as multiple albums for a face recognition system, for example. If you don't need this, don't specify anything and a custom dataset will be used.

Initialize

The file containing the data can be created using the class constructor. If the file already exists, it will just load it.

from vecdb import VecDB

# Load or create h5 file
db = VecDB(emb_dim=512, filepath='data.h5')

Store new data

To add new entries to our database, we can use the method store. This method returns the index of the last inserted entry. This is useful to reference the embedding vector in a separate database, such as SQL or NoSQL.

# One or more vectors
embeddings = np.random.randn(100000, 512)

# Store in the main dataset
last_index = db.store(embeddings)

# Store in a specific dataset
last_index = db.store(embeddings, 'my_dataset')

Update entries

To update a specific embedding vector, just use the method update.

# New embedding vector
new_embedding = np.random.randn(512)

# Update the vector with index 100 in the main dataset
db.update(100, new_embedding)

# Update the vector with index 100 in a specific dataset
db.update(100, new_embedding, 'my_dataset')

Delete entries

Entries can be deleted with the method delete. You just have to specify the index of the entry to be deleted.

# Delete entry in the main dataset
db.delete(200)

# Delete entry in a specific dataset
db.delete(200, 'my_dataset')

Compare

In order to compare a new embedding vector with all entries in a dataset, you can use the compare method. This method needs a func argument in which is specified the function the be used to compare the entries.

from sklearn.metrics.pairwise import cosine_similarity, euclidean_distance

# Embedding to be compared
embedding = np.random.randn(512)

# Calculates the cosine similarity with all vectors in the main dataset
similar = db.most(
  embedding,
  func=cosine_similarity,
  n=3,
  desc=True
)

# Calculates the euclidean distance with all vectors in a specific dataset
similar = db.most(
  embedding,
  func=euclidean_distance,
  n=3,
  desc=True,
  dataset='my_dataset'
)

License

MIT © jrmiranda