Interface to finalfusion written in (almost) pure Python.
ffp supports reading from various embedding formats and more liberal construction of embeddings from components compared to the other
finalfusion interfaces. Lots of pretrained finalfrontier embeddings are available here, fastText embeddings converted to finalfusion can be found here
Documentation can be found at https://ffp.readthedocs.io/.
This is an early version of
ffp, feedback is very much appreciated!
ffp supports reading most widely used embedding formats, including finalfusion, text(-dims), word2vec binary and fastText. All
finalfusion chunks are supported by
ffp, including quantized storages.
ffp provides construction, reading and writing of single
finalfusion chunks, i.e., vocabularies, storage, norms, etc. can be read from or written to a
finalfusion file in any combination. There are no assumptions about what constitutes a
finalfusion file other than having at least a single chunk.
ffp integrates directly with
numpy as the
NdArray storage is a subclass of
numpy.ndarray. All common numpy operations are available for this storage type.
Currently supported file formats:
- word2vec binary
- NdArray (mmap, in-memory)
- QuantizedStorage (mmap, in-memory)
- all vocabulary types
- from pypi:
pip install ffp
- from source:
git clone https://github.com/sebpuetz/ffp cd ffp pip install cython python setup.py install
...read embeddings from a file in finalfusion format and query for an embedding:
import ffp import numpy as np embeddings = ffp.load_finalfusion("path/to/file.fifu", "finalfusion") res = embeddings["Query"] # reading into an output array in_vocab_embeddings = np.zeros((len(embeddings.vocab), embeddings.storage.shape)) for word in embeddings.vocab: # Embeddings.embedding also returns `out` out = embeddings.embedding(word, out=in_vocab_embeddings[i]) assert np.allclose(in_vocab_embeddings[i], out)
- ...read the vocabulary from a file in
import ffp vocab = ffp.vocab.load_vocab("path/to/file.fifu")
- ...construct an
ExplicitVocabfrom a corpus and write it to a file:
import ffp # discard all ngrams appearing less than 30 times in the corpus ngram_cutoff = ffp.vocab.Cutoff(30, "min_freq") # keep less than 500,000 tokens in the vocabulary, setting the cutoff at the next frequency boundary token_cutoff = ffp.vocab.Cutoff(500000, "target_size") # extract ngrams in range 3 to 6 (including 6) ngram_range = (3, 6) vocab, token_counts, ngram_counts = ffp.vocab.ExplicitVocab.from_corpus("whitespace-tokenized-corpus.txt", ngram_range, token_cutoff, ngram_cutoff) vocab.write("explicit_vocab.fifu")
SimpleVocabextracted from a corpus and a randomly initialized matrix:
import ffp import numpy as np # keep less than 500,000 tokens in the vocabulary, setting the cutoff at the next frequency boundary token_cutoff = ffp.vocab.Cutoff(500000, "target_size") vocab, _ = ffp.vocab.SimpleVocab.from_corpus("whitespace-tokenized-corpus.txt", token_cutoff) rand_matrix = np.float32(np.random.rand(vocab.idx_bound, 300)) storage = ffp.storage.NdArray(rand_matrix) e = ffp.embeddings.Embeddings(vocab=vocab, storage=storage)