jsonllm

Turn silly json into intelligent embeddings


License
Apache-2.0
Install
pip install jsonllm==0.1.0a2

Documentation

jsonllm

Tools for working with LLMs on JSON data

Usage | Installation | Why | How

Usage

Usage: jsonllm [OPTIONS] COMMAND [ARGS]...

  Tools for working with LLMs on JSON data

Options:
  --version  Show the version and exit.
  --help     Show this message and exit.

Commands:
  embed  Turn a JSON of content into a JSON of embeddings.
Usage: jsonllm embed [OPTIONS]

  Turn a JSON of content into a JSON of embeddings.

Options:
  -i, --input PATH  File to embed
  -m, --model TEXT  Embedding model(s) to use
                    
                    Issue `llm embed-models list` to list available models.
                    
                    Currently installed are: ['3-large', '3-large-1024',
                    '3-large-256', '3-small', '3-small-512', 'ada-002',
                    'clip', 'jina-embeddings-v2-base-en', 'jina-
                    embeddings-v2-large-en', 'jina-embeddings-v2-small-en',
                    'onnx-bge-base', 'onnx-bge-large', 'onnx-bge-micro',
                    'onnx-bge-small', 'onnx-gte-tiny', 'onnx-minilm-l12',
                    'onnx-minilm-l6', 'sentence-transformers/all-MiniLM-L6-v2']
                    
                    You can install more via `llm install ...`
                    
                    You can find available models here: https://llm.datasette.io/en/stable/plugins/directory.html#embedding-models
  -j, --jq TEXT     Embed only the keys that satisfy the given jq filter
                    expression
  --in-arrays       Embed text appearing in arrays too
  --help            Show this message and exit.
CREATE TABLE people (data JSONB);
python tests/gen_people.py 100 |\
jsonllm embed -m clip -j '.name'
psql -c "\COPY people(data) FROM stdin"
echo '{"hello": "world"}' | jsonllm embed -m clip

Installation

pip install jsonllm

Available Models

Available embedding models are those provided and installed via the llm package.

llm install llm-sentence-transformers
llm install llm-clip
llm install llm-embed-jina
llm install llm-embed-onnx

For an up-to-date list check here

Why

There are now plenty of tools providing ways of getting embeddings out of a corpus of text. Some even can generate embeddings from JSON documents, but they treat JSON as simple text too.

That is rarely the case though; JSON documents have structure and semantics depending on their application in context. Most importantly though it's data exchange format and a data aggregation tool. Aggregation in the sense of getting data from A to B.

In my case point A was a JSON object created by an SQL query from a Postgres database, piped through jsonllm and pushed into another Postgres instance specifically designed for AI-related experiments.

How

jsonllm traverses a JSON object recursively, and replaces text values with their embeddings array.

Other data types are not modified at all and the overall object structure is not changed.

Development

pip install -e '.[test]'
pytest