LLM testing on steroids

large, langualge, models, evaluation, datasets, benchmark, innodata, llm, red-teaming, red-teaming-tools
pip install redlite==0.3.9



PyPI version Documentation Test and Lint GitHub Pages

An opinionated toolset for testing Conversational Language Models.




  1. Install required dependencies

    pip install redlite[all]
  2. Generate several runs (using Python scripting, see examples, and below)

  3. Review and compare runs

    redlite server --port <PORT>
  4. Optionally, upload to Zeno

    ZENO_API_KEY=zen_XXXX redlite upload

Python API

import os
from redlite import run, load_dataset
from redlite.model.openai_model import OpenAIModel
from redlite.metric import MatchMetric

model = OpenAIModel(api_key=os.environ["OPENAI_API_KEY"])
dataset = load_dataset("hf:innodatalabs/rt-gsm8k-gaia")
metric = MatchMetric(ignore_case=True, ignore_punct=True, strategy='prefix')

run(model=model, dataset=dataset, metric=metric)

Note: the code above uses OpenAI model via their API. You will need to register with OpenAI and get an API access key, then set it in the environment as OPENAI_API_KEY.


  • simple, easy-to-learn API
  • lightweight
  • only necessary dependencies
  • framework-agnostic (PyTorch, Tensorflow, Keras, Flax, Jax)
  • basic analytic tools included


python -m venv .venv
. .venv/bin/activate
pip install -e .[dev,all]

Make commands:

  • test
  • test-server
  • lint
  • wheel
  • docs
  • docs-server
  • black

Zeno <zenoml.com> integration

Benchmarks can be uploaded to Zeno interactive AI evaluation platform <hub.zenoml.com>:

redlite upload --project my-cool-project

All tasks will be concatenated and uploaded as a single dataset, with extra fields:

  • task_id
  • dataset
  • metric

All models will be uploaded. If model was not tested on a specific task, a simulated zero-score dataframe is used instead.

Use task_id (or dataset as appropriate) to create task slices. Slices can be used to navigate data or create charts.

Serving as a static website

UI server data and code can be exported to a local directory that then can be served statically.

This is useful for publishing as a static website on cloud storage (S3, Google Storage).

redlite server-freeze /tmp/my-server
gsutil -m rsync -R /tmp/my-server gs://{your GS bucket}

Note that you have to configure cloud bucket in a special way, so that cloud provider serves it as a website. How to do this depends on the cloud provider.


  • deps cleanup (randomname!)
  • review/improve module structure
  • automate CI/CD
  • write docs
  • publish docs automatically (CI/CD)
  • web UI styling
  • better test server
  • tests
  • Integrate HF models
  • Integrate OpenAI models
  • Integrate Anthropic models
  • Integrate AWS Bedrock models
  • Integrate vLLM models
  • Fix data format in HF datasets (innodatalabs/rt-* ones) to match standard
  • more robust backend API (future-proof)
  • better error handling for missing deps
  • document which deps we need when
  • export to CSV
  • Upload to Zeno