An opinionated toolset for testing Conversational Language Models.
https://innodatalabs.github.io/redlite/
-
Install required dependencies
pip install redlite[all]
-
Generate several runs (using Python scripting, see examples, and below)
-
Review and compare runs
redlite server --port <PORT>
-
Optionally, upload to Zeno
ZENO_API_KEY=zen_XXXX redlite upload
import os
from redlite import run, load_dataset
from redlite.model.openai_model import OpenAIModel
from redlite.metric import MatchMetric
model = OpenAIModel(api_key=os.environ["OPENAI_API_KEY"])
dataset = load_dataset("hf:innodatalabs/rt-gsm8k-gaia")
metric = MatchMetric(ignore_case=True, ignore_punct=True, strategy='prefix')
run(model=model, dataset=dataset, metric=metric)
Note: the code above uses OpenAI model via their API.
You will need to register with OpenAI and get an API access key, then set it in the environment as OPENAI_API_KEY
.
- simple, easy-to-learn API
- lightweight
- only necessary dependencies
- framework-agnostic (PyTorch, Tensorflow, Keras, Flax, Jax)
- basic analytic tools included
python -m venv .venv
. .venv/bin/activate
pip install -e .[dev,all]
Make commands:
- test
- test-server
- lint
- wheel
- docs
- docs-server
- black
Benchmarks can be uploaded to Zeno interactive AI evaluation platform <hub.zenoml.com>:
redlite upload --project my-cool-project
All tasks will be concatenated and uploaded as a single dataset, with extra fields:
task_id
dataset
metric
All models will be uploaded. If model was not tested on a specific task, a simulated zero-score dataframe is used instead.
Use task_id
(or dataset
as appropriate) to create task slices. Slices can be used to
navigate data or create charts.
UI server data and code can be exported to a local directory that then can be served statically.
This is useful for publishing as a static website on cloud storage (S3, Google Storage).
redlite server-freeze /tmp/my-server
gsutil -m rsync -R /tmp/my-server gs://{your GS bucket}
Note that you have to configure cloud bucket in a special way, so that cloud provider serves it as a website. How to do this depends on the cloud provider.
- deps cleanup (randomname!)
- review/improve module structure
- automate CI/CD
- write docs
- publish docs automatically (CI/CD)
- web UI styling
- better test server
- tests
- Integrate HF models
- Integrate OpenAI models
- Integrate Anthropic models
- Integrate AWS Bedrock models
- Integrate vLLM models
- Fix data format in HF datasets (innodatalabs/rt-* ones) to match standard
- more robust backend API (future-proof)
- better error handling for missing deps
- document which deps we need when
- export to CSV
- Upload to Zeno