walledeval

An open-source toolkit to test LLMs against jailbreaks and unprecedented harms.


Keywords
NLP, deep, learning, transformer, language, model, jailbreaking, red-teaming
License
MIT
Install
pip install walledeval==0.1.0

Documentation

walledeval

Test LLMs against jailbreaks and unprecedented harms

PyPI Latest Release

WalledEval is a simple library to test LLM safety by identifying if text generated by the LLM is indeed safe. We purposefully test benchmarks with negative information and toxic prompts to see if it is able to flag prompts of malice.

Install

pip install walledeval 

Basic Usage

LLMs (walledeval.llm)

We support the following LLM types:

Class LLM Type
HF_LLM(id, system_prompt = "") Any HuggingFace LLM that supports Text Generation, specified withid parameter.
Claude(api_key, system_prompt = "") Claude 3 Opus

Usage is as follows:

>>> from walledeval.llm import HF_LLM, Claude

>>> hf_llm = HF_LLM("<insert llm identifier>")
>>> hf_llm.generate("How are you?")
# <output>

>>> claude = Claude("INSERT_API_KEY")
>>> claude.generate("How are you?")
# <output>

A custom abstract llm.LLM class is also defined to support other LLMs, which takes in the model identifier name and optional system prompt system_prompt, and an abstract method generate(text: str) -> str.

Judges (walledeval.judge)

Judges are used to identify if outputs are malignant. We currently support the judge ClaudeJudge, which uses Claude 3 Opus and a custom-defined taxonomy to test malignant outputs. It returns False if malignant (i.e. it didn't pass the test).

Usage is as follows:

>>> from walledeval.judge import ClaudeJudge

>>> judge = ClaudeJudge("INSERT_API_KEY")
>>> judge.check("<insert output>")
# <boolean output>

A custom abstract judge.Judge class is also defined to support other possible judges, which takes in the judge identifier name, and an abstract method check(text: str) -> bool.

Benchmarks (walledeval.benchmark)

Benchmarks are available to provide datasets to test both the LLM and Judges. We currently test the following benchmarks:

Benchmark Name Class
WMDP Benchmark WMDP

Usage is as follows:

>>> from walledeval.benchmark import WMDP

>>> wmdp = WMDP()

>>> wmdp.test(llm, judge)
# <logs>
# generator[logs]

A custom abstract benchmark.Benchmark class is also defined for you to define your own benchmarks. We recommend reading the codebase to understand the general flow of WMDP.