walledeval

Test LLMs against jailbreaks and unprecedented harms

WalledEval is a simple library to test LLM safety by identifying if text generated by the LLM is indeed safe. We purposefully test benchmarks with negative information and toxic prompts to see if it is able to flag prompts of malice.

Install

pip install walledeval

Basic Usage

LLMs (`walledeval.llm`)

We support the following LLM types:

Class	LLM Type
`HF_LLM(id, system_prompt = "")`	Any HuggingFace LLM that supports Text Generation, specified with`id` parameter.
`Claude(api_key, system_prompt = "")`	Claude 3 Opus

Usage is as follows:

>>> from walledeval.llm import HF_LLM, Claude

>>> hf_llm = HF_LLM("<insert llm identifier>")
>>> hf_llm.generate("How are you?")
# <output>

>>> claude = Claude("INSERT_API_KEY")
>>> claude.generate("How are you?")
# <output>

A custom abstract llm.LLM class is also defined to support other LLMs, which takes in the model identifier name and optional system prompt system_prompt, and an abstract method generate(text: str) -> str.

Judges (`walledeval.judge`)

Judges are used to identify if outputs are malignant. We currently support the judge ClaudeJudge, which uses Claude 3 Opus and a custom-defined taxonomy to test malignant outputs. It returns False if malignant (i.e. it didn't pass the test).

Usage is as follows:

>>> from walledeval.judge import ClaudeJudge

>>> judge = ClaudeJudge("INSERT_API_KEY")
>>> judge.check("<insert output>")
# <boolean output>

A custom abstract judge.Judge class is also defined to support other possible judges, which takes in the judge identifier name, and an abstract method check(text: str) -> bool.

Benchmarks (`walledeval.benchmark`)

Benchmarks are available to provide datasets to test both the LLM and Judges. We currently test the following benchmarks:

Benchmark Name	Class
WMDP Benchmark	`WMDP`

Usage is as follows:

>>> from walledeval.benchmark import WMDP

>>> wmdp = WMDP()

>>> wmdp.test(llm, judge)
# <logs>
# generator[logs]

A custom abstract benchmark.Benchmark class is also defined for you to define your own benchmarks. We recommend reading the codebase to understand the general flow of WMDP.

walledeval
Release 0.1.0

Release 0.1.0

0.1.0

0.0.2.dev0

0.0.1.dev1

0.0.1.dev0

Documentation

walledeval

Install

Basic Usage

LLMs (`walledeval.llm`)

Judges (`walledeval.judge`)

Benchmarks (`walledeval.benchmark`)

Stats

Development practices

Releases

Contributors

walledeval Release 0.1.0

Release 0.1.0 Toggle Dropdown 0.1.0 0.0.2.dev0 0.0.1.dev1 0.0.1.dev0

Documentation

walledeval

Install

Basic Usage

LLMs (walledeval.llm)

Judges (walledeval.judge)

Benchmarks (walledeval.benchmark)

Stats

Development practices

Releases

Contributors

walledeval
Release 0.1.0

Release 0.1.0

0.1.0

0.0.2.dev0

0.0.1.dev1

0.0.1.dev0

LLMs (`walledeval.llm`)

Judges (`walledeval.judge`)

Benchmarks (`walledeval.benchmark`)