hf_clean_benchmarks
This repository is heavily inspired by the BigCode repository and is mostly a refactoring of their code. Specifically, the main person who worked on this repository is Chenghao Mou (Awesome work!).
Install
pip install hf_clean_benchmarks
How to use
Using the API
First you need to specify which benchmarks you want to clean your data
of. You can do this by creating dictionary with the benchmark name in
huggingfaceβs datasets repository as the key and the name of the column
containing the benchmark data as the value. For example, if you want to
clean your data of the HumanEval
and LAMBADA
benchmarks, you would
do the following:
# Benchmarks to clean
benchmarks = [
{
"name": "openai_humaneval",
"splits": ["test"],
"columns": ["prompt", "canonical_solution", "test"],
},
{
"name": "lambada",
"splits": ["test"],
"columns": ["text"],
},
]
You then pass this dictionary to the
BenchmarkCleaner
class. This class will download the benchmarks and construct the suffix
array for each benchmark. You can then use the clean
method to clean a
huggingface dataset. For example:
from datasets import load_dataset
from hf_clean_benchmarks.core import BenchmarkCleaner
cleaner = BenchmarkCleaner(benchmarks, threshold=0.1, num_perm=128)
# load your dataset
dataset = load_dataset("bigcode/the-stack-smol", data_dir="data/python", split="train")
# clean the dataset
cleaned_dataset = cleaner.clean(dataset, column="content")
Checking for false positives...: 100%|ββββββββββ| 8780/8780 [00:34<00:00, 251.05it/s]
Checking for false positives...: 100%|ββββββββββ| 8805/8805 [07:34<00:00, 19.39it/s]
[11/06/22 10:34:43] INFO Data Number : 10000 core.py:210
INFO Duplicate Number : 4033 core.py:211
INFO Duplicate Rate : 40.33% core.py:212
INFO Total Time : 493.73 seconds core.py:213
cleaned_dataset
Dataset({
features: ['content', 'avg_line_length', 'max_line_length', 'alphanum_fraction', 'licenses', 'repository_name', 'path', 'size', 'lang', '__id__'],
num_rows: 5967
})
Using the CLI
First you need to specify which benchmarks you want to clean your data
of. You can do this by creating a json file with the benchmark name in
huggingfaceβs datasets repository as the key and the name of the column
containing the benchmark data as the value. For example, if you want to
clean your data of the HumanEval
and LAMBADA
benchmarks, you would
do the following:
file: benchmarks.json
[
{
"name": "openai_humaneval",
"splits": ["test"],
"columns": ["prompt", "canonical_solution", "test"],
},
{
"name": "lambada",
"splits": ["test"],
"columns": ["text"],
},
]
You then pass this json file to the
clean_dataset
command. This command will download the benchmarks and construct the
suffix array for each benchmark. You can then use the clean
command to
clean a huggingface dataset. For example:
clean_dataset \
--dataset_name bigcode/the-stack-smol \
--column_name content \
--benchmark_configs_path benchmarks.json \
--output_path /tmp/test.jsonl \
--data_dir data/python \
--dataset_split train \
--save_json