llm-dataset-converter

For converting large language model (LLM) datasets from one format into another. Filters can be supplied as well, e.g., for cleaning up the data.

Installation

Via PyPI:

pip install llm-dataset-converter

The latest code straight from the repository:

pip install git+https://github.com/waikato-llm/llm-dataset-converter.git

Docker

Docker images are available from:

Docker hub: waikatodatamining/llm-dataset-converter
In-house registry: public.aml-repo.cms.waikato.ac.nz:443/tools/llm-dataset-converter

Datasets

The following repository contains a curated list of datasets for LLMs:

https://github.com/Zjh-819/LLMDataHub

The Hugging Face Hub has an abundance of datasets as well:

https://huggingface.co/datasets

Dataset formats

The following dataset formats are supported:

Domain	Format	Read	Write	Compression
classification	CSV	from-csv-cl	to-csv-cl	Y
classification	Jsonlines	from-jsonlines-cl	to-jsonlines-cl	Y
classification	Parquet	from-parquet-cl	to-parquet-cl	N
classification	TSV	from-tsv-cl	to-tsv-cl	Y
pairs	Alpaca	from-alpaca	to-alpaca	Y
pairs	CSV	from-csv-pr	to-csv-pr	Y
pairs	Jsonlines	from-jsonlines-pr	to-jsonlines-pr	Y
pairs	Parquet	from-parquet-pr	to-parquet-pr	N
pairs	TSV	from-tsv-pr	to-tsv-pr	Y
pairs	XTuner	from-xtuner	to-xtuner	Y
pretrain	CSV	from-csv-pt	to-csv-pt	Y
pretrain	Jsonlines	from-jsonlines-pt	to-jsonlines-pt	Y
pretrain	Parquet	from-parquet-pt	to-parquet-pt	N
pretrain	TSV	from-tsv-pt	to-tsv-pt	Y
pretrain	TXT	from-txt-pt	to-txt-pt	Y ¹
translation	CSV	from-csv-t9n	to-csv-t9n	Y
translation	Jsonlines ²	from-jsonlines-t9n	to-jsonlines-t9n	Y
translation	Parquet ³	from-parquet-t9n	to-parquet-t9n	N
translation	TSV	from-tsv-t9n	to-tsv-t9n	Y
translation	TXT	from-txt-t9n	to-txt-t9n	Y ¹

¹ Compression not available when concatenating content in single file.

² Format defined here.

³ Translation data itself is stored as JSON dictionary.

Compression formats

In case a format supports compression, then the following compression formats are automatically supported for loading/saving files:

bzip2: .bz2
gzip: .gz
xz: .xz
zstd: .zst, .zstd

File encodings

Most readers offer the --encoding option to override the automatically determined file encoding, as that can be wrong due to only inspecting a fixed number of bytes. The number of bytes of a file inspected can be influenced via the following environment variable:

LDC_ENCODING_MAX_CHECK_LENGTH

A value of -1 means the complete file. However, that can be very slow and a smaller value of <1MB is recommended.

Tools

Dataset conversion

usage: llm-convert [-h|--help|--help-all|-help-plugin NAME] [-u INTERVAL]
                   [-c {None,bz2,gz,xz,zstd}]
                   [-l {DEBUG,INFO,WARNING,ERROR,CRITICAL}]
                   reader
                   [filter [filter [...]]]
                   [writer]

Tool for converting between large language model (LLM) dataset formats.

readers (20):
   from-alpaca, from-csv-cl, from-csv-pr, from-csv-pt, from-csv-t9n, 
   from-jsonlines-cl, from-jsonlines-pr, from-jsonlines-pt, 
   from-jsonlines-t9n, from-parquet-cl, from-parquet-pr, 
   from-parquet-pt, from-parquet-t9n, from-tsv-cl, from-tsv-pr, 
   from-tsv-pt, from-tsv-t9n, from-txt-pt, from-txt-t9n, from-xtuner
filters (38):
   assemble-sentences, change-case, classification-label-map, 
   file-filter, find-substr, inspect, keyword, language, 
   llama2-to-pairs, max-length-pt, max-records, metadata, 
   metadata-from-name, pairs-to-llama2, pairs-to-pretrain, 
   pretrain-sentences-to-classification, pretrain-sentences-to-pairs, 
   randomize-records, record-files, record-window, remove-blocks, 
   remove-empty, remove-patterns, replace-patterns, require-languages, 
   reset-ids, sentences-pt, skip-duplicate-ids, skip-duplicate-text, 
   split-pt, split-records, tee, text-length, text-stats, 
   to-llama2-format, translation-to-pairs, translation-to-pretrain, 
   update-pair-data
writers (20):
   to-alpaca, to-csv-cl, to-csv-pr, to-csv-pt, to-csv-t9n, 
   to-jsonlines-cl, to-jsonlines-pr, to-jsonlines-pt, to-jsonlines-t9n, 
   to-parquet-cl, to-parquet-pr, to-parquet-pt, to-parquet-t9n, 
   to-tsv-cl, to-tsv-pr, to-tsv-pt, to-tsv-t9n, to-txt-pt, to-txt-t9n, 
   to-xtuner

optional arguments:
  -h, --help              show basic help message and exit
  --help-all              show basic help message plus help on all plugins and exit
  --help-plugin NAME      show help message for plugin NAME and exit
  -u INTERVAL, --update_interval INTERVAL
                          outputs the progress every INTERVAL records (default: 1000)
  -l {DEBUG,INFO,WARNING,ERROR,CRITICAL}, --logging_level {DEBUG,INFO,WARNING,ERROR,CRITICAL}
                          the logging level to use (default: WARN)
  -c {None,bz2,gz,xz,zstd}, --compression {None,bz2,gz,xz,zstd}
                          the type of compression to use when only providing an output
                          directory to the writer (default: None)
  -b, --force_batch       processes the data in batches
  -U, --unescape_unicode  unescape unicode characters in the command-line

Download

usage: llm-download [-h|--help|--help-all|-help-plugin NAME]
                    downloader

Tool for downloading data for large language models (LLMs).

downloaders:
   huggingface

optional arguments:
  -h, --help            show basic help message and exit
  --help-all            show basic help message plus help on all plugins and exit
  --help-plugin NAME    show help message for plugin NAME and exit

Combining multiple files (one-after-the-other)

usage: llm-append [-h] [-i [INPUT [INPUT ...]]]
                  [-I [INPUT_LIST [INPUT_LIST ...]]]
                  [-t {csv,json,jsonlines,plain-text,tsv}] [-o FILE] [-p]
                  [-l {DEBUG,INFO,WARNING,ERROR,CRITICAL}]

Tool for combining multiple text files by appending them.

optional arguments:
  -h, --help            show this help message and exit
  -i [INPUT [INPUT ...]], --input [INPUT [INPUT ...]]
                        Path to the text file(s) to append; glob syntax is
                        supported (default: None)
  -I [INPUT_LIST [INPUT_LIST ...]], --input_list [INPUT_LIST [INPUT_LIST ...]]
                        Path to the text file(s) listing the data files to
                        append (default: None)
  -t {csv,json,jsonlines,plain-text,tsv}, --file_type {csv,json,jsonlines,plain-text,tsv}
                        The type of files that are being processed. (default:
                        plain-text)
  -o FILE, --output FILE
                        The path of the file to store the combined data in;
                        outputs it to stdout if omitted or a directory
                        (default: None)
  -p, --pretty_print    Whether to output the JSON in more human-readable
                        format. (default: False)
  -l {DEBUG,INFO,WARNING,ERROR,CRITICAL}, --logging_level {DEBUG,INFO,WARNING,ERROR,CRITICAL}
                        The logging level to use. (default: WARN)

Combining multiple files (side-by-side)

usage: llm-paste [-h] [-i [INPUT [INPUT ...]]]
                 [-I [INPUT_LIST [INPUT_LIST ...]]] [-o FILE]
                 [-s [SEP [SEP ...]]] [-l {DEBUG,INFO,WARN,ERROR,CRITICAL}]

Tool for combining multiple text files by placing them side-by-side.

optional arguments:
  -h, --help            show this help message and exit
  -i [INPUT [INPUT ...]], --input [INPUT [INPUT ...]]
                        Path to the text file(s) to combine; glob syntax is
                        supported (default: None)
  -I [INPUT_LIST [INPUT_LIST ...]], --input_list [INPUT_LIST [INPUT_LIST ...]]
                        Path to the text file(s) listing the data files to
                        combine (default: None)
  -o FILE, --output FILE
                        The path of the file to store the combined data in;
                        outputs it to stdout if omitted or a directory
                        (default: None)
  -s [SEP [SEP ...]], --separator [SEP [SEP ...]]
                        The separators to use between the files; uses TAB if
                        not supplied; use '{T}' as placeholder for tab
                        (default: None)
  -l {DEBUG,INFO,WARN,ERROR,CRITICAL}, --logging_level {DEBUG,INFO,WARN,ERROR,CRITICAL}
                        The logging level to use (default: WARN)

File encodings

The following tool allows you to determine the encoding of text files.

usage: llm-file-encoding [-h] [-i [INPUT [INPUT ...]]]
                         [-I [INPUT_LIST [INPUT_LIST ...]]]
                         [-m MAX_CHECK_LENGTH] [-o FILE]
                         [-l {DEBUG,INFO,WARNING,ERROR,CRITICAL}]

Tool for determining the file encoding of text files.

optional arguments:
  -h, --help            show this help message and exit
  -i [INPUT [INPUT ...]], --input [INPUT [INPUT ...]]
                        Path to the text file(s) to check; glob syntax is
                        supported (default: None)
  -I [INPUT_LIST [INPUT_LIST ...]], --input_list [INPUT_LIST [INPUT_LIST ...]]
                        Path to the text file(s) listing the actual files to
                        check (default: None)
  -m MAX_CHECK_LENGTH, --max_check_length MAX_CHECK_LENGTH
                        The maxmimum number of bytes to use for checking
                        (default: None)
  -o FILE, --output FILE
                        The path of the file to store the determined encodings
                        in; outputs it to stdout if omitted or a directory
                        (default: None)
  -l {DEBUG,INFO,WARNING,ERROR,CRITICAL}, --logging_level {DEBUG,INFO,WARNING,ERROR,CRITICAL}
                        The logging level to use. (default: WARN)

Locating files

Readers tend to support input via file lists. The llm-find tool can generate these.

usage: llm-find [-h] -i DIR [DIR ...] [-r] -o FILE [-m [REGEXP [REGEXP ...]]]
                [-n [REGEXP [REGEXP ...]]]
                [--split_ratios [SPLIT_RATIOS [SPLIT_RATIOS ...]]]
                [--split_names [SPLIT_NAMES [SPLIT_NAMES ...]]]
                [--split_name_separator SPLIT_NAME_SEPARATOR]
                [-l {DEBUG,INFO,WARN,ERROR,CRITICAL}]

Tool for locating files in directories that match certain patterns and store
them in files.

optional arguments:
  -h, --help            show this help message and exit
  -i DIR [DIR ...], --input DIR [DIR ...]
                        The dir(s) to scan for files. (default: None)
  -r, --recursive       Whether to search the directories recursively
                        (default: False)
  -o FILE, --output FILE
                        The file to store the located file names in (default:
                        None)
  -m [REGEXP [REGEXP ...]], --match [REGEXP [REGEXP ...]]
                        The regular expression that the (full) file names must
                        match to be included (default: None)
  -n [REGEXP [REGEXP ...]], --not-match [REGEXP [REGEXP ...]]
                        The regular expression that the (full) file names must
                        match to be excluded (default: None)
  --split_ratios [SPLIT_RATIOS [SPLIT_RATIOS ...]]
                        The split ratios to use for generating the splits
                        (int; must sum up to 100) (default: None)
  --split_names [SPLIT_NAMES [SPLIT_NAMES ...]]
                        The split names to use as filename suffixes for the
                        generated splits (before .ext) (default: None)
  --split_name_separator SPLIT_NAME_SEPARATOR
                        The separator to use between file name and split name
                        (default: -)
  -l {DEBUG,INFO,WARN,ERROR,CRITICAL}, --logging_level {DEBUG,INFO,WARN,ERROR,CRITICAL}
                        The logging level to use (default: WARN)

Generating help screens for plugins

usage: llm-help [-h] [-c [PACKAGE [PACKAGE ...]]] [-e EXCLUDED_CLASS_LISTERS]
                [-p NAME] [-f FORMAT] [-L INT] [-o PATH] [-i FILE] [-t TITLE]
                [-l {DEBUG,INFO,WARNING,ERROR,CRITICAL}]

Tool for outputting help for plugins in various formats.

optional arguments:
  -h, --help            show this help message and exit
  -c [PACKAGE [PACKAGE ...]], --custom_class_listers [PACKAGE [PACKAGE ...]]
                        The names of the custom class listers, uses the
                        default ones if not provided. (default: None)
  -e EXCLUDED_CLASS_LISTERS, --excluded_class_listers EXCLUDED_CLASS_LISTERS
                        The comma-separated list of class listers to excluded.
                        (default: None)
  -p NAME, --plugin_name NAME
                        The name of the plugin to generate the help for,
                        generates it for all if not specified (default: None)
  -f FORMAT, --help_format FORMAT
                        The output format to generate (default: text)
  -L INT, --heading_level INT
                        The level to use for the heading (default: 1)
  -o PATH, --output PATH
                        The directory or file to store the help in; outputs it
                        to stdout if not supplied; if pointing to a directory,
                        automatically generates file name from plugin name and
                        help format (default: None)
  -i FILE, --index_file FILE
                        The file in the output directory to generate with an
                        overview of all plugins, grouped by type (in markdown
                        format, links them to the other generated files)
                        (default: None)
  -t TITLE, --index_title TITLE
                        The title to use in the index file (default: llm-
                        dataset-converter plugins)
  -l {DEBUG,INFO,WARNING,ERROR,CRITICAL}, --logging_level {DEBUG,INFO,WARNING,ERROR,CRITICAL}
                        The logging level to use. (default: WARN)

Plugin registry

usage: llm-registry [-h] [-c CUSTOM_CLASS_LISTERS] [-e EXCLUDED_CLASS_LISTERS]
                    [-l {plugins,custom-class-listers,env-class-listers,downloaders,readers,filters,writers}]

For inspecting/querying the registry.

optional arguments:
  -h, --help            show this help message and exit
  -c CUSTOM_CLASS_LISTERS, --custom_class_listers CUSTOM_CLASS_LISTERS
                        The comma-separated list of custom class listers to
                        use. (default: None)
  -e EXCLUDED_CLASS_LISTERS, --excluded_class_listers EXCLUDED_CLASS_LISTERS
                        The comma-separated list of class listers to excluded.
                        (default: None)
  -l {plugins,custom-class-listers,env-class-listers,downloaders,readers,filters,writers}, --list {plugins,custom-class-listers,env-class-listers,downloaders,readers,filters,writers}
                        For outputting various lists on stdout. (default:
                        None)

Plugins

See here for an overview of all plugins.

Examples

You can find examples for using the library (command-line and code) here:

https://waikato-llm.github.io/llm-dataset-converter-examples/

Additional libraries

Class listers

The llm-dataset-converter uses the class lister registry provided by the seppl library.

Each module defines a function, typically called list_classes that returns a dictionary of names of superclasses associated with a list of modules that should be scanned for derived classes. Here is an example:

from typing import List, Dict


def list_classes() -> Dict[str, List[str]]:
    return {
        "ldc.api.Downloader": [
            "mod.ule1",
        ],
        "ldc.api.Reader": [
            "mod.ule2",
            "mod.ule3",
        ],
        "ldc.api.Filter": [
            "mod.ule4",
        ],
        "seppl.io.Writer": [
            "mod.ule5",
        ],
    }

Such a class lister gets referenced in the entry_points section of the setup.py file:

    entry_points={
        "class_lister": [
            "unique_string=module_name:function_name",
        ],
    },

:function_name can be omitted if :list_classes.

The following environment variables can be used to influence the class listers:

LDC_CLASS_LISTERS
LDC_CLASS_LISTERS_EXCL

Each variable is a comma-separated list of module_name:function_name, defining the class listers.

llm-dataset-converter
Release 0.2.4

Release 0.2.4

0.2.4

0.2.3

0.2.2

0.2.1

0.2.0

0.1.1

0.1.0

0.0.5

0.0.4

0.0.3

Documentation

llm-dataset-converter

Installation

Docker

Datasets

Dataset formats

Compression formats

File encodings

Tools

Dataset conversion

Download

Combining multiple files (one-after-the-other)

Combining multiple files (side-by-side)

File encodings

Locating files

Generating help screens for plugins

Plugin registry

Plugins

Examples

Additional libraries

Class listers

Stats

Development practices

Releases

Contributors

llm-dataset-converter Release 0.2.4

Release 0.2.4 Toggle Dropdown 0.2.4 0.2.3 0.2.2 0.2.1 0.2.0 0.1.1 0.1.0 0.0.5 0.0.4 0.0.3

Documentation

llm-dataset-converter

Installation

Docker

Datasets

Dataset formats

Compression formats

File encodings

Tools

Dataset conversion

Download

Combining multiple files (one-after-the-other)

Combining multiple files (side-by-side)

File encodings

Locating files

Generating help screens for plugins

Plugin registry

Plugins

Examples

Additional libraries

Class listers

Stats

Development practices

Releases

Contributors

llm-dataset-converter
Release 0.2.4

Release 0.2.4

0.2.4

0.2.3

0.2.2

0.2.1

0.2.0

0.1.1

0.1.0

0.0.5

0.0.4

0.0.3