zeroshot-classifier

code and data for the Findings of ACL'23 paper Label Agnostic Pre-training for Zero-shot Text Classification


Keywords
python, nlp, machine-learning, deep-learning, text-classification, zero-shot-classification
License
MIT
Install
pip install zeroshot-classifier==0.1.2

Documentation

Label Agnostic Pre-training for Zero-shot Text Classification

This repository contains the code and data for the Findings of ACL'23 paper Label Agnostic Pre-training for Zero-shot Text Classification by Christopher Clarke, Yuzhao Heng, Yiping Kang, Krisztian Flautner, Lingjia Tang and Jason Mars.

In this paper, we investigate the task of zero-shot text classification with the aim of improving the ability of PLMs to generalize both seen and unseen data across domains without the need for additional training. We introduce two new simple yet effective training strategies, Implicit training & Explicit pre-training which specifically inject aspect-level understanding into the model at train time. To evaluate this, we release UTCD, a new benchmark dataset for evaluating text classification in zero-shot settings. Models, data & paper coming soon!

Universal Text Classification Dataset (UTCD)

UTCD is a compilation of 18 classification datasets spanning 3 categories of Sentiment, Intent/Dialogue and Topic classification. UTCD focuses on the task of zero-shot text classification where the candidate labels are descriptive of the text being classified. UTCD consists of ~ 6M/800K train/test examples.

UTCD Datasets & Principles:

In order to make NLP models more broadly useful, zero-shot techniques need to be capable of label, domain & aspect transfer. As such, in the construction of UTCD we enforce the following principles:

  • Textual labels: In UTCD, we mandate the use of textual labels. While numerical label values are often used in classification tasks, descriptive textual labels such as those present in the datasets across UTCD enable the development of techniques that can leverage the class name which is instrumental in providing zero-shot support. As such, for each of the compiled datasets, labels are standardized such that the labels are descriptive of the text in natural language.
  • Diverse domains and Sequence lengths: In addition to broad coverage of aspects, UTCD compiles diverse data across several domains such as Banking, Finance, Legal, etc each comprising varied length sequences (long and short). The datasets are listed above.

User’s Guide (HuggingFace)

The UTCD dataset and trained models are available on HuggingFace. Please refer to the instructions there.

User’s Guide (Local)

Setup environment

OS: UNIX; Python version 3.8.10; CUDA version 11.6.

Create conda environment:

conda create -n zs-cls python=3.8.10 pip

Move to project root directory, install python packages:

pip3 install -r requirements.txt

Add current directory for python to look for our local package:

export PYTHONPATH=$PATHONPATH:`pwd`

Note

Denoting the package directory at system path <BASE_PATH>/zero-shot-text-classification, all trained models will be saved to <BASE_PATH>/models, all evaluation CSV files will be saved to <BASE_PATH>/eval.

Below we include command line arguments and example train/eval commands for models in our paper.

BERT Sequence Classifier

Arguments

  • dataset: Dataset to train/evaluate the model on, pass all for all datasets
  • domain: One of [in, out], the domain of dataset(s) to train/evaluate on
  • normalize_aspect: If true, datasets are normalized by aspect, ==TODO add==
  • learning_rate: Learning rate for training
  • batch_size: Batch size for training/evaluation
  • epochs: #epochs for training
  • model_name_or_path: File system path or HuggingFace model name for model evaluation, ==TODO test==

Train

  • Train solely on in-domain dataset go_emotion

    • python zeroshot_classifier/models/bert.py train --domain in --dataset go_emotion
  • Train solely on out-of-domain dataset consumer_finance

    • python zeroshot_classifier/models/bert.py train --domain out --dataset consumer_finance
  • Train on all in-domain datasets

    • python zeroshot_classifier/models/bert.py train --domain in --dataset all

Eval

  • Evaluate a local model on out-of-domain dataset multi_eurlex

    • python zeroshot_classifier/models/bert.py test --domain out --dataset multi_eurlex --model_name_or_path models/2022-06-15_21-23-57_BERT-Seq-CLS-out-multi_eurlex/trained

Binary & Dual Encoding Zero-shot Classification

Arguments

  • mode: Training strategy, one of [vanilla, implicit-on-text-encode-sep, explicit]
  • normalize_aspect: If true, datasets are normalized by aspect, ==TODO add==
  • learning_rate: Learning rate for training
  • batch_size: Batch size for training/evaluation
  • epochs: #epochs for training
  • init_model_name_or_path: Fie system path or HuggingFace model name to initialize model weights for explicit training, ==TODO test==
  • output_dir: Directory name postfix for trained model
  • domain: One of [in, out], the domain of datasets to evaluate on
  • model_name_or_path: Directory name or HuggingFace model name for evaluation

Train

  • Vanilla training on Binary BERT

    • python zeroshot_classifier/models/binary_bert.py train --mode vanilla --batch_size 32 --epochs 8 --learning_rate 2e-5 --output_dir '{a=2e-5}'
  • Explicit training on Bi-Encoder

    • python zeroshot_classifier/models/bi-encoder.py train --mode explicit --model_init '2022-11-21_18-58-54_Aspect-Pretrain-Binary-BERT_{md=exp, na=T}_{a=3e-05}/trained'

Eval

  • Evaluate implicitly-trained model on all in-domain datasets

    • python zeroshot_classifier/models/binary_bert.py test --mode implicit-on-text-encode-sep --domain in --model_dir_nm 2022-10-12_01-21-08_Binary-BERT-implicit-on-text-encode-sep-rand-aspect-norm

Explicit Pretraining

Arguments

  • output_dir: Directory name postfix for trained model
  • normalize_aspect: If true, datasets are normalized by aspect
  • learning_rate: Learning rate for training
  • batch_size: Batch size for training/evaluation
  • epochs: #epochs for training

Train

  • Train with learning rate 2e-5, ==TODO verify working==

    • python zeroshot_classifier/models/explicit/binary_bert_pretrain.py --learning_rate 2e-5 output_dir '{a=2e-5}'

Generative Classification

Arguments

  • mode: Training strategy, one of [vanilla, implicit, explicit]
  • normalize_aspect: If true, datasets are normalized by aspect
  • learning_rate: Learning rate for training
  • batch_size: Batch size for training/evaluation
  • gradient_accumulation_steps: #gradient accumulation steps for training
  • epochs: #epochs for training
  • ddp: DDP training flag, intended for proper logging during training
  • init_model_name_or_path: Fie system path or HuggingFace model name to initialize model weights for explicit training, ==TODO verify working==
  • output_dir: Directory name postfix for trained model
  • model_name_or_path: Directory name for model evaluation

==TODO, verify command args==

Train

  • Implicit training on GPT with DDP

    • torchrun --nproc_per_node=4 zeroshot_classifier/models/gpt2.py train --mode implicit
  • Explicit training on GPT

    • python zeroshot_classifier/models/gpt2.py train --mode explicit --model_init '2022-11-27_17-39-06_Aspect-Pretrain-NVIDIA-GPT2_{md=exp, na=T}_{a=2e-05}'

Eval

  • Evaluate model with vanilla training on all out-of-domain datasets

    • python zeroshot_classifier/models/gpt2.py test --mode implicit --model_dir_nm '2022-11-29_19-37-13_NVIDIA-GPT2_{md=van, na=T}_{a=3e-05}'

Explicit Pretraining

Arguments

  • output_dir: Directory name postfix for trained model
  • normalize_aspect: If true, datasets are normalized by aspect
  • learning_rate: Learning rate for training
  • batch_size: Batch size for training/evaluation
  • gradient_accumulation_steps: #gradient accumulation steps for training
  • epochs: #epochs for training

Train

  • Train with learning rate 2e-5, ==TODO verify working==

    • python zeroshot_classifier/models/explicit/gpt2_pretrain.py --learning_rate 4e-5 output_dir '{a=4e-5}'