Label Agnostic Pre-training for Zero-shot Text Classification

This repository contains the code and data for the Findings of ACL'23 paper Label Agnostic Pre-training for Zero-shot Text Classification by Christopher Clarke, Yuzhao Heng, Yiping Kang, Krisztian Flautner, Lingjia Tang and Jason Mars.

In this paper, we investigate the task of zero-shot text classification with the aim of improving the ability of PLMs to generalize both seen and unseen data across domains without the need for additional training. We introduce two new simple yet effective training strategies, Implicit training & Explicit pre-training which specifically inject aspect-level understanding into the model at train time. To evaluate this, we release UTCD, a new benchmark dataset for evaluating text classification in zero-shot settings. Models, data & paper coming soon!

Universal Text Classification Dataset (UTCD)

UTCD is a compilation of 18 classification datasets spanning 3 categories of Sentiment, Intent/Dialogue and Topic classification. UTCD focuses on the task of zero-shot text classification where the candidate labels are descriptive of the text being classified. UTCD consists of ~ 6M/800K train/test examples.

UTCD Datasets & Principles:

Sentiment
- GoEmotions introduced in GoEmotions: A Dataset of Fine-Grained Emotions
- TweetEval introduced in TWEETEVAL: Unified Benchmark and Comparative Evaluation for Tweet Classification (Sentiment subset)
- Emotion introduced in CARER: Contextualized Affect Representations for Emotion Recognition
- Amazon Polarity introduced in Character-level Convolutional Networks for Text Classification
- Finance Phrasebank introduced in Good debt or bad debt: Detecting semantic orientations in economic texts
- Yelp introduced in Character-level Convolutional Networks for Text Classification
Intent/Dialogue
- Schema-Guided Dialogue introduced in Towards Scalable Multi-Domain Conversational Agents: The Schema-Guided Dialogue Dataset
- Clinc-150 introduced in An Evaluation Dataset for Intent Classification and Out-of-Scope Prediction
- SLURP SLU introduced in SLURP: A Spoken Language Understanding Resource Package
- Banking77 introduced in Efficient Intent Detection with Dual Sentence Encoders
- Snips introduced in Snips Voice Platform: an embedded Spoken Language Understanding system for private-by-design voice interfaces
- NLU Evaluation introduced in Benchmarking Natural Language Understanding Services for building Conversational Agents
Topic
- AG News introduced in Character-level Convolutional Networks for Text Classification
- DBpedia 14 introduced in DBpedia: A Nucleus for a Web of Open Data
- Yahoo Answer Topics introduced in Character-level Convolutional Networks for Text Classification
- MultiEurlex introduced in MultiEURLEX -- A multi-lingual and multi-label legal document classification dataset for zero-shot cross-lingual transfer
- BigPatent introduced in BIGPATENT: A Large-Scale Dataset for Abstractive and Coherent Summarization
- Consumer Finance introduced in Consumer Complaint Database

In order to make NLP models more broadly useful, zero-shot techniques need to be capable of label, domain & aspect transfer. As such, in the construction of UTCD we enforce the following principles:

Textual labels: In UTCD, we mandate the use of textual labels. While numerical label values are often used in classification tasks, descriptive textual labels such as those present in the datasets across UTCD enable the development of techniques that can leverage the class name which is instrumental in providing zero-shot support. As such, for each of the compiled datasets, labels are standardized such that the labels are descriptive of the text in natural language.
Diverse domains and Sequence lengths: In addition to broad coverage of aspects, UTCD compiles diverse data across several domains such as Banking, Finance, Legal, etc each comprising varied length sequences (long and short). The datasets are listed above.

User’s Guide (HuggingFace)

The UTCD dataset and trained models are available on HuggingFace. Please refer to the instructions there.

User’s Guide (Local)

Setup environment

OS: UNIX; Python version 3.8.10; CUDA version 11.6.

Create conda environment:

conda create -n zs-cls python=3.8.10 pip

Move to project root directory, install python packages:

pip3 install -r requirements.txt

Add current directory for python to look for our local package:

export PYTHONPATH=$PATHONPATH:`pwd`

Note

Denoting the package directory at system path <BASE_PATH>/zero-shot-text-classification, all trained models will be saved to <BASE_PATH>/models, all evaluation CSV files will be saved to <BASE_PATH>/eval.

Below we include command line arguments and example train/eval commands for models in our paper.

BERT Sequence Classifier

Arguments

dataset: Dataset to train/evaluate the model on, pass all for all datasets
domain: One of [in, out], the domain of dataset(s) to train/evaluate on
normalize_aspect: If true, datasets are normalized by aspect, ==TODO add==
learning_rate: Learning rate for training
batch_size: Batch size for training/evaluation
epochs: #epochs for training
model_name_or_path: File system path or HuggingFace model name for model evaluation, ==TODO test==

Train

Train solely on in-domain dataset go_emotion

python zeroshot_classifier/models/bert.py train --domain in --dataset go_emotion

Train solely on out-of-domain dataset consumer_finance

python zeroshot_classifier/models/bert.py train --domain out --dataset consumer_finance

Train on all in-domain datasets

python zeroshot_classifier/models/bert.py train --domain in --dataset all

Eval

Evaluate a local model on out-of-domain dataset multi_eurlex

python zeroshot_classifier/models/bert.py test --domain out --dataset multi_eurlex --model_name_or_path models/2022-06-15_21-23-57_BERT-Seq-CLS-out-multi_eurlex/trained

Binary & Dual Encoding Zero-shot Classification

Arguments

mode: Training strategy, one of [vanilla, implicit-on-text-encode-sep, explicit]
normalize_aspect: If true, datasets are normalized by aspect, ==TODO add==
learning_rate: Learning rate for training
batch_size: Batch size for training/evaluation
epochs: #epochs for training
init_model_name_or_path: Fie system path or HuggingFace model name to initialize model weights for explicit training, ==TODO test==
output_dir: Directory name postfix for trained model
domain: One of [in, out], the domain of datasets to evaluate on
model_name_or_path: Directory name or HuggingFace model name for evaluation

Train

Vanilla training on Binary BERT

python zeroshot_classifier/models/binary_bert.py train --mode vanilla --batch_size 32 --epochs 8 --learning_rate 2e-5 --output_dir '{a=2e-5}'

Explicit training on Bi-Encoder

python zeroshot_classifier/models/bi-encoder.py train --mode explicit --model_init '2022-11-21_18-58-54_Aspect-Pretrain-Binary-BERT_{md=exp, na=T}_{a=3e-05}/trained'

Eval

Evaluate implicitly-trained model on all in-domain datasets

python zeroshot_classifier/models/binary_bert.py test --mode implicit-on-text-encode-sep --domain in --model_dir_nm 2022-10-12_01-21-08_Binary-BERT-implicit-on-text-encode-sep-rand-aspect-norm

Explicit Pretraining

Arguments

output_dir: Directory name postfix for trained model
normalize_aspect: If true, datasets are normalized by aspect
learning_rate: Learning rate for training
batch_size: Batch size for training/evaluation
epochs: #epochs for training

Train

Train with learning rate 2e-5, ==TODO verify working==

python zeroshot_classifier/models/explicit/binary_bert_pretrain.py --learning_rate 2e-5 output_dir '{a=2e-5}'

Generative Classification

Arguments

mode: Training strategy, one of [vanilla, implicit, explicit]
normalize_aspect: If true, datasets are normalized by aspect
learning_rate: Learning rate for training
batch_size: Batch size for training/evaluation
gradient_accumulation_steps: #gradient accumulation steps for training
epochs: #epochs for training
ddp: DDP training flag, intended for proper logging during training
init_model_name_or_path: Fie system path or HuggingFace model name to initialize model weights for explicit training, ==TODO verify working==
output_dir: Directory name postfix for trained model
model_name_or_path: Directory name for model evaluation

==TODO, verify command args==

Train

Implicit training on GPT with DDP

torchrun --nproc_per_node=4 zeroshot_classifier/models/gpt2.py train --mode implicit

Explicit training on GPT

python zeroshot_classifier/models/gpt2.py train --mode explicit --model_init '2022-11-27_17-39-06_Aspect-Pretrain-NVIDIA-GPT2_{md=exp, na=T}_{a=2e-05}'

Eval

Evaluate model with vanilla training on all out-of-domain datasets

python zeroshot_classifier/models/gpt2.py test --mode implicit --model_dir_nm '2022-11-29_19-37-13_NVIDIA-GPT2_{md=van, na=T}_{a=3e-05}'

Explicit Pretraining

Arguments

output_dir: Directory name postfix for trained model
normalize_aspect: If true, datasets are normalized by aspect
learning_rate: Learning rate for training
batch_size: Batch size for training/evaluation
gradient_accumulation_steps: #gradient accumulation steps for training
epochs: #epochs for training

Train

Train with learning rate 2e-5, ==TODO verify working==

python zeroshot_classifier/models/explicit/gpt2_pretrain.py --learning_rate 4e-5 output_dir '{a=4e-5}'

zeroshot-classifier
Release 0.1.2

Release 0.1.2

0.1.0

0.1.1

0.1.2

0.2.0

0.2.1

0.2.2

0.2.3

Documentation

Label Agnostic Pre-training for Zero-shot Text Classification

Universal Text Classification Dataset (UTCD)

User’s Guide (HuggingFace)

User’s Guide (Local)

Setup environment

Note

BERT Sequence Classifier

Binary & Dual Encoding Zero-shot Classification

Explicit Pretraining

Generative Classification

Explicit Pretraining

Stats

Development practices

Releases

Contributors

zeroshot-classifier Release 0.1.2

Release 0.1.2 Toggle Dropdown 0.1.0 0.1.1 0.1.2 0.2.0 0.2.1 0.2.2 0.2.3

Documentation

Label Agnostic Pre-training for Zero-shot Text Classification

Universal Text Classification Dataset (UTCD)

User’s Guide (HuggingFace)

User’s Guide (Local)

Setup environment

Note

BERT Sequence Classifier

Binary & Dual Encoding Zero-shot Classification

Explicit Pretraining

Generative Classification

Explicit Pretraining

Stats

Development practices

Releases

Contributors

zeroshot-classifier
Release 0.1.2

Release 0.1.2

0.1.0

0.1.1

0.1.2

0.2.0

0.2.1

0.2.2

0.2.3