torch_text_similarity

Implementations of models and metrics for semantic text similarity. Includes fine-tuning and prediction of models


License
MIT
Install
pip install torch_text_similarity==1.0.4

Documentation

torch-text-similarity

Implementations of models and metrics for semantic text similarity. Includes fine-tuning and prediction of models. Thanks for the elegent implementations of @Andriy Mulyar, who has published a lot of useful codes.

Installation

Install with pip:

pip install torch-text-similarity

Use

Maps batches of sentence pairs to real-valued scores in the range [0,5]

import torch

from torch_text_similarity import TextSimilarityLearner
from torch_text_similarity.data import train_eval_sts_a_dataset

learner = TextSimilarityLearner(batch_size=10,
                                model_name='web-bert-similarity',
                                loss_func=torch.nn.MSELoss(),
                                learning_rate=5e-5,
                                weight_decay=0,
                                device=torch.device('cuda:0'))

train_dataset, eval_dataset = train_eval_sts_a_dataset(learner.bert_tokenizer, path='./data/train.csv')

learner.load_train_data(train_dataset)
learner.train(epoch=1)

predictions = learner.predict([('The patient is sick.', 'Grass is green.'),
                               ('A prescription of acetaminophen 325 mg was given.', ' The patient was given Tylenol.')
                               ])

print(predictions)

Make submission to a semantic text similarity competition

import torch
import pandas as pd

from torch_text_similarity import TextSimilarityLearner
from torch_text_similarity.data import train_eval_sts_a_dataset

learner = TextSimilarityLearner(batch_size=10,
                                model_name='web-bert-similarity',
                                loss_func=torch.nn.MSELoss(),
                                learning_rate=5e-5,
                                weight_decay=0,
                                device=torch.device('cuda:0'))

train_dataset, eval_dataset = train_eval_sts_a_dataset(learner.bert_tokenizer, path='./data/train.csv')

learner.load_train_data(train_dataset)
learner.train(epoch=1)

test_data = pd.read_csv('./data/test.csv')
preds_list = []
for i, row in test_data.iterrows():
    text_a = row['text_a']
    text_b = row['text_b']
    preds = learner.predict([(text_a, text_b)])[0]
    preds_list.append(preds)

submission = pd.DataFrame({"id": range(len(preds_list)), "label": preds_list})
submission.to_csv('./submission.csv', index=False, header=False)

Ensemble two pre-trained models and make a submission

import torch
import pandas as pd

from torch_text_similarity import TextSimilarityLearner
from torch_text_similarity.data import train_eval_sts_a_dataset

learner_1 = TextSimilarityLearner(model_name='web-bert-similarity', device=torch.device('cuda:0'))
learner_2 = TextSimilarityLearner(model_name='clinical-bert-similarity', device=torch.device('cuda:0'))


test_data = pd.read_csv('./data/test.csv')
preds_list = []
for i, row in test_data.iterrows():
    text_a = row['text_a']
    text_b = row['text_b']
    preds = learner_1.predict([(text_a, text_b)])[0] * 0.55 + learner_2.predict([(text_a, text_b)])[0] * 0.45
    preds_list.append(preds)

submission = pd.DataFrame({"id": range(len(preds_list)), "label": preds_list})
submission.to_csv('./submission.csv', index=False, header=False)

More examples.

Installation

The data sets in the examples can be found in Google Cloud Drive: