Project Demo Slides
Package: new release coming soon.
OnPoint: A Question Answering Service leveraging user reviews
OnPoint is a question answering service which levearages product user reviews. OnPoint saves you lots of time when you try to look for a product detail by providing you a short answer in seconds.
This repository explores the application of XL-Net on user review based question answering service. The base model and algorithm was inspired and based upon the XLNet: Generalized Autoregressive Pretraining for Language Understanding link and renatoviolin/xlnet link repo.
The directory structure of this repo is the following:
- onpoint : contains all the source code
-
tst : contains all the unit tests
- data : contains data for unit test
- configs : contains config files for hyperparameters during finetuning and evaluation
Setup
Installation
git clone https://github.com/hairong-wang/OnPoint.git
cd OnPoint
Requisites
- tensorflow-gpu==1.15.0-rc1
- absl-py==0.8.0
- Flask==1.1.1
- pip
Environment setup
export CUDA_VISIBLE_DEVICES=0
Steps to run
Step1: Configuration
All files in OnPoint/onpoint/bin need configuration. Here's one example to change path:
SQUAD_DATA_S3_BUCKET='squad-data'
SQUAD_DATA_TRAIN_S3_KEY='squad2.0/train-v2.0.json'
SQUAD_DATA_DEV_S3_KEY='squad 2.0/dev-v2.0.json'
LOCAL_SQUAD_DATA_TRAIN_PATH=./squad2.0_train.json
LOCAL_SQUAD_DATA_DEV_PATH=./squad2.0_dev.json
Step2: Prepare and Preprocess
- Download dataset
Download the dataset you want to use for finetuning. The datasets used in this project are:
- The Squad dataset is used in this proeject.
- The manual sampled and labeled AmazonQA and preprocessed dataset is available at Google Cloud Storage Buckets/xlnet_squad2/data/amazon, you can access the bucket from here.
- Download model checkpoints
- The model checkpoints is available at Google Cloud Storage Buckets/xlnet_squad2/experiment/squad_and_amazon_8000steps_1000warmup, you can access the bucket from here. So far, the top performance model checkpoint is 'model.ckpt-4000'
- Convert dataset to SQuAD format(Optional)
If you want to try other dataset, it needs to be converted to SQuAD format first.
# Change the INFILE and OUTFILE path
python3 squad_converter.py
- Preprocess data
multi-processing available, need to change 'NUM_PROC=' to the number of core you'll use.
cd onpoint
bash bin/data_processing
Step3: Train model
bash bin/model_building
Step4: Evaluate model
bash bin/model_analysis
Step5: Inference model
bash bin/model_inference
Step6: run the flask app on your local machine
python3 app.py
Analysis
Final result:
Model | Finetune Dataset | Validation Dataset | AmazonQA Sample Coverage | F1 |
---|---|---|---|---|
BERT-Large | SQuAD 2.0. | Augmented AmazonQA | 30% | 67.34 |
XLNet-Large | SQuAD 2.0. | Augmented AmazonQA | 40% | 66.20 |
XLNet-Large | Augmented AmazonQA | Augmented AmazonQA | 0% | 66.67 |
XLNet-Large | SQuAD 2.0 + Augmented AmazonQA | Augmented AmazonQA | 50% | 69.27 |