stroll-srl

Graph based semantic role labeler


License
Apache-2.0
Install
pip install stroll-srl==0.5.1

Documentation

Graph based Semantic Role Labeller

This is a semantic role laballer based on a graph convolutional network. The goal is to make something reasonably state-of-the-art for the Dutch language.

This is work in progress.

Training data

SoNaR contains a annotated dataset of 500K words. The annotation of the frames is quite basic:

  • the semantic annotations were made on top of a syntactic dependency tree
  • only (some) verbs are marked
  • no linking to FrameNet or any frame identification is made

The original annotations were made on AlpinoXML; the files are converted to conllu using a script from Groningen. Conversion scripts from Alpino and Lassy to the UD format available here.

I am planning to use universal dependencies schema for the syntax, and assume input files are in conllu format. This way the labeller can be used as part of the Newsreader annotation pipeline.

Approach

The training set was made as annotations on top of a syntactic tree. We'll use the same tree, as given by the HEAD and DEPREL fields from the conll files. Words form the nodes, and we add three kinds of edges:

  • from dependent to head (weighted by the number of dependents)
  • from the head to the dependent
  • from the node to itself

At each node, we add a GRU cell. The initial state is made using one-hot encoding of a number of features. The output of the cell is then passed to two classifiers One to indicate if this word is a frame, and one to indicate if the word is the head of an arguments (and which argument).

As node features we can use the information in the conllu file. We also added pre-trained word vectors form either fasttext, or BERTje. Finally, we add a positional encoding (ie. 'first descendant').

Installation

Setup a virtual env with python3, and install dependencies:

python3 -m venv env
. env/bin/activate
pip install -r requirements.txt

Plans

  • Train frames and roles separately, to see effect of multi-task learning;
  • Reconsider how we use the BERT wordvectors. We now first encode a whole sentence, and sum the resulting vectors per bert-token to our tokeninzation. Alternatively, per word we find the sentence part it is the head of, and encode that.

Best model until now

Model layers

Net(
  (embedding): Sequential(
    (0): Linear(in_features=189, out_features=100, bias=True)
    (1): BatchNorm1d(100, eps=1e-05, momentum=0.1, affine=True, track_running_stats=True)
    (2): ReLU()
  )
  (gru): RGRUGCN(
    in_feats=100, out_feats=100
    (gru): GRU(100, 100)
    (batchnorm): BatchNorm1d(100, eps=1e-05, momentum=0.1, affine=True, track_running_stats=True)
  )
  (task_a): MLP(
    (fc): Sequential(
      (0): Linear(in_features=100, out_features=100, bias=True)
      (1): BatchNorm1d(100, eps=1e-05, momentum=0.1, affine=True, track_running_stats=True)
      (2): ReLU()
      (3): Linear(in_features=100, out_features=2, bias=True)
    )
  )
  (task_b): MLP(
    (fc): Sequential(
      (0): Linear(in_features=100, out_features=100, bias=True)
      (1): BatchNorm1d(100, eps=1e-05, momentum=0.1, affine=True, track_running_stats=True)
      (2): ReLU()
      (3): Linear(in_features=100, out_features=21, bias=True)
    )
  )
)

Training settings

  • Features used UPOS, DEPREL, FEATS, RID, FastText100
  • Size of hidden state 100
  • Number of GRU iterations 4
  • Activation functions relu
  • Batch size 50 sentences
  • Adam optimizer ADAM
  • Initial learning rate 1e-02
  • Learning rate scheduler StepLR, gamma=0.9
  • Focal Loss, gamma=5.
  • The two loss functions were added, both with weight 1

Results

2 classes are so rare, they are not in our 10% evaluation set, and were not predicted by the model. (Arg5 and ArgM-STR).

The confustion matrix and statistics (classification_report) were made with scikit-learn.

The best model was after 6628920 words, or 15 epochs.

Frames

_ rel
_ 45602 152
rel 164 3561
precision recall f1-score support
_ 1.00 1.00 1.00 45754
rel 0.96 0.96 0.96 3725
accuracy 0.99 49479
macro avg 0.98 0.98 0.98 49479
weighted avg 0.99 0.99 0.99 49479

Roles

Arg0 Arg1 Arg2 Arg3 Arg4 ArgM-ADV ArgM-CAU ArgM-DIR ArgM-DIS ArgM-EXT ArgM-LOC ArgM-MNR ArgM-MOD ArgM-NEG ArgM-PNC ArgM-PRD ArgM-REC ArgM-TMP _
Arg0 1650 113 9 0 0 2 2 1 0 0 3 0 0 0 2 0 0 0 32
Arg1 232 2596 111 0 6 4 3 3 0 5 41 17 1 6 7 2 2 2 100
Arg2 16 154 346 2 8 1 0 10 0 3 91 27 0 1 16 6 2 6 18
Arg3 0 9 26 5 0 0 0 1 0 0 7 3 0 0 8 1 0 4 1
Arg4 0 5 11 0 28 0 0 3 0 0 7 0 0 0 0 0 0 1 0
ArgM-ADV 0 8 13 0 0 312 10 2 27 5 20 32 0 0 4 6 0 45 44
ArgM-CAU 9 5 5 0 0 13 96 1 1 0 0 20 0 0 1 0 0 7 12
ArgM-DIR 0 0 10 0 4 0 0 30 0 0 4 5 0 0 0 1 0 2 2
ArgM-DIS 0 0 0 0 0 27 3 0 322 2 5 13 0 0 0 0 0 14 157
ArgM-EXT 0 4 4 1 0 6 0 0 0 47 2 8 0 1 0 0 0 9 5
ArgM-LOC 0 12 49 0 3 8 1 7 0 1 519 6 0 1 4 2 1 26 36
ArgM-MNR 3 25 35 0 0 38 6 8 4 13 18 313 1 1 0 10 0 14 27
ArgM-MOD 0 2 0 0 0 1 0 0 0 0 0 1 598 0 0 0 0 0 54
ArgM-NEG 1 2 1 0 0 2 0 0 0 0 0 2 1 266 0 0 0 3 13
ArgM-PNC 0 9 20 0 0 2 7 0 0 0 4 4 0 0 119 0 0 5 11
ArgM-PRD 0 9 7 0 0 9 0 0 0 2 3 13 0 0 2 74 2 1 12
ArgM-REC 0 0 1 0 0 0 0 0 0 0 1 0 0 0 1 1 105 0 0
ArgM-TMP 0 4 13 1 1 45 7 0 3 13 31 24 1 3 4 0 0 862 32
_ 81 138 24 0 1 37 9 2 13 3 27 28 50 15 8 7 0 30 38234
precision recall f1-score support
Arg0 0.83 0.91 0.87 1814
Arg1 0.84 0.83 0.83 3138
Arg2 0.51 0.49 0.50 707
Arg3 0.56 0.08 0.14 65
Arg4 0.55 0.51 0.53 55
ArgM-ADV 0.62 0.59 0.60 528
ArgM-CAU 0.67 0.56 0.61 170
ArgM-DIR 0.44 0.52 0.48 58
ArgM-DIS 0.87 0.59 0.71 543
ArgM-EXT 0.50 0.54 0.52 87
ArgM-LOC 0.66 0.77 0.71 676
ArgM-MNR 0.61 0.61 0.61 516
ArgM-MOD 0.92 0.91 0.91 656
ArgM-NEG 0.90 0.91 0.91 291
ArgM-PNC 0.68 0.66 0.67 181
ArgM-PRD 0.67 0.55 0.61 134
ArgM-REC 0.94 0.96 0.95 109
ArgM-TMP 0.84 0.83 0.83 1044
_ 0.99 0.99 0.99 38707
accuracy 0.94 49479
macro avg 0.71 0.67 0.68 49479
weighted avg 0.94 0.94 0.94 49479

Ensemble prediction

We trained several models, most of which are similarly good (or bad). An ensemble is a method to further increase the model performance. For this we need a way to combine the indenpendent classifications in a single classifier. A simple way is to use marjority voting: the label with the most votes wins. But this does not make the best use of the separate classifiers. We'll combine the individual label distributions in a single distribution by multiplying the distributions. This is known as 'conflation' (in physics), or 'product-of-experts', or a logarithmic opinion pool.

unweighted

First we'll try an unweighted product.

weighted

We are also looking into using weight factors, following the references below. The idea is to minimize the KL divergence of ensemble.

References

  1. Encoding Sentences with Graph Convolutional Networks for Semantic Role Labeling
  2. Multi-Task Learning Using Uncertainty to Weigh Losses for Scene Geometry and Semantics
  3. Focal Loss for Dense Object Detection
  4. GaAN: Gated Attention Networks for Learning on Large and Spatiotemporal Graphs
  5. Look Again at the Syntax: Relational Graph Convolutional Network for Gendered Ambiguous Pronoun Resolution code
  6. Deep Graph Library
  7. The CoNLL-2009 shared task: syntactic and semantic dependencies in multiple languages
  8. 2009 Shared task evaluation script
  9. Super-Convergence: Very Fast Training of NeuralNetworks Using Large Learning Rates
  10. Label Distribution Learning
  11. On Loss Functions for Deep Neural Networks in Classification
  12. Training Products of Experts by Minimizing Contrastive Divergence
  13. Selecting weighting factors in logarithmic opinion pools.pdf

best srl:

runs_srl/ADAM_1e-02_FL1.50e+00cst_50b_100d_2l_reluUPOS_FEATS_DEPREL_WVEC_FT100/model_005925154.pt.eval

best mention:

runs_mentions/ADAMv3.32_1e-03_150b_HL_64d_2lUPOS_FEATS_DEPREL_WVEC_FT50/model_030325091.pt

best coref:

On the test set:

|------|------|------|---------|

R P F
muc 47.06 55.12 50.08
b3 82.19 86.48 84.02
ceafe 85.51 81.38 83.18
conll 72.43
------ ------ ------ ---------

(runs_coref/ADAMv4.11_1e-03_50b_HL_100d_2l_WVEC_DEPREL_FT50/model_005840868.pt)

best entity:

On the test set:

|------|------|------|---------|

R P F
muc 47.34 60.30 51.63
b3 82.87 88.95 85.47
ceafe 86.77 80.74 83.36
conll 73.49
------ ------ ------ ---------

(runs_entity/entity_v0.52_30_stat/model_000022850.pt)

try it:

python3 run_stanza.py --output jenj.conll jip_en_janneke.txt python3 postprocess_srl.py --output jenj_srl.conll jenj.conll python3 run_mentions.py --output jenj_mentions.conll jenj_srl.conll python3 run_coref.py --html coref.html --output jenj_coref.conll jenj_mentions.conll python3 run_entity.py --html entity.html --output jenj_entity.conll jenj_mentions.conll google-chrome coref.html entity.html