seqnereval

Evaluate NER models significantly faster and easily.


Keywords
error-analysis, evaluator, metrics, named-entity-recognition, nlp, semeval, semeval-2013
License
GPL-3.0
Install
pip install seqnereval==0.0.1

Documentation

seqnereval: NER Model Evaluator

Build - Main codecov License: GPL v3

seqnereval is a Python module that allows you to efficiently perform extensive error analysis on your NER models. It allows you to:

  • Check what was the type of errors were made by the model.
  • Find the exact entities that were misclassified or missed.
  • Get the context of these errors.

One of the key motivation behind writing this module was to provide an easier and more optimal way of evaluating NER models. It was inspired by some existing NER model evaluation and was designed keeping performance in mind, so you can get your results faster than most of the existing NER evaluation packages.

Installation

To install simply execute:

pip install seqnereval

Usage

from seqnereval import NERTagListEvaluator

# list of lists of tokens for different docs
tokens_lists = [
    ['The', 'John', 'Doe\'s', 'Basketball', 'Club'], # Doc 1
    ['The', 'Canada', 'Place', 'is', 'best', '.'], # Doc 2
    ['Other', 'John', 'is', 'a', 'good', 'person', '.'], # Doc 3
    ['John', 'Doe', 'Jenny', 'Doe', '_', '_'], # Doc 4
]

# list of lists of predicted tags for different docs
predicted_tag_lists = [
    ["O", "B-PER", "I-PER", "B-ORG", "I-ORG"], # Doc 1
    ["O", "B-LOC", "I-LOC", "O", "O", "O"], # Doc 2
    ["O", "U-PER", "O", "O", "O", "O", "O"], # Doc 3
    ["B-PER", "I-PER", "B-PER", "I-PER", "O", "O"], # Doc 4
]

# list of lists of golden/true tags for different docs
gold_tag_lists = [
    ["O", "B-PER", "I-PER", "B-ORG", "I-ORG"], # Doc 1
    ["O", "B-LOC", "I-LOC", "O", "O", "O"], # Doc 2
    ["O", "U-PER", "O", "O", "O", "O", "O"], # Doc 3
    ["B-PER", "I-PER", "B-PER", "I-PER", "O", "O"], # Doc 4
]

    
evaluator = NERTagListEvaluator(tokens_lists, gold_tag_lists, predicted_tag_lists, 2)
result, results_by_tags = evaluator.evaluate()
# Refer to the next section (Understanding the results) to find how to use
# result object obtained to get more information.

# For e.g. results can be summarized as follows
print(result.summarize_result())
"""
OUTPUT: 

{'strict_match': {'correct': 34926,
  'incorrect': 23323,
  'partial': 0,
  'missed': 7319,
  'spurious': 6002,
  'possible': 65568,
  'actual': 64251,
  'precision': 0.5435868702432647,
  'recall': 0.5326683748169839,
  'f1': 0.5380722390405103},
 'type_match': {'correct': 42283,
  'incorrect': 15966,
  'partial': 0,
    .
    .
    .
 'partial_match': {'correct': 41668,
    .
    .
    .
 'bounds_match': {'correct': 41668,
    .
    .
    .
}
"""

Extracting and Undestanding the Results

seqnereval identifies the error made by an NER model while tagging the entities in a sequence and classifies these errors into following 6 categories:

Type 1. Entity Type and Span match

Token Gold Prediction
Vancouver B-LOC B-LOC
Island I-LOC I-LOC
is O O
the O O

Type 2. Predicted Entity is not an entity according to golden dataset

Token Gold Prediction
is O O
an O B-PER
extremely O I-PER
desireable O O

Type 3. Entity is not predicted by the system

Token Gold Prediction
Vancouver B-LOC O
Island I-LOC O
is O O
the O O

Type 4. Entity type is wrong but the span is correct

Token Gold Prediction
I O O
live O O
in O O
Palo B-LOC B-ORG
Alto I-LOC I-ORG
, O O

Type 5. System gets the boundaries of the surface string wrong

Token Gold Prediction
Unless O B-PER
Karl B-PER I-PER
Smith I-PER I-PER
resigns O O

Type 6. System gets the boundaries and entity type wrong

Token Gold Prediction
Unless O B-ORG
Karl B-PER I-ORG
Smith I-PER I-OR

Predicted Entities and their corresponding Gold/True entities (if applicable) that fall into each of these categories can be obtained as follows:

...
...
evaluator =  NERTagListEvaluator(
                     # list of lists of tokens, 
                     # e.g. [[tokens for doc 1..],[tokens for doc 2..]...]
                    list_of_token_lists, 
                    # list of lists of gold tags, 
                    # e.g. [[gold tags for doc 1..],[gold tags for doc 2..]...]
                    list_of_gold_tag_lists, 
                    # list of lists of predicted tags, 
                    # e.g. [[predicted tags for doc 1..],[predicted tags for doc 2..]...
                    list_of_predicted_tag_lists
                )
results, results_by_tags = evaluator.evaluate()

print(results.type_match_span_match)

"""
OUTPUT:
  
[
    {Gold: (Entity Type: "T103", Token Span IDX:(0, 1), Tokens:['Nonylphenol', 'diethoxylate'], Context:['Nonylphenol', 'diethoxylate', 'inhibits', 'apoptosis']), 
    Predicted: (Entity Type: "T103", Token Span IDX:(0, 1), Tokens:['Nonylphenol', 'diethoxylate'], Context:['Nonylphenol', 'diethoxylate', 'inhibits', 'apoptosis'])}, 
    
    {Gold: (Entity Type: "T038", Token Span IDX:(3, 3), Tokens:['apoptosis'], Context:['diethoxylate', 'inhibits', 'apoptosis', 'induced', 'in']), 
    Predicted: (Entity Type: "T038", Token Span IDX:(3, 3), Tokens:['apoptosis'], Context:['diethoxylate', 'inhibits', 'apoptosis', 'induced', 'in'])}, 
    
    {Gold: (Entity Type: "T169", Token Span IDX:(4, 4), Tokens:['induced'], Context:['inhibits', 'apoptosis', 'induced', 'in', 'PC12']), 
    Predicted: (Entity Type: "T169", Token Span IDX:(4, 4), Tokens:['induced'], Context:['inhibits', 'apoptosis', 'induced', 'in', 'PC12'])}
    .
    .
    .
]
"""

# similarily the entities in other categories can be accessed in the similar way

print(results.unecessary_predicted_entity) # Type 2
print(results.missed_gold_entity) # Type 3
print(results.type_mismatch_span_match) # Type 4
print(results.type_match_span_partial) # Type 5
print(results.type_mismatch_span_partial) # Type 6

Following five metrics are used to consider difference categories of errors:

Error type Explanation
Correct (COR) both are the same
Incorrect (INC) the output of a system and the golden annotation don’t match
Partial (PAR) system and the golden annotation are somewhat “similar” but not the same
Missing (MIS) a golden annotation is not captured by a system
Spurius (SPU) system produces a response which doesn’t exit in the golden annotation

These metrics are measured in following four different ways:

Evaluation schema Explanation
Strict Match exact boundary surface string match and entity type
Bount Match exact boundary match over the surface string, regardless of the type
Partial Match partial boundary match over the surface string, regardless of the type
Type Match some overlap between the system tagged entity and the gold annotation is required

These five errors and four evaluation schema interact in the following ways:

Scenario Gold entity Gold string Pred entity Pred string Type Match Partial Match Bound Match Strict Match
I PER John PER John COR COR COR COR
II LOC extreme SPU SPU SPU SPU
III LOC Germany MIS MIS MIS MIS
IV LOC vancouver island ORG vancouver island INC COR COR INC
V LOC Detroit LOC in Detroit COR PAR INC INC
VI LOC Detroit ORG in Detroit INC PAR INC INC

The entity spans falling into each of these categories can be obtained as follows:

...
...
evaluator =  NERTagListEvaluator(
                     # list of lists of tokens, 
                     # e.g. [[tokens for doc 1..],[tokens for doc 2..]...]
                    list_of_token_lists, 
                    # list of lists of gold tags, 
                    # e.g. [[gold tags for doc 1..],[gold tags for doc 2..]...]
                    list_of_gold_tag_lists, 
                    # list of lists of predicted tags, 
                    # e.g. [[predicted tags for doc 1..],[predicted tags for doc 2..]...
                    list_of_predicted_tag_lists
                )

results, results_by_tags = evaluator.evaluate()

# Strict Match
print(results.strict_match["correct"])
print(results.strict_match["incorrect"])
print(results.strict_match["missed"])
print(results.strict_match["spurious"])

print(results.strict_match["precision"])
print(results.strict_match["recall"])
print(results.strict_match["f1"])

# Type Match
print(results.type_match["correct"])
print(results.type_match["incorrect"])
print(results.type_match["missed"])
print(results.type_match["spurious"])

print(results.type_match["precision"])
print(results.type_match["recall"])
print(results.type_match["f1"])

# Partial Match
print(results.partial_match["correct"])
print(results.partial_match["incorrect"])
print(results.partial_match["missed"])
print(results.partial_match["spurious"])

print(results.partial_match["precision"])
print(results.partial_match["recall"])
print(results.partial_match["f1"])

# Bounds/Exact Match
print(results.bounds_match["correct"])
print(results.bounds_match["incorrect"])
print(results.bounds_match["missed"])
print(results.bounds_match["spurious"])

print(results.bounds_match["precision"])
print(results.bounds_match["recall"])
print(results.bounds_match["f1"])

Precision/Recall/F1-score are calculated for each different evaluation schema as follows:

For Strict Match and Bounds Match

Precision = (COR / ACT) = TP / (TP + FP)
Recall = (COR / POS) = TP / (TP+FN)

For Partial Match and Type Match

Precision = (COR + 0.5 Ă— PAR) / ACT = TP / (TP + FP)
Recall = (COR + 0.5 Ă— PAR)/POS = COR / ACT = TP / (TP + FP)

where:

POSSIBLE (POS) = COR + INC + PAR + MIS = TP + FN
ACTUAL (ACT) = COR + INC + PAR + SPU = TP + FP

References

seqnereval draws heavily on Segura-bedmar, I., & Mart, P. (2013). 2013 SemEval-2013 Task 9 Extraction of Drug-Drug Interactions from. Semeval, 2(DDIExtraction), 341–350. It was inspired by nerevaluate and is designed to be significantly faster, easier to understand/extend and provide more granular insights on the nature of errors made by the model.