seqnereval: NER Model Evaluator

seqnereval is a Python module that allows you to efficiently perform extensive error analysis on your NER models. It allows you to:

Check what was the type of errors were made by the model.
Find the exact entities that were misclassified or missed.
Get the context of these errors.

One of the key motivation behind writing this module was to provide an easier and more optimal way of evaluating NER models. It was inspired by some existing NER model evaluation and was designed keeping performance in mind, so you can get your results faster than most of the existing NER evaluation packages.

Installation

To install simply execute:

pip install seqnereval

Usage

from seqnereval import NERTagListEvaluator

# list of lists of tokens for different docs
tokens_lists = [
    ['The', 'John', 'Doe\'s', 'Basketball', 'Club'], # Doc 1
    ['The', 'Canada', 'Place', 'is', 'best', '.'], # Doc 2
    ['Other', 'John', 'is', 'a', 'good', 'person', '.'], # Doc 3
    ['John', 'Doe', 'Jenny', 'Doe', '_', '_'], # Doc 4
]

# list of lists of predicted tags for different docs
predicted_tag_lists = [
    ["O", "B-PER", "I-PER", "B-ORG", "I-ORG"], # Doc 1
    ["O", "B-LOC", "I-LOC", "O", "O", "O"], # Doc 2
    ["O", "U-PER", "O", "O", "O", "O", "O"], # Doc 3
    ["B-PER", "I-PER", "B-PER", "I-PER", "O", "O"], # Doc 4
]

# list of lists of golden/true tags for different docs
gold_tag_lists = [
    ["O", "B-PER", "I-PER", "B-ORG", "I-ORG"], # Doc 1
    ["O", "B-LOC", "I-LOC", "O", "O", "O"], # Doc 2
    ["O", "U-PER", "O", "O", "O", "O", "O"], # Doc 3
    ["B-PER", "I-PER", "B-PER", "I-PER", "O", "O"], # Doc 4
]

    
evaluator = NERTagListEvaluator(tokens_lists, gold_tag_lists, predicted_tag_lists, 2)
result, results_by_tags = evaluator.evaluate()
# Refer to the next section (Understanding the results) to find how to use
# result object obtained to get more information.

# For e.g. results can be summarized as follows
print(result.summarize_result())
"""
OUTPUT: 

{'strict_match': {'correct': 34926,
  'incorrect': 23323,
  'partial': 0,
  'missed': 7319,
  'spurious': 6002,
  'possible': 65568,
  'actual': 64251,
  'precision': 0.5435868702432647,
  'recall': 0.5326683748169839,
  'f1': 0.5380722390405103},
 'type_match': {'correct': 42283,
  'incorrect': 15966,
  'partial': 0,
    .
    .
    .
 'partial_match': {'correct': 41668,
    .
    .
    .
 'bounds_match': {'correct': 41668,
    .
    .
    .
}
"""

Extracting and Undestanding the Results

seqnereval identifies the error made by an NER model while tagging the entities in a sequence and classifies these errors into following 6 categories:

Type 1. Entity Type and Span match

Token	Gold	Prediction
Vancouver	B-LOC	B-LOC
Island	I-LOC	I-LOC
is	O	O
the	O	O

Type 2. Predicted Entity is not an entity according to golden dataset

Token	Gold	Prediction
is	O	O
an	O	B-PER
extremely	O	I-PER
desireable	O	O

Type 3. Entity is not predicted by the system

Token	Gold	Prediction
Vancouver	B-LOC	O
Island	I-LOC	O
is	O	O
the	O	O

Type 4. Entity type is wrong but the span is correct

Token	Gold	Prediction
I	O	O
live	O	O
in	O	O
Palo	B-LOC	B-ORG
Alto	I-LOC	I-ORG
,	O	O

Type 5. System gets the boundaries of the surface string wrong

Token	Gold	Prediction
Unless	O	B-PER
Karl	B-PER	I-PER
Smith	I-PER	I-PER
resigns	O	O

Type 6. System gets the boundaries and entity type wrong

Token	Gold	Prediction
Unless	O	B-ORG
Karl	B-PER	I-ORG
Smith	I-PER	I-OR

Predicted Entities and their corresponding Gold/True entities (if applicable) that fall into each of these categories can be obtained as follows:

...
...
evaluator =  NERTagListEvaluator(
                     # list of lists of tokens, 
                     # e.g. [[tokens for doc 1..],[tokens for doc 2..]...]
                    list_of_token_lists, 
                    # list of lists of gold tags, 
                    # e.g. [[gold tags for doc 1..],[gold tags for doc 2..]...]
                    list_of_gold_tag_lists, 
                    # list of lists of predicted tags, 
                    # e.g. [[predicted tags for doc 1..],[predicted tags for doc 2..]...
                    list_of_predicted_tag_lists
                )
results, results_by_tags = evaluator.evaluate()

print(results.type_match_span_match)

"""
OUTPUT:
  
[
    {Gold: (Entity Type: "T103", Token Span IDX:(0, 1), Tokens:['Nonylphenol', 'diethoxylate'], Context:['Nonylphenol', 'diethoxylate', 'inhibits', 'apoptosis']), 
    Predicted: (Entity Type: "T103", Token Span IDX:(0, 1), Tokens:['Nonylphenol', 'diethoxylate'], Context:['Nonylphenol', 'diethoxylate', 'inhibits', 'apoptosis'])}, 
    
    {Gold: (Entity Type: "T038", Token Span IDX:(3, 3), Tokens:['apoptosis'], Context:['diethoxylate', 'inhibits', 'apoptosis', 'induced', 'in']), 
    Predicted: (Entity Type: "T038", Token Span IDX:(3, 3), Tokens:['apoptosis'], Context:['diethoxylate', 'inhibits', 'apoptosis', 'induced', 'in'])}, 
    
    {Gold: (Entity Type: "T169", Token Span IDX:(4, 4), Tokens:['induced'], Context:['inhibits', 'apoptosis', 'induced', 'in', 'PC12']), 
    Predicted: (Entity Type: "T169", Token Span IDX:(4, 4), Tokens:['induced'], Context:['inhibits', 'apoptosis', 'induced', 'in', 'PC12'])}
    .
    .
    .
]
"""

# similarily the entities in other categories can be accessed in the similar way

print(results.unecessary_predicted_entity) # Type 2
print(results.missed_gold_entity) # Type 3
print(results.type_mismatch_span_match) # Type 4
print(results.type_match_span_partial) # Type 5
print(results.type_mismatch_span_partial) # Type 6

Following five metrics are used to consider difference categories of errors:

Error type	Explanation
Correct (COR)	both are the same
Incorrect (INC)	the output of a system and the golden annotation don’t match
Partial (PAR)	system and the golden annotation are somewhat “similar” but not the same
Missing (MIS)	a golden annotation is not captured by a system
Spurius (SPU)	system produces a response which doesn’t exit in the golden annotation

These metrics are measured in following four different ways:

Evaluation schema	Explanation
Strict Match	exact boundary surface string match and entity type
Bount Match	exact boundary match over the surface string, regardless of the type
Partial Match	partial boundary match over the surface string, regardless of the type
Type Match	some overlap between the system tagged entity and the gold annotation is required

These five errors and four evaluation schema interact in the following ways:

Scenario	Gold entity	Gold string	Pred entity	Pred string	Type Match	Partial Match	Bound Match	Strict Match
I	PER	John	PER	John	COR	COR	COR	COR
II			LOC	extreme	SPU	SPU	SPU	SPU
III	LOC	Germany			MIS	MIS	MIS	MIS
IV	LOC	vancouver island	ORG	vancouver island	INC	COR	COR	INC
V	LOC	Detroit	LOC	in Detroit	COR	PAR	INC	INC
VI	LOC	Detroit	ORG	in Detroit	INC	PAR	INC	INC

The entity spans falling into each of these categories can be obtained as follows:

...
...
evaluator =  NERTagListEvaluator(
                     # list of lists of tokens, 
                     # e.g. [[tokens for doc 1..],[tokens for doc 2..]...]
                    list_of_token_lists, 
                    # list of lists of gold tags, 
                    # e.g. [[gold tags for doc 1..],[gold tags for doc 2..]...]
                    list_of_gold_tag_lists, 
                    # list of lists of predicted tags, 
                    # e.g. [[predicted tags for doc 1..],[predicted tags for doc 2..]...
                    list_of_predicted_tag_lists
                )

results, results_by_tags = evaluator.evaluate()

# Strict Match
print(results.strict_match["correct"])
print(results.strict_match["incorrect"])
print(results.strict_match["missed"])
print(results.strict_match["spurious"])

print(results.strict_match["precision"])
print(results.strict_match["recall"])
print(results.strict_match["f1"])

# Type Match
print(results.type_match["correct"])
print(results.type_match["incorrect"])
print(results.type_match["missed"])
print(results.type_match["spurious"])

print(results.type_match["precision"])
print(results.type_match["recall"])
print(results.type_match["f1"])

# Partial Match
print(results.partial_match["correct"])
print(results.partial_match["incorrect"])
print(results.partial_match["missed"])
print(results.partial_match["spurious"])

print(results.partial_match["precision"])
print(results.partial_match["recall"])
print(results.partial_match["f1"])

# Bounds/Exact Match
print(results.bounds_match["correct"])
print(results.bounds_match["incorrect"])
print(results.bounds_match["missed"])
print(results.bounds_match["spurious"])

print(results.bounds_match["precision"])
print(results.bounds_match["recall"])
print(results.bounds_match["f1"])

Precision/Recall/F1-score are calculated for each different evaluation schema as follows:

For Strict Match and Bounds Match

Precision = (COR / ACT) = TP / (TP + FP)
Recall = (COR / POS) = TP / (TP+FN)

For Partial Match and Type Match

Precision = (COR + 0.5 × PAR) / ACT = TP / (TP + FP)
Recall = (COR + 0.5 × PAR)/POS = COR / ACT = TP / (TP + FP)

where:

POSSIBLE (POS) = COR + INC + PAR + MIS = TP + FN
ACTUAL (ACT) = COR + INC + PAR + SPU = TP + FP

References

seqnereval draws heavily on Segura-bedmar, I., & Mart, P. (2013). 2013 SemEval-2013 Task 9 Extraction of Drug-Drug Interactions from. Semeval, 2(DDIExtraction), 341–350. It was inspired by nerevaluate and is designed to be significantly faster, easier to understand/extend and provide more granular insights on the nature of errors made by the model.

seqnereval
Release 0.0.1

Release 0.0.1

0.0.1

Documentation

seqnereval: NER Model Evaluator

Installation

Usage

Extracting and Undestanding the Results

References

Stats

Development practices

Releases

Contributors

seqnereval Release 0.0.1

Release 0.0.1 Toggle Dropdown 0.0.1

Documentation

seqnereval: NER Model Evaluator

Installation

Usage

Extracting and Undestanding the Results

References

Stats

Development practices

Releases

Contributors

seqnereval
Release 0.0.1

Release 0.0.1

0.0.1