Knowledge Graph Extraction
We designed a pipeline that can extract a special kind of knowledge graphs where a person's name will be recognized and his/her rank, role, title and organization will be related to him/her. It is not expected to perform perfectly so that all relevant persons will be recognized and all irrelevant persons will be excluded. Rather, it is seen as a first step to reduce the workload that is involved to manually extract such knowledge by combing through a large amount of documents.
This pipeline consists of two major components: Name Entity Recognition and Relation Extraction. Name Entity Recognition uses a BiLSTM-CNNs-CRF model. It recognizes names, ranks, roles, titles and organizations from raw text files. Then the Relation Extraction relates names to his/her corresponding rank, role, title or organization.
Dependencies
Tensorflow 2.2.0
SpaCy
NumPy
Install
Package: https://pypi.org/project/extract-sfm/
$ pip install extract_sfm
Usage
Method 1
Create a python file and write:
import extract_sfm
extract_sfm.extract("/PATH/TO/DIRECTORY/OF/INPUT/FILES")
Then run the python file. This may take a while to finish.
Method 2
Download this Github repository Under the project root directory, run the python script
$ python pipeline.py /PATH/TO/DIRECTORY/OF/INPUT/FILES
Note: Use absolute path.
Website
- Copy NER_v2, RE, pipeline.py into the "SERVER/KGE" directory
- Install npm dependencies under the "SERVER" directory: express, path, multer
$ npm install <package name>
- Run the server by typing in:
$ node server.js
Note: Cannot have [space] character in the path to the website's root directory
NER Documentation
TRAINING
Dataset:
1. SFM starter dataset: https://github.com/security-force-monitor/nlp_starter_dataset
2. CONLL2003: https://github.com/guillaumegenthial/tf_ner/tree/master/data/example
3. A set of known organizations from the starter dataset
Note: Title and role were collapsed into one class
Usage:
1) Prepare data
$ python process.py
$ cd SFM_STARTER
$ python build_vocab.py
$ python build_glove.py
$ cd ..
2) Train model
$ python train.py
3) Make predictions
$ python pred.py
4) Evaluate model
$ python eval.py
$ python eval_class.py
Files:
process.py: 1) preprocess dataset by recording info in dicts,
which are saved in two pickle files: dataset_labels.pickle, dataset_sentences.pickle
2) convert SFM starter dataset to a format that can be used by the model,
which are in files: {}.words.txt and {}.tags.txt where {} could be train, valid or test.
pred.py: generates predictions using the trained model
eval.py: evaluate the predctions made by model, which are generated by running pred.py
eval_class.py: get precision, recall and f1 score for each class
Other files are from https://github.com/guillaumegenthial/tf_ner
train.py, tf_metrics.py, SFM_STARTER/build_vocab.py, SFM_STARTER/build_glove.py
PREDICTING
Usage:
$ python ner.py <doc_id>.txt
File:
ner.py: get BRAT format prediction for a text file.
RE Documentation
jPTDP:
Before running the following 3 methods, you need to run a dependency parser first, which some methods relies on.
Usage: Go to the jPTDP directory and run
$ python fast_parse.py <path_to_txt>.txt
The output will be put along side with the input text file in a directory whose name is same as the text file.
--- METHOD 1: nearest person:
Assign the non-person name entities to the nearest person that is behind the name entities.
Usage:
1. To extraction relations in a single text file:
(extracted relations will be appended to the .ann file)
$ python relation_np.py <doc_id>.txt <doc_id>.ann
2. To generate annotations for a set of text file under <directory>
Set "output_dir" in pipeline.sh to <directory> and run:
$ source pipeline.sh
--- METHOD 2: dependency parsing
Assign the non-person name entities to the closest person where distance is the length of the dependency path between the name entity and the person
Constraint: If we only choose from one of the two person that appear immediately on the left and the right side, the results could be improved but the drawbacks are also obvious
Usage:
1. To extraction relations in a single text file:
(extracted relations will be appended to the .ann file)
$ python relation_dep.py <jPTDP_buffer_path> <doc_id>.txt <doc_id>.ann
2. To generate annotations for a set of text file under <directory>
$ source pipeline.sh <directory>
--- METHOD 3: neural networks
Use dependency path and its distance as features to predict which person in the sentence is the best option
The best model is saved in "model_86.h5"
Usage:
Predictions are made on files in "pred_path" and are written in place, "pred_path" can be set in config.py
$ python pred.py