
Knowledge Graph Extraction for SFM dataset

pip install extract-sfm==1.0.1


Knowledge Graph Extraction

We designed a pipeline that can extract a special kind of knowledge graphs where a person's name will be recognized and his/her rank, role, title and organization will be related to him/her. It is not expected to perform perfectly so that all relevant persons will be recognized and all irrelevant persons will be excluded. Rather, it is seen as a first step to reduce the workload that is involved to manually extract such knowledge by combing through a large amount of documents.

This pipeline consists of two major components: Name Entity Recognition and Relation Extraction. Name Entity Recognition uses a BiLSTM-CNNs-CRF model. It recognizes names, ranks, roles, titles and organizations from raw text files. Then the Relation Extraction relates names to his/her corresponding rank, role, title or organization.

Example: Example


Tensorflow 2.2.0


$ pip install extract_sfm


Method 1

Create a python file and write:

import extract_sfm


Then run the python file. This may take a while to finish.

Method 2

Download this Github repository Under the project root directory, run the python script

$ python pipeline.py /PATH/TO/DIRECTORY/OF/INPUT/FILES

Note: Use absolute path.


  1. Copy NER_v2, RE, pipeline.py into the "SERVER/KGE" directory
  2. Install npm dependencies under the "SERVER" directory: express, path, multer
  $ npm install <package name>
  1. Run the server by typing in:
  $ node server.js

Note: Cannot have [space] character in the path to the website's root directory


NER Documentation

    1. SFM starter dataset: https://github.com/security-force-monitor/nlp_starter_dataset
    2. CONLL2003: https://github.com/guillaumegenthial/tf_ner/tree/master/data/example
    3. A set of known organizations from the starter dataset
    Note: Title and role were collapsed into one class

    1) Prepare data
      $ python process.py
      $ cd SFM_STARTER
      $ python build_vocab.py
      $ python build_glove.py
      $ cd ..

    2) Train model
      $ python train.py

    3) Make predictions
      $ python pred.py

    4) Evaluate model
      $ python eval.py
      $ python eval_class.py

    process.py: 1) preprocess dataset by recording info in dicts,
                      which are saved in two pickle files: dataset_labels.pickle, dataset_sentences.pickle
                2) convert SFM starter dataset to a format that can be used by the model,
                      which are in files: {}.words.txt and {}.tags.txt where {} could be train, valid or test.
    pred.py: generates predictions using the trained model
    eval.py: evaluate the predctions made by model, which are generated by running pred.py
    eval_class.py: get precision, recall and f1 score for each class

    Other files are from https://github.com/guillaumegenthial/tf_ner
      train.py, tf_metrics.py, SFM_STARTER/build_vocab.py, SFM_STARTER/build_glove.py

    $ python ner.py <doc_id>.txt

    ner.py: get BRAT format prediction for a text file.

RE Documentation

  Before running the following 3 methods, you need to run a dependency parser first, which some methods relies on.
  Usage: Go to the jPTDP directory and run
    $ python fast_parse.py <path_to_txt>.txt
  The output will be put along side with the input text file in a directory whose name is same as the text file.

--- METHOD 1: nearest person:
    Assign the non-person name entities to the nearest person that is behind the name entities.

      1. To extraction relations in a single text file:
        (extracted relations will be appended to the .ann file)
        $ python relation_np.py <doc_id>.txt <doc_id>.ann
      2. To generate annotations for a set of text file under <directory>
        Set "output_dir" in pipeline.sh to <directory> and run:
        $ source pipeline.sh

--- METHOD 2: dependency parsing
    Assign the non-person name entities to the closest person where distance is the length of the dependency path between the name entity and the person
    Constraint: If we only choose from one of the two person that appear immediately on the left and the right side, the results could be improved but the drawbacks are also obvious

      1. To extraction relations in a single text file:
        (extracted relations will be appended to the .ann file)
        $ python relation_dep.py <jPTDP_buffer_path> <doc_id>.txt <doc_id>.ann
      2. To generate annotations for a set of text file under <directory>
        $ source pipeline.sh <directory>

--- METHOD 3: neural networks
    Use dependency path and its distance as features to predict which person in the sentence is the best option
    The best model is saved in "model_86.h5"

      Predictions are made on files in "pred_path" and are written in place, "pred_path" can be set in config.py
      $ python pred.py