extract-sfm

Knowledge Graph Extraction for SFM dataset


License
MIT
Install
pip install extract-sfm==2.0

Documentation

Knowledge Graph Extraction

We designed a pipeline that can extract a special kind of knowledge graphs where a person's name will be recognized and his/her rank, role, title and organization will be related to him/her. It is not expected to perform perfectly so that all relevant persons will be recognized and all irrelevant persons will be excluded. Rather, it is seen as a first step to reduce the workload that is involved to manually extract such knowledge by combing through a large amount of documents.

This pipeline consists of two major components: Name Entity Recognition and Relation Extraction. Name Entity Recognition uses a BiLSTM-CNNs-CRF model. It recognizes names, ranks, roles, titles and organizations from raw text files. Then the Relation Extraction relates names to his/her corresponding rank, role, title or organization.

Example: Example

Dependencies

Tensorflow 2.2.0
SpaCy
NumPy

Install

Package: https://pypi.org/project/extract-sfm/

$ pip install extract_sfm

Usage

Method 1

Create a python file and write:

import extract_sfm

extract_sfm.extract("/PATH/TO/DIRECTORY/OF/INPUT/FILES")

Then run the python file. This may take a while to finish.

Method 2

Download this Github repository Under the project root directory, run the python script

$ python pipeline.py /PATH/TO/DIRECTORY/OF/INPUT/FILES

Note: Use absolute path.

Website

  1. Copy NER_v2, RE, pipeline.py into the "SERVER/KGE" directory
  2. Install npm dependencies under the "SERVER" directory: express, path, multer
  $ npm install <package name>
  1. Run the server by typing in:
  $ node server.js

Note: Cannot have [space] character in the path to the website's root directory

Example

NER Documentation

TRAINING
  Dataset:
    1. SFM starter dataset: https://github.com/security-force-monitor/nlp_starter_dataset
    2. CONLL2003: https://github.com/guillaumegenthial/tf_ner/tree/master/data/example
    3. A set of known organizations from the starter dataset
    Note: Title and role were collapsed into one class

  Usage:
    1) Prepare data
      $ python process.py
      $ cd SFM_STARTER
      $ python build_vocab.py
      $ python build_glove.py
      $ cd ..

    2) Train model
      $ python train.py

    3) Make predictions
      $ python pred.py

    4) Evaluate model
      $ python eval.py
      $ python eval_class.py

  Files:
    process.py: 1) preprocess dataset by recording info in dicts,
                      which are saved in two pickle files: dataset_labels.pickle, dataset_sentences.pickle
                2) convert SFM starter dataset to a format that can be used by the model,
                      which are in files: {}.words.txt and {}.tags.txt where {} could be train, valid or test.
    pred.py: generates predictions using the trained model
    eval.py: evaluate the predctions made by model, which are generated by running pred.py
    eval_class.py: get precision, recall and f1 score for each class

    Other files are from https://github.com/guillaumegenthial/tf_ner
      train.py, tf_metrics.py, SFM_STARTER/build_vocab.py, SFM_STARTER/build_glove.py

PREDICTING
  Usage:
    $ python ner.py <doc_id>.txt

  File:
    ner.py: get BRAT format prediction for a text file.

RE Documentation

jPTDP:
  Before running the following 3 methods, you need to run a dependency parser first, which some methods relies on.
  Usage: Go to the jPTDP directory and run
    $ python fast_parse.py <path_to_txt>.txt
  The output will be put along side with the input text file in a directory whose name is same as the text file.



--- METHOD 1: nearest person:
    Assign the non-person name entities to the nearest person that is behind the name entities.

    Usage:
      1. To extraction relations in a single text file:
        (extracted relations will be appended to the .ann file)
        $ python relation_np.py <doc_id>.txt <doc_id>.ann
      2. To generate annotations for a set of text file under <directory>
        Set "output_dir" in pipeline.sh to <directory> and run:
        $ source pipeline.sh



--- METHOD 2: dependency parsing
    Assign the non-person name entities to the closest person where distance is the length of the dependency path between the name entity and the person
    Constraint: If we only choose from one of the two person that appear immediately on the left and the right side, the results could be improved but the drawbacks are also obvious

    Usage:
      1. To extraction relations in a single text file:
        (extracted relations will be appended to the .ann file)
        $ python relation_dep.py <jPTDP_buffer_path> <doc_id>.txt <doc_id>.ann
      2. To generate annotations for a set of text file under <directory>
        $ source pipeline.sh <directory>



--- METHOD 3: neural networks
    Use dependency path and its distance as features to predict which person in the sentence is the best option
    The best model is saved in "model_86.h5"

    Usage:
      Predictions are made on files in "pred_path" and are written in place, "pred_path" can be set in config.py
      $ python pred.py