cyberspacy

spaCy pipeline component for adding cyber meta data to Doc, Token and Span objects.


License
MIT
Install
pip install cyberspacy==1.1.1

Documentation

cyberspacy: cybersecurity datatypes for spaCy

spaCy v2.0 extension and pipeline component for tagging IPs, email addresses, URLs, and Windows commandlines. Heavily inspired by spacymoji.

Installation

cyberspacy requires spacy v2.0.0 or higher.

pip

pip install cyberspacy

Parsing Windows commandlines

You can use cyberspacy to tokenize, tag, and normalize Windows command lines from endpoint telemetry.

from cyberspacy import WindowsCommandlineProcessor

processor = WindowsCommandlineProcessor()
cmd_line = r'"C:\Program Files\MyProgram.exe" /d C:\Users\Alice\file.txt --file C:\test.py'

assert processor.get_args(cmd_line) == ["/d", "--file"]
assert processor.get_paths(cmd_line) == ['"C:\\Program Files\\MyProgram.exe"', 'C:\\Users\\Alice\\file.txt', 'C:\\test.py']
assert processor.get_normalized_paths(cmd_line) == ['"?pf64\\myprogram.exe"', '?usr\\file.txt', '?c\\test.py']
assert processor.normalize(cmd_line) == '"?pf64\\myprogram.exe" /d ?usr\\file.txt --file ?c\\test.py'

Tagging documents

Import the component and initialise it with the shared nlp object (i.e. an instance of Language), which is used to initialise the PhraseMatcher with the shared vocab, and create the match patterns. Then add the component anywhere in your pipeline.

import spacy
from spacy.lang.en import English
from cyberspacy import IPTagger
nlp = English()
ip_Tagger = IPTagger(nlp)
nlp.add_pipe(ip_Tagger, first=True)
doc = nlp(u'This is a sentence which contains 2.3.4.5 as an IP address')
assert doc._.has_ipv4 == True
assert doc[0]._.is_ipv4 == False
assert doc[6]._.is_ipv4 == True
assert len(doc._.ipv4) == 1
idx, ipv4_token = doc._.ipv4[0]
assert idx == 6
assert ipv4_token.text == '2.3.4.5'

cyberspacy only cares about the token text, so you can use it on a blank Language instance (it should work for all available languages!), or in a pipeline with a loaded model.

Available attributes

The extension sets attributes on the Doc, Span and Token. You can change the attribute names on initialisation of the extension. For more details on custom components and attributes, see the processing pipelines documentation.

The attributes provided by the IPTagger class are:

Token._.is_ipv4 bool Whether the token is an IPv4 address.
Doc._.has_ipv4 bool Whether the document contains an IPv4 address.
Doc._.ipv4 list (index, token) tuples of the document's IPv4 addresses.
Span._.has_ipv4 bool Whether the span contains IPv4 addresses.
Span._.ipv4 list (index, token) tuples of the span's IPv4 addresses.

The attributes provided by the URLTagger class are:

Token._.is_url bool Whether the token is a URL.
Doc._.has_url bool Whether the document contains a URL.
Doc._.url list (index, token) tuples of the document's URLs.
Span._.has_url bool Whether the span contains a URL.
Span._.url list (index, token) tuples of the span's URLs.

The attributes provided by the EmailTagger class are:

Token._.is_email_addr bool Whether the token is an email address.
Doc._.has_email_addr bool Whether the document contains an email address.
Doc._.email_addr list (index, token) tuples of the document's email addresses.
Span._.has_email_addr bool Whether the span contains an email address.
Span._.email_addr list (index, token) tuples of the span's email addresses.

The attributes provided by the CommandLineTagger class are:

Token._.is_path bool Whether the token is a path.
Token._.is_arg bool Whether the token is an argument/flag.
Token._.is_val bool Whether the token is a value for an argument.
Token._.is_cmd bool Whether the token is a nested command.
Doc._.normalize str Returns a normalized version of the commandline
Doc._.has_path bool Whether the document contains a path.
Doc._.path list (index, token) tuples of the document's paths.
Doc._.has_arg bool Whether the document contains an argument/flag.
Doc._.arg list (index, token) tuples of the document's args.
Doc._.has_val bool Whether the document contains a value for an argument.
Doc._.val list (index, token) tuples of the document's values.
Doc._.has_cmd bool Whether the document contains a nested command.
Doc._.cmd list (index, token) tuples of the document's subcommands.
Span._.has_path bool Whether the span contains a path.
Span._.path list (index, token) tuples of the span's paths.
Span._.has_arg bool Whether the span contains an argument/flag.
Span._.arg list (index, token) tuples of the span's args.
Span._.has_val bool Whether the span contains a value for an argument.
Span._.val list (index, token) tuples of the span's values.
Span._.has_cmd bool Whether the span contains a nested command.
Span._.cmd list (index, token) tuples of the span's subcommands.