cyberspacy: cybersecurity datatypes for spaCy
spaCy v2.0 extension and pipeline component for tagging IPs, email addresses, URLs, and Windows commandlines. Heavily inspired by spacymoji.
Installation
cyberspacy
requires spacy
v2.0.0 or higher.
pip
pip install cyberspacy
Parsing Windows commandlines
You can use cyberspacy to tokenize, tag, and normalize Windows command lines from endpoint telemetry.
from cyberspacy import WindowsCommandlineProcessor
processor = WindowsCommandlineProcessor()
cmd_line = r'"C:\Program Files\MyProgram.exe" /d C:\Users\Alice\file.txt --file C:\test.py'
assert processor.get_args(cmd_line) == ["/d", "--file"]
assert processor.get_paths(cmd_line) == ['"C:\\Program Files\\MyProgram.exe"', 'C:\\Users\\Alice\\file.txt', 'C:\\test.py']
assert processor.get_normalized_paths(cmd_line) == ['"?pf64\\myprogram.exe"', '?usr\\file.txt', '?c\\test.py']
assert processor.normalize(cmd_line) == '"?pf64\\myprogram.exe" /d ?usr\\file.txt --file ?c\\test.py'
Tagging documents
Import the component and initialise it with the shared nlp
object (i.e. an
instance of Language
), which is used to initialise the PhraseMatcher
with the shared vocab, and create the match patterns. Then add the component
anywhere in your pipeline.
import spacy
from spacy.lang.en import English
from cyberspacy import IPTagger
nlp = English()
ip_Tagger = IPTagger(nlp)
nlp.add_pipe(ip_Tagger, first=True)
doc = nlp(u'This is a sentence which contains 2.3.4.5 as an IP address')
assert doc._.has_ipv4 == True
assert doc[0]._.is_ipv4 == False
assert doc[6]._.is_ipv4 == True
assert len(doc._.ipv4) == 1
idx, ipv4_token = doc._.ipv4[0]
assert idx == 6
assert ipv4_token.text == '2.3.4.5'
cyberspacy
only cares about the token text, so you can use it on a blank
Language
instance (it should work for all
available languages!), or in
a pipeline with a loaded model.
Available attributes
The extension sets attributes on the Doc
, Span
and Token
. You can
change the attribute names on initialisation of the extension. For more details
on custom components and attributes, see the
processing pipelines documentation.
The attributes provided by the IPTagger class are:
Token._.is_ipv4 |
bool | Whether the token is an IPv4 address. |
Doc._.has_ipv4 |
bool | Whether the document contains an IPv4 address. |
Doc._.ipv4 |
list |
(index, token) tuples of the document's IPv4 addresses. |
Span._.has_ipv4 |
bool | Whether the span contains IPv4 addresses. |
Span._.ipv4 |
list |
(index, token) tuples of the span's IPv4 addresses. |
The attributes provided by the URLTagger class are:
Token._.is_url |
bool | Whether the token is a URL. |
Doc._.has_url |
bool | Whether the document contains a URL. |
Doc._.url |
list |
(index, token) tuples of the document's URLs. |
Span._.has_url |
bool | Whether the span contains a URL. |
Span._.url |
list |
(index, token) tuples of the span's URLs. |
The attributes provided by the EmailTagger class are:
Token._.is_email_addr |
bool | Whether the token is an email address. |
Doc._.has_email_addr |
bool | Whether the document contains an email address. |
Doc._.email_addr |
list |
(index, token) tuples of the document's email addresses. |
Span._.has_email_addr |
bool | Whether the span contains an email address. |
Span._.email_addr |
list |
(index, token) tuples of the span's email addresses. |
The attributes provided by the CommandLineTagger class are:
Token._.is_path |
bool | Whether the token is a path. |
Token._.is_arg |
bool | Whether the token is an argument/flag. |
Token._.is_val |
bool | Whether the token is a value for an argument. |
Token._.is_cmd |
bool | Whether the token is a nested command. |
Doc._.normalize |
str | Returns a normalized version of the commandline |
Doc._.has_path |
bool | Whether the document contains a path. |
Doc._.path |
list |
(index, token) tuples of the document's paths. |
Doc._.has_arg |
bool | Whether the document contains an argument/flag. |
Doc._.arg |
list |
(index, token) tuples of the document's args. |
Doc._.has_val |
bool | Whether the document contains a value for an argument. |
Doc._.val |
list |
(index, token) tuples of the document's values. |
Doc._.has_cmd |
bool | Whether the document contains a nested command. |
Doc._.cmd |
list |
(index, token) tuples of the document's subcommands. |
Span._.has_path |
bool | Whether the span contains a path. |
Span._.path |
list |
(index, token) tuples of the span's paths. |
Span._.has_arg |
bool | Whether the span contains an argument/flag. |
Span._.arg |
list |
(index, token) tuples of the span's args. |
Span._.has_val |
bool | Whether the span contains a value for an argument. |
Span._.val |
list |
(index, token) tuples of the span's values. |
Span._.has_cmd |
bool | Whether the span contains a nested command. |
Span._.cmd |
list |
(index, token) tuples of the span's subcommands. |