UDPipe Parser
This parser takes a sentence, does syntax analysis using udpipe model and returns a structure that is easy-to-use in common NLP/NLU tasks.
Quickstart
import udpipe_parser
P = udpipe_parser.UDPipe_Parser()
exps = P.run("Π― Ρ
ΠΎΡΡ ΠΌΠ°ΡΠ΅ΡΠΈΠ°Π»ΡΠ½ΡΡ ΠΏΠΎΠΌΠΎΡΡ! ΠΠ°ΠΊ Π΅Ρ ΠΏΠΎΠ»ΡΡΠΈΡΡ?",solve_anaphora=True,logging=False)
for exp in exps:
print(exp,'\n')
subj : Ρ,
pred : Ρ
ΠΎΡΠ΅ΡΡ,
obj : ΠΏΠΎΠΌΠΎΡΡ ΠΌΠ°ΡΠ΅ΡΠΈΠ°Π»ΡΠ½ΡΠΉ,
params :
subj :
pred : ΠΏΠΎΠ»ΡΡΠΈΡΡ,
obj : ΠΏΠΎΠΌΠΎΡΡ ΠΌΠ°ΡΠ΅ΡΠΈΠ°Π»ΡΠ½ΡΠΉ,
params : ΠΊΠ°ΠΊ,
exps[0].obj
[{'pos': 'NOUN', 'polarity': 'affirmative', 'case': 'Acc', 'numb': 'Sing', 'form': 'ΠΏΠΎΠΌΠΎΡΡ ΠΌΠ°ΡΠ΅ΡΠΈΠ°Π»ΡΠ½ΡΠΉ', 'prep': None, 'dets': []}]
Metrics
To evaluate accuracy of the parser a dataset was collected from questions and annotated in terms of predicates, subjects, objects, parameters. Examples:
Documentation
Expression
Expression is a class for presenting a result of analysis. It contains four items each of them has attributes.
class Expression:
def __init__(self,subj=[],pred=[],obj=[],parameters=[]):
self.subj = [] # SUBJECT
self.pred = [] # PREDICATE
self.obj = [] # OBJECT
self.params = [] # PARAMETERS
PREDICATE expresses action or property of the subject
SUBJECT - the person or thing performing the action expressed by predicate
OBJECT - the person or thing that receives the action expressed by predicate
PARAMETERS - specifications of predicates (e.g. Π±ΡΡΡΡΠΎ, Π² ΡΠΏΠ΅ΡΠΊΠ΅, Π½Π΅ Π²ΡΡ ΠΎΠ΄Ρ ΠΈΠ· Π΄ΠΎΠΌΠ°)
Each item can have some of these attributes:
attributes:
'pos' : Part of speech
'polarity': Affirmative/Negative
'numb': Number
'case': Case
'tense': Tense
'voice': Active/Passive
'aspect': Imperfective/perfective
'form': word form
'prep': Preposition with it
'dets': Determiner
'modality' : Modal Verb
UDPipe_Parser
Instance of this class perfomes analysis of text data. It takes a text, makes preprocessing, builds conllu-trees, solves anaphora (if needed) and parses trees. The result is a list of Expression instances which represent trees/subtrees. Analyzer class can take dict as a parameter, it is a words or expressions that should be treated as proper names (hence, they can be subjects or objects). In order to provide more stable perfomance it is advisable to fill this list with out-of-dictionary words, such as acronyms, jargon words, slang etc.
my_propn_nouns = ["ΡΡΠ²Π΄","ΠΌΠ³Ρ","ΡΠΎΡΡΠ΅Ρ
","ΡΠΎΡΠ½Π΅ΡΡΡ","Π²ΡΡΠΈ"]
P = udpipe_parser.UDPipe_Parser(propn_nouns=my_propn_nouns)
Also, one can use abbreviations dictionary to convert abbreviations into normal word forms
abbrev_dict = {'Π°Π±Ρ.' : 'Π°Π±ΡΠΎΠ»ΡΡΠ½ΡΠΉ', 'Π³ΡΠ°Π΄.':'Π³ΡΠ°Π΄ΡΡ'}
P = udpipe_parser.UDPipe_Parser(abbrev_dict=abbrev_dict)
exps = P.run('ΠΊΠ°ΠΊΠΎΠ΅ Π°Π±Ρ. Π·Π½Π°ΡΠ΅Π½ΠΈΠ΅ Π² Π³ΡΠ°Π΄. Π¦Π΅Π»ΡΡΠΈΡ?')
print(exps[0])
subj : Π·Π½Π°ΡΠ΅Π½ΠΈΠ΅ Π°Π±ΡΠΎΠ»ΡΡΠ½ΡΠΉ Π² Π³ΡΠ°Π΄ΡΡ ΡΠ΅Π»ΡΡΠΈΠΉ,
pred : Π±ΡΡΡ,
obj :
params :
Solving Anaphora
The parser can try to solve anaphora optionally:
>>> P = udpipe_parser.UDPipe_Parser()
>>> exps = P.run('Π― Π·Π°ΠΊΠ°Π·ΡΠ²Π°Π» ΠΊΡΠ΅Π΄ΠΈΡΠΊΡ. ΠΠ΄Π΅ Ρ ΠΌΠΎΠ³Ρ Π΅Ρ Π·Π°Π±ΡΠ°ΡΡ?',solve_anaphora=True)
>>> for exp in exps:
print(exp)
subj : Ρ,
pred : Π·Π°ΠΊΠ°Π·ΡΠ²Π°ΡΡ,
obj : ΠΊΡΠ΅Π΄ΠΈΡΠΊΠ°,
params :
subj : Ρ,
pred : Π·Π°Π±ΡΠ°ΡΡ,
obj : ΠΊΡΠ΅Π΄ΠΈΡΠΊΠ°,
params : Π³Π΄Π΅,
Logging
To keep track of perfomance process use a flag 'logging':
P = udpipe_parser.UDPipe_Parser()
exps = P.run('ΠΠ΅ΠΆΠ΄Ρ Π΄ΡΡ
ΠΎΠΌ ΠΈ ΠΌΠ°ΡΠ΅ΡΠΈΠ΅ΠΉ ΠΏΠΎΡΡΠ΅Π΄Π½ΠΈΡΠ°Π΅Ρ ΠΌΠ°ΡΠ΅ΠΌΠ°ΡΠΈΠΊΠ°.',logging=True)
sent: ΠΠ΅ΠΆΠ΄Ρ Π΄ΡΡ
ΠΎΠΌ ΠΈ ΠΌΠ°ΡΠ΅ΡΠΈΠ΅ΠΉ ΠΏΠΎΡΡΠ΅Π΄Π½ΠΈΡΠ°Π΅Ρ ΠΌΠ°ΡΠ΅ΠΌΠ°ΡΠΈΠΊΠ° .
child of ΠΏΠΎΡΡΠ΅Π΄Π½ΠΈΡΠ°Π΅Ρ : Π΄ΡΡ
ΠΎΠΌ {'id': 2, 'form': 'Π΄ΡΡ
ΠΎΠΌ', 'lemma': 'Π΄ΡΡ
', 'upos': 'NOUN', 'xpos': None, 'feats': {'Animacy': 'Inan', 'Case': 'Ins', 'Gender': 'Masc', 'Number': 'Sing'}, 'head': 5, 'deprel': 'obl', 'deps': None, 'misc': None}
child of ΠΏΠΎΡΡΠ΅Π΄Π½ΠΈΡΠ°Π΅Ρ : ΠΌΠ°ΡΠ΅ΠΌΠ°ΡΠΈΠΊΠ° {'id': 6, 'form': 'ΠΌΠ°ΡΠ΅ΠΌΠ°ΡΠΈΠΊΠ°', 'lemma': 'ΠΌΠ°ΡΠ΅ΠΌΠ°ΡΠΈΠΊΠ°', 'upos': 'NOUN', 'xpos': None, 'feats': {'Animacy': 'Inan', 'Case': 'Nom', 'Gender': 'Fem', 'Number': 'Sing'}, 'head': 5, 'deprel': 'nsubj', 'deps': None, 'misc': None}
child of ΠΏΠΎΡΡΠ΅Π΄Π½ΠΈΡΠ°Π΅Ρ : . {'id': 7, 'form': '.', 'lemma': '.', 'upos': 'PUNCT', 'xpos': None, 'feats': None, 'head': 5, 'deprel': 'punct', 'deps': None, 'misc': {'SpaceAfter': 'No'}}
[UDPipe Parser] elapsed time: 0.4530816078186035
print(exps[0])
subj : ΠΌΠ°ΡΠ΅ΠΌΠ°ΡΠΈΠΊΠ°,
pred : ΠΏΠΎΡΡΠ΅Π΄Π½ΠΈΡΠ°ΡΡ,
obj :
params : ΠΌΠ΅ΠΆΠ΄Ρ Π΄ΡΡ
, ΠΌΠ°ΡΠ΅ΡΠΈΡ,