udpipe-parser

UDPipe-based Parser brings Universal Dependencies trees in more practical form.


Keywords
parser, Universal, Dependencies, NLP, russian, syntax
License
MIT
Install
pip install udpipe-parser==0.4

Documentation

PyPI version

UDPipe Parser

This parser takes a sentence, does syntax analysis using udpipe model and returns a structure that is easy-to-use in common NLP/NLU tasks.

Quickstart

import udpipe_parser
P = udpipe_parser.UDPipe_Parser()
exps = P.run("Π― Ρ…ΠΎΡ‡Ρƒ ΠΌΠ°Ρ‚Π΅Ρ€ΠΈΠ°Π»ΡŒΠ½ΡƒΡŽ ΠΏΠΎΠΌΠΎΡ‰ΡŒ! Как Π΅Ρ‘ ΠΏΠΎΠ»ΡƒΡ‡ΠΈΡ‚ΡŒ?",solve_anaphora=True,logging=False)
for exp in exps:
  print(exp,'\n')

subj : я, 
pred : Ρ…ΠΎΡ‚Π΅Ρ‚ΡŒ, 
obj : ΠΏΠΎΠΌΠΎΡ‰ΡŒ ΠΌΠ°Ρ‚Π΅Ρ€ΠΈΠ°Π»ΡŒΠ½Ρ‹ΠΉ, 
params :  

subj : 
pred : ΠΏΠΎΠ»ΡƒΡ‡ΠΈΡ‚ΡŒ, 
obj : ΠΏΠΎΠΌΠΎΡ‰ΡŒ ΠΌΠ°Ρ‚Π΅Ρ€ΠΈΠ°Π»ΡŒΠ½Ρ‹ΠΉ, 
params : ΠΊΠ°ΠΊ,

exps[0].obj
[{'pos': 'NOUN', 'polarity': 'affirmative', 'case': 'Acc', 'numb': 'Sing', 'form': 'ΠΏΠΎΠΌΠΎΡ‰ΡŒ ΠΌΠ°Ρ‚Π΅Ρ€ΠΈΠ°Π»ΡŒΠ½Ρ‹ΠΉ', 'prep': None, 'dets': []}]

Metrics

To evaluate accuracy of the parser a dataset was collected from questions and annotated in terms of predicates, subjects, objects, parameters. Examples:

picture

Documentation

Expression

Expression is a class for presenting a result of analysis. It contains four items each of them has attributes.

class Expression:
  def __init__(self,subj=[],pred=[],obj=[],parameters=[]):
    self.subj = [] # SUBJECT
    self.pred = [] # PREDICATE
    self.obj = [] # OBJECT
    self.params = [] # PARAMETERS

PREDICATE expresses action or property of the subject

SUBJECT - the person or thing performing the action expressed by predicate

OBJECT - the person or thing that receives the action expressed by predicate

PARAMETERS - specifications of predicates (e.g. быстро, Π² спСшкС, Π½Π΅ выходя ΠΈΠ· Π΄ΠΎΠΌΠ°)

Each item can have some of these attributes:

attributes:


'pos' : Part of speech

'polarity': Affirmative/Negative

'numb': Number

'case': Case

'tense': Tense

'voice': Active/Passive

'aspect': Imperfective/perfective

'form': word form

'prep': Preposition with it

'dets': Determiner

'modality' : Modal Verb

UDPipe_Parser

Instance of this class perfomes analysis of text data. It takes a text, makes preprocessing, builds conllu-trees, solves anaphora (if needed) and parses trees. The result is a list of Expression instances which represent trees/subtrees. Analyzer class can take dict as a parameter, it is a words or expressions that should be treated as proper names (hence, they can be subjects or objects). In order to provide more stable perfomance it is advisable to fill this list with out-of-dictionary words, such as acronyms, jargon words, slang etc.


my_propn_nouns = ["Ρ€ΡƒΠ²Π΄","ΠΌΠ³Ρƒ","ростСх","Ρ€ΠΎΡΠ½Π΅Ρ„Ρ‚ΡŒ","Π²Ρ‹Ρ‡ΠΈ"]
P = udpipe_parser.UDPipe_Parser(propn_nouns=my_propn_nouns)

Also, one can use abbreviations dictionary to convert abbreviations into normal word forms

abbrev_dict = {'абс.' : 'Π°Π±ΡΠΎΠ»ΡŽΡ‚Π½Ρ‹ΠΉ', 'Π³Ρ€Π°Π΄.':'градус'}
P = udpipe_parser.UDPipe_Parser(abbrev_dict=abbrev_dict)
exps = P.run('ΠΊΠ°ΠΊΠΎΠ΅ абс. Π·Π½Π°Ρ‡Π΅Π½ΠΈΠ΅ Π² Π³Ρ€Π°Π΄. ЦСльсия?')
print(exps[0])

subj : Π·Π½Π°Ρ‡Π΅Π½ΠΈΠ΅ Π°Π±ΡΠΎΠ»ΡŽΡ‚Π½Ρ‹ΠΉ Π² градус Ρ†Π΅Π»ΡŒΡΠΈΠΉ, 
pred : Π±Ρ‹Ρ‚ΡŒ, 
obj : 
params :

Solving Anaphora

The parser can try to solve anaphora optionally:

>>> P = udpipe_parser.UDPipe_Parser()
>>> exps = P.run('Π― Π·Π°ΠΊΠ°Π·Ρ‹Π²Π°Π» ΠΊΡ€Π΅Π΄ΠΈΡ‚ΠΊΡƒ. Π“Π΄Π΅ я ΠΌΠΎΠ³Ρƒ Π΅Ρ‘ Π·Π°Π±Ρ€Π°Ρ‚ΡŒ?',solve_anaphora=True)
>>> for exp in exps:
	print(exp)

	
subj : я, 
pred : Π·Π°ΠΊΠ°Π·Ρ‹Π²Π°Ρ‚ΡŒ, 
obj : ΠΊΡ€Π΅Π΄ΠΈΡ‚ΠΊΠ°, 
params : 

subj : я, 
pred : Π·Π°Π±Ρ€Π°Ρ‚ΡŒ, 
obj : ΠΊΡ€Π΅Π΄ΠΈΡ‚ΠΊΠ°, 
params : Π³Π΄Π΅, 

Logging

To keep track of perfomance process use a flag 'logging':

P = udpipe_parser.UDPipe_Parser()
exps = P.run('ΠœΠ΅ΠΆΠ΄Ρƒ Π΄ΡƒΡ…ΠΎΠΌ ΠΈ ΠΌΠ°Ρ‚Π΅Ρ€ΠΈΠ΅ΠΉ посрСдничаСт ΠΌΠ°Ρ‚Π΅ΠΌΠ°Ρ‚ΠΈΠΊΠ°.',logging=True)

sent: ΠœΠ΅ΠΆΠ΄Ρƒ Π΄ΡƒΡ…ΠΎΠΌ ΠΈ ΠΌΠ°Ρ‚Π΅Ρ€ΠΈΠ΅ΠΉ посрСдничаСт ΠΌΠ°Ρ‚Π΅ΠΌΠ°Ρ‚ΠΈΠΊΠ° .
child of  посрСдничаСт : Π΄ΡƒΡ…ΠΎΠΌ {'id': 2, 'form': 'Π΄ΡƒΡ…ΠΎΠΌ', 'lemma': 'Π΄ΡƒΡ…', 'upos': 'NOUN', 'xpos': None, 'feats': {'Animacy': 'Inan', 'Case': 'Ins', 'Gender': 'Masc', 'Number': 'Sing'}, 'head': 5, 'deprel': 'obl', 'deps': None, 'misc': None}
child of  посрСдничаСт : ΠΌΠ°Ρ‚Π΅ΠΌΠ°Ρ‚ΠΈΠΊΠ° {'id': 6, 'form': 'ΠΌΠ°Ρ‚Π΅ΠΌΠ°Ρ‚ΠΈΠΊΠ°', 'lemma': 'ΠΌΠ°Ρ‚Π΅ΠΌΠ°Ρ‚ΠΈΠΊΠ°', 'upos': 'NOUN', 'xpos': None, 'feats': {'Animacy': 'Inan', 'Case': 'Nom', 'Gender': 'Fem', 'Number': 'Sing'}, 'head': 5, 'deprel': 'nsubj', 'deps': None, 'misc': None}
child of  посрСдничаСт : . {'id': 7, 'form': '.', 'lemma': '.', 'upos': 'PUNCT', 'xpos': None, 'feats': None, 'head': 5, 'deprel': 'punct', 'deps': None, 'misc': {'SpaceAfter': 'No'}}
[UDPipe Parser] elapsed time: 0.4530816078186035

print(exps[0])
subj : ΠΌΠ°Ρ‚Π΅ΠΌΠ°Ρ‚ΠΈΠΊΠ°, 
pred : ΠΏΠΎΡΡ€Π΅Π΄Π½ΠΈΡ‡Π°Ρ‚ΡŒ, 
obj : 
params : ΠΌΠ΅ΠΆΠ΄Ρƒ Π΄ΡƒΡ…, матСрия,