otazkovac

Question generator for Slovak language


Keywords
question, generation, nlp
License
GPL-3.0
Install
pip install otazkovac==0.0.1

Documentation

otazkovac

otazkovac is a very simple (web|command line) application capable of generating questions from unstructured[1] Slovak text provided it is fed with appropriate data. It is intended to be used as a submodule for Multimedialna Citanka, a web application thanks to which kids in the first three grades of Slovak primary schools learn how to read with some help of a computer. otazovac shall be used to help teachers automatically generate questions from the text their pupils read and thereby test how much detail from the text were they able to comprehend.

This project is licensed under the terms of the GNU General Public License (v2 or later).

Question generation

While generation of questions from text might sound like an interesting problem the amount of academic publications on this topic would suggest otherwise. The most comprehensive work on this topic currently seems to be Michael Heilman's dissertation[2] in which many approaches for generating questions from text are described.

In our case, however, we went for 'easy win' in terms of techniques used for question generation since the use case of otazkovac suggests that the generated questions do not need to be difficult.

otazkovac therefore functions on the following premise: if a sentence starts with a preposition then this sentence can easily be converted into a sentence by replacing the prepositional part of it with a preposition (in case of otazkovac either 'kedy' (Slovak for when) or 'kde' (Slovak for where)).

In order for otazkovac to find sentences that could potentially be turned into questions two stages are required: splitting text into sentences and detecting whether a sentence starts with a preposition. Luckily, both of these tasks can be performed by MorphoDiTa: Morphological Dictionary and Tagger[3], provided that it will get the pre-trained model as an input. Thankfully, we were provided such a model by the Slovak National Corpus[4] under the terms of the GNU GPLv2 license.

It is important to note that while most of available POS taggers use the Penn Treebank POS tags[5], Slovak National Corpus uses a specific set of tags[6] that reflects the nature of Slovak language and provides more information that might be of use in next stages of processing. For the purpose of our discussion it shall be noted that preposition tags start with E. An example of a tagged sentence:

E---6----------u- - Po
AAis6----x------- - dobrom
SSis6------------ - kúpeli
R---------------- - sa
V-ms---cL-A-d---- - rozlúčil
E---7----------u- - s
SSis7------------ - mesiačikom
O---------------- - a
SSfp7------------ - hviezdičkami
O---------------- - a
V-ms---cL-A-d---- - ľahol
R---------------- - si
V-------I-A-e---- - spať
Z - .

As we can see in the example, the first word starts with E and is therefore tagged as a preposition. Note that the next two words share the number 6 in their tags in the same position as the first word. This is due to the fact that this position is used for the case of a word and the tagger thinks that these words are all in the 6th Slovak case which turns out to be Locative. As any Slovak speaker might also see, these three words can be replaced with 'Kedy', the full stop with a question mark and we got a sentence that might constitute a fairly good question:

Po dobrom kúpeli sa rozlúčil s mesiačikom a hviezdičkami a ľahol si spať*.*

then becomes

Kedy sa rozlúčil s mesiačikom a hviezdičkami a ľahol si spať*?*

Question type detection

Let us consider another example of a sentence with tags for each word:

E---6----------u- - V
SSfs6------------ - chate
V--p---aK-A-e---- - sme
R---------------- - sa
V-hp---aL-A-d---- - stretli
E---7----------u- - s
AAms7----x------- - ďalším
SSms7------------ - poľovníkom
Z - .

As we can see the first two words in this example are of a similar type as the words in the first example. However, replacing them with 'Kedy' does not seem like an option since in Slovak language if the preposition 'v' is followed by an object and this object itself is not a time reference (such as weekday or name of a month) this 'prepositional clause' is most probably associated with a place, not a time. Therefore 'Kde' would be way more appropriate in this case than 'Kedy'.

Just from these two examples it is obvious that in order for otazkovac to create correct and relevant questions it needs to be able to detect what type of a question can be generated from a given sentence (if any).

To do so we gathered a dataset of sentences that could possibly be transformed into questions as described above from all stories available on Multimedialna Citanka. MorphoDiTa models also provide the lemma along with a tag, we included this information in the dataset too in order to make the detection more robust and prone to variation in natural language.

In the end the dataset consists of 695 tagged sentences. Unfortunately, the premise from above does not hold in general (thanks to variability in natural language) and there are sentences like 'O zvieratách sa dočítam v encyklopédii Svet zvierat' in the dataset that do not fall in neither the 'Kde' (marked P for Place in the dataset) nor the 'Kedy' (marked T for Time in the dataset) category. These sentences should be Ignored and are therefore marked I in the dataset. The numbers of sentences that belong to respective categories are as follows:

 I: 24
 T: 240
 P: 431

The dataset is stored in form of a .csv file where the first column is the sentence, the second is the list of lemmatized/tagged words and the third is one of the types described above. For example:

V tíme vyhrávajú buď všetci alebo nikto .	v/E---6----------u- tím/SSis6------------	P

Feature engineering

A natural choice for features in a scenario like this would be n-grams over the list of lemmatized words. A slightly better alternative might be to treat POS tags as words. The motivation behind such a decision might be that it makes more sense that a sentence of type P is more likely to have the preposition 'na' followed by some sort of a noun represented by a tag SSis6------------ rather than a specific noun itself. Since the dataset we have is very small this setup should help us capture more variability in the data.

One last improvement that might help even more would be the addition of concatenated bigrams from the beginning and the end of the list so that the list v/E---6----------u- posledný/NAfs6------------ zákruta/SSfs6------------ would be represented by a feature vector similar to v/E---6----------u- zákruta/SSfs6------------ since the middle word does not change the type.

Model selection

There are multiple models to choose from when it comes to text classification. We might use a multinomial Naive Bayes classifier (NB) as a baseline, random forest classifier (RF) as an example of a model that tends not to overfit, and a SVM which is one of the recommended models when it comes to text classification on small datasets.

All of the models were tested in combination with the features described above using 10-fold cross validation. The resulting accuracies are provided below:

NB RF SVM
2-3 normal 87.36 83.90 85.90
2-4 normal 88.21 85.76 86.19
1-4 specia 89.49 85.62 89.06
2-4 revers 89.49 87.06 89.06
2-3 revers 89.49 88.78 90.64

The numbers are the degrees of grams which were used (2-3 grams means that bigrams and trigrams were used), normal means setup with just lemmatized words, specia is the setup described in the last paragraph of the section above and revers is the setup in which POS tags are treated as words.

As it turns out our special handcrafted features are essentially the same as POS tags with 4-grams.

When we train the best model on the whole dataset we get the accuracy of 0.98273381295. This is what the classification report looks like for respective classes:

Classification report:
             precision    recall  f1-score   support

          I       1.00      0.71      0.83        24
          P       0.99      0.99      0.99       431
          T       0.97      1.00      0.98       240

avg / total       0.98      0.98      0.98       695

And the confusion matrix

I P T
I 17 4 3
P 0 427 4
T 0 1 239

It might be interesting to see in which cases did the model failed to predict the correct class. Here are a few:

Na konci mesta si našiel lietajúcu motorku a ukradol ju.

Unfortunately in this case (and in many others) the MorphoDiTa model decided that the third word ('mesta') used a different case than the two before. This is not true but given our premise the model received only the first two lemmatized words instead of the first three of them which greatly affected the result.

Na Havaj sa teším.

In this case the MorphoDiTa model thinks that Havaj is abbreviation or a special entity of sorts that our premise is not ready for. However, we can also see that in this case a completely different type of question could be generated (namely with 'Kam') which shows potential for future improvement.

Usage

Once we have all the models ready we can actually test otazkovac ourselves.

First of all, otazkovac needs to be installed by running

$ python setup.py install

Note that it is written using Python 2.7. The command above will create a new command otazkovac which we can then use on the command line. Assuming that the MorphoDiTa model and the prediction model generated via classify.py are present in current directory we can test otazkovac by running the following command:

$ echo "Po dobrom kúpeli sa rozlúčil s mesiačikom a hviezdičkami a ľahol si spať." | otazkovac --morpho-model tagger_model.sk --pipeline pipeline.pkl
<Question('Kedy sa rozlúčil s mesiačikom a hviezdičkami a ľahol si spať?', answer='Po dobrom kúpeli')>

Web server (JSON REST API)

otazkovac can also be used as a REST API endpoint. In order to turn it on you need to add the --server flag to its command. For example:

$ otazkovac --morpho-model tagger_model.sk --pipeline pipeline.pkl --server
 * Running on http://127.0.0.1:5000/ (Press CTRL+C to quit)

As the debug message suggests, otazkovac will respond on http://127.0.0.1:5000/.

Note that the API endpoint lives at /questions and accepts POST data in the field text. If you want to see whether the endpoint works you can use the curl in order to send a POST request for example like this:

$ curl -d 'text=Po dobrom obede pôjdeme spať.' http://127.0.0.1:5000/questions
{
  "questions": {
    "count": 1,
    "entries": [
      {
        "answer": "Po dobrom obede",
        "question": "Kedy p\u00f4jdeme spa\u0165?"
      }
    ]
  }

If the "questions" field is not present in the response object then the "error" field has to be (along with the description of the error):

$ curl -d 'fero=jano' http://127.0.0.1:5000/questions
{
  "error": "Missing text parameter"
}

Conclusion and Future Work

We present a simple yet quite powerful method for generating simple questions from unstructured Slovak text. This method incorporates POS tagging as well as supervised learning of 'question types' for sentences that start with a preposition.

While this work's focus is on just one possible way of generating questions thanks to possible sentence transformation[7] this approach might be reusable in other contexts.


[1] By which we mean that no special structure of the text is required. It is expected, however, that the text is grammatically correct and follows standard stylistic conventions such as using questions marks, exclamation marks and full stops to determine the end of a sentence.

[2] Heilman, Michael. Automatic factual question generation from text. Diss. Carnegie Mellon University, 2011.

[3] MorphoDiTa: Morphological Dictionary and Tagger: http://ufal.mff.cuni.cz/morphodita

[4] Slovak National Corpus - Ľ. Štúr Institute of Linguistics, Slovak Academy of Sciences: http://korpus.juls.savba.sk/index_en.html

[5] http://www.ling.upenn.edu/courses/Fall_2003/ling001/penn_treebank_pos.html

[6] http://korpus.juls.savba.sk/attachments/morpho_en/tagset-www.pdf

[7] Systems that generate questions from English text (such as Heilman, Michael, and Noah A. Smith. Question generation via overgenerating transformations and ranking. No. CMU-LTI-09-013. CARNEGIE-MELLON UNIV PITTSBURGH PA LANGUAGE TECHNOLOGIES INST, 2009.) usually use Tregex and Tsurgeon expressions that take care of many cases of sentence transformation: http://nlp.stanford.edu/software/tregex.shtml.