Junk Not-Junk Detector

pip install junkdetect==0.1.2


Junk, Not-Junk Detector

This tool is built to do just one simple task: detect junk and not-junk texts from a variety of languages. Just like that famous hotdog not-hotdog, but applied on natural language text. It can be very useful to test tools that extract, decompress, and/or decrypt natural language texts.


Uses fairseq

# Optionally create a brand new conda environment for this
#conda create -n junkdetect python=3.7
#conda activate junkdetect

# Install: use only one of these methods
# 1. from pypi; recommended
pip install junkdetect

# 2. latest master branch
pip install git+https://github.com/thammegowda/junkdetect

# 3. for development
git clone https://github.com/thammegowda/junkdetect \
     && cd junkdetect \
     && pip install --editable .

How to use

Once you install it via pip, junkdetect or python -m junkdetect can be used to invoke from commandline

printf "This is a good sentence. \nT6785*&^T is 747658 you T&*^\n" | junkdetect
0.999824	This is a good sentence.
0.0747487	T6785*&^T is 747658 you T&*^

The output is one line per input, with two column separated by \t. The first column has perplexity: a lower value (i.e close to 0.0) means junk and an higher value (close to 1.0) means not-junk. If you dont want input sentences back in the output, please cut them out -- just use junkdetect | cut -f1 > scores.txt

How does this work

junkdetect looks like only a few lines of python code, but under the hood, it hides a great deal of complexity.
It uses perplexity from neural (masked/auto-regressive) language models that were trained on tera bytes of web text from 100s of languages.
Specifically, it uses Facebookresearch's XML-R retrieved from torch.hub. Quoting the original developers of XML-R and their paper, (see Table 6)

XLM-R handles the following 100 languages: Afrikaans, Albanian, Amharic, Arabic, Armenian, Assamese, Azerbaijani, Basque, Belarusian, Bengali, Bengali Romanized, Bosnian, Breton, Bulgarian, Burmese, Burmese, Catalan, Chinese (Simplified), Chinese (Traditional), Croatian, Czech, Danish, Dutch, English, Esperanto, Estonian, Filipino, Finnish, French, Galician, Georgian, German, Greek, Gujarati, Hausa, Hebrew, Hindi, Hindi Romanized, Hungarian, Icelandic, Indonesian, Irish, Italian, Japanese, Javanese, Kannada, Kazakh, Khmer, Korean, Kurdish (Kurmanji), Kyrgyz, Lao, Latin, Latvian, Lithuanian, Macedonian, Malagasy, Malay, Malayalam, Marathi, Mongolian, Nepali, Norwegian, Oriya, Oromo, Pashto, Persian, Polish, Portuguese, Punjabi, Romanian, Russian, Sanskri, Scottish, Gaelic, Serbian, Sindhi, Sinhala, Slovak, Slovenian, Somali, Spanish, Sundanese, Swahili, Swedish, Tamil, Tamil Romanized, Telugu, Telugu Romanized, Thai, Turkish, Ukrainian, Urdu, Urdu Romanized, Uyghur, Uzbek, Vietnamese, Welsh, Western, Frisian, Xhosa, Yiddish.

Back Story and Acknowledgements: