Find the english language indeterminate article ('a' or 'an') for a word. Based on real usage patterns extracted from the wikipedia text dump; can therefore even deal with tricky edge cases such as acronyms (FIAT vs. FAA, NASA vs. NSA) and odd symbols. (Requires .NET Core 1.0 or .NET 4.5)


Keywords
english, NLP, text-processing, library, indeterminate-article, AvsAn, an, a-vs-an
License
Apache-2.0
Install
Install-Package AvsAn -Version 3.2.0

Documentation

a-vs-an

Find the english language indeterminate article ("a" or "an") for a word. Based on real usage patterns extracted from the wikipedia text dump; can therefore even deal with tricky edge cases such as acronyms (FIAT vs. FAA, NASA vs. NSA) and odd symbols.

The implementations (C# and Javascript) in this project determine whether "a" or "an" should precede a word. They are efficient and accurate (using the method described in this stackoverflow response).

You can try the javascript implementation of this library online: A-vs-An.

The dataset used is based on the wikipedia-article-text dump of july 2014. Some additional preprocessing was done to remove as much wiki-markup as possible and extract only things vaguely resembling sentences using regular expressions. If the word following 'a' or 'an' started with a quote or parenthesis, the initial quote or parenthesis was ignored. The resulting prefix-list with the code to query it is less than 10KB in size; excluding the actual counts would reduce the size still further.

The implementations are efficient: on a single thread of a 4.1GHz i7-4770k a benchmark classifying all words of an english dictionary (archived local copy: 354984si.ngl) achieves about 17 million words a second; that's just 60ns per word. The javascript implementations were benchmarked on chrome 84 (80ns per lookup), firefox 32.0a1 (2014-05-22), IE 11, and opera (12 and 21), and are all about 7-10 times slower, at approximately 4-5 million classifications per second.

Contributing

Contributions welcome! Feel free to make a suggestion, create a pull request with improvements. Contributed code should be apache 2 licensed, as a-vs-an is.

Thanks in particular to @lukespice for adding .net core support!