stopwordsiso

Collection of stopwords for multiple languages. Using ISO 639-1 language code.


Keywords
stopwords, language
License
MIT
Install
pip install stopwordsiso==0.6.1

Documentation

Stopwords ISO

The most comprehensive collection of stopwords for multiple languages.

The collection follows the ISO 639-1 language code.

If you only need stopwords for a specific language, there is a separate collection for each.

Usage

The collection is in JSON format. You are free to use this collection any way you like.

It is only currently published on npm, bower, and pip.

Node/JavaScript

$ npm install stopwords-iso
$ bower install stopwords-iso
// Node
const stopwords = require('stopwords-iso');  // object of stopwords for multiple languages
const english = stopwords.en;  // English stopwords

Python

$ pip install stopwordsiso
# Python
import stopwordsiso as stopwords

stopwords.has_lang("th")  # check if there is a stopwords for the language
stopwords.langs()  # return a set of all the supported languages
stopwords.stopwords("en")  # English stopwords
stopwords.stopwords(["de", "id", "zh"])  # German, Indonesian, and Chinese stopwords
stopwords.stopwords("xxx")  # an empty set will be returned for unknown language

Contributing

If you wish to remove or update some of the stopwords, please file an issue first before sending a PR on the repo of the specific language.

If you would like to add a stopword or a new set of stopwords, please add them as a new text file on the repo of the corresponding language.

Credits

All stopwords sources are listed here.

List of Included Languages

This table lists the entire set of ISO 639-1:2002 codes, with a check mark indicating those language codes that are found in stopwords-iso.json.

The list of codes itself is from www.loc.gov, which is the official "language codes list" and is linked to from www.iso.org.

ISO 639-1 Code Language Included Here
aa Afar
ab Abkhazian
af Afrikaans βœ“
ak Akan
sq Albanian
am Amharic
ar Arabic βœ“
an Aragonese
hy Armenian βœ“
as Assamese
av Avaric
ae Avestan
ay Aymara
az Azerbaijani
ba Bashkir
bm Bambara
eu Basque βœ“
be Belarusian
bn Bengali βœ“
bh Bihari languages
bi Bislama
bo Tibetan
bs Bosnian
br Breton βœ“
bg Bulgarian βœ“
my Burmese
ca Catalan; Valencian βœ“
cs Czech βœ“
ch Chamorro
ce Chechen
zh Chinese βœ“
cu Church Slavic; Old Slavonic; Church Slavonic; Old Bulgarian; Old Church Slavonic
cv Chuvash
kw Cornish
co Corsican
cr Cree
cy Welsh
da Danish βœ“
de German βœ“
dv Divehi; Dhivehi; Maldivian
nl Dutch; Flemish βœ“
dz Dzongkha
el Greek, Modern (1453-) βœ“
en English βœ“
eo Esperanto βœ“
et Estonian βœ“
ee Ewe
fo Faroese
fa Persian βœ“
fj Fijian
fi Finnish βœ“
fr French βœ“
fy Western Frisian
ff Fulah
ka Georgian
gd Gaelic; Scottish Gaelic
ga Irish βœ“
gl Galician βœ“
gv Manx
gn Guarani
gu Gujarati βœ“
ht Haitian; Haitian Creole
ha Hausa βœ“
he Hebrew βœ“
hz Herero
hi Hindi βœ“
ho Hiri Motu
hr Croatian βœ“
hu Hungarian βœ“
ig Igbo
is Icelandic
io Ido
ii Sichuan Yi; Nuosu
iu Inuktitut
ie Interlingue; Occidental
ia Interlingua (International Auxiliary Language Association)
id Indonesian βœ“
ik Inupiaq
it Italian βœ“
jv Javanese
ja Japanese βœ“
kl Kalaallisut; Greenlandic
kn Kannada
ks Kashmiri
kr Kanuri
kk Kazakh
km Central Khmer
ki Kikuyu; Gikuyu
rw Kinyarwanda
ky Kirghiz; Kyrgyz
kv Komi
kg Kongo
ko Korean βœ“
kj Kuanyama; Kwanyama
ku Kurdish βœ“
lo Lao
la Latin βœ“
lv Latvian βœ“
li Limburgan; Limburger; Limburgish
ln Lingala
lt Lithuanian βœ“
lb Luxembourgish; Letzeburgesch
lu Luba-Katanga
lg Ganda
mk Macedonian
mh Marshallese
ml Malayalam
mi Maori
mr Marathi βœ“
ms Malay βœ“
mg Malagasy
mt Maltese
mn Mongolian
na Nauru
nv Navajo; Navaho
nr Ndebele, South; South Ndebele
nd Ndebele, North; North Ndebele
ng Ndonga
ne Nepali
nn Norwegian Nynorsk; Nynorsk, Norwegian
nb BokmΓ₯l, Norwegian; Norwegian BokmΓ₯l
no Norwegian βœ“
ny Chichewa; Chewa; Nyanja
oc Occitan (post 1500)
oj Ojibwa
or Oriya
om Oromo
os Ossetian; Ossetic
pa Panjabi; Punjabi
pi Pali
pl Polish βœ“
pt Portuguese βœ“
ps Pushto; Pashto
qu Quechua
rm Romansh
ro Romanian; Moldavian; Moldovan βœ“
rn Rundi
ru Russian βœ“
sg Sango
sa Sanskrit
si Sinhala; Sinhalese
sk Slovak βœ“
sl Slovenian βœ“
se Northern Sami
sm Samoan
sn Shona
sd Sindhi
so Somali βœ“
st Sotho, Southern βœ“
es Spanish; Castilian βœ“
sc Sardinian
sr Serbian
ss Swati
su Sundanese
sw Swahili βœ“
sv Swedish βœ“
ty Tahitian
ta Tamil
tt Tatar
te Telugu
tg Tajik
tl Tagalog βœ“
th Thai βœ“
ti Tigrinya
to Tonga (Tonga Islands)
tn Tswana
ts Tsonga
tk Turkmen
tr Turkish βœ“
tw Twi
ug Uighur; Uyghur
uk Ukrainian βœ“
ur Urdu βœ“
uz Uzbek
ve Venda
vi Vietnamese βœ“
vo VolapΓΌk
wa Walloon
wo Wolof
xh Xhosa
yi Yiddish
yo Yoruba βœ“
za Zhuang; Chuang
zu Zulu βœ“