The goal of textdata is to provide access to text-related data sets for easy access without bundling them inside a package. Some text datasets are too large to store within an R package or are licensed in such a way that prevents them from being included in an OSS-licensed package. Instead, this package provides a framework to download, parse, and store the datasets on the disk and load them when needed.
You can install the not yet released version of textdata from CRAN with:
And the development version from GitHub with:
# install.packages("remotes") remotes::install_github("EmilHvitfeldt/textdata")
The first time you use one of the functions for accessing an included
text dataset, such as
dataset_sentence_polarity(), the function will prompt you to agree
download the dataset to your computer.
After the first use, each time you use a function like
lexicon_afinn(), the function will load the dataset from disk.
Included text datasets
As of today, the datasets included in textdata are:
|v1.0 sentence polarity dataset||
|AFINN-111 sentiment lexicon||
|Hu and Liu’s opinion lexicon||
|NRC word-emotion association lexicon||
|NRC Emotion Intensity Lexicon||
|The NRC Valence, Arousal, and Dominance Lexicon||
|Loughran and McDonald’s opinion lexicon for financial documents||
|Trec-6 and Trec-50||
|IMDb Large Movie Review Dataset||
Check out each function’s documentation for detailed information (including citations) for the relevant dataset.
Note that this project is released with a Contributor Code of Conduct. By contributing to this project, you agree to abide by its terms. Feedback, bug reports (and fixes!), and feature requests are welcome; file issues or seek support here. For details on how to add a new dataset to this package, check out the vignette!