Tools to Create, Modify and Manage Corpora for the Corpus Workbench (CWB)
The Corpus Workbench (CWB) is a classic indexing and query engine to efficiently work with large, linguistically annotated corpora. The cwbtools package offers a set of tools to conveniently create, modify and manage CWB indexed corpora from within R. It complements R packages that use the CWB as a backend for text mining with R, namely the RcppCWB package for low-level access to CWB indexed corpora, and polmineR as a toolset to implement common text mining workflows.
The package is available via CRAN and can be installed as follows on Windows, macOS and Linux.
# Make sure the remotes package is present if (!"remotes" %in% installed.packages()[,"Package"]) install.packages("remotes") Sys.setenv(R_REMOTES_STANDALONE = "true") remotes::install_github("PolMine/cwbtools", ref = "dev", force = TRUE)
The default approach to install the development version
cwbtools from GitHub would be
devtools::install_github("PolMine/cwbtools", ref = "dev"). However, the concurrent dependency of
devtools and of
cwbtools on the
curl package may cause nerve-wrecking problems if
curl can be updated: If a newer version of
curl is available, the user will be prompted whether this update is desired. Most users will agree. However, this update will fail because
curl is loaded by
devtools, and parts of the
curl package cannot be deleted/updated (the dynamic library that is loaded).
To avoid having to perform manual updates in the correct order, using the original
install_github() function of the
remotes package is recommended. When setting the environment variable
remotes package will rely on a minimal set of additional packages. The aforementioned situation that may make the installation of
cwbtools difficult for most users is omitted.
The CWB is a classical indexing and query engine. Its character as an open source project is of great value for the community working with corpora. The enduring effort of the developers of the CWB is gratefully acknowledged!