capesR

Access to CAPES Data


Licenses
CNRI-Python-GPL-Compatible/CNRI-Python-GPL-Compatible

Documentation

capesR

CRAN_Status_Badge  CRAN Downloads  devel version  License  Documentation

capesR is an R package designed to facilitate access to and manipulation of data from the Catalog of Theses and Dissertations maintained by the Brazilian Coordination for the Improvement of Higher Education Personnel (CAPES). This catalog contains information about theses and dissertations defended at higher education institutions (HEIs) in Brazil.

The original CAPES data is available at dadosabertos.capes.gov.br.

The data used in this package is available in the repository of the The Open Science Framework (OSF).

Installation

You can install this package directly from GitHub with:

# Install capesR from CRAN
install.packages('capesR')

Functions

Download Data

The download_capes_data function allows you to download CAPES data files hosted on OSF. You can specify the desired years, and the corresponding files will be saved locally.

Example 1

Download data using the temporary directory (default):

library(capesR)
library(dplyr)

# Download data for the years 1987 and 1990
capes_files <- download_capes_data(c(1987, 1990))

# View the list of downloaded files
capes_files %>% glimpse()

In this case, the data will not persist for future use.

Example 2 - Reusing Data

It is recommended to define a persistent directory to store the downloaded data instead of using the default temporary directory (tempdir()). This allows you to reuse the data later.

# Define the directory where the data will be stored
data_directory <- "/capes_data"

# Download data for 1987 and 1990 using a persistent directory
capes_files <- download_capes_data(
  c(1987, 1990),
  destination = data_directory)

In this case, data will only be downloaded once. Future calls will identify which files already exist and return their paths.

Combine Data

Use the read_capes_data function to combine the downloaded files from a list generated by download_capes_data or manually created.

Example 1 - Combine Data Without Filters

# Combine all selected data without filters
combined_data <- read_capes_data(capes_files)

# View the combined data
combined_data %>% glimpse()

Example 2 - Combine Data with Exact Filters

Filters are applied before reading the data, improving performance.

# Create a filter object
exact_filter <- list(
  ano_base = c(2021, 2022),
  uf = c("PE", "CE")
)

# Combine filtered data
filtered_data <- read_capes_data(capes_files, exact_filter)

# View the filtered data
filtered_data %>% glimpse()

Example 3 - Combine Data with Text Filters

Exact filters are applied before reading for performance, and the text filter is optimized for quick searches.

# Create a filter object
text_filter <- list(
  ano_base = c(2018, 2019, 2020, 2021, 2022),
  uf = c("PE", "CE"),
  titulo = "Educação"
)

# Combine filtered data
text_filtered_data <- read_capes_data(capes_files, text_filter)

# View the filtered data
text_filtered_data %>% glimpse()

Search Text

To search for text in already combined data, use the search_capes_text function, specifying the term and the text field (e.g., title, abstract, author, or advisor).

Example:

results <- search_capes_text(
  data = combined_data,
  term = "Educação",
  field = "titulo"
)

Data

Synthetic Data

The package also provides a set of synthetic data, capes_synthetic_df, which contains aggregated information from the CAPES Catalog of Theses and Dissertations. These synthetic data simplify quick analyses and prototyping without requiring full data downloads and processing.

Data Structure

The synthetic data includes the following columns:

  • base_year: Reference year of the data.
  • institution: Higher Education Institution.
  • area: Area of Concentration.
  • program_name: Graduate Program Name.
  • type: Type of work (e.g., Master's, Doctorate).
  • region: Region of Brazil.
  • state: Federative Unit (state).
  • n: Total number of works.

Loading the Data

The synthetic data is available directly in the package and can be loaded with:

data(capes_synthetic_df)

# View the first rows of the data
head(capes_synthetic_df)

Example Usage

You can use the synthetic data for quick exploratory analyses or charts:

# Load the data
data(capes_synthetic_df)

# Example: Count by year and type of work
library(dplyr)
capes_synthetic_df %>%
  group_by(base_year, type) %>%
  summarise(total = sum(n)) %>%
  arrange(desc(total))