If R makes complex things simple, it can sometimes make simple things
difficult. This is why tabxplor
tries to make it easy to deal with
multiple cross-tables: to create and manipulate them, but also to read
them, using color helpers to highlight important informations
(differences from totals, comparisons between lines or columns,
contributions to variance, margins of error, etc.). It would love to
enhance your data exploration experience with simple yet powerful tools.
All functions are propelled by tidyverse
, pipe-friendly, and render
tibble
data frames which can be easily manipulated with dplyr
. In
the same time, time-taking operations are done with data.table
to go
faster with big dataframes. Tables can be exported to Excel and in html
with formats and colors.
You can install tabxplor from CRAN with:
install.packages("tabxplor")
Or from github with :
# install.packages(devtools)
devtools::install_github("BriceNocenti/tabxplor")
The main functions are made to be user-friendly and time-saving in data analysis workflows.
tab
makes a simple cross-table:
library(tabxplor)
tab(forcats::gss_cat, marital, race)
#> # A tabxplor tab: 7 Ă— 5
#> marital Other Black White Total
#> <fct> <n> <n> <n> <n>
#> 1 No answer 2 2 13 17
#> 2 Never married 633 1 305 3 478 5 416
#> 3 Separated 110 196 437 743
#> 4 Divorced 212 495 2 676 3 383
#> 5 Widowed 70 262 1 475 1 807
#> 6 Married 932 869 8 316 10 117
#> 7 Total 1 959 3 129 16 395 21 483
When one of the row or column variables is numeric, tab
calculates
means by category of the other variable.
tab
comes with options to weight the table, print percentages, manage
totals, digits and missing values, add legends, gather rare categories
in a “Others” level.
tab(forcats::gss_cat, marital, race, pct = "row", na = "drop",
other_if_less_than = 1000, other_level = "Custom_other_level_name")
When a third variable is provided, tab
makes a table with as many
subtables as it has levels. With several tab_vars
, it makes a subtable
for each combination of their levels. The result is grouped: in dplyr,
operations like sum()
or all()
are done within each subtable, and
not for the whole dataframe.
Colors may be added to highlight over-represented and under-represented
cells, and therefore help the user read the table. By default, with
color = "diff"
, colors are based on the differences between a cell and
it’s related total (which only works with means and row or col pct).
When a percentage is superior to the average percentage of the line or
column, it appears with shades of green (or blue). When it’s inferior,
it appears with shades of red/orange. A color legend is added below the
table. In RStudio colors are adapted to the theme, light or dark.
data <- forcats::gss_cat |>
dplyr::filter(year %in% c(2000, 2006, 2012), !marital %in% c("No answer", "Widowed"))
gss <- "Source: General social survey 2000-2014"
gss2 <- "Source: General social survey 2000, 2006 and 2012"
tab(data, race, marital, year, subtext = gss2, pct = "row", color = "diff")
The sup_cols
argument adds supplementary column variables to the
table. With numeric variables, it calculates the mean for each category
or the row variable. With text variables, only the first level is kept
(you can choose which one to use by placing it first with
forcats::fct_relevel
). Use tab_many
to keep all levels.
tab(dplyr::storms, category, status, sup_cols = c("pressure", "wind"))
#> # A tabxplor tab: 7 Ă— 13
#> category disturbance extratropical hurricane `other low`
#> <fct> <n> <n> <n> <n>
#> 1 1 0 0 2 548 0
#> 2 2 0 0 993 0
#> 3 3 0 0 593 0
#> 4 4 0 0 553 0
#> 5 5 0 0 116 0
#> 6 NA 171 2 151 0 1 453
#> 7 Total 171 2 151 4 803 1 453
#> # â„ą 8 more variables: `subtropical depression` <n>, `subtropical storm` <n>,
#> # `tropical depression` <n>, `tropical storm` <n>, `tropical wave` <n>,
#> # Total <n>, pressure <mean>, wind <mean>
By default, to calculate colors, each cell is compared to the subtable’s related total.
When a third variable or more are provided, it’s possible to compare
with the general total line instead, by setting comp = "all"
. Here,
only the last total row is highlighted (TOTAL ENSEMBLE appears in white
but other total rows in grey).
tab(data, race, marital, year, subtext = gss2, pct = "row", color = "diff", comp = "all")
With ref = "first"
, each row (or column) is compared to the first row
(or column), which is particularly helpful to highlight historical
evolutions. The first rows then appears in white (while rows totals are
themselves colored like normal lines).
data <- data |> dplyr::mutate(year = as.factor(year))
tab(data, year, marital, race, pct = "row", color = "diff", ref = "first", tot = "col",
totaltab = "table")
When ref
is a number, the nth row (or column) is used for comparison.
tab(data, year, marital, race, pct = "row", color = "diff", ref = 3)
Finally, when ref
is a string, it it used as a regular expression, to
match with the names of the rows (or columns).
tab(data, year, marital, race, pct = "col", tot = "row", color = "diff", ref = "Married")
It it possible to print confidence intervals for each cell:
tab(forcats::gss_cat, race, marital, pct = "row", ci = "cell")
#> # A tabxplor tab: 4 Ă— 9
#> race `No answer` `Never married` Separated Divorced Widowed Married Total
#> <fct> <row%> <row%> <row%> <row%> <row%> <row%> <row%>
#> 1 Other 0% [30;34]% [5;7]% [9;12]% [3;4]% [45;50]% 100%
#> 2 Black 0% [40;43]% [5;7]% [14;17]% [7;9]% [26;29]% 100%
#> 3 White 0% [21;22]% [2;3]% [16;17]% 9% [50;51]% 100%
#> 4 Total 0% 25% 3% 16% 8% 47% 100%
#> # â„ą 1 more variable: n <n>
It is also possible to use confidence intervals to enhance colors
helpers. With color = "diff_ci"
, the cells are only colored if the
confidence interval of the difference between them and their reference
cell (in total or first row/col) is superior to the difference itself.
Otherwise, it means the cell is not significantly different from it’s
reference in the total (or first) row: it turns grey, and the reader is
not anymore tempted to over-interpret the difference.
tab(forcats::gss_cat, race, marital, pct = "row", color = "diff_ci")
Finally, another calculation appears helpful: the difference between the
cell and the total, minus the confidence interval of this difference (or
in other word, what remains of the difference after having subtracted
the confidence interval). ci = "after_ci"
highligths all the cells
whose value is significantly different from the relative total (or first
cell). This is particularly useful when working on small samples : we
can see at a glance which numbers we have right to read and interpret.
tab(forcats::gss_cat, race, marital, subtext = gss, pct = "row", color = "after_ci")
chi2 = TRUE
add summary statistics made in the chi2 metric: degrees of
freedom (df), unweighted count, pvalue and (sub)table’s variance. Chi2
pvalue is colored in green when inferior to 5%, and in red when superior
or equal to 5%, meaning that the table is not significantly different
from the independent hypothesis (the two variables may be independent).
tab(forcats::gss_cat, race, marital, chi2 = TRUE)
#> chi2 stats marital
#> df 10
#> variance 0.0464
#> chi2 997
#> pvalue <0.01%
#> count 21 483
#>
#> # A tabxplor tab: 4 Ă— 8
#> race `No answer` `Never married` Separated Divorced Widowed Married Total
#> <fct> <n> <n> <n> <n> <n> <n> <n>
#> 1 Other 2 633 110 212 70 932 1 959
#> 2 Black 2 1 305 196 495 262 869 3 129
#> 3 White 13 3 478 437 2 676 1 475 8 316 16 395
#> 4 Total 17 5 416 743 3 383 1 807 10 117 21 483
Chi2 stats can also be used to color cells based on their contributions
to the variance of the (sub)table, with color = "contrib"
. By default,
only the cells whose contribution is superior to the mean contribution
are colored. It highlights the cells which would stand out in a
correspondence analysis (the two related categories would be located at
the edges of the first axes ; here, being black is associated with never
married and being separated).
tab(forcats::gss_cat, race, marital, color = "contrib")
The result of tab
is a tibble::tibble
data frame with class tab
.
It gets it’s own printing methods but, in the same time, can be
transformed using most dplyr
verbs, like a normal tibble
.
library(dplyr)
tab(storms, category, status, sup_cols = c("pressure", "wind")) |>
filter(category != "-1") |>
dplyr::select(-`tropical depression`)
arrange(is_totrow(.), desc(category)) # use is_totrow to keep total rows order
With dplyr::arrange
, don’t forget to keep the order of tab variables
and total rows:
tab(data, race, marital, year, pct = "row") |>
arrange(year, is_totrow(.), desc(Married))
tab
is a wrapper around the more powerful function tab_many
, which
can be used to customize your tables.
It’s possible, for example, to make a summary table of as many columns variables as you want (showing all levels, or showing only one specific level like here):
first_lvs <- c("Married", "$25000 or more", "Strong republican", "Protestant")
data <- forcats::gss_cat |> mutate(across(
where(is.factor),
~ forcats::fct_relevel(., first_lvs[first_lvs %in% levels(.)])
))
tab_many(data, race, c(marital, rincome, partyid, relig, age, tvhours),
levels = "first", pct = "row", chi2 = TRUE, color = "auto")
Using tab
or tab_many
with purrr::map
and tibble::tribble
, you
can program several tables with different parameters all at once, in a
readable way:
tabs <-
purrr::pmap(
tibble::tribble(
~row_var, ~col_vars , ~pct , ~filter , ~subtext ,
"race" , "marital" , "no" , NULL , "Source: GSS 2000-2014",
"race" , "marital" , "row", NULL , "Source: GSS 2000-2014",
"race" , "marital" , "col", NULL , "Source: GSS 2000-2014",
"relig" , c("race", "age"), "row", "year %in% 2000:2010", "Source: GSS 2000-2010",
"relig" , c("race", "age"), "row", "year %in% 2010:2014", "Source: GSS 2010-2014",
NA_character_, "race" , "no" , NULL , "Source: GSS 2000-2014",
),
.f = tab_many,
data = forcats::gss_cat, color = "auto", chi2 = TRUE)
To export a table to html with colors, tabxplor uses knitr::kable
and
kableExtra
. In this format differences from totals, confidence
intervals, contribution to variance, and unweighted counts, are
available in a tooltip at cells hover.
tabs <- tab(forcats::gss_cat, race, marital, subtext = "Source: GSS 2000-2014",
pct = "row", color = "diff")
tabs |> tab_kable()
To print an html table by default (for example, in RStudio viewer), use tabxplor options:
options(tabxplor.print = "kable") # default to options(tabxplor.print = "console")
tab_xl
exports any table or list of tables to Excel, with all colors,
chi2 stats and formatting. On Excel, it is still possible to do
calculations on raw numbers (display is rounded but, below, decimals are
kept).
tabs |> tab_xl(replace = TRUE, sheets = "unique")
tab_plot
exports any table as a plot image.
tabs |> tab_plot()
When not doing data analysis but writing functions, you can use the
sub-functions of tab_many
step by step to attain more flexibility or
speed. That way, it’s possible to write new functions to customize your
tables even more.
data <- dplyr::starwars |>
tab_prepare(sex, hair_color, gender, other_if_less_than = 5,
na_drop_all = sex)
data |>
tab_plain(sex, hair_color, gender, tot = c("row", "col"), pct = "row", comp = "all") |>
tab_ci("diff", color = "after_ci") |>
tab_chi2(calc = "p")
The whole architecture of tabxplor
is powered by a special vector
class, named tabxplor_fmt
for formatted numbers. As a vctrs::record
,
it stores behind the scenes all the data necessary to calculate printed
results, formats and colors. A set of functions are available to access
or transform this data. ?fmt
to get more information.
The simple way to recover the underlying numbers as numeric vectors is
get_num
, which extract the currently displayed field whatever it is :
tabs <- tab(forcats::gss_cat, race, marital, pct = "row")
tabs |> dplyr::mutate(across(where(is_fmt), get_num))
#> # A tabxplor tab: 4 Ă— 9
#> race `No answer` `Never married` Separated Divorced Widowed Married Total
#> <fct> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
#> 1 Other 0.00102 0.323 0.0562 0.108 0.0357 0.476 1
#> 2 Black 0.000639 0.417 0.0626 0.158 0.0837 0.278 1
#> 3 White 0.000793 0.212 0.0267 0.163 0.0900 0.507 1
#> 4 Total 0.000791 0.252 0.0346 0.157 0.0841 0.471 1
#> # â„ą 1 more variable: n <dbl>
To render character vectors (without colors), use format
:
tabs |> mutate(across(where(is_fmt), format))
The following fields compose any fmt
column (though many can be NA
if not calculated) :
-
display
: name of the field to display, customisable for each cell (character) -
n
: raw count (integer) -
wn
: weighted count -
pct
: percentages -
diff
: differences from totals or reference cells -
digits
: digits to display, customisable for each cell (integer) -
ctr
: contributions of cells to variance (withcolor = "contrib"
) -
mean
: means (for numeric column variables) -
var
: variance (for numeric column variables ; Chi2 variance withpct
) -
ci
: confidence intervals -
rr
: relative risks, needed to calculate odds ratio -
or
: odds ratios (or relative risks ratios) -
in_totrow
:TRUE
if the cell is part of a total row,FALSE
otherwise (logical) -
in_tottab
:TRUE
if the cell is part of a total table,FALSE
otherwise (logical) -
in_refrow
:TRUE
if the cell is part of a reference row,FALSE
otherwise (logical)
vctrs::vec_data(tabs$Married)
#> n display digits wn pct mean diff ctr var ci rr or
#> 1 932 pct 0 NA 0.4757529 1.0102402 0.004822432 NA NA NA NA NA
#> 2 869 pct 0 NA 0.2777245 0.5897357 -0.193205991 NA NA NA NA NA
#> 3 8316 pct 0 NA 0.5072278 1.0770757 0.036297310 NA NA NA NA NA
#> 4 10117 pct 0 NA 0.4709305 1.0000000 0.000000000 NA NA NA NA NA
#> in_totrow in_tottab in_refrow
#> 1 FALSE FALSE FALSE
#> 2 FALSE FALSE FALSE
#> 3 FALSE FALSE FALSE
#> 4 TRUE FALSE FALSE
To get those underlying fields you can either use vctrs::fields
or,
more simply, $
:
tabs |> mutate(across(where(is_fmt), ~ vctrs::field(., "pct") ))
tabs$Married$pct
tabs$Married$n
tabs |> mutate(across(where(is_fmt), ~ .$n))
To change the field currently displayed, for the whole table or a single
vector, you can use set_display()
:
tabs |> set_display("diff")
tabs |> mutate(across(where(is_fmt), ~ set_display(., "diff")))
To modify a field, you can use vctrs
field<-
. For example, to change
the number of digits :
tab(forcats::gss_cat, race, marital, pct = "row") |>
mutate(across(where(is_fmt), ~ vctrs::`field<-`(., "digits", rep(2L, length(.)))))
Faster to write and easier to read, you can also use dplyr::mutate()
on an fmt
vector. For example, to create a new column with standards
deviations and display it with decimals :
tab_num(data, race, c(age, tvhours), marital, digits = 1L, comp = "all") |>
dplyr::mutate(dplyr::across( #Mutate over the whole table.
c(age, tvhours),
~ dplyr::mutate(., #Mutate over each fmt vector's underlying data.frame.
var = sqrt(var),
display = "var",
digits = 2L) |>
set_color("no"),
.names = "{.col}_sd"
))
Some helper functions exists for total rows, total tables and reference
rows (is_totrow()
/ as_totrow()
, is_tottab()
/ as_tottab()
,
is_refrow()
/ as_refrow()
) :
tab(data, race, marital, year, pct = "row") |>
dplyr::mutate(across(
where(is_fmt),
~ dplyr::if_else(is_totrow(.),
true = mutate(., digits = 1L),
false = mutate(., digits = 2L))
))
Each fmt
column have attributes, which you can access or modify with
get_
and set_
functions :
-
type /
get_type()
/set_type()
: the type of thefmt
vector, amongc("n", "mean", "row", "col", "all", "all_tabs")
; it determines which calculations are done withintab_
functions. -
totcol /
is_totcol()
/as_totcol()
:TRUE
if the column is a total column,FALSE
otherwise (logical) -
refcol /
is_refcol()
/as_refcol()
:TRUE
if the column is a reference column for comparison,FALSE
otherwise (logical) -
color /
get_color()
/set_color()
: the calculation to make to print colors ; amongc("", "no", "diff", "diff_ci", "after_ci", "contrib")
-
col_var /
get_col_var()
/set_col_var()
: the name of the column variable (there can be many in one single table) -
comp_all /
get_comp_all
/set_comp_all()
: when there aretab_vars
, is the reference for comparison the subtable (FALSE
), or the total table (TRUE
) ? -
ref /
get_ref_type()
/set_diff_type()
: the type of difference calculated, either"no"
,"tot"
for totals, an index, or a regular expression. -
ci_type /
get_ci_type()
/set_ci_type()
: the type of confidence interval, either"cell"
or"diff"
For example, to print the number of observations of the total column :
tab(data, race, marital, year, pct = "row") |>
mutate(across(where(is_totcol), ~ mutate(., display = "n") ))
Note that, if tab_vars
are provided, the table is grouped and all
operations are made within groups. To remove grouping (for example when
it gives errors), use dplyr::ungroup()
.
If you only need the simplest table, with only numeric counts (no
fmt
), or even a base data.frame
(not a tibble
) :
tab_plain(data, race, marital, num = TRUE) # counts as a numeric vector
tab_plain(data, race, marital, df = TRUE) # same, with unique class = "data.frame"