(And a Few R Markdown/Quarto Tips!)
Karissa Whiting
Research Biostatistician
Memorial Sloan Kettering
{biostatR}
Intro & Installation
Component Packages
Extra Features
R Markdown & Quarto
{biostatR} is a set of packages to help you create and customize analysis reports
Installs and loads 3 useful packages
Also installs (without loading) additional packages to improve your workflow (meaning you’ll need to load these ‘extra packages’ explicitly with library()
).
Run the following installation code outside of an RStudio project (you only need to do this once!)
Run the following installation code outside of an RStudio project (you only need to do this once!)
install.packages("rstudio.prefs")
rstudio.prefs::use_rstudio_secondary_repo(
MSK_RSPM = "http://rspm.mskcc.org/MSKREPO/latest"
)
{gtsummary} - Tools for creating publication-ready statistical summary tables
{bstfun} - A miscellaneous collection of functions for our department
{gtsummary} - Tools for creating publication-ready statistical summary tables
{bstfun} - A miscellaneous collection of functions for our department
{lubridate} - Tools to work with date-times in R
Characteristic | N = 2001 |
---|---|
Age | 47 (38, 57) |
Unknown | 11 |
Grade | |
I | 68 (34%) |
II | 68 (34%) |
III | 64 (32%) |
Tumor Response | 61 (32%) |
Unknown | 7 |
1 Median (IQR); n (%) |
Four types of summaries: continuous
, continuous2
, categorical
, and dichotomous
Variables coded 0/1
, TRUE/FALSE
, Yes/No
treated as dichotomous
Statistics are median (IQR)
for continuous, n (%)
for categorical/dichotomous
Lists NA
values under “Unknown”
Label attributes are printed automatically
Characteristic | Drug A, N = 981 | Drug B, N = 1021 |
---|---|---|
Age | ||
Mean (SD) | 47.0 (14.7) | 47.4 (14.0) |
Range | 6.0, 78.0 | 9.0, 83.0 |
Unknown | 7 | 4 |
Pathologic tumor grade | ||
I | 35 (36%) | 33 (32%) |
II | 32 (33%) | 36 (35%) |
III | 31 (32%) | 33 (32%) |
Tumor Response | 28 / 95 (29%) | 33 / 98 (34%) |
Unknown | 3 | 4 |
1 n (%); n / N (%) |
Characteristic | Drug A, N = 981 | Drug B, N = 1021 |
---|---|---|
Age | ||
Mean (SD) | 47.0 (14.7) | 47.4 (14.0) |
Range | 6.0, 78.0 | 9.0, 83.0 |
Unknown | 7 | 4 |
Pathologic tumor grade | ||
I | 35 (36%) | 33 (32%) |
II | 32 (33%) | 36 (35%) |
III | 31 (32%) | 33 (32%) |
Tumor Response | 28 / 95 (29%) | 33 / 98 (34%) |
Unknown | 3 | 4 |
1 n (%); n / N (%) |
by
: specifies a column variable for cross-tabulation
type
: specifies the summary type
statistic
: customize the reported statistics
label
: change or customize variable labels
digits
: specify the number of decimal places for rounding
Summary tables can be further updated using helper functions:
add_*() add additional column of statistics or information, e.g. p-values, q-values, overall statistics, treatment differences, N obs., and more
modify_*() modify table headers, spanning headers, footnotes, and more
bold_()/italicize_() style labels, variable levels, significant p-values
Characteristic | Drug A, N = 981 | Drug B, N = 1021 | p-value2 | q-value3 |
---|---|---|---|---|
Age | 46 (37, 59) | 48 (39, 56) | 0.7 | 0.9 |
Unknown | 7 | 4 | ||
Grade | 0.9 | 0.9 | ||
I | 35 (36%) | 33 (32%) | ||
II | 32 (33%) | 36 (35%) | ||
III | 31 (32%) | 33 (32%) | ||
Tumor Response | 28 (29%) | 33 (34%) | 0.5 | 0.9 |
Unknown | 3 | 4 | ||
1 Median (IQR); n (%) | ||||
2 Wilcoxon rank sum test; Pearson’s Chi-squared test | ||||
3 False discovery rate correction for multiple testing |
add_p()
: adds a column of p-values
add_q()
: adds a column of p-values adjusted for multiple comparisons through a call to p.adjust()
Characteristic | N | Overall, N = 200 | Drug A, N = 98 | Drug B, N = 102 |
---|---|---|---|---|
Age, Median (IQR) | 189 | 47 (38, 57) | 46 (37, 59) | 48 (39, 56) |
Grade, No. (%) | 200 | |||
I | 68 (34%) | 35 (36%) | 33 (32%) | |
II | 68 (34%) | 32 (33%) | 36 (35%) | |
III | 64 (32%) | 31 (32%) | 33 (32%) | |
Tumor Response, No. (%) | 193 | 61 (32%) | 28 (29%) | 33 (34%) |
add_overall()
: adds a column of overall statisticsadd_n()
: adds a column with the sample sizeadd_stat_label()
: adds a description of the reported statisticCharacteristic | Drug A, N = 981 | Drug B, N = 1021 | p-value2 |
---|---|---|---|
Age | 46 (37, 59) | 48 (39, 56) | 0.7 |
Unknown | 7 | 4 | |
Grade | 0.9 | ||
I | 35 (36%) | 33 (32%) | |
II | 32 (33%) | 36 (35%) | |
III | 31 (32%) | 33 (32%) | |
Tumor Response | 28 (29%) | 33 (34%) | 0.5 |
Unknown | 3 | 4 | |
1 Median (IQR); n (%) | |||
2 Wilcoxon rank sum test; Pearson’s Chi-squared test |
bold_labels()
: bold the variable labelsitalicize_levels()
: italicize the variable levelsbold_p()
: bold p-values according a specified thresholdCharacteristic | Drug | |
---|---|---|
Group A1 | Group B1 | |
Age | 46 (37, 59) | 48 (39, 56) |
Grade | ||
I | 35 (36%) | 33 (32%) |
II | 32 (33%) | 36 (35%) |
III | 31 (32%) | 33 (32%) |
Tumor Response | 28 (29%) | 33 (34%) |
1 median (IQR) for continuous; n (%) for categorical |
show_header_names()
to see the internal header names available for use in modify_header()
Many more customization available!
See the documentation at http://www.danieldsjoberg.com/gtsummary/reference/index.html
And a detailed tbl_summary()
vignette at http://www.danieldsjoberg.com/gtsummary/articles/tbl_summary.html
1. Create new data frame (new_trial
) and select columns age
, stage
, response
, marker
2. Make a basic tbl_summary()
summarizing by response
variable and add the following customization:
For age
make statistic
report the"{mean} ({min}, {max})"
Use missing
arg to remove display of missing values
3. Now apply the following customization:
add a p-value
bold labels and p-values ≤ .05, italicize levels
add an ‘overall’ column
See documentation for help: https://www.danieldsjoberg.com/gtsummary/
1. Create new data frame (new_trial
) and select columns age
, stage
, response
, marker
2. Make a basic tbl_summary()
summarizing by response
variable and add the following customization:
For age
make statistic
report the"{mean} ({min}, {max})"
Use missing
arg to remove display of missing values
3. Now apply the following customization:
add a p-value
bold labels and p-values ≤ .05, italicize levels
add an ‘overall’ column
1. Create new data frame (new_trial
) and select columns age
, stage
, response
, marker
2. Make a basic tbl_summary()
summarizing by response
variable and add the following customization:
For age
make statistic
report the"{mean} ({min}, {max})"
Use missing
arg to remove display of missing values
3. Now apply the following customization:
add a p-value
bold labels and p-values ≤ .05, italicize levels
add an ‘overall’ column
new_trial <- select(trial, age, stage, response, marker)
new_trial %>%
tbl_summary(by = response,
statistic = age ~ "{mean} ({min}, {max})",
missing = "no")
Characteristic | 0, N = 1321 | 1, N = 611 |
---|---|---|
Age | 46 (6, 83) | 50 (9, 78) |
T Stage | ||
T1 | 34 (26%) | 18 (30%) |
T2 | 39 (30%) | 13 (21%) |
T3 | 25 (19%) | 15 (25%) |
T4 | 34 (26%) | 15 (25%) |
Marker Level (ng/mL) | 0.59 (0.21, 1.24) | 0.98 (0.31, 1.53) |
1 Mean (Range); n (%); Median (IQR) |
1. Create new data frame (new_trial
) and select columns age
, stage
, response
, marker
2. Make a basic tbl_summary()
summarizing by response
variable and add the following customization:
For age
make statistic
report the"{mean} ({min}, {max})"
Use missing
arg to remove display of missing values
3. Now apply the following customization:
add a p-value
bold labels and p-values ≤ .05, italicize levels
add an ‘overall’ column
new_trial <- select(trial, age, stage, response, marker)
new_trial %>%
tbl_summary(by = response,
statistic = age ~ "{mean} ({min}, {max})",
missing = "no") %>%
add_p() %>%
bold_labels() %>%
bold_p() %>%
italicize_levels() %>%
add_overall()
Characteristic | Overall, N = 1931 | 0, N = 1321 | 1, N = 611 | p-value2 |
---|---|---|---|---|
Age | 47 (6, 83) | 46 (6, 83) | 50 (9, 78) | 0.091 |
T Stage | 0.6 | |||
T1 | 52 (27%) | 34 (26%) | 18 (30%) | |
T2 | 52 (27%) | 39 (30%) | 13 (21%) | |
T3 | 40 (21%) | 25 (19%) | 15 (25%) | |
T4 | 49 (25%) | 34 (26%) | 15 (25%) | |
Marker Level (ng/mL) | 0.62 (0.22, 1.38) | 0.59 (0.21, 1.24) | 0.98 (0.31, 1.53) | 0.10 |
1 Mean (Range); n (%); Median (IQR) | ||||
2 Wilcoxon rank sum test; Pearson’s Chi-squared test |
library(survival)
fit <- survfit(Surv(ttdeath, death) ~ trt, trial)
tbl_survfit(
fit,
times = c(12, 24),
label_header = "**{time} Month**"
) %>%
add_p()
Characteristic | 12 Month | 24 Month | p-value1 |
---|---|---|---|
Chemotherapy Treatment | 0.2 | ||
Drug A | 91% (85%, 97%) | 47% (38%, 58%) | |
Drug B | 86% (80%, 93%) | 41% (33%, 52%) | |
1 Log-rank test |
m1 <- glm(
response ~ age + stage,
data = trial,
family = binomial(link = "logit")
)
tbl_regression(
m1,
exponentiate = TRUE
) %>%
add_global_p()
Characteristic | OR1 | 95% CI1 | p-value |
---|---|---|---|
Age | 1.02 | 1.00, 1.04 | 0.087 |
T Stage | 0.6 | ||
T1 | — | — | |
T2 | 0.58 | 0.24, 1.37 | |
T3 | 0.94 | 0.39, 2.28 | |
T4 | 0.79 | 0.33, 1.90 | |
1 OR = Odds Ratio, CI = Confidence Interval |
tbl_uvreg <- sm_trial %>%
tbl_uvregression(
method = glm,
y = response,
method.args = list(family = binomial),
exponentiate = TRUE
) %>%
bold_labels()
tbl_uvreg
Characteristic | N | OR1 | 95% CI1 | p-value |
---|---|---|---|---|
Chemotherapy Treatment | 193 | |||
Drug A | — | — | ||
Drug B | 1.21 | 0.66, 2.24 | 0.5 | |
Age | 183 | 1.02 | 1.00, 1.04 | 0.10 |
Grade | 193 | |||
I | — | — | ||
II | 0.95 | 0.45, 2.00 | 0.9 | |
III | 1.10 | 0.52, 2.29 | 0.8 | |
1 OR = Odds Ratio, CI = Confidence Interval |
Specify model method
, method.args
, and the response
variable
Arguments and helper functions like exponentiate
, bold_*()
, add_global_p()
can also be used with tbl_uvregression()
A shared space for department members to add functions that may be useful to others
Houses individual member’s project templates and function to start projects (create_bst_project()
: will be discussed in further training)
Diverse functions for various analysis-related tasks, {bstfun} Reference Index: https://www.danieldsjoberg.com/bstfun/reference/index.html
style_tbl_compact()
Before:
trt | age | marker | stage | grade | response | death | ttdeath |
---|---|---|---|---|---|---|---|
Drug A | 23 | 0.160 | T1 | II | 0 | 0 | 24.00 |
Drug B | 9 | 1.107 | T2 | I | 1 | 0 | 24.00 |
Drug A | 31 | 0.277 | T1 | II | 0 | 0 | 24.00 |
Drug A | NA | 2.067 | T3 | III | 1 | 1 | 17.64 |
Drug A | 51 | 2.767 | T4 | III | 1 | 1 | 16.43 |
Drug B | 39 | 0.613 | T4 | I | 0 | 1 | 15.64 |
❗ Also see tbl_listing()
for this functionality: https://shannonpileggi.github.io/gtreg/reference/tbl_listing.html
clean_mrn()
MRNs follows specific formatting rules:
Must be character
Must contain only numeric components
Must be eight characters long and include leading zeros.
This function converts numeric MRNs to character and ensures it follows MRN conventions. Character MRNs can also be passed, and leading zeros will be appended and checked for consistency.
[1] “00000100” “00000100” “00000100”
use_varnames_as_labels()
Automatically add labels to your data based on column names
Before:
Characteristic | N = 321 |
---|---|
mpg | 19.2 (15.4, 22.8) |
cyl | |
4 | 11 (34%) |
6 | 7 (22%) |
8 | 14 (44%) |
vs | 14 (44%) |
am | 13 (41%) |
1 Median (IQR); n (%) |
[1] 10
[1] Mon Levels: Sun < Mon < Tue < Wed < Thu < Fri < Sat
[1] Fri Levels: Sun < Mon < Tue < Wed < Thu < Fri < Sat
Tables are important, but we often need to report results in-line in a report.
Any statistic reported in a {gtsummary} table can be extracted and reported in-line in an R Markdown document with the inline_text()
function.
The pattern of what is reported can be modified with the pattern =
argument.
Default is pattern = "{estimate} ({conf.level*100}% CI {conf.low}, {conf.high}; {p.value})"
Characteristic | N | OR1 | 95% CI1 | p-value |
---|---|---|---|---|
Chemotherapy Treatment | 193 | |||
Drug A | — | — | ||
Drug B | 1.21 | 0.66, 2.24 | 0.5 | |
Age | 183 | 1.02 | 1.00, 1.04 | 0.10 |
Grade | 193 | |||
I | — | — | ||
II | 0.95 | 0.45, 2.00 | 0.9 | |
III | 1.10 | 0.52, 2.29 | 0.8 | |
1 OR = Odds Ratio, CI = Confidence Interval |
In Code: The odds ratio for age is ‘inline_text(tbl_uvreg, variable = age)
’
In Report: The odds ratio for age is 1.02 (95% CI 1.00, 1.04; p=0.10)
Quarto is a multi-language, next generation version of R Markdown from RStudio
Like R Markdown, Quarto uses Knitr to execute R code, therefore it can render most existing Rmd files with few changes
The latest release of RStudio (v2022.07) includes support for editing and preview of Quarto documents.
You can now easily organize your report using various layouts including column formats
YAML Header now uses format
instead of output
and allows auto-complete to make editing the YAML more fool-proof.
Global chunk options are set in YAML header using execute
instead of in a knitr setup chunk.
Individual chunk options are also set in YAML in each chunk as needed, using the “hashpipe” (#|
).
R Markdown:
Quarto:
Questions?