Lesson 4: Statistical inference

Published

February 20, 2024

1 Prepare the R Environment

Load the libraries we’ll use below.

library(palmerpenguins)
library(ggplot2)
library(broom)
library(tibble)
library(dplyr)

Attaching package: 'dplyr'
The following objects are masked from 'package:stats':

    filter, lag
The following objects are masked from 'package:base':

    intersect, setdiff, setequal, union
library(tidyr)

We’ll be working with the palmerpenguins data Several columns in the penguins tibble are stored as factors. Recall, this is how R represents categorical data.

penguins
# A tibble: 344 × 8
   species island    bill_length_mm bill_depth_mm flipper_length_mm body_mass_g
   <fct>   <fct>              <dbl>         <dbl>             <int>       <int>
 1 Adelie  Torgersen           39.1          18.7               181        3750
 2 Adelie  Torgersen           39.5          17.4               186        3800
 3 Adelie  Torgersen           40.3          18                 195        3250
 4 Adelie  Torgersen           NA            NA                  NA          NA
 5 Adelie  Torgersen           36.7          19.3               193        3450
 6 Adelie  Torgersen           39.3          20.6               190        3650
 7 Adelie  Torgersen           38.9          17.8               181        3625
 8 Adelie  Torgersen           39.2          19.6               195        4675
 9 Adelie  Torgersen           34.1          18.1               193        3475
10 Adelie  Torgersen           42            20.2               190        4250
# ℹ 334 more rows
# ℹ 2 more variables: sex <fct>, year <int>
summary(penguins)
      species          island    bill_length_mm  bill_depth_mm  
 Adelie   :152   Biscoe   :168   Min.   :32.10   Min.   :13.10  
 Chinstrap: 68   Dream    :124   1st Qu.:39.23   1st Qu.:15.60  
 Gentoo   :124   Torgersen: 52   Median :44.45   Median :17.30  
                                 Mean   :43.92   Mean   :17.15  
                                 3rd Qu.:48.50   3rd Qu.:18.70  
                                 Max.   :59.60   Max.   :21.50  
                                 NA's   :2       NA's   :2      
 flipper_length_mm  body_mass_g       sex           year     
 Min.   :172.0     Min.   :2700   female:165   Min.   :2007  
 1st Qu.:190.0     1st Qu.:3550   male  :168   1st Qu.:2007  
 Median :197.0     Median :4050   NA's  : 11   Median :2008  
 Mean   :200.9     Mean   :4202                Mean   :2008  
 3rd Qu.:213.0     3rd Qu.:4750                3rd Qu.:2009  
 Max.   :231.0     Max.   :6300                Max.   :2009  
 NA's   :2         NA's   :2                                 

For illustrative purposes, we’re going to convert the ‘sex’ column to a character type (raw text).

mod_penguins <- 
    penguins |> 
    mutate(sex = as.character(sex))
mod_penguins
# A tibble: 344 × 8
   species island    bill_length_mm bill_depth_mm flipper_length_mm body_mass_g
   <fct>   <fct>              <dbl>         <dbl>             <int>       <int>
 1 Adelie  Torgersen           39.1          18.7               181        3750
 2 Adelie  Torgersen           39.5          17.4               186        3800
 3 Adelie  Torgersen           40.3          18                 195        3250
 4 Adelie  Torgersen           NA            NA                  NA          NA
 5 Adelie  Torgersen           36.7          19.3               193        3450
 6 Adelie  Torgersen           39.3          20.6               190        3650
 7 Adelie  Torgersen           38.9          17.8               181        3625
 8 Adelie  Torgersen           39.2          19.6               195        4675
 9 Adelie  Torgersen           34.1          18.1               193        3475
10 Adelie  Torgersen           42            20.2               190        4250
# ℹ 334 more rows
# ℹ 2 more variables: sex <chr>, year <int>

As we work through the example code below, we’ll see how different data types interact with some of R’s statistical analysis functions.

2 Statistical Inference in R

In the previous lessons, we’ve focused largely on visualization and data manipulation. We’ve also used summary stats, and qualitative assessments from our figures to compare data between groups. Those types of approaches are excellent first steps when exploring a dataset.

For example, let’s create a plot comparing the distributions of male and female bill lengths across all three penguin species.

mod_penguins |> 
    filter(!is.na(sex)) |> 
    select(species, sex, bill_length_mm) |> 
    ggplot(aes(x = sex,
               y = bill_length_mm)) +
    geom_boxplot() +
    facet_wrap(facets = vars(species))

These visuals suggest these penguin species exhibit sexual dimorphism across bill length, but we haven’t applied any sort of rigorous test. In this lesson, we’ll add statistical analyses to our R skill set.

3 T-Test

The T-test is one of the most commonly-used methods for testing for differences in means between two groups. It relies on the assumption that the data from each of the two groups follow a normal distribution, and they both have equal variances.

Looking at the boxplots in the previous section, let’s focus on the female vs male comparison of bill length in the Adelie penguins. This should make a good example dataset for the T-test.

First, we’ll generate a histogram of these data so we can get a better view of their distributions. Let’s complete this graph together. Note, we’ll need to transform the penguin data first, so we’re just looking at bill length data from Adelie penguins. Then we’ll need to construct the histogram.

mod_penguins |> 
    filter(species == "Adelie",
           !is.na(sex)) |> 
    select(species, bill_length_mm, sex) |> 
    ggplot(aes(x = bill_length_mm)) +
    geom_histogram() +
    facet_grid(rows = vars(sex))
`stat_bin()` using `bins = 30`. Pick better value with `binwidth`.

The distributions are encouraging. They appear to be separated, roughly normal, and might have equal variances.

Now we’ll compare these data using the base R t.test() function.

help("t.test")

Looking at the documentation for the t.test() function, we can see it supports a wide array of options.

We can run this function with vector input or data frame input. We’ll start by demonstrating with vector input:

# Create vector of female data
female_data <-
    mod_penguins |> 
    filter(species == "Adelie",
           sex == "female") |>
    pull(bill_length_mm)
# Create vector of male data
male_data <-
    mod_penguins |> 
    filter(species == "Adelie",
           sex == "male") |> 
    pull(bill_length_mm)
# Use both vectors as input to the t.test function
t.test(female_data, male_data)

    Welch Two Sample t-test

data:  female_data and male_data
t = -8.7765, df = 142.12, p-value = 4.801e-15
alternative hypothesis: true difference in means is not equal to 0
95 percent confidence interval:
 -3.838514 -2.427240
sample estimates:
mean of x mean of y 
 37.25753  40.39041 

What information can we pull from the output?

Now let’s run the t.test() function with data frame input:

input_for_ttest <- 
    mod_penguins |> 
    filter(species == "Adelie",
           !is.na(sex)) |> 
    select(sex, bill_length_mm)
t.test(bill_length_mm ~ sex,
       data = input_for_ttest)

    Welch Two Sample t-test

data:  bill_length_mm by sex
t = -8.7765, df = 142.12, p-value = 4.801e-15
alternative hypothesis: true difference in means between group female and group male is not equal to 0
95 percent confidence interval:
 -3.838514 -2.427240
sample estimates:
mean in group female   mean in group male 
            37.25753             40.39041 

Are there any differences between the output from running t.test() in vector mode or in data frame mode?

Setting factor levels

Many R functions, particularly those involved in statistical analyses, represent categorical data as a factor. Specifically, we use factors in cases when we have a finite number of categories (e.g. multiple choice answers, satisfaction survey data).

We create factors from character data with the factor() function. We can specify all of the possible categories and their order using the ‘levels’ argument.

species_factor <-
    factor(mod_penguins$species,
           levels = c("Chinstrap", "Gentoo", "Adelie"))

When we provided the t.test() function a data frame with a character column defining our comparisons groups, it automatically converted the character data to factor data. In doing so, it assigned the factor levels in alphabetical order (“female”, “male”). If we wanted to make the male data our baseline group, we’d need to manually convert the character data to a factor and set the order of the levels.

Let’s modify the t.test() code we used above to turn the ‘sex’ column into a factor:

input_for_ttest <- 
    mod_penguins |> 
           filter(!is.na(sex),
                  species == "Adelie") |> 
           select(sex, bill_length_mm) |> 
    mutate(sex = factor(sex, levels = c("male", "female")))

t.test(bill_length_mm ~ sex, data = input_for_ttest)

    Welch Two Sample t-test

data:  bill_length_mm by sex
t = 8.7765, df = 142.12, p-value = 4.801e-15
alternative hypothesis: true difference in means between group male and group female is not equal to 0
95 percent confidence interval:
 2.427240 3.838514
sample estimates:
  mean in group male mean in group female 
            40.39041             37.25753 

3.1 Reformatting t.test() Output With The broom Package

The output from the t.test() function contains a lot of detailed information. What kind of problem might it present in an analysis pipeline?

The broom package has functions which are designed to “tidy” output from many of R’s statistical analysis functions. Let’s see how the tidy() function cleans up output from the t.test() function.

t.test(bill_length_mm ~ sex, data = input_for_ttest) |> 
    broom::tidy()
# A tibble: 1 × 10
  estimate estimate1 estimate2 statistic  p.value parameter conf.low conf.high
     <dbl>     <dbl>     <dbl>     <dbl>    <dbl>     <dbl>    <dbl>     <dbl>
1     3.13      40.4      37.3      8.78 4.80e-15      142.     2.43      3.84
# ℹ 2 more variables: method <chr>, alternative <chr>

How does the information contained in the tidied output compare to the original output from the t.test() function?

4 Mann-Whitney U Test

Earlier we discussed the assumptions underlying the T-test, once of which is the assumption that the data we’re testing were sampled from a normal distribution. Since the T-test is assuming something about the distribution of its sample population, it is part of a family of tests we call parametric test.

But what do we do if we don’t know anything about the sample population, or don’t feel comfortable assuming it follows a normal distribution? Luckily, there is another family of tests that don’t make these distributional assumptions. These are so-called non-parametric tests.

The non-parametric test that is most often used in place of the T-test is called the Mann-Whitney U Test (or the Wilcoxon rank-sum test). We’ll skip discussing the specifics of how this test is performed.

Let’s examine the function we use to perform a Mann-Whitney U test in R: the wilcox.test() function.

help("wilcox.test")

Are there any differences you notice between the arguments of the wilcox.test() function and the t.test() function? What do you think those differences mean?

Now let’s examine the function’s output.

wilcox.test(bill_length_mm ~ sex,
            data = input_for_ttest)

    Wilcoxon rank sum test with continuity correction

data:  bill_length_mm by sex
W = 4536, p-value = 2.403e-13
alternative hypothesis: true location shift is not equal to 0

Again, how does the output differ from the t.test() output?

4.1 Reformatting wilcox.test() Output With The broom Package

The functions from the broom package also work with output from the wilcox.test() function. Let’s see how the tidied output looks.

wilcox.test(bill_length_mm ~ sex, data = input_for_ttest) |> 
    broom::tidy()
# A tibble: 1 × 4
  statistic  p.value method                                          alternative
      <dbl>    <dbl> <chr>                                           <chr>      
1      4536 2.40e-13 Wilcoxon rank sum test with continuity correct… two.sided  

How does this compare to the untidied results we get from the the wilcox.test() function? How do they compare to the tidied results we get from the t.test() function?

5 Binding Data Frames

It would be very helpful if we had some means of concatenating the t.test() and wilcox.test() results so we could compare them side-by-side. We can accomplish this with the bind_rows() function from the dplyr package.

The bind_rows() function concatenates all the rows from one data frame, to the bottom of another. As part of this operation, it looks for columns in both data frames that share the same names and variable types. It uses this information to line up the two data frames and merges them.

Looking back at the tidied results from the t.test() and wilcox.test() functions we noted there were some similarities and differences.

t.test(bill_length_mm ~ sex, data = input_for_ttest) |> 
    broom::tidy()
# A tibble: 1 × 10
  estimate estimate1 estimate2 statistic  p.value parameter conf.low conf.high
     <dbl>     <dbl>     <dbl>     <dbl>    <dbl>     <dbl>    <dbl>     <dbl>
1     3.13      40.4      37.3      8.78 4.80e-15      142.     2.43      3.84
# ℹ 2 more variables: method <chr>, alternative <chr>
wilcox.test(bill_length_mm ~ sex, data = input_for_ttest) |> 
    broom::tidy()
# A tibble: 1 × 4
  statistic  p.value method                                          alternative
      <dbl>    <dbl> <chr>                                           <chr>      
1      4536 2.40e-13 Wilcoxon rank sum test with continuity correct… two.sided  

Comparing the column names in these two tables, what do you think it will look like if we attempt to concatenate them by row?

bind_rows(
    t.test(bill_length_mm ~ sex, data = input_for_ttest) |> 
        broom::tidy(),
    wilcox.test(bill_length_mm ~ sex, data = input_for_ttest) |> 
        broom::tidy()
)
# A tibble: 2 × 10
  estimate estimate1 estimate2 statistic  p.value parameter conf.low conf.high
     <dbl>     <dbl>     <dbl>     <dbl>    <dbl>     <dbl>    <dbl>     <dbl>
1     3.13      40.4      37.3      8.78 4.80e-15      142.     2.43      3.84
2    NA         NA        NA     4536    2.40e-13       NA     NA        NA   
# ℹ 2 more variables: method <chr>, alternative <chr>

There is also a bind_cols() function for cases where we want to concatenate data frames by column. The bind_cols() function requires that there are the same number of rows in all of the data frames we’re trying to bind. It also assumes they’re sorted in the same order, because it doesn’t check any row values when merging.

6 Iteration

In the previous sections, we performed statistical analyses and extracted results from individual tests. While this is certainly useful, we often want to apply tests to several metrics and aggregate the results.

In this case, we’d like to repeat the same operation on different input data. While we could manually write out all of the tests for each input dataset, that opens us up to copy-and-paste errors, and ultimately isn’t sustainable as the number of tests we want to run grows.

We can accomplish this task through iteration (aka looping). Iteration is a concept that’s fundamental to most, if not all programming languages.

Loops are notoriously slow in R

That’s right. Even though we’re using loops here to learn about iteration, we generally want to avoid using loops in our day-to-day programming. There are various ways of avoiding loops, but those often involve more complex programming constructs. We’ll get to those methods eventually.

All of that being said, sometimes you just need to loop.

We often want to repeat an operation on a series of input data (“iterate over the input data”). The easiest way to accomplish this type of iteration is with a for loop. Here’s an example:

for (counter in 1:10) {
    print(paste("Processing count", counter))
}
[1] "Processing count 1"
[1] "Processing count 2"
[1] "Processing count 3"
[1] "Processing count 4"
[1] "Processing count 5"
[1] "Processing count 6"
[1] "Processing count 7"
[1] "Processing count 8"
[1] "Processing count 9"
[1] "Processing count 10"

In this example, we’re iterating over the values 1, 2, 3, 4, 5, 6, 7, 8, 9, 10. We keep track of which value we’re on with a variable called “counter”. The R code located between the curly braces gets executed every time the for loop iterates. Note, we arbitrarily chose the name “counter” for our tracking variable.

So let’s return to our penguin data. Above, we performed a T-test comparing the bill lengths for male and female Adelie penguins. What if we want to perform the same test for all three penguin species? We can do that with a for loop.

Let’s start by creating a simple for loop that iterates over the three penguin species and prints the species name. Note: Let RStudio’s autocomplete help you make for loops.

# Manual definition of species
#penguin_species <- c("Adelie", "Gentoo", "Chinstrap")
# Extract unique values from a column
penguin_species <- 
    unique(penguins$species)
# Extract factor levels
# penguin_species <- 
#     levels(penguins$species)
for (species in penguin_species) {
    print(paste("Penguin species:", species))
}
[1] "Penguin species: Adelie"
[1] "Penguin species: Gentoo"
[1] "Penguin species: Chinstrap"

Now let’s combine this with the code we used previously. Here’s the code we used to perform one T-test and tidy the results:

input_for_ttest <- 
    mod_penguins |> 
           filter(!is.na(sex),
                  species == "Adelie") |> 
           select(sex, bill_length_mm)

t.test(bill_length_mm ~ sex, data = input_for_ttest) |> broom::tidy()
# A tibble: 1 × 10
  estimate estimate1 estimate2 statistic  p.value parameter conf.low conf.high
     <dbl>     <dbl>     <dbl>     <dbl>    <dbl>     <dbl>    <dbl>     <dbl>
1    -3.13      37.3      40.4     -8.78 4.80e-15      142.    -3.84     -2.43
# ℹ 2 more variables: method <chr>, alternative <chr>

Now, we need to insert this code into the for loop so it’s run on a different penguin species every time the loop repeats.

penguin_bill_data_by_species <- tibble()

for (peng_species in unique(mod_penguins$species)) {
    
    input_for_ttest <- 
        mod_penguins |> 
               filter(!is.na(sex),
                      species == peng_species) |> 
            select(sex, bill_length_mm)
    
    penguin_bill_data_by_species <- 
        bind_rows(
            penguin_bill_data_by_species,
            t.test(bill_length_mm ~ sex, data = input_for_ttest) |>
                broom::tidy() |> 
                mutate(species = peng_species,
                       .before = 1)
        )
}

Don’t forget to save the output of each loop iteration to an external variable.

penguin_bill_data_by_species
# A tibble: 3 × 11
  species   estimate estimate1 estimate2 statistic  p.value parameter conf.low
  <chr>        <dbl>     <dbl>     <dbl>     <dbl>    <dbl>     <dbl>    <dbl>
1 Adelie       -3.13      37.3      40.4     -8.78 4.80e-15     142.     -3.84
2 Gentoo       -3.91      45.6      49.5     -8.88 1.32e-14     111.     -4.78
3 Chinstrap    -4.52      46.6      51.1     -7.57 8.92e-10      48.7    -5.72
# ℹ 3 more variables: conf.high <dbl>, method <chr>, alternative <chr>

7 Multiple testing

Now that we’ve learned a method for running statistical tests across many different variables in a dataset (e.g. genes from an RNA-Seq experiment), we need to start worrying about multiple testing.

In R, was us the p.adjust() function to apply multiple testing corrections to vectors of p-values. Let’s start by taking a look at its documentation.

help("p.adjust")

We’ll use the p.adjust() function to apply a Holm correction to our table of penguin species T-test results.

First, a refresher on what our merged T-test data look like.

penguin_bill_data_by_species
# A tibble: 3 × 11
  species   estimate estimate1 estimate2 statistic  p.value parameter conf.low
  <chr>        <dbl>     <dbl>     <dbl>     <dbl>    <dbl>     <dbl>    <dbl>
1 Adelie       -3.13      37.3      40.4     -8.78 4.80e-15     142.     -3.84
2 Gentoo       -3.91      45.6      49.5     -8.88 1.32e-14     111.     -4.78
3 Chinstrap    -4.52      46.6      51.1     -7.57 8.92e-10      48.7    -5.72
# ℹ 3 more variables: conf.high <dbl>, method <chr>, alternative <chr>

So let’s apply the Holm correction to our p-values.

penguin_bill_data_by_species |> 
    select(species, p.value) |> 
    arrange(p.value) |> 
    mutate(Holm_p.value = p.adjust(p.value, method = "holm"))
# A tibble: 3 × 3
  species    p.value Holm_p.value
  <chr>        <dbl>        <dbl>
1 Adelie    4.80e-15     1.44e-14
2 Gentoo    1.32e-14     2.63e-14
3 Chinstrap 8.92e-10     8.92e-10

How do the corrected p-values compare to the original?

Let’s also add a column for the Hommel corrected p-value.

penguin_bill_data_by_species |> 
    select(species, p.value) |> 
    arrange(p.value) |> 
    mutate(Hommel_p.value = p.adjust(p.value, method = "hommel"))
# A tibble: 3 × 3
  species    p.value Hommel_p.value
  <chr>        <dbl>          <dbl>
1 Adelie    4.80e-15       1.44e-14
2 Gentoo    1.32e-14       2.63e-14
3 Chinstrap 8.92e-10       8.92e-10

How do the Hommel corrected p-values compare?

Even though this is a toy example where we’re correcting data from three p-values, we could use the exact same code to apply these corrections to 30,000 p-values.

8 Linear Regression

Let’s return to the bill length vs body mass comparison as a motivating example for linear regression.

mod_penguins |> 
    ggplot(aes(x = bill_length_mm, y = body_mass_g, color = species)) +
    geom_point() +
    geom_smooth(method = "lm", formula = y ~ x)
Warning: Removed 2 rows containing non-finite outside the scale range
(`stat_smooth()`).
Warning: Removed 2 rows containing missing values or values outside the scale range
(`geom_point()`).

The geom_smooth() function can apply a linear regression to our data to aid with visualizing patterns.

If we want to perform the regression on our own, we use the lm() function (aka “linear modeling”). Let’s check out the help docs.

help("lm")

Let’s fit a linear model to the body mass vs bill length data:

lm(formula = body_mass_g ~ bill_length_mm,
   data = mod_penguins) |> 
    summary() |> 
    broom::tidy()
# A tibble: 2 × 5
  term           estimate std.error statistic  p.value
  <chr>             <dbl>     <dbl>     <dbl>    <dbl>
1 (Intercept)       362.     283.        1.28 2.02e- 1
2 bill_length_mm     87.4      6.40     13.7  3.81e-34
lm(formula = body_mass_g ~ species + bill_length_mm + species:bill_length_mm,
   data = mod_penguins) |> 
    summary() |> 
    broom::tidy()
# A tibble: 6 × 5
  term                            estimate std.error statistic  p.value
  <chr>                              <dbl>     <dbl>     <dbl>    <dbl>
1 (Intercept)                         34.9     443.     0.0787 9.37e- 1
2 speciesChinstrap                   811.      800.     1.01   3.11e- 1
3 speciesGentoo                     -159.      683.    -0.232  8.16e- 1
4 bill_length_mm                      94.5      11.4    8.29   2.73e-15
5 speciesChinstrap:bill_length_mm    -35.4      17.7   -1.99   4.70e- 2
6 speciesGentoo:bill_length_mm        15.0      15.8    0.948  3.44e- 1
lm(formula = body_mass_g ~ bill_length_mm,
   data = mod_penguins) |> 
    broom::tidy()
# A tibble: 2 × 5
  term           estimate std.error statistic  p.value
  <chr>             <dbl>     <dbl>     <dbl>    <dbl>
1 (Intercept)       362.     283.        1.28 2.02e- 1
2 bill_length_mm     87.4      6.40     13.7  3.81e-34

9 Tidyverse cheatsheets

These are great reference materials for each of the main Tidyverse packages, including ggplot2. For day-to-day programming, it’s helpful to print some of these out and keep them at your workstation. These offer a great way to quickly look up the info you need to run a function.

10 R session information

Here we report the version number for R and the package versions we used to perform the analyses in this document.

sessionInfo()
R version 4.4.0 (2024-04-24 ucrt)
Platform: x86_64-w64-mingw32/x64
Running under: Windows 10 x64 (build 19045)

Matrix products: default


locale:
[1] LC_COLLATE=English_United States.utf8 
[2] LC_CTYPE=English_United States.utf8   
[3] LC_MONETARY=English_United States.utf8
[4] LC_NUMERIC=C                          
[5] LC_TIME=English_United States.utf8    

time zone: America/New_York
tzcode source: internal

attached base packages:
[1] stats     graphics  grDevices utils     datasets  methods   base     

other attached packages:
[1] tidyr_1.3.1          dplyr_1.1.4          tibble_3.2.1        
[4] broom_1.0.6          ggplot2_3.5.1        palmerpenguins_0.1.1

loaded via a namespace (and not attached):
 [1] Matrix_1.7-0      gtable_0.3.5      jsonlite_1.8.8    compiler_4.4.0   
 [5] tidyselect_1.2.1  splines_4.4.0     scales_1.3.0      yaml_2.3.8       
 [9] fastmap_1.2.0     lattice_0.22-6    R6_2.5.1          labeling_0.4.3   
[13] generics_0.1.3    knitr_1.47        htmlwidgets_1.6.4 backports_1.5.0  
[17] munsell_0.5.1     pillar_1.9.0      rlang_1.1.3       utf8_1.2.4       
[21] xfun_0.44         cli_3.6.2         withr_3.0.0       magrittr_2.0.3   
[25] mgcv_1.9-1        digest_0.6.35     grid_4.4.0        rstudioapi_0.16.0
[29] lifecycle_1.0.4   nlme_3.1-164      vctrs_0.6.5       evaluate_0.23    
[33] glue_1.7.0        farver_2.1.2      fansi_1.0.6       colorspace_2.1-0 
[37] rmarkdown_2.27    purrr_1.0.2       tools_4.4.0       pkgconfig_2.0.3  
[41] htmltools_0.5.8.1
Back to top