The following objects are masked from 'package:stats':
filter, lag
The following objects are masked from 'package:base':
intersect, setdiff, setequal, union
library(tidyr)
We’ll be working with the palmerpenguins data Several columns in the penguins tibble are stored as factors. Recall, this is how R represents categorical data.
penguins
# A tibble: 344 × 8
species island bill_length_mm bill_depth_mm flipper_length_mm body_mass_g
<fct> <fct> <dbl> <dbl> <int> <int>
1 Adelie Torgersen 39.1 18.7 181 3750
2 Adelie Torgersen 39.5 17.4 186 3800
3 Adelie Torgersen 40.3 18 195 3250
4 Adelie Torgersen NA NA NA NA
5 Adelie Torgersen 36.7 19.3 193 3450
6 Adelie Torgersen 39.3 20.6 190 3650
7 Adelie Torgersen 38.9 17.8 181 3625
8 Adelie Torgersen 39.2 19.6 195 4675
9 Adelie Torgersen 34.1 18.1 193 3475
10 Adelie Torgersen 42 20.2 190 4250
# ℹ 334 more rows
# ℹ 2 more variables: sex <fct>, year <int>
summary(penguins)
species island bill_length_mm bill_depth_mm
Adelie :152 Biscoe :168 Min. :32.10 Min. :13.10
Chinstrap: 68 Dream :124 1st Qu.:39.23 1st Qu.:15.60
Gentoo :124 Torgersen: 52 Median :44.45 Median :17.30
Mean :43.92 Mean :17.15
3rd Qu.:48.50 3rd Qu.:18.70
Max. :59.60 Max. :21.50
NA's :2 NA's :2
flipper_length_mm body_mass_g sex year
Min. :172.0 Min. :2700 female:165 Min. :2007
1st Qu.:190.0 1st Qu.:3550 male :168 1st Qu.:2007
Median :197.0 Median :4050 NA's : 11 Median :2008
Mean :200.9 Mean :4202 Mean :2008
3rd Qu.:213.0 3rd Qu.:4750 3rd Qu.:2009
Max. :231.0 Max. :6300 Max. :2009
NA's :2 NA's :2
For illustrative purposes, we’re going to convert the ‘sex’ column to a character type (raw text).
# A tibble: 344 × 8
species island bill_length_mm bill_depth_mm flipper_length_mm body_mass_g
<fct> <fct> <dbl> <dbl> <int> <int>
1 Adelie Torgersen 39.1 18.7 181 3750
2 Adelie Torgersen 39.5 17.4 186 3800
3 Adelie Torgersen 40.3 18 195 3250
4 Adelie Torgersen NA NA NA NA
5 Adelie Torgersen 36.7 19.3 193 3450
6 Adelie Torgersen 39.3 20.6 190 3650
7 Adelie Torgersen 38.9 17.8 181 3625
8 Adelie Torgersen 39.2 19.6 195 4675
9 Adelie Torgersen 34.1 18.1 193 3475
10 Adelie Torgersen 42 20.2 190 4250
# ℹ 334 more rows
# ℹ 2 more variables: sex <chr>, year <int>
As we work through the example code below, we’ll see how different data types interact with some of R’s statistical analysis functions.
2 Statistical Inference in R
In the previous lessons, we’ve focused largely on visualization and data manipulation. We’ve also used summary stats, and qualitative assessments from our figures to compare data between groups. Those types of approaches are excellent first steps when exploring a dataset.
For example, let’s create a plot comparing the distributions of male and female bill lengths across all three penguin species.
These visuals suggest these penguin species exhibit sexual dimorphism across bill length, but we haven’t applied any sort of rigorous test. In this lesson, we’ll add statistical analyses to our R skill set.
3 T-Test
The T-test is one of the most commonly-used methods for testing for differences in means between two groups. It relies on the assumption that the data from each of the two groups follow a normal distribution, and they both have equal variances.
Looking at the boxplots in the previous section, let’s focus on the female vs male comparison of bill length in the Adelie penguins. This should make a good example dataset for the T-test.
First, we’ll generate a histogram of these data so we can get a better view of their distributions. Let’s complete this graph together. Note, we’ll need to transform the penguin data first, so we’re just looking at bill length data from Adelie penguins. Then we’ll need to construct the histogram.
`stat_bin()` using `bins = 30`. Pick better value with `binwidth`.
The distributions are encouraging. They appear to be separated, roughly normal, and might have equal variances.
Now we’ll compare these data using the base R t.test() function.
help("t.test")
Looking at the documentation for the t.test() function, we can see it supports a wide array of options.
We can run this function with vector input or data frame input. We’ll start by demonstrating with vector input:
# Create vector of female datafemale_data <- mod_penguins |>filter(species =="Adelie", sex =="female") |>pull(bill_length_mm)# Create vector of male datamale_data <- mod_penguins |>filter(species =="Adelie", sex =="male") |>pull(bill_length_mm)# Use both vectors as input to the t.test functiont.test(female_data, male_data)
Welch Two Sample t-test
data: female_data and male_data
t = -8.7765, df = 142.12, p-value = 4.801e-15
alternative hypothesis: true difference in means is not equal to 0
95 percent confidence interval:
-3.838514 -2.427240
sample estimates:
mean of x mean of y
37.25753 40.39041
What information can we pull from the output?
Now let’s run the t.test() function with data frame input:
Welch Two Sample t-test
data: bill_length_mm by sex
t = -8.7765, df = 142.12, p-value = 4.801e-15
alternative hypothesis: true difference in means between group female and group male is not equal to 0
95 percent confidence interval:
-3.838514 -2.427240
sample estimates:
mean in group female mean in group male
37.25753 40.39041
Are there any differences between the output from running t.test() in vector mode or in data frame mode?
Setting factor levels
Many R functions, particularly those involved in statistical analyses, represent categorical data as a factor. Specifically, we use factors in cases when we have a finite number of categories (e.g. multiple choice answers, satisfaction survey data).
We create factors from character data with the factor() function. We can specify all of the possible categories and their order using the ‘levels’ argument.
When we provided the t.test() function a data frame with a character column defining our comparisons groups, it automatically converted the character data to factor data. In doing so, it assigned the factor levels in alphabetical order (“female”, “male”). If we wanted to make the male data our baseline group, we’d need to manually convert the character data to a factor and set the order of the levels.
Let’s modify the t.test() code we used above to turn the ‘sex’ column into a factor:
input_for_ttest <- mod_penguins |>filter(!is.na(sex), species =="Adelie") |>select(sex, bill_length_mm) |>mutate(sex =factor(sex, levels =c("male", "female")))t.test(bill_length_mm ~ sex, data = input_for_ttest)
Welch Two Sample t-test
data: bill_length_mm by sex
t = 8.7765, df = 142.12, p-value = 4.801e-15
alternative hypothesis: true difference in means between group male and group female is not equal to 0
95 percent confidence interval:
2.427240 3.838514
sample estimates:
mean in group male mean in group female
40.39041 37.25753
3.1 Reformatting t.test() Output With The broom Package
The output from the t.test() function contains a lot of detailed information. What kind of problem might it present in an analysis pipeline?
The broom package has functions which are designed to “tidy” output from many of R’s statistical analysis functions. Let’s see how the tidy() function cleans up output from the t.test() function.
t.test(bill_length_mm ~ sex, data = input_for_ttest) |> broom::tidy()
How does the information contained in the tidied output compare to the original output from the t.test() function?
4 Mann-Whitney U Test
Earlier we discussed the assumptions underlying the T-test, once of which is the assumption that the data we’re testing were sampled from a normal distribution. Since the T-test is assuming something about the distribution of its sample population, it is part of a family of tests we call parametric test.
But what do we do if we don’t know anything about the sample population, or don’t feel comfortable assuming it follows a normal distribution? Luckily, there is another family of tests that don’t make these distributional assumptions. These are so-called non-parametric tests.
The non-parametric test that is most often used in place of the T-test is called the Mann-Whitney U Test (or the Wilcoxon rank-sum test). We’ll skip discussing the specifics of how this test is performed.
Let’s examine the function we use to perform a Mann-Whitney U test in R: the wilcox.test() function.
help("wilcox.test")
Are there any differences you notice between the arguments of the wilcox.test() function and the t.test() function? What do you think those differences mean?
Wilcoxon rank sum test with continuity correction
data: bill_length_mm by sex
W = 4536, p-value = 2.403e-13
alternative hypothesis: true location shift is not equal to 0
Again, how does the output differ from the t.test() output?
4.1 Reformatting wilcox.test() Output With The broom Package
The functions from the broom package also work with output from the wilcox.test() function. Let’s see how the tidied output looks.
wilcox.test(bill_length_mm ~ sex, data = input_for_ttest) |> broom::tidy()
# A tibble: 1 × 4
statistic p.value method alternative
<dbl> <dbl> <chr> <chr>
1 4536 2.40e-13 Wilcoxon rank sum test with continuity correct… two.sided
How does this compare to the untidied results we get from the the wilcox.test() function? How do they compare to the tidied results we get from the t.test() function?
5 Binding Data Frames
It would be very helpful if we had some means of concatenating the t.test() and wilcox.test() results so we could compare them side-by-side. We can accomplish this with the bind_rows() function from the dplyr package.
The bind_rows() function concatenates all the rows from one data frame, to the bottom of another. As part of this operation, it looks for columns in both data frames that share the same names and variable types. It uses this information to line up the two data frames and merges them.
Looking back at the tidied results from the t.test() and wilcox.test() functions we noted there were some similarities and differences.
t.test(bill_length_mm ~ sex, data = input_for_ttest) |> broom::tidy()
wilcox.test(bill_length_mm ~ sex, data = input_for_ttest) |> broom::tidy()
# A tibble: 1 × 4
statistic p.value method alternative
<dbl> <dbl> <chr> <chr>
1 4536 2.40e-13 Wilcoxon rank sum test with continuity correct… two.sided
Comparing the column names in these two tables, what do you think it will look like if we attempt to concatenate them by row?
bind_rows(t.test(bill_length_mm ~ sex, data = input_for_ttest) |> broom::tidy(),wilcox.test(bill_length_mm ~ sex, data = input_for_ttest) |> broom::tidy())
# A tibble: 2 × 10
estimate estimate1 estimate2 statistic p.value parameter conf.low conf.high
<dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
1 3.13 40.4 37.3 8.78 4.80e-15 142. 2.43 3.84
2 NA NA NA 4536 2.40e-13 NA NA NA
# ℹ 2 more variables: method <chr>, alternative <chr>
There is also a bind_cols() function for cases where we want to concatenate data frames by column. The bind_cols() function requires that there are the same number of rows in all of the data frames we’re trying to bind. It also assumes they’re sorted in the same order, because it doesn’t check any row values when merging.
6 Iteration
In the previous sections, we performed statistical analyses and extracted results from individual tests. While this is certainly useful, we often want to apply tests to several metrics and aggregate the results.
In this case, we’d like to repeat the same operation on different input data. While we could manually write out all of the tests for each input dataset, that opens us up to copy-and-paste errors, and ultimately isn’t sustainable as the number of tests we want to run grows.
We can accomplish this task through iteration (aka looping). Iteration is a concept that’s fundamental to most, if not all programming languages.
Loops are notoriously slow in R
That’s right. Even though we’re using loops here to learn about iteration, we generally want to avoid using loops in our day-to-day programming. There are various ways of avoiding loops, but those often involve more complex programming constructs. We’ll get to those methods eventually.
All of that being said, sometimes you just need to loop.
We often want to repeat an operation on a series of input data (“iterate over the input data”). The easiest way to accomplish this type of iteration is with a for loop. Here’s an example:
for (counter in1:10) {print(paste("Processing count", counter))}
In this example, we’re iterating over the values 1, 2, 3, 4, 5, 6, 7, 8, 9, 10. We keep track of which value we’re on with a variable called “counter”. The R code located between the curly braces gets executed every time the for loop iterates. Note, we arbitrarily chose the name “counter” for our tracking variable.
So let’s return to our penguin data. Above, we performed a T-test comparing the bill lengths for male and female Adelie penguins. What if we want to perform the same test for all three penguin species? We can do that with a for loop.
Let’s start by creating a simple for loop that iterates over the three penguin species and prints the species name. Note: Let RStudio’s autocomplete help you make for loops.
# Manual definition of species#penguin_species <- c("Adelie", "Gentoo", "Chinstrap")# Extract unique values from a columnpenguin_species <-unique(penguins$species)# Extract factor levels# penguin_species <- # levels(penguins$species)for (species in penguin_species) {print(paste("Penguin species:", species))}
Now that we’ve learned a method for running statistical tests across many different variables in a dataset (e.g. genes from an RNA-Seq experiment), we need to start worrying about multiple testing.
In R, was us the p.adjust() function to apply multiple testing corrections to vectors of p-values. Let’s start by taking a look at its documentation.
help("p.adjust")
We’ll use the p.adjust() function to apply a Holm correction to our table of penguin species T-test results.
First, a refresher on what our merged T-test data look like.
Even though this is a toy example where we’re correcting data from three p-values, we could use the exact same code to apply these corrections to 30,000 p-values.
8 Linear Regression
Let’s return to the bill length vs body mass comparison as a motivating example for linear regression.
mod_penguins |>ggplot(aes(x = bill_length_mm, y = body_mass_g, color = species)) +geom_point() +geom_smooth(method ="lm", formula = y ~ x)
Warning: Removed 2 rows containing non-finite outside the scale range
(`stat_smooth()`).
Warning: Removed 2 rows containing missing values or values outside the scale range
(`geom_point()`).
The geom_smooth() function can apply a linear regression to our data to aid with visualizing patterns.
If we want to perform the regression on our own, we use the lm() function (aka “linear modeling”). Let’s check out the help docs.
help("lm")
Let’s fit a linear model to the body mass vs bill length data:
These are great reference materials for each of the main Tidyverse packages, including ggplot2. For day-to-day programming, it’s helpful to print some of these out and keep them at your workstation. These offer a great way to quickly look up the info you need to run a function.
10 R session information
Here we report the version number for R and the package versions we used to perform the analyses in this document.