install.packages("here")
Lesson 2: Working with tabular data
1 Prepare the R environment for this lesson
For this lesson we’ll need four packages:
- palmerpenguins
- here
- readr
- dplyr
First, we need to install the here package, using the install.packages()
function.
Note, the readr and dplyr packages are part of the tidyverse, so we don’t need to install them separately. Now we use the library()
function to load all of these packages.
library(palmerpenguins)
library(here)
here() starts at C:/Users/nickopotamus/Projects/ITMAT_office_houRs
library(readr)
library(dplyr)
Attaching package: 'dplyr'
The following objects are masked from 'package:stats':
filter, lag
The following objects are masked from 'package:base':
intersect, setdiff, setequal, union
The output from this code indicates R successfully loaded these packages. The message we get from the here package tells us which directory on our computers it’s using as the “root” directory (more on what this means below). The other warning messages explains that dplyr includes several functions (e.g. filter()
and lag()
) that have the same names as functions from the stats and base packages. R’s default behavior for resolving these naming conflicts is to use the version of the function loaded most recently into memory. The warning lets us know that if we run any of these functions, R will use the version of these functions from the dplyr package and not the stats/base package.
The potential for naming conflicts like this is another reason why we only load the packages we need for the current analyses.
When we start to type a function’s name, and RStudio’s tab-completion prompt opens up, the package for each function is listed in curly bracers to the right of the function’s name. As we’re writing code, we can use this to make sure we’re using the correct version of the function.
2 Reading data from files
We have a file named penguins.csv
in the DATA/
directory. This data file is in .csv
(comma-separated values) format and contains the same information as the penguins
data frame we used in the previous lesson. We can view the raw contents of this file by using the Files tab in the lower right pane of the RStudio window. Navigate to the DATA/
directory, click on the penguins.csv
file, then select the “View File” option.
2.1 Base R
We can read the data from this file using the base R function read.csv()
and store the data in a variable named penguin_data_from_base_r
.
<- read.csv(file = here::here("DATA/penguins.csv")) penguin_data_from_base_r
In R, <-
is called the assignment operator. The assignment operator takes the output from the function to its right (read.csv()
) and assigns it to the variable to its right (penguin_data_from_base_r
). Put differently, we’re storing the contents of the ‘penguins.csv’ file in ‘penguin_data_from_base_r’.
We can look at the contents of penguin_data_from_base_r
by entering the variable name into the R console.
penguin_data_from_base_r
species island bill_length_mm bill_depth_mm flipper_length_mm body_mass_g
1 Adelie Torgersen 39.1 18.7 181 3750
2 Adelie Torgersen 39.5 17.4 186 3800
3 Adelie Torgersen 40.3 18.0 195 3250
4 Adelie Torgersen NA NA NA NA
5 Adelie Torgersen 36.7 19.3 193 3450
6 Adelie Torgersen 39.3 20.6 190 3650
7 Adelie Torgersen 38.9 17.8 181 3625
8 Adelie Torgersen 39.2 19.6 195 4675
9 Adelie Torgersen 34.1 18.1 193 3475
10 Adelie Torgersen 42.0 20.2 190 4250
11 Adelie Torgersen 37.8 17.1 186 3300
12 Adelie Torgersen 37.8 17.3 180 3700
13 Adelie Torgersen 41.1 17.6 182 3200
14 Adelie Torgersen 38.6 21.2 191 3800
15 Adelie Torgersen 34.6 21.1 198 4400
16 Adelie Torgersen 36.6 17.8 185 3700
17 Adelie Torgersen 38.7 19.0 195 3450
18 Adelie Torgersen 42.5 20.7 197 4500
sex year
1 male 2007
2 female 2007
3 female 2007
4 <NA> 2007
5 female 2007
6 male 2007
7 female 2007
8 male 2007
9 <NA> 2007
10 <NA> 2007
11 <NA> 2007
12 <NA> 2007
13 female 2007
14 male 2007
15 male 2007
16 female 2007
17 female 2007
18 male 2007
[ reached 'max' / getOption("max.print") -- omitted 326 rows ]
Notice, that “penguin_data_from_base_r” now appears in the Environment tab of the upper right pane of the RStudio window. We can use this tab to quickly check the data we’ve loaded into R.
While the read.csv()
function gets the job done, it doesn’t do a lot to format the data.
2.2 The readr package
As the name implies, the readr package contains functions designed to help us read tabular data from text files. These functions have a lot of useful features to mark and handle problems we’re likely to encounter in real-world data (we’ll see some examples of this in later lessons).
To read this csv file, we’re going to use the read_csv()
function from the readr package. The command is almost identical to base R, except we have an underscore in read_csv()
, instead of a period.
<- read_csv(file = here::here("DATA/penguins.csv")) penguin_data_from_readr
Rows: 344 Columns: 8
── Column specification ────────────────────────────────────────────────────────
Delimiter: ","
chr (3): species, island, sex
dbl (5): bill_length_mm, bill_depth_mm, flipper_length_mm, body_mass_g, year
ℹ Use `spec()` to retrieve the full column specification for this data.
ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.
The readr function includes some additional output that tells us about guesses it’s making about data in the file. Specifically, it determined the species, island, and sex columns contain text (or “characters”), while the remaining columns contain numbers with decimal points (or “doubles”).
Compare the the contents of the penguin_data_from_readr
to the penguin_data_from_base_r
data we loaded above:
penguin_data_from_readr
# A tibble: 344 × 8
species island bill_length_mm bill_depth_mm flipper_length_mm body_mass_g
<chr> <chr> <dbl> <dbl> <dbl> <dbl>
1 Adelie Torgersen 39.1 18.7 181 3750
2 Adelie Torgersen 39.5 17.4 186 3800
3 Adelie Torgersen 40.3 18 195 3250
4 Adelie Torgersen NA NA NA NA
5 Adelie Torgersen 36.7 19.3 193 3450
6 Adelie Torgersen 39.3 20.6 190 3650
7 Adelie Torgersen 38.9 17.8 181 3625
8 Adelie Torgersen 39.2 19.6 195 4675
9 Adelie Torgersen 34.1 18.1 193 3475
10 Adelie Torgersen 42 20.2 190 4250
# ℹ 334 more rows
# ℹ 2 more variables: sex <chr>, year <dbl>
readr functions read data from files and store them as tibbles, a special version of a data frame. We saw an example of a tibble when we worked with the penguins
data frame, loaded by the palmerpenguins package.
3 Transforming penguins
3.1 The dplyr Package
The dplyr package comes with many functions for manipulating and extracting information from tabular data. As we’ll see below, dplyr functions are named after verbs that describe what we’re doing to the input data, and the first argument of every function is the input data frame (or tibble).
For simplicity, we’ll return to using the penguins
data frame for the remainder of this lesson. While we could use dplyr functions to work with the data we read from penguins.csv
, the penguins
data frame has some nicer formatting.
3.2 Filter
We can use the filter()
function to grab rows from our data that contain specific information. Here, we extract just those rows containing measurements from Gentoo penguins.
filter(penguins, species == "Gentoo")
# A tibble: 124 × 8
species island bill_length_mm bill_depth_mm flipper_length_mm body_mass_g
<fct> <fct> <dbl> <dbl> <int> <int>
1 Gentoo Biscoe 46.1 13.2 211 4500
2 Gentoo Biscoe 50 16.3 230 5700
3 Gentoo Biscoe 48.7 14.1 210 4450
4 Gentoo Biscoe 50 15.2 218 5700
5 Gentoo Biscoe 47.6 14.5 215 5400
6 Gentoo Biscoe 46.5 13.5 210 4550
7 Gentoo Biscoe 45.4 14.6 211 4800
8 Gentoo Biscoe 46.7 15.3 219 5200
9 Gentoo Biscoe 43.3 13.4 209 4400
10 Gentoo Biscoe 46.8 15.4 215 5150
# ℹ 114 more rows
# ℹ 2 more variables: sex <fct>, year <int>
Remember, the species information in our data frame is contained in the “species” column. This example code is telling R to search through every row in the penguin data, and return those rows that have “Gentoo” in the species column. From this example, we can see the general form for using the filter function: filter(dataset, comparison)
. We used “==” to indicate we want to find all rows in species that match the word “Gentoo”. This is an example of a relational operator.
R supports several operators that let us compare values:
- == : Check if two values are exactly equal. Many programming languages use the double equals sign to indicate comparisons, because they’re already using the single equal sign for something else (e.g. assignment).
- <, > : Less-than, and greater-than comparisons.
- <=, >= : Less-than or equal, and greater-than or equal comparisons.
- != : Check if two values are not equal.
These operators return a logical value: TRUE
or FALSE
.
"Gentoo" == "Gentoo"
[1] TRUE
121 < 43
[1] FALSE
We can also combine multiple filtering conditions in the same command. In this example, we want to get the rows containing data from female Adelie penguins.
filter(penguins,
== "female",
sex == "Adelie") species
# A tibble: 73 × 8
species island bill_length_mm bill_depth_mm flipper_length_mm body_mass_g
<fct> <fct> <dbl> <dbl> <int> <int>
1 Adelie Torgersen 39.5 17.4 186 3800
2 Adelie Torgersen 40.3 18 195 3250
3 Adelie Torgersen 36.7 19.3 193 3450
4 Adelie Torgersen 38.9 17.8 181 3625
5 Adelie Torgersen 41.1 17.6 182 3200
6 Adelie Torgersen 36.6 17.8 185 3700
7 Adelie Torgersen 38.7 19 195 3450
8 Adelie Torgersen 34.4 18.4 184 3325
9 Adelie Biscoe 37.8 18.3 174 3400
10 Adelie Biscoe 35.9 19.2 189 3800
# ℹ 63 more rows
# ℹ 2 more variables: sex <fct>, year <int>
Alternatively, we can combine multiple conditions with the &
symbol (meaning “and”) and the |
symbol (meaning “or”). We can re-write the previous filter command using the &
symbol:
filter(penguins,
== "female" & species == "Adelie") sex
# A tibble: 73 × 8
species island bill_length_mm bill_depth_mm flipper_length_mm body_mass_g
<fct> <fct> <dbl> <dbl> <int> <int>
1 Adelie Torgersen 39.5 17.4 186 3800
2 Adelie Torgersen 40.3 18 195 3250
3 Adelie Torgersen 36.7 19.3 193 3450
4 Adelie Torgersen 38.9 17.8 181 3625
5 Adelie Torgersen 41.1 17.6 182 3200
6 Adelie Torgersen 36.6 17.8 185 3700
7 Adelie Torgersen 38.7 19 195 3450
8 Adelie Torgersen 34.4 18.4 184 3325
9 Adelie Biscoe 37.8 18.3 174 3400
10 Adelie Biscoe 35.9 19.2 189 3800
# ℹ 63 more rows
# ℹ 2 more variables: sex <fct>, year <int>
We can use the |
symbol to retrieve data from female penguins that are either Adelie or Chinstrap:
filter(penguins,
== "female",
sex == "Adelie" | species == "Chinstrap") species
# A tibble: 107 × 8
species island bill_length_mm bill_depth_mm flipper_length_mm body_mass_g
<fct> <fct> <dbl> <dbl> <int> <int>
1 Adelie Torgersen 39.5 17.4 186 3800
2 Adelie Torgersen 40.3 18 195 3250
3 Adelie Torgersen 36.7 19.3 193 3450
4 Adelie Torgersen 38.9 17.8 181 3625
5 Adelie Torgersen 41.1 17.6 182 3200
6 Adelie Torgersen 36.6 17.8 185 3700
7 Adelie Torgersen 38.7 19 195 3450
8 Adelie Torgersen 34.4 18.4 184 3325
9 Adelie Biscoe 37.8 18.3 174 3400
10 Adelie Biscoe 35.9 19.2 189 3800
# ℹ 97 more rows
# ℹ 2 more variables: sex <fct>, year <int>
Let’s use the filter function to create a new data frame that just contains rows from our penguin data from Gentoo penguins.
<- filter(penguins, species == "Gentoo") gentoo_penguin_data
3.3 Select
With the filter()
function, we can choose which rows we want to extract from our data. If we want to choose which columns to extract, we use the select()
function. Here we extract the columns containing the species, flipper length, and body mass from each penguin in the dataset.
select(penguins, species, flipper_length_mm, body_mass_g)
# A tibble: 344 × 3
species flipper_length_mm body_mass_g
<fct> <int> <int>
1 Adelie 181 3750
2 Adelie 186 3800
3 Adelie 195 3250
4 Adelie NA NA
5 Adelie 193 3450
6 Adelie 190 3650
7 Adelie 181 3625
8 Adelie 195 4675
9 Adelie 193 3475
10 Adelie 190 4250
# ℹ 334 more rows
From this example code, we can see the general form of the select()
function: select(dataset, column_name1, column_name2, ...)
. The select()
function is very useful for reducing our dataset to just the columns we need for a particular calculation or analysis. This is critical when we’re working with input data that have 100s of columns.
Above, we created a data frame that only contains data from Gentoo penguins. Now let’s use the select()
function on that data frame to extract the columns containing the species, flipper length, and body mass measurements. We’ll save the selected data frame in a new variable.
<- select(gentoo_penguin_data,
gentoo_body_and_flipper
species,
flipper_length_mm, body_mass_g)
3.4 Mutate
If we want to add new columns to a data frame, We use the mutate()
function. Here, we add a new column which contains the body mass of each penguins in kilograms (the “body_mass_g” column is in grams).
mutate(penguins,
body_mass_kg = body_mass_g / 1000)
# A tibble: 344 × 9
species island bill_length_mm bill_depth_mm flipper_length_mm body_mass_g
<fct> <fct> <dbl> <dbl> <int> <int>
1 Adelie Torgersen 39.1 18.7 181 3750
2 Adelie Torgersen 39.5 17.4 186 3800
3 Adelie Torgersen 40.3 18 195 3250
4 Adelie Torgersen NA NA NA NA
5 Adelie Torgersen 36.7 19.3 193 3450
6 Adelie Torgersen 39.3 20.6 190 3650
7 Adelie Torgersen 38.9 17.8 181 3625
8 Adelie Torgersen 39.2 19.6 195 4675
9 Adelie Torgersen 34.1 18.1 193 3475
10 Adelie Torgersen 42 20.2 190 4250
# ℹ 334 more rows
# ℹ 3 more variables: sex <fct>, year <int>, body_mass_kg <dbl>
From this code, we can see the general form of the mutate()
function: mutate(dataset, new_column_name = expression)
. In this example, used the “/” operator to indicate we want to divide penguin body mass in grams by 1000, to calculate the body mass in kilograms. This is an example of an arithmetic operator.
R supports several operators that allow us to perform various mathematical operations:
- + addition
- - subtraction
- * multiplication
- / division
- ^ exponentiation
When we use these operators on the column of a data frame, they’re designed to work separately on each value in the column (called an “element wise” operation).
Note that mutate()
adds new columns to the right side of the data frame. If we want to add new columns in different locations, we can use the .before
and .after
arguments.
mutate(penguins,
body_mass_kg = body_mass_g / 1000,
.after = body_mass_g)
# A tibble: 344 × 9
species island bill_length_mm bill_depth_mm flipper_length_mm body_mass_g
<fct> <fct> <dbl> <dbl> <int> <int>
1 Adelie Torgersen 39.1 18.7 181 3750
2 Adelie Torgersen 39.5 17.4 186 3800
3 Adelie Torgersen 40.3 18 195 3250
4 Adelie Torgersen NA NA NA NA
5 Adelie Torgersen 36.7 19.3 193 3450
6 Adelie Torgersen 39.3 20.6 190 3650
7 Adelie Torgersen 38.9 17.8 181 3625
8 Adelie Torgersen 39.2 19.6 195 4675
9 Adelie Torgersen 34.1 18.1 193 3475
10 Adelie Torgersen 42 20.2 190 4250
# ℹ 334 more rows
# ℹ 3 more variables: body_mass_kg <dbl>, sex <fct>, year <int>
We can provide .after
and .before
with either a column name (like we did above), or a number referring to the position in the table we want to insert the new column. Here we insert the new column before the current second column:
mutate(penguins,
body_mass_kg = body_mass_g / 1000,
.before = 2)
# A tibble: 344 × 9
species body_mass_kg island bill_length_mm bill_depth_mm flipper_length_mm
<fct> <dbl> <fct> <dbl> <dbl> <int>
1 Adelie 3.75 Torgersen 39.1 18.7 181
2 Adelie 3.8 Torgersen 39.5 17.4 186
3 Adelie 3.25 Torgersen 40.3 18 195
4 Adelie NA Torgersen NA NA NA
5 Adelie 3.45 Torgersen 36.7 19.3 193
6 Adelie 3.65 Torgersen 39.3 20.6 190
7 Adelie 3.62 Torgersen 38.9 17.8 181
8 Adelie 4.68 Torgersen 39.2 19.6 195
9 Adelie 3.48 Torgersen 34.1 18.1 193
10 Adelie 4.25 Torgersen 42 20.2 190
# ℹ 334 more rows
# ℹ 3 more variables: body_mass_g <int>, sex <fct>, year <int>
Let’s use the mutate()
function to add a body mass (mg) column to the data frame of Gentoo data we’ve been working on so far. We’ll save this expanded data frame in a new variable.
<- mutate(gentoo_body_and_flipper,
gentoo_body_mg_and_flipper body_mass_mg = body_mass_g * 1000,
.after = body_mass_g)
3.5 Pipes in R
R has a functionality allowing us to take the output of one function and provide it as input to another. The general name for this type of operation is “piping”. The pipe operator in R is |>
. Here we use the filter()
function, the R pipe (|>
), and the head()
function to view the first six rows returned by the filter function.
filter(penguins, species == "Gentoo") |> head()
# A tibble: 6 × 8
species island bill_length_mm bill_depth_mm flipper_length_mm body_mass_g
<fct> <fct> <dbl> <dbl> <int> <int>
1 Gentoo Biscoe 46.1 13.2 211 4500
2 Gentoo Biscoe 50 16.3 230 5700
3 Gentoo Biscoe 48.7 14.1 210 4450
4 Gentoo Biscoe 50 15.2 218 5700
5 Gentoo Biscoe 47.6 14.5 215 5400
6 Gentoo Biscoe 46.5 13.5 210 4550
# ℹ 2 more variables: sex <fct>, year <int>
In this R code, we use the filter()
command to extract all of the rows from the data containing measurements from Gentoo penguins. We then use the |>
operator to send the output of the filter()
function to the head()
function. The head()
function returns the first 6 rows from its input data frame.
When working in RStudio, we can use the shortcut Ctrl/Cmd
+ Shift
+ M
to enter the pipe.
Over the last three sections we used the filter()
, select()
, and mutate()
functions to create this data frame:
head(gentoo_body_mg_and_flipper)
# A tibble: 6 × 4
species flipper_length_mm body_mass_g body_mass_mg
<fct> <int> <int> <dbl>
1 Gentoo 211 4500 4500000
2 Gentoo 230 5700 5700000
3 Gentoo 210 4450 4450000
4 Gentoo 218 5700 5700000
5 Gentoo 215 5400 5400000
6 Gentoo 210 4550 4550000
We saved each of the intermediates to their own variables (take a look in the Environment tab to see the list of variables). Alternatively, we can use the pipe to generate the same data frame without saving any of the intermediate results:
|>
penguins filter(species == "Gentoo") |>
select(species, flipper_length_mm, body_mass_g) |>
mutate(body_mass_mg = body_mass_g * 1000,
.before = body_mass_g) |>
head()
# A tibble: 6 × 4
species flipper_length_mm body_mass_mg body_mass_g
<fct> <int> <dbl> <int>
1 Gentoo 211 4500000 4500
2 Gentoo 230 5700000 5700
3 Gentoo 210 4450000 4450
4 Gentoo 218 5700000 5700
5 Gentoo 215 5400000 5400
6 Gentoo 210 4550000 4550
With the pipe operator, we can combine many simple R functions to create complex pipelines, all while keeping our code readable.
Under the hood, the |>
operator is taking the output of the function on its left and feeding it into the first argument of the function on its right. All dplyr functions are fully compatible with the pipe operator (the first argument of every function is the input data frame).
%>%
The |>
pipe operator is a relatively recent (May 2021) addition to base R. Before that, we needed to use the %>%
operator, also called the “magrittr pipe.” This operator is still around and used in a lot of existing R code, but it requires us to load the magrittr package. In these lessons, we’ll only use the native |>
pipe operator, so we don’t need to load any extra packages.
3.6 Arrange
We can use the arrange()
function to sort the rows in our data according to the values in one or more columns. Here we sort the penguins by bill depth:
|>
penguins arrange(bill_length_mm)
# A tibble: 344 × 8
species island bill_length_mm bill_depth_mm flipper_length_mm body_mass_g
<fct> <fct> <dbl> <dbl> <int> <int>
1 Adelie Dream 32.1 15.5 188 3050
2 Adelie Dream 33.1 16.1 178 2900
3 Adelie Torgersen 33.5 19 190 3600
4 Adelie Dream 34 17.1 185 3400
5 Adelie Torgersen 34.1 18.1 193 3475
6 Adelie Torgersen 34.4 18.4 184 3325
7 Adelie Biscoe 34.5 18.1 187 2900
8 Adelie Torgersen 34.6 21.1 198 4400
9 Adelie Torgersen 34.6 17.2 189 3200
10 Adelie Biscoe 35 17.9 190 3450
# ℹ 334 more rows
# ℹ 2 more variables: sex <fct>, year <int>
By default, arrange()
sorts values from smallest to largest (ascending). We can use the desc()
function inside arrange to sort values from largest to smallest (descending).
arrange(penguins, desc(bill_depth_mm))
# A tibble: 344 × 8
species island bill_length_mm bill_depth_mm flipper_length_mm body_mass_g
<fct> <fct> <dbl> <dbl> <int> <int>
1 Adelie Torgers… 46 21.5 194 4200
2 Adelie Torgers… 38.6 21.2 191 3800
3 Adelie Dream 42.3 21.2 191 4150
4 Adelie Torgers… 34.6 21.1 198 4400
5 Adelie Dream 39.2 21.1 196 4150
6 Adelie Biscoe 41.3 21.1 195 4400
7 Chinstrap Dream 54.2 20.8 201 4300
8 Adelie Torgers… 42.5 20.7 197 4500
9 Adelie Biscoe 39.6 20.7 191 3900
10 Chinstrap Dream 52 20.7 210 4800
# ℹ 334 more rows
# ℹ 2 more variables: sex <fct>, year <int>
Lastly, we can sort data based on multiple columns:
arrange(penguins, island, desc(bill_depth_mm))
# A tibble: 344 × 8
species island bill_length_mm bill_depth_mm flipper_length_mm body_mass_g
<fct> <fct> <dbl> <dbl> <int> <int>
1 Adelie Biscoe 41.3 21.1 195 4400
2 Adelie Biscoe 39.6 20.7 191 3900
3 Adelie Biscoe 45.6 20.3 191 4600
4 Adelie Biscoe 41 20 203 4725
5 Adelie Biscoe 37.8 20 190 4250
6 Adelie Biscoe 38.2 20 190 3900
7 Adelie Biscoe 42 19.5 200 4050
8 Adelie Biscoe 42.2 19.5 197 4275
9 Adelie Biscoe 35.9 19.2 189 3800
10 Adelie Biscoe 37.6 19.1 194 3750
# ℹ 334 more rows
# ℹ 2 more variables: sex <fct>, year <int>
3.7 Distinct
The distinct()
function returns all unique rows from the data frame. We can provide the names of the columns we want to search for unique combinations. Here we want to find all unique combinations of species and island.
distinct(penguins, island, species)
# A tibble: 5 × 2
island species
<fct> <fct>
1 Torgersen Adelie
2 Biscoe Adelie
3 Dream Adelie
4 Biscoe Gentoo
5 Dream Chinstrap
If we don’t specify any column names, the distinct()
function will look for unique combinations across all columns.
distinct(penguins)
# A tibble: 344 × 8
species island bill_length_mm bill_depth_mm flipper_length_mm body_mass_g
<fct> <fct> <dbl> <dbl> <int> <int>
1 Adelie Torgersen 39.1 18.7 181 3750
2 Adelie Torgersen 39.5 17.4 186 3800
3 Adelie Torgersen 40.3 18 195 3250
4 Adelie Torgersen NA NA NA NA
5 Adelie Torgersen 36.7 19.3 193 3450
6 Adelie Torgersen 39.3 20.6 190 3650
7 Adelie Torgersen 38.9 17.8 181 3625
8 Adelie Torgersen 39.2 19.6 195 4675
9 Adelie Torgersen 34.1 18.1 193 3475
10 Adelie Torgersen 42 20.2 190 4250
# ℹ 334 more rows
# ℹ 2 more variables: sex <fct>, year <int>
3.8 Grouping and Summarizing Data
We started transforming these data because we wanted to extract some summary stats about the body mass and flipper length of the three penguin species in our data. With the functions we’ve used so far, we can extract the data we need for a specific penguin species. Here, we use the summarize()
function to calculate the mean mean body mass and flipper length across all Gentoo penguins in our data.
|>
penguins filter(species == "Gentoo") |>
select(species, flipper_length_mm, body_mass_g) |>
na.omit() |> # Filter out any rows containing NA values in any columns
summarize(mean_body_mass_g = mean(body_mass_g),
mean_flipper_length_mm = mean(flipper_length_mm))
# A tibble: 1 × 2
mean_body_mass_g mean_flipper_length_mm
<dbl> <dbl>
1 5076. 217.
In this example we use dplyr functions and the pipe operator to filter our data for Gentoo penguins, select our columns of interest (body_mass_g and flipper_length_mm), and the summarize()
function (also from dplyr) to calculate the mean values across data in the body_mass_g and flipper_length_mm columns.
The first time we ran the code above, we didn’t use the na.omit()
function and our mean body mass and flipper length calculations returned ‘NA’ values. ‘NA’ is one of the ways R represents missing data, and it turns out one of the penguins in our dataset has ‘NA’ values for all of its measurements (you can find it by looking through the data with the View()
function). Many function that perform mathematical operations (like mean), will return an ‘NA’ value if any of its inputs are ‘NA’. This is so we’re aware there are ‘NA’ values present in our data and can handle them accordingly. Once we realized there was a single ‘NA’ values in our data, we excluded it using the na.omit()
function.
From this code, we see the general form of the summarize()
function: summarize(dataset, column_name = expression)
. This is quite similar to the mutate()
function. However, while the mutate()
function performs a calculation for each row in a data frame column, the summarize()
function performs one calculation using all of the data in a column.
Using the summarize()
function we can quickly calculate summary stats from the columns in a data frame. In order to get the same summary stats for the other penguin data, we’d need to repeat the same set of operations two more times (one for each species). Ideally, we want to be able to work on data from all three penguin species at the same time.
We can do this with the group_by()
function.
|>
penguins select(species, body_mass_g, flipper_length_mm) |>
group_by(species) |>
# This time we're using arguments in the mean function to remove the NA values
summarize(mean_body_mass_g = mean(body_mass_g, na.rm = TRUE),
mean_flipper_length_mm = mean(flipper_length_mm, na.rm = TRUE))
# A tibble: 3 × 3
species mean_body_mass_g mean_flipper_length_mm
<fct> <dbl> <dbl>
1 Adelie 3701. 190.
2 Chinstrap 3733. 196.
3 Gentoo 5076. 217.
Here we use the group_by()
function to group the data according to the species column, before using the summarize()
function. This grouping causes the summarize()
function to perform calculations across the data within each group (species, in this case), rather than across the entire data frame.
The means don’t give us the whole picture, so let’s calculate the standard deviations for each of these measurements, as well as the total number of penguins from each species.
|>
penguins select(species, body_mass_g, flipper_length_mm) |>
group_by(species) |>
na.omit() |>
summarize(mean_body_mass_g = mean(body_mass_g),
sd_body_mass_g = sd(body_mass_g),
mean_flipper_length_mm = mean(flipper_length_mm),
sd_mean_flipper_length_mm = sd(flipper_length_mm),
Total_animals = n())
It looks like the raw numbers agree with what we saw in the figure we generated in the previous lesson. Namely, the Gentoo penguins tend to have more mass and longer flippers than the other two species. And while the Chinstrap penguins have higher mean body mass and flipper length than the Adélie penguins, the standard deviations in these measurements are large enough that there probably isn’t a significant difference in size between the two. In the coming lessons, we’ll apply some statistical tests to these data to test our hypotheses.
So far, we’ve seen two ways of keeping ‘NA’ values from affecting our calculations: 1. The na.omit()
function removes all rows from a data frame that contain ‘NA’ values in any column. 2. The mean()
and sd()
have an na.rm
argument that excludes all ‘NA’ values from the mean / standard deviation calculations when we set it to TRUE (na.rm = TRUE
).
We used the na.omit()
function to exclude the ‘NA’ values before we used the summarize()
function to calculated all of our summary statistics above. However, if we skip the na.omit()
function and instead use the “na.rm” argument for mean()
and sd()
to exclude the ‘NA’ values, we get a slightly different result.
|>
penguins select(species, body_mass_g, flipper_length_mm) |>
group_by(species) |>
# na.omit() |>
summarize(mean_body_mass_g = mean(body_mass_g, na.rm = TRUE),
sd_body_mass_g = sd(body_mass_g, na.rm = TRUE),
mean_flipper_length_mm = mean(flipper_length_mm, na.rm = TRUE),
sd_mean_flipper_length_mm = sd(flipper_length_mm, na.rm = TRUE),
Total_animals = n())
# A tibble: 3 × 6
species mean_body_mass_g sd_body_mass_g mean_flipper_length_mm
<fct> <dbl> <dbl> <dbl>
1 Adelie 3701. 459. 190.
2 Chinstrap 3733. 384. 196.
3 Gentoo 5076. 504. 217.
# ℹ 2 more variables: sd_mean_flipper_length_mm <dbl>, Total_animals <int>
Compare these results to the previous code using na.omit()
, paying close attention to the “Total_animals” column. If there are too many columns in the results to compare them easily, you could always use the select()
function to grab just the “species” and “Total_animals” columns.
When we used the na.rm
argument approach, we ended with with one extra penguin in the “Total_animals” column for the Adélie and Gentoo penguins. This is because the n()
function counts rows, regardless of their contents (you can confirm there’s no na.rm
argument for n()
using the R docs). While there are many different ways to accomplish the same task, they are not all equivalent in all cases. If we weren’t also using the n()
function to count the total number of penguins in each species, both of our methods for removing the ‘NA’ values would have produced the same result.
Even though this is a toy example, we’ve created a flexible analysis pipeline by combining these dplyr functions, The code we’ve written will still work if we collect new data from different penguin species, add additional biological measurements beyond body mass and flipper length, or remove some of the rows from the original input data.
4 R session information
Here we report the version number for R and the package versions we used to perform the analyses in this document.
sessionInfo()
R version 4.4.0 (2024-04-24 ucrt)
Platform: x86_64-w64-mingw32/x64
Running under: Windows 10 x64 (build 19045)
Matrix products: default
locale:
[1] LC_COLLATE=English_United States.utf8
[2] LC_CTYPE=English_United States.utf8
[3] LC_MONETARY=English_United States.utf8
[4] LC_NUMERIC=C
[5] LC_TIME=English_United States.utf8
time zone: America/New_York
tzcode source: internal
attached base packages:
[1] stats graphics grDevices utils datasets methods base
other attached packages:
[1] dplyr_1.1.4 readr_2.1.5 here_1.0.1
[4] palmerpenguins_0.1.1
loaded via a namespace (and not attached):
[1] crayon_1.5.2 vctrs_0.6.5 cli_3.6.2 knitr_1.47
[5] rlang_1.1.3 xfun_0.44 generics_0.1.3 jsonlite_1.8.8
[9] bit_4.0.5 glue_1.7.0 rprojroot_2.0.4 htmltools_0.5.8.1
[13] hms_1.1.3 fansi_1.0.6 rmarkdown_2.27 evaluate_0.23
[17] tibble_3.2.1 tzdb_0.4.0 fastmap_1.2.0 yaml_2.3.8
[21] lifecycle_1.0.4 compiler_4.4.0 htmlwidgets_1.6.4 pkgconfig_2.0.3
[25] rstudioapi_0.16.0 digest_0.6.35 R6_2.5.1 tidyselect_1.2.1
[29] utf8_1.2.4 parallel_4.4.0 vroom_1.6.5 pillar_1.9.0
[33] magrittr_2.0.3 withr_3.0.0 bit64_4.0.5 tools_4.4.0