Lesson 2: Working with tabular data

Published

October 26, 2023

1 Prepare the R environment for this lesson

For this lesson we’ll need four packages:

palmerpenguins
here
readr
dplyr

First, we need to install the here package, using the install.packages() function.

install.packages("here")

Note, the readr and dplyr packages are part of the tidyverse, so we don’t need to install them separately. Now we use the library() function to load all of these packages.

library(palmerpenguins)
library(here)

here() starts at C:/Users/nickopotamus/Projects/ITMAT_office_houRs

library(readr)
library(dplyr)


Attaching package: 'dplyr'

The following objects are masked from 'package:stats':

    filter, lag

The following objects are masked from 'package:base':

    intersect, setdiff, setequal, union

The output from this code indicates R successfully loaded these packages. The message we get from the here package tells us which directory on our computers it’s using as the “root” directory (more on what this means below). The other warning messages explains that dplyr includes several functions (e.g. filter() and lag()) that have the same names as functions from the stats and base packages. R’s default behavior for resolving these naming conflicts is to use the version of the function loaded most recently into memory. The warning lets us know that if we run any of these functions, R will use the version of these functions from the dplyr package and not the stats/base package.

The potential for naming conflicts like this is another reason why we only load the packages we need for the current analyses.

RStudio’s Tab-Complete List Includes Package Names for Each Function

When we start to type a function’s name, and RStudio’s tab-completion prompt opens up, the package for each function is listed in curly bracers to the right of the function’s name. As we’re writing code, we can use this to make sure we’re using the correct version of the function.

2 Reading data from files

We have a file named penguins.csv in the DATA/ directory. This data file is in .csv (comma-separated values) format and contains the same information as the penguins data frame we used in the previous lesson. We can view the raw contents of this file by using the Files tab in the lower right pane of the RStudio window. Navigate to the DATA/ directory, click on the penguins.csv file, then select the “View File” option.

2.1 Base R

We can read the data from this file using the base R function read.csv() and store the data in a variable named penguin_data_from_base_r.

penguin_data_from_base_r <- read.csv(file = here::here("DATA/penguins.csv"))

In R, <- is called the assignment operator. The assignment operator takes the output from the function to its right (read.csv()) and assigns it to the variable to its right (penguin_data_from_base_r). Put differently, we’re storing the contents of the ‘penguins.csv’ file in ‘penguin_data_from_base_r’.

The here package

The here() function is helping to point R to the specific location of the file. It allows us to define a file’s location relative to the project’s main directory (ITMAT_office_houRs, in this case). While the here() function is not required to read files, it can make our lives easier when we’re using RStudio’s ‘Projects’ to manage our work.

We can look at the contents of penguin_data_from_base_r by entering the variable name into the R console.

penguin_data_from_base_r

   species    island bill_length_mm bill_depth_mm flipper_length_mm body_mass_g
1   Adelie Torgersen           39.1          18.7               181        3750
2   Adelie Torgersen           39.5          17.4               186        3800
3   Adelie Torgersen           40.3          18.0               195        3250
4   Adelie Torgersen             NA            NA                NA          NA
5   Adelie Torgersen           36.7          19.3               193        3450
6   Adelie Torgersen           39.3          20.6               190        3650
7   Adelie Torgersen           38.9          17.8               181        3625
8   Adelie Torgersen           39.2          19.6               195        4675
9   Adelie Torgersen           34.1          18.1               193        3475
10  Adelie Torgersen           42.0          20.2               190        4250
11  Adelie Torgersen           37.8          17.1               186        3300
12  Adelie Torgersen           37.8          17.3               180        3700
13  Adelie Torgersen           41.1          17.6               182        3200
14  Adelie Torgersen           38.6          21.2               191        3800
15  Adelie Torgersen           34.6          21.1               198        4400
16  Adelie Torgersen           36.6          17.8               185        3700
17  Adelie Torgersen           38.7          19.0               195        3450
18  Adelie Torgersen           42.5          20.7               197        4500
      sex year
1    male 2007
2  female 2007
3  female 2007
4    <NA> 2007
5  female 2007
6    male 2007
7  female 2007
8    male 2007
9    <NA> 2007
10   <NA> 2007
11   <NA> 2007
12   <NA> 2007
13 female 2007
14   male 2007
15   male 2007
16 female 2007
17 female 2007
18   male 2007
 [ reached 'max' / getOption("max.print") -- omitted 326 rows ]

Notice, that “penguin_data_from_base_r” now appears in the Environment tab of the upper right pane of the RStudio window. We can use this tab to quickly check the data we’ve loaded into R.

While the read.csv() function gets the job done, it doesn’t do a lot to format the data.

2.2 The readr package

dplyr logo

As the name implies, the readr package contains functions designed to help us read tabular data from text files. These functions have a lot of useful features to mark and handle problems we’re likely to encounter in real-world data (we’ll see some examples of this in later lessons).

To read this csv file, we’re going to use the read_csv() function from the readr package. The command is almost identical to base R, except we have an underscore in read_csv(), instead of a period.

penguin_data_from_readr <- read_csv(file = here::here("DATA/penguins.csv"))

Rows: 344 Columns: 8
── Column specification ────────────────────────────────────────────────────────
Delimiter: ","
chr (3): species, island, sex
dbl (5): bill_length_mm, bill_depth_mm, flipper_length_mm, body_mass_g, year

ℹ Use `spec()` to retrieve the full column specification for this data.
ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.

The readr function includes some additional output that tells us about guesses it’s making about data in the file. Specifically, it determined the species, island, and sex columns contain text (or “characters”), while the remaining columns contain numbers with decimal points (or “doubles”).

Compare the the contents of the penguin_data_from_readr to the penguin_data_from_base_r data we loaded above:

penguin_data_from_readr

# A tibble: 344 × 8
   species island    bill_length_mm bill_depth_mm flipper_length_mm body_mass_g
   <chr>   <chr>              <dbl>         <dbl>             <dbl>       <dbl>
 1 Adelie  Torgersen           39.1          18.7               181        3750
 2 Adelie  Torgersen           39.5          17.4               186        3800
 3 Adelie  Torgersen           40.3          18                 195        3250
 4 Adelie  Torgersen           NA            NA                  NA          NA
 5 Adelie  Torgersen           36.7          19.3               193        3450
 6 Adelie  Torgersen           39.3          20.6               190        3650
 7 Adelie  Torgersen           38.9          17.8               181        3625
 8 Adelie  Torgersen           39.2          19.6               195        4675
 9 Adelie  Torgersen           34.1          18.1               193        3475
10 Adelie  Torgersen           42            20.2               190        4250
# ℹ 334 more rows
# ℹ 2 more variables: sex <chr>, year <dbl>

readr functions read data from files and store them as tibbles, a special version of a data frame. We saw an example of a tibble when we worked with the penguins data frame, loaded by the palmerpenguins package.

3 Transforming penguins

3.1 The dplyr Package

dplyr logo

The dplyr package comes with many functions for manipulating and extracting information from tabular data. As we’ll see below, dplyr functions are named after verbs that describe what we’re doing to the input data, and the first argument of every function is the input data frame (or tibble).

For simplicity, we’ll return to using the penguins data frame for the remainder of this lesson. While we could use dplyr functions to work with the data we read from penguins.csv, the penguins data frame has some nicer formatting.

3.2 Filter

We can use the filter() function to grab rows from our data that contain specific information. Here, we extract just those rows containing measurements from Gentoo penguins.

filter(penguins, species == "Gentoo")

# A tibble: 124 × 8
   species island bill_length_mm bill_depth_mm flipper_length_mm body_mass_g
   <fct>   <fct>           <dbl>         <dbl>             <int>       <int>
 1 Gentoo  Biscoe           46.1          13.2               211        4500
 2 Gentoo  Biscoe           50            16.3               230        5700
 3 Gentoo  Biscoe           48.7          14.1               210        4450
 4 Gentoo  Biscoe           50            15.2               218        5700
 5 Gentoo  Biscoe           47.6          14.5               215        5400
 6 Gentoo  Biscoe           46.5          13.5               210        4550
 7 Gentoo  Biscoe           45.4          14.6               211        4800
 8 Gentoo  Biscoe           46.7          15.3               219        5200
 9 Gentoo  Biscoe           43.3          13.4               209        4400
10 Gentoo  Biscoe           46.8          15.4               215        5150
# ℹ 114 more rows
# ℹ 2 more variables: sex <fct>, year <int>

Remember, the species information in our data frame is contained in the “species” column. This example code is telling R to search through every row in the penguin data, and return those rows that have “Gentoo” in the species column. From this example, we can see the general form for using the filter function: filter(dataset, comparison). We used “==” to indicate we want to find all rows in species that match the word “Gentoo”. This is an example of a relational operator.

Compare Values With Relational Operators

R supports several operators that let us compare values:

== : Check if two values are exactly equal. Many programming languages use the double equals sign to indicate comparisons, because they’re already using the single equal sign for something else (e.g. assignment).
<, > : Less-than, and greater-than comparisons.
<=, >= : Less-than or equal, and greater-than or equal comparisons.
!= : Check if two values are not equal.

These operators return a logical value: TRUE or FALSE.

"Gentoo" == "Gentoo"

[1] TRUE

121 < 43

[1] FALSE

We can also combine multiple filtering conditions in the same command. In this example, we want to get the rows containing data from female Adelie penguins.

filter(penguins,
       sex == "female",
       species == "Adelie")

# A tibble: 73 × 8
   species island    bill_length_mm bill_depth_mm flipper_length_mm body_mass_g
   <fct>   <fct>              <dbl>         <dbl>             <int>       <int>
 1 Adelie  Torgersen           39.5          17.4               186        3800
 2 Adelie  Torgersen           40.3          18                 195        3250
 3 Adelie  Torgersen           36.7          19.3               193        3450
 4 Adelie  Torgersen           38.9          17.8               181        3625
 5 Adelie  Torgersen           41.1          17.6               182        3200
 6 Adelie  Torgersen           36.6          17.8               185        3700
 7 Adelie  Torgersen           38.7          19                 195        3450
 8 Adelie  Torgersen           34.4          18.4               184        3325
 9 Adelie  Biscoe              37.8          18.3               174        3400
10 Adelie  Biscoe              35.9          19.2               189        3800
# ℹ 63 more rows
# ℹ 2 more variables: sex <fct>, year <int>

Alternatively, we can combine multiple conditions with the & symbol (meaning “and”) and the | symbol (meaning “or”). We can re-write the previous filter command using the & symbol:

filter(penguins,
       sex == "female" & species == "Adelie")

# A tibble: 73 × 8
   species island    bill_length_mm bill_depth_mm flipper_length_mm body_mass_g
   <fct>   <fct>              <dbl>         <dbl>             <int>       <int>
 1 Adelie  Torgersen           39.5          17.4               186        3800
 2 Adelie  Torgersen           40.3          18                 195        3250
 3 Adelie  Torgersen           36.7          19.3               193        3450
 4 Adelie  Torgersen           38.9          17.8               181        3625
 5 Adelie  Torgersen           41.1          17.6               182        3200
 6 Adelie  Torgersen           36.6          17.8               185        3700
 7 Adelie  Torgersen           38.7          19                 195        3450
 8 Adelie  Torgersen           34.4          18.4               184        3325
 9 Adelie  Biscoe              37.8          18.3               174        3400
10 Adelie  Biscoe              35.9          19.2               189        3800
# ℹ 63 more rows
# ℹ 2 more variables: sex <fct>, year <int>

We can use the | symbol to retrieve data from female penguins that are either Adelie or Chinstrap:

filter(penguins,
       sex == "female",
       species == "Adelie" | species == "Chinstrap")

# A tibble: 107 × 8
   species island    bill_length_mm bill_depth_mm flipper_length_mm body_mass_g
   <fct>   <fct>              <dbl>         <dbl>             <int>       <int>
 1 Adelie  Torgersen           39.5          17.4               186        3800
 2 Adelie  Torgersen           40.3          18                 195        3250
 3 Adelie  Torgersen           36.7          19.3               193        3450
 4 Adelie  Torgersen           38.9          17.8               181        3625
 5 Adelie  Torgersen           41.1          17.6               182        3200
 6 Adelie  Torgersen           36.6          17.8               185        3700
 7 Adelie  Torgersen           38.7          19                 195        3450
 8 Adelie  Torgersen           34.4          18.4               184        3325
 9 Adelie  Biscoe              37.8          18.3               174        3400
10 Adelie  Biscoe              35.9          19.2               189        3800
# ℹ 97 more rows
# ℹ 2 more variables: sex <fct>, year <int>

Let’s use the filter function to create a new data frame that just contains rows from our penguin data from Gentoo penguins.

gentoo_penguin_data <- filter(penguins, species == "Gentoo")

3.3 Select

With the filter() function, we can choose which rows we want to extract from our data. If we want to choose which columns to extract, we use the select() function. Here we extract the columns containing the species, flipper length, and body mass from each penguin in the dataset.

select(penguins, species, flipper_length_mm, body_mass_g)

# A tibble: 344 × 3
   species flipper_length_mm body_mass_g
   <fct>               <int>       <int>
 1 Adelie                181        3750
 2 Adelie                186        3800
 3 Adelie                195        3250
 4 Adelie                 NA          NA
 5 Adelie                193        3450
 6 Adelie                190        3650
 7 Adelie                181        3625
 8 Adelie                195        4675
 9 Adelie                193        3475
10 Adelie                190        4250
# ℹ 334 more rows

From this example code, we can see the general form of the select() function: select(dataset, column_name1, column_name2, ...). The select() function is very useful for reducing our dataset to just the columns we need for a particular calculation or analysis. This is critical when we’re working with input data that have 100s of columns.

Above, we created a data frame that only contains data from Gentoo penguins. Now let’s use the select() function on that data frame to extract the columns containing the species, flipper length, and body mass measurements. We’ll save the selected data frame in a new variable.

gentoo_body_and_flipper <- select(gentoo_penguin_data,
                                  species,
                                  flipper_length_mm,
                                  body_mass_g)

3.4 Mutate

Cartoon representation of mutate operation — Artwork by @allison_horst

If we want to add new columns to a data frame, We use the mutate() function. Here, we add a new column which contains the body mass of each penguins in kilograms (the “body_mass_g” column is in grams).

mutate(penguins,
       body_mass_kg = body_mass_g / 1000)

# A tibble: 344 × 9
   species island    bill_length_mm bill_depth_mm flipper_length_mm body_mass_g
   <fct>   <fct>              <dbl>         <dbl>             <int>       <int>
 1 Adelie  Torgersen           39.1          18.7               181        3750
 2 Adelie  Torgersen           39.5          17.4               186        3800
 3 Adelie  Torgersen           40.3          18                 195        3250
 4 Adelie  Torgersen           NA            NA                  NA          NA
 5 Adelie  Torgersen           36.7          19.3               193        3450
 6 Adelie  Torgersen           39.3          20.6               190        3650
 7 Adelie  Torgersen           38.9          17.8               181        3625
 8 Adelie  Torgersen           39.2          19.6               195        4675
 9 Adelie  Torgersen           34.1          18.1               193        3475
10 Adelie  Torgersen           42            20.2               190        4250
# ℹ 334 more rows
# ℹ 3 more variables: sex <fct>, year <int>, body_mass_kg <dbl>

From this code, we can see the general form of the mutate() function: mutate(dataset, new_column_name = expression). In this example, used the “/” operator to indicate we want to divide penguin body mass in grams by 1000, to calculate the body mass in kilograms. This is an example of an arithmetic operator.

Arithmetic Operators

R supports several operators that allow us to perform various mathematical operations:

+ addition
- subtraction
* multiplication
/ division
^ exponentiation

When we use these operators on the column of a data frame, they’re designed to work separately on each value in the column (called an “element wise” operation).

Note that mutate() adds new columns to the right side of the data frame. If we want to add new columns in different locations, we can use the .before and .after arguments.

mutate(penguins,
       body_mass_kg = body_mass_g / 1000,
       .after = body_mass_g)

# A tibble: 344 × 9
   species island    bill_length_mm bill_depth_mm flipper_length_mm body_mass_g
   <fct>   <fct>              <dbl>         <dbl>             <int>       <int>
 1 Adelie  Torgersen           39.1          18.7               181        3750
 2 Adelie  Torgersen           39.5          17.4               186        3800
 3 Adelie  Torgersen           40.3          18                 195        3250
 4 Adelie  Torgersen           NA            NA                  NA          NA
 5 Adelie  Torgersen           36.7          19.3               193        3450
 6 Adelie  Torgersen           39.3          20.6               190        3650
 7 Adelie  Torgersen           38.9          17.8               181        3625
 8 Adelie  Torgersen           39.2          19.6               195        4675
 9 Adelie  Torgersen           34.1          18.1               193        3475
10 Adelie  Torgersen           42            20.2               190        4250
# ℹ 334 more rows
# ℹ 3 more variables: body_mass_kg <dbl>, sex <fct>, year <int>

We can provide .after and .before with either a column name (like we did above), or a number referring to the position in the table we want to insert the new column. Here we insert the new column before the current second column:

mutate(penguins,
       body_mass_kg = body_mass_g / 1000,
       .before = 2)

# A tibble: 344 × 9
   species body_mass_kg island    bill_length_mm bill_depth_mm flipper_length_mm
   <fct>          <dbl> <fct>              <dbl>         <dbl>             <int>
 1 Adelie          3.75 Torgersen           39.1          18.7               181
 2 Adelie          3.8  Torgersen           39.5          17.4               186
 3 Adelie          3.25 Torgersen           40.3          18                 195
 4 Adelie         NA    Torgersen           NA            NA                  NA
 5 Adelie          3.45 Torgersen           36.7          19.3               193
 6 Adelie          3.65 Torgersen           39.3          20.6               190
 7 Adelie          3.62 Torgersen           38.9          17.8               181
 8 Adelie          4.68 Torgersen           39.2          19.6               195
 9 Adelie          3.48 Torgersen           34.1          18.1               193
10 Adelie          4.25 Torgersen           42            20.2               190
# ℹ 334 more rows
# ℹ 3 more variables: body_mass_g <int>, sex <fct>, year <int>

Let’s use the mutate() function to add a body mass (mg) column to the data frame of Gentoo data we’ve been working on so far. We’ll save this expanded data frame in a new variable.

gentoo_body_mg_and_flipper <- mutate(gentoo_body_and_flipper,
                                     body_mass_mg = body_mass_g * 1000,
                                     .after = body_mass_g)

3.5 Pipes in R

R has a functionality allowing us to take the output of one function and provide it as input to another. The general name for this type of operation is “piping”. The pipe operator in R is |>. Here we use the filter() function, the R pipe (|>), and the head() function to view the first six rows returned by the filter function.

filter(penguins, species == "Gentoo") |> head()

# A tibble: 6 × 8
  species island bill_length_mm bill_depth_mm flipper_length_mm body_mass_g
  <fct>   <fct>           <dbl>         <dbl>             <int>       <int>
1 Gentoo  Biscoe           46.1          13.2               211        4500
2 Gentoo  Biscoe           50            16.3               230        5700
3 Gentoo  Biscoe           48.7          14.1               210        4450
4 Gentoo  Biscoe           50            15.2               218        5700
5 Gentoo  Biscoe           47.6          14.5               215        5400
6 Gentoo  Biscoe           46.5          13.5               210        4550
# ℹ 2 more variables: sex <fct>, year <int>

In this R code, we use the filter() command to extract all of the rows from the data containing measurements from Gentoo penguins. We then use the |> operator to send the output of the filter() function to the head() function. The head() function returns the first 6 rows from its input data frame.

RStudio’s pipe shortcut

When working in RStudio, we can use the shortcut Ctrl/Cmd + Shift + M to enter the pipe.

Over the last three sections we used the filter(), select(), and mutate() functions to create this data frame:

head(gentoo_body_mg_and_flipper)

# A tibble: 6 × 4
  species flipper_length_mm body_mass_g body_mass_mg
  <fct>               <int>       <int>        <dbl>
1 Gentoo                211        4500      4500000
2 Gentoo                230        5700      5700000
3 Gentoo                210        4450      4450000
4 Gentoo                218        5700      5700000
5 Gentoo                215        5400      5400000
6 Gentoo                210        4550      4550000

We saved each of the intermediates to their own variables (take a look in the Environment tab to see the list of variables). Alternatively, we can use the pipe to generate the same data frame without saving any of the intermediate results:

penguins |> 
    filter(species == "Gentoo") |> 
    select(species, flipper_length_mm, body_mass_g) |> 
    mutate(body_mass_mg = body_mass_g * 1000,
           .before = body_mass_g) |> 
    head()

# A tibble: 6 × 4
  species flipper_length_mm body_mass_mg body_mass_g
  <fct>               <int>        <dbl>       <int>
1 Gentoo                211      4500000        4500
2 Gentoo                230      5700000        5700
3 Gentoo                210      4450000        4450
4 Gentoo                218      5700000        5700
5 Gentoo                215      5400000        5400
6 Gentoo                210      4550000        4550

With the pipe operator, we can combine many simple R functions to create complex pipelines, all while keeping our code readable.

Under the hood, the |> operator is taking the output of the function on its left and feeding it into the first argument of the function on its right. All dplyr functions are fully compatible with the pipe operator (the first argument of every function is the input data frame).

Another Pipe Operator: %>%

The |> pipe operator is a relatively recent (May 2021) addition to base R. Before that, we needed to use the %>% operator, also called the “magrittr pipe.” This operator is still around and used in a lot of existing R code, but it requires us to load the magrittr package. In these lessons, we’ll only use the native |> pipe operator, so we don’t need to load any extra packages.

3.6 Arrange

We can use the arrange() function to sort the rows in our data according to the values in one or more columns. Here we sort the penguins by bill depth:

penguins |> 
    arrange(bill_length_mm)

# A tibble: 344 × 8
   species island    bill_length_mm bill_depth_mm flipper_length_mm body_mass_g
   <fct>   <fct>              <dbl>         <dbl>             <int>       <int>
 1 Adelie  Dream               32.1          15.5               188        3050
 2 Adelie  Dream               33.1          16.1               178        2900
 3 Adelie  Torgersen           33.5          19                 190        3600
 4 Adelie  Dream               34            17.1               185        3400
 5 Adelie  Torgersen           34.1          18.1               193        3475
 6 Adelie  Torgersen           34.4          18.4               184        3325
 7 Adelie  Biscoe              34.5          18.1               187        2900
 8 Adelie  Torgersen           34.6          21.1               198        4400
 9 Adelie  Torgersen           34.6          17.2               189        3200
10 Adelie  Biscoe              35            17.9               190        3450
# ℹ 334 more rows
# ℹ 2 more variables: sex <fct>, year <int>

By default, arrange() sorts values from smallest to largest (ascending). We can use the desc() function inside arrange to sort values from largest to smallest (descending).

arrange(penguins, desc(bill_depth_mm))

# A tibble: 344 × 8
   species   island   bill_length_mm bill_depth_mm flipper_length_mm body_mass_g
   <fct>     <fct>             <dbl>         <dbl>             <int>       <int>
 1 Adelie    Torgers…           46            21.5               194        4200
 2 Adelie    Torgers…           38.6          21.2               191        3800
 3 Adelie    Dream              42.3          21.2               191        4150
 4 Adelie    Torgers…           34.6          21.1               198        4400
 5 Adelie    Dream              39.2          21.1               196        4150
 6 Adelie    Biscoe             41.3          21.1               195        4400
 7 Chinstrap Dream              54.2          20.8               201        4300
 8 Adelie    Torgers…           42.5          20.7               197        4500
 9 Adelie    Biscoe             39.6          20.7               191        3900
10 Chinstrap Dream              52            20.7               210        4800
# ℹ 334 more rows
# ℹ 2 more variables: sex <fct>, year <int>

Lastly, we can sort data based on multiple columns:

arrange(penguins, island, desc(bill_depth_mm))

# A tibble: 344 × 8
   species island bill_length_mm bill_depth_mm flipper_length_mm body_mass_g
   <fct>   <fct>           <dbl>         <dbl>             <int>       <int>
 1 Adelie  Biscoe           41.3          21.1               195        4400
 2 Adelie  Biscoe           39.6          20.7               191        3900
 3 Adelie  Biscoe           45.6          20.3               191        4600
 4 Adelie  Biscoe           41            20                 203        4725
 5 Adelie  Biscoe           37.8          20                 190        4250
 6 Adelie  Biscoe           38.2          20                 190        3900
 7 Adelie  Biscoe           42            19.5               200        4050
 8 Adelie  Biscoe           42.2          19.5               197        4275
 9 Adelie  Biscoe           35.9          19.2               189        3800
10 Adelie  Biscoe           37.6          19.1               194        3750
# ℹ 334 more rows
# ℹ 2 more variables: sex <fct>, year <int>

3.7 Distinct

The distinct() function returns all unique rows from the data frame. We can provide the names of the columns we want to search for unique combinations. Here we want to find all unique combinations of species and island.

distinct(penguins, island, species)

# A tibble: 5 × 2
  island    species  
  <fct>     <fct>    
1 Torgersen Adelie   
2 Biscoe    Adelie   
3 Dream     Adelie   
4 Biscoe    Gentoo   
5 Dream     Chinstrap

If we don’t specify any column names, the distinct() function will look for unique combinations across all columns.

distinct(penguins)

# A tibble: 344 × 8
   species island    bill_length_mm bill_depth_mm flipper_length_mm body_mass_g
   <fct>   <fct>              <dbl>         <dbl>             <int>       <int>
 1 Adelie  Torgersen           39.1          18.7               181        3750
 2 Adelie  Torgersen           39.5          17.4               186        3800
 3 Adelie  Torgersen           40.3          18                 195        3250
 4 Adelie  Torgersen           NA            NA                  NA          NA
 5 Adelie  Torgersen           36.7          19.3               193        3450
 6 Adelie  Torgersen           39.3          20.6               190        3650
 7 Adelie  Torgersen           38.9          17.8               181        3625
 8 Adelie  Torgersen           39.2          19.6               195        4675
 9 Adelie  Torgersen           34.1          18.1               193        3475
10 Adelie  Torgersen           42            20.2               190        4250
# ℹ 334 more rows
# ℹ 2 more variables: sex <fct>, year <int>

3.8 Grouping and Summarizing Data

We started transforming these data because we wanted to extract some summary stats about the body mass and flipper length of the three penguin species in our data. With the functions we’ve used so far, we can extract the data we need for a specific penguin species. Here, we use the summarize() function to calculate the mean mean body mass and flipper length across all Gentoo penguins in our data.

penguins |> 
    filter(species == "Gentoo") |> 
    select(species, flipper_length_mm, body_mass_g) |> 
    na.omit() |> # Filter out any rows containing NA values in any columns
    summarize(mean_body_mass_g = mean(body_mass_g),
              mean_flipper_length_mm = mean(flipper_length_mm))

# A tibble: 1 × 2
  mean_body_mass_g mean_flipper_length_mm
             <dbl>                  <dbl>
1            5076.                   217.

In this example we use dplyr functions and the pipe operator to filter our data for Gentoo penguins, select our columns of interest (body_mass_g and flipper_length_mm), and the summarize() function (also from dplyr) to calculate the mean values across data in the body_mass_g and flipper_length_mm columns.

Missing Data

The first time we ran the code above, we didn’t use the na.omit() function and our mean body mass and flipper length calculations returned ‘NA’ values. ‘NA’ is one of the ways R represents missing data, and it turns out one of the penguins in our dataset has ‘NA’ values for all of its measurements (you can find it by looking through the data with the View() function). Many function that perform mathematical operations (like mean), will return an ‘NA’ value if any of its inputs are ‘NA’. This is so we’re aware there are ‘NA’ values present in our data and can handle them accordingly. Once we realized there was a single ‘NA’ values in our data, we excluded it using the na.omit() function.

From this code, we see the general form of the summarize() function: summarize(dataset, column_name = expression). This is quite similar to the mutate() function. However, while the mutate() function performs a calculation for each row in a data frame column, the summarize() function performs one calculation using all of the data in a column.

Using the summarize() function we can quickly calculate summary stats from the columns in a data frame. In order to get the same summary stats for the other penguin data, we’d need to repeat the same set of operations two more times (one for each species). Ideally, we want to be able to work on data from all three penguin species at the same time.

We can do this with the group_by() function.

penguins |> 
    select(species, body_mass_g, flipper_length_mm) |> 
    group_by(species) |> 
    # This time we're using arguments in the mean function to remove the NA values
    summarize(mean_body_mass_g = mean(body_mass_g, na.rm = TRUE),
              mean_flipper_length_mm = mean(flipper_length_mm, na.rm = TRUE))

# A tibble: 3 × 3
  species   mean_body_mass_g mean_flipper_length_mm
  <fct>                <dbl>                  <dbl>
1 Adelie               3701.                   190.
2 Chinstrap            3733.                   196.
3 Gentoo               5076.                   217.

Here we use the group_by() function to group the data according to the species column, before using the summarize() function. This grouping causes the summarize() function to perform calculations across the data within each group (species, in this case), rather than across the entire data frame.

The means don’t give us the whole picture, so let’s calculate the standard deviations for each of these measurements, as well as the total number of penguins from each species.

penguins |> 
    select(species, body_mass_g, flipper_length_mm) |> 
    group_by(species) |> 
    na.omit() |>
    summarize(mean_body_mass_g = mean(body_mass_g),
              sd_body_mass_g = sd(body_mass_g),
              mean_flipper_length_mm = mean(flipper_length_mm),
              sd_mean_flipper_length_mm = sd(flipper_length_mm),
              Total_animals = n())

It looks like the raw numbers agree with what we saw in the figure we generated in the previous lesson. Namely, the Gentoo penguins tend to have more mass and longer flippers than the other two species. And while the Chinstrap penguins have higher mean body mass and flipper length than the Adélie penguins, the standard deviations in these measurements are large enough that there probably isn’t a significant difference in size between the two. In the coming lessons, we’ll apply some statistical tests to these data to test our hypotheses.

Not all operations are equivalent

So far, we’ve seen two ways of keeping ‘NA’ values from affecting our calculations: 1. The na.omit() function removes all rows from a data frame that contain ‘NA’ values in any column. 2. The mean() and sd() have an na.rm argument that excludes all ‘NA’ values from the mean / standard deviation calculations when we set it to TRUE (na.rm = TRUE).

We used the na.omit() function to exclude the ‘NA’ values before we used the summarize() function to calculated all of our summary statistics above. However, if we skip the na.omit() function and instead use the “na.rm” argument for mean() and sd() to exclude the ‘NA’ values, we get a slightly different result.

penguins |> 
    select(species, body_mass_g, flipper_length_mm) |> 
    group_by(species) |> 
    # na.omit() |>
    summarize(mean_body_mass_g = mean(body_mass_g, na.rm = TRUE),
              sd_body_mass_g = sd(body_mass_g, na.rm = TRUE),
              mean_flipper_length_mm = mean(flipper_length_mm, na.rm = TRUE),
              sd_mean_flipper_length_mm = sd(flipper_length_mm, na.rm = TRUE),
              Total_animals = n())

# A tibble: 3 × 6
  species   mean_body_mass_g sd_body_mass_g mean_flipper_length_mm
  <fct>                <dbl>          <dbl>                  <dbl>
1 Adelie               3701.           459.                   190.
2 Chinstrap            3733.           384.                   196.
3 Gentoo               5076.           504.                   217.
# ℹ 2 more variables: sd_mean_flipper_length_mm <dbl>, Total_animals <int>

Compare these results to the previous code using na.omit(), paying close attention to the “Total_animals” column. If there are too many columns in the results to compare them easily, you could always use the select() function to grab just the “species” and “Total_animals” columns.

When we used the na.rm argument approach, we ended with with one extra penguin in the “Total_animals” column for the Adélie and Gentoo penguins. This is because the n() function counts rows, regardless of their contents (you can confirm there’s no na.rm argument for n() using the R docs). While there are many different ways to accomplish the same task, they are not all equivalent in all cases. If we weren’t also using the n() function to count the total number of penguins in each species, both of our methods for removing the ‘NA’ values would have produced the same result.

Even though this is a toy example, we’ve created a flexible analysis pipeline by combining these dplyr functions, The code we’ve written will still work if we collect new data from different penguin species, add additional biological measurements beyond body mass and flipper length, or remove some of the rows from the original input data.

4 R session information

Here we report the version number for R and the package versions we used to perform the analyses in this document.

sessionInfo()

R version 4.4.0 (2024-04-24 ucrt)
Platform: x86_64-w64-mingw32/x64
Running under: Windows 10 x64 (build 19045)

Matrix products: default


locale:
[1] LC_COLLATE=English_United States.utf8 
[2] LC_CTYPE=English_United States.utf8   
[3] LC_MONETARY=English_United States.utf8
[4] LC_NUMERIC=C                          
[5] LC_TIME=English_United States.utf8    

time zone: America/New_York
tzcode source: internal

attached base packages:
[1] stats     graphics  grDevices utils     datasets  methods   base     

other attached packages:
[1] dplyr_1.1.4          readr_2.1.5          here_1.0.1          
[4] palmerpenguins_0.1.1

loaded via a namespace (and not attached):
 [1] crayon_1.5.2      vctrs_0.6.5       cli_3.6.2         knitr_1.47       
 [5] rlang_1.1.3       xfun_0.44         generics_0.1.3    jsonlite_1.8.8   
 [9] bit_4.0.5         glue_1.7.0        rprojroot_2.0.4   htmltools_0.5.8.1
[13] hms_1.1.3         fansi_1.0.6       rmarkdown_2.27    evaluate_0.23    
[17] tibble_3.2.1      tzdb_0.4.0        fastmap_1.2.0     yaml_2.3.8       
[21] lifecycle_1.0.4   compiler_4.4.0    htmlwidgets_1.6.4 pkgconfig_2.0.3  
[25] rstudioapi_0.16.0 digest_0.6.35     R6_2.5.1          tidyselect_1.2.1 
[29] utf8_1.2.4        parallel_4.4.0    vroom_1.6.5       pillar_1.9.0     
[33] magrittr_2.0.3    withr_3.0.0       bit64_4.0.5       tools_4.4.0