The following objects are masked from 'package:base':
date, intersect, setdiff, union
library(stringr)library(dplyr)
Attaching package: 'dplyr'
The following objects are masked from 'package:stats':
filter, lag
The following objects are masked from 'package:base':
intersect, setdiff, setequal, union
library(tidyr)
In addition to the tidyverse packages we loaded above, we’ll also need the janitor package. The janitor package contains functions to help us “clean” data after we’ve read it into R. Since we haven’t used this package yet, we’ll likely need to install it.
install.packages("janitor")
The ‘eval’ cell option
If you’re looking at the code in the Quarto document for this lesson (instead of the HTML version), you’ll notice the #| eval: false at the top of the code chunk, or “cell”. This tells quarto to skip this code cell (i.e. it will not evaluate the code cell). We only want to install the janitor package, so we don’t need to re-run the code cell whenever we render this quarto document.
Lastly, we need to load the janitor package before we can use its functions.
library(janitor)
Attaching package: 'janitor'
The following objects are masked from 'package:stats':
chisq.test, fisher.test
2 Reading the raw Palmer Penguins Data
We’ve worked with the Palmer Penguins dataset in several previous lessons. Here is a quick reminder of what this dataset looks like:
View(penguins)
This data table and its contents are formatted to work well with R’s conventions for visualization and analysis (as we’ve seen). However, original version of these data, as collected by the authors of the Palmer Penguins study, looked a little different. In this lesson, we will read the raw penguin data into R and work to transform it into the cleaned version we see in the penguins table.
The palmerpenguins package includes a csv file of this raw penguins data table. We can use the path_to_file() to find the location of this file on our computers:
In Lesson 2, we learned about the read_csv() function from the readr package. Here we will use this function to read the contents of this raw penguin data table into R. Don’t forget to refer to the help docs for read_csv() if you need a refresher on how it works.
Rows: 344 Columns: 17
── Column specification ────────────────────────────────────────────────────────
Delimiter: ","
chr (9): studyName, Species, Region, Island, Stage, Individual ID, Clutch C...
dbl (7): Sample Number, Culmen Length (mm), Culmen Depth (mm), Flipper Leng...
date (1): Date Egg
ℹ Use `spec()` to retrieve the full column specification for this data.
ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.
Functions from the readr package produce a lot of output describing the files they read. Looking through these messages you can find information on the dimensions of this data table, as well as the guesses the read_csv() function made about the contents of each data column. Think about how this information compares to the structure of the cleaned penguins data.
str(penguins)
tibble [344 × 8] (S3: tbl_df/tbl/data.frame)
$ species : Factor w/ 3 levels "Adelie","Chinstrap",..: 1 1 1 1 1 1 1 1 1 1 ...
$ island : Factor w/ 3 levels "Biscoe","Dream",..: 3 3 3 3 3 3 3 3 3 3 ...
$ bill_length_mm : num [1:344] 39.1 39.5 40.3 NA 36.7 39.3 38.9 39.2 34.1 42 ...
$ bill_depth_mm : num [1:344] 18.7 17.4 18 NA 19.3 20.6 17.8 19.6 18.1 20.2 ...
$ flipper_length_mm: int [1:344] 181 186 195 NA 193 190 181 195 193 190 ...
$ body_mass_g : int [1:344] 3750 3800 3250 NA 3450 3650 3625 4675 3475 4250 ...
$ sex : Factor w/ 2 levels "female","male": 2 1 1 NA 1 2 1 2 NA NA ...
$ year : int [1:344] 2007 2007 2007 2007 2007 2007 2007 2007 2007 2007 ...
The str() function
str() provides a compact, general overview of the structure of most R data structures. It’s useful when you need to quickly familiarize yourself with a data structure.
Based on any differences you note between the raw and cleaned penguin data, think about the types of changes you’ll need to make to the raw data to get it to look like the cleaned data.
3 Data cleaning
3.1 Column names
One of the first differences between the raw and cleaned data you may notice is that they have different numbers of columns, with different names. Let’s start by using the colnames() function.
Which columns in the raw data appear to correspond to those in the cleaned data? How do the names of the corresponding column different between the two tables?
Column naming contentions in R
In R, column names have the same naming requirements as variables:
May only contains letters, numbers, underscores, and period. No spaces.
Cannot start with a number.
Do any of the column names in the raw data table violate R’s naming conventions? What about the names of the columns in the cleaned data table?
Let’s examine the raw data table’s contents and column names. What do you notice about names of columns that violate R’s naming conventions?
We can use backticks to represent column names that violate R’s variable naming conventions. This is useful when we’re reading data from software that doesn’t follow the same conventions (e.g. excel spreadsheets, hand-made CSV files). While backticks give us greater flexibility in name columns, they can be unwieldy.
See what happens if we try to select the Sample Number column and forget the backticks.
raw_penguins_data |>select(Sample Number)
Error: <text>:2:19: unexpected symbol
1: raw_penguins_data |>
2: select(Sample Number
^
If there are columns we’ll need to refer to frequently in our code, it can make our lives easier to simlify their names and make sure they follow R’s naming conventions. In the next two sections, we’ll learn about two methods for modify column names.
3.1.1 The rename() function
The rename() function from the dplyr package allows us to rename individual columns using standard tidyverse conventions. In this example code, we rename the Sample Number column to remove the space.
From this code cell, you can see the general usage of the rename() function: rename(input_data, new_column_name = old_column_name). Note that we needed to enclose the original column name inside backticks, because it did not follow R’s naming conventions.
We can change the names of multiple columns in the same rename() function by separating each rename operation with a comma. Modify the code below so it also changes the name of the Culmen Length (mm) column so it matches R’s naming conventions, in addition to renaming the Sample Number column:
While we can use the rename() function to change the names of multiple columns in a data frame, this can be laborious if we need to correct many (or most) columns.
3.1.2 The janitor package
As the name implies, the janitor package includes several functions that can help us clean our data for use in R. Specifically, we’ll be using the clean_names() function. Run the following code cell to see how it affects our raw penguin data table.
How have the “cleaned” column names changed? How do they compare to column names in the cleaned penguin data table?
head(penguins)
# A tibble: 6 × 8
species island bill_length_mm bill_depth_mm flipper_length_mm body_mass_g
<fct> <fct> <dbl> <dbl> <int> <int>
1 Adelie Torgersen 39.1 18.7 181 3750
2 Adelie Torgersen 39.5 17.4 186 3800
3 Adelie Torgersen 40.3 18 195 3250
4 Adelie Torgersen NA NA NA NA
5 Adelie Torgersen 36.7 19.3 193 3450
6 Adelie Torgersen 39.3 20.6 190 3650
# ℹ 2 more variables: sex <fct>, year <int>
Use the clean_names function in combination with the select() function collect the columns from the raw data table that contain the same information as the columns in the clean data table. Recall, the select() allows us to grab or discard columns from a data frame by names (refer to the docs for examples).
Lastly, use the rename() function to change the culmen length/depth column names to “bill” length/depth. This matches the column names in the cleaned data.
While our partially cleaned data frame has a similar column structure compared to the full cleaned data frame, the contents of the columns are still different.
head(partially_cleaned_penguin_data)
# A tibble: 6 × 8
species island bill_length_mm bill_depth_mm flipper_length_mm body_mass_g
<chr> <chr> <dbl> <dbl> <dbl> <dbl>
1 Adelie Peng… Torge… 39.1 18.7 181 3750
2 Adelie Peng… Torge… 39.5 17.4 186 3800
3 Adelie Peng… Torge… 40.3 18 195 3250
4 Adelie Peng… Torge… NA NA NA NA
5 Adelie Peng… Torge… 36.7 19.3 193 3450
6 Adelie Peng… Torge… 39.3 20.6 190 3650
# ℹ 2 more variables: sex <chr>, date_egg <date>
head(penguins)
# A tibble: 6 × 8
species island bill_length_mm bill_depth_mm flipper_length_mm body_mass_g
<fct> <fct> <dbl> <dbl> <int> <int>
1 Adelie Torgersen 39.1 18.7 181 3750
2 Adelie Torgersen 39.5 17.4 186 3800
3 Adelie Torgersen 40.3 18 195 3250
4 Adelie Torgersen NA NA NA NA
5 Adelie Torgersen 36.7 19.3 193 3450
6 Adelie Torgersen 39.3 20.6 190 3650
# ℹ 2 more variables: sex <fct>, year <int>
What is differences do you notice between the contents of these two data tables? Pay special attention to the variable types in each column.
Recall from previous lessons we can use the mutate() function from the dplyr function to create new columns or modify the contents of existing columns. In the following sections we’ll learn about several packages and functions we can use in combination with mutate() to finish cleaning the penguin data.
3.2.1 The stringr package
Pay special attention to the contents of the species and sex columns. The species column in the partially cleaned data is quite a bit longer than the column in the full cleaned data. Specifically, it contains the word “Penguin”, as well the as the full genus and species name for each penguin. While the sex column contains the same information in the partially and fully cleaned data, the values in the partially cleaned data are all uppercase while those in the fully cleaned data are all lowercase.
Since the data in these two columns are character strings, we can use the stringr package. This package contains several functions designed to easily modify and manipulate character strings. For details on all functions in the stringr package, refer to the documentation or the Posit cheatsheet.
Let’s start by fixing the case discrepancy in the sex column. The str_to_lower() function from the stringr package takes a character string and converts all the letters in the string to lowercase.
Now let’s consider the contents of the species column. The fully cleaned penguin data contains just the common species name (Adelie, Chinstrap, Gentoo), while the partially cleaned data contains the common species name, the word “Penguin”, and the full genus and species names. Let’s examine all the unique values in the species column
# A tibble: 3 × 1
species
<fct>
1 Adelie
2 Gentoo
3 Chinstrap
Why might it be preferable to use just the common name? Think about what it would be like to use the species column to filter the data just for Gentoo penguins, or what figure legends would look like if different traces were colored by species.
Here we’ll use the str_remove() function to remove the portion of the strings in the species column that we don’t want.
The pattern argument in this function defines the portion of the string that we want to remove. This pattern is defined using what is called a “regular expression”.
Regular expressions
Regular expressions are a special kind of grammar that lets us loosely define a pattern in a character string that we’re looking for. While we’re not going to delve too deeply into writing regular expressions, here are some examples to help give you some intuition for how they work.
In their simplest forms, these patterns match portions of a string exactly.
There are special character combinations we can use in a regular expression to make a pattern more open-ended. For example, the period (“.”) will match any single character.
str_remove("Adelie Penguinssssss",pattern =".")
[1] "delie Penguinssssss"
Multiple periods will match multiple characters.
str_remove("Adelie Penguinssssss",pattern ="...")
[1] "lie Penguinssssss"
Lastly, we can use the plus sign (“+”) in combination with another character to match repeated occurrences of that character.
str_remove("Adelie Penguinssssss",pattern ="s+")
[1] "Adelie Penguin"
If we combine the period and plus signs, we match multiple occurrences of any combination of characters (i.e. it matches everything).
str_remove("Adelie Penguinssssss",pattern =".+")
[1] ""
Let’s review the regular expression we used above to remove everything from the species, except for the common name.
The pattern starts with ” Penguin”, so it will match any portion of a string where a space is followed by the word “Penguin”. The “.+” matches any combination of characters of any length. So taken together, this pattern matches a space, followed by the word “Penguin”, followed by any combination of characters of any length.
Let’s use the str_remove() function with the mutate() function to reduce the contents of the species column to just the common species name.
The partially cleaned data contains a date_egg column with the full calendar date on which the researhcers first observed one egg in each study nest. The cleaned penguin data contains a year column that only contains the year. While we could use stringr functions to try to extract the year component from each entry, we have a better option since this column is of date type.
The lubridate package contains a series of functions designed to make date/time objects easier to manage and manipulate. You can refer to the full lubridate documentation or the cheatsheet provided by Posit for details on the various functions in this package.
Here, we will use the year() function to extract just the year portion of a date object. Here’s an example:
ymd("2007-11-11")
[1] "2007-11-11"
year(ymd("2007-11-11"))
[1] 2007
Now use the year() function in combination with the mutate() function to create a new year column with just the year information from each study nest.
We are very close to producing a fully cleaned version of the penguin dataset. Comparing the displays of these two data tables, can you identify the remaining differences between them?
head(partially_cleaned_penguin_data)
# A tibble: 6 × 9
species island bill_length_mm bill_depth_mm flipper_length_mm body_mass_g
<chr> <chr> <dbl> <dbl> <dbl> <dbl>
1 Adelie Torgersen 39.1 18.7 181 3750
2 Adelie Torgersen 39.5 17.4 186 3800
3 Adelie Torgersen 40.3 18 195 3250
4 Adelie Torgersen NA NA NA NA
5 Adelie Torgersen 36.7 19.3 193 3450
6 Adelie Torgersen 39.3 20.6 190 3650
# ℹ 3 more variables: sex <chr>, date_egg <date>, year <dbl>
head(penguins)
# A tibble: 6 × 8
species island bill_length_mm bill_depth_mm flipper_length_mm body_mass_g
<fct> <fct> <dbl> <dbl> <int> <int>
1 Adelie Torgersen 39.1 18.7 181 3750
2 Adelie Torgersen 39.5 17.4 186 3800
3 Adelie Torgersen 40.3 18 195 3250
4 Adelie Torgersen NA NA NA NA
5 Adelie Torgersen 36.7 19.3 193 3450
6 Adelie Torgersen 39.3 20.6 190 3650
# ℹ 2 more variables: sex <fct>, year <int>
The species, island, and sex columns in the fully cleaned data are factors, while in the partially cleaned data they are character strings. Similarly, the flipper_length_mm and year columns in the fully cleaned data are integers, while in the partially cleaned data they are doubles.
We can use the mutate() function in combination with the as.factor() and as.integer() functions to correct these remaining differences.
Use these functions to convert the contents of the species and year columns to factors and integers, respectively.
# A tibble: 344 × 9
species island bill_length_mm bill_depth_mm flipper_length_mm body_mass_g
<fct> <chr> <dbl> <dbl> <dbl> <dbl>
1 Adelie Torgersen 39.1 18.7 181 3750
2 Adelie Torgersen 39.5 17.4 186 3800
3 Adelie Torgersen 40.3 18 195 3250
4 Adelie Torgersen NA NA NA NA
5 Adelie Torgersen 36.7 19.3 193 3450
6 Adelie Torgersen 39.3 20.6 190 3650
7 Adelie Torgersen 38.9 17.8 181 3625
8 Adelie Torgersen 39.2 19.6 195 4675
9 Adelie Torgersen 34.1 18.1 193 3475
10 Adelie Torgersen 42 20.2 190 4250
# ℹ 334 more rows
# ℹ 3 more variables: sex <chr>, date_egg <date>, year <int>
While this is an effective method for us to change individual columns, it can involve a lot of repeated code if we need to perform the same operation across multiple columns. This is where the across() function can help. As the name implies, this function is designed to make it easy for us to apply the same operation to multiple columns. Here’s how we’d use the across() function to convert the contents of the flipper_length_mm and year columns to integers.
# A tibble: 344 × 9
species island bill_length_mm bill_depth_mm flipper_length_mm body_mass_g
<chr> <chr> <dbl> <dbl> <int> <dbl>
1 Adelie Torgersen 39.1 18.7 181 3750
2 Adelie Torgersen 39.5 17.4 186 3800
3 Adelie Torgersen 40.3 18 195 3250
4 Adelie Torgersen NA NA NA NA
5 Adelie Torgersen 36.7 19.3 193 3450
6 Adelie Torgersen 39.3 20.6 190 3650
7 Adelie Torgersen 38.9 17.8 181 3625
8 Adelie Torgersen 39.2 19.6 195 4675
9 Adelie Torgersen 34.1 18.1 193 3475
10 Adelie Torgersen 42 20.2 190 4250
# ℹ 334 more rows
# ℹ 3 more variables: sex <chr>, date_egg <date>, year <int>
The .cols argument is where we provide a vector naming the columns we want to modify. The .fns argument is where we define the function that we want to use to modify each of the columns. Note, when we provide the as.intger function to the .fns argument, we don’t include any parentheses.
Using the across() function saves us the need to write repeated code for each of the columns we want to modify. However, we still need to provide the name of each column we want to modify, which could be a problem we want to work on many columns. Luckily, there are alternative ways we can define these columns. Here we use the where() function to identify all the columns that are currently of character type. This allows us to convert them all to factors.