Base R and Tidyverse syntax comparison for data wrangling

Author
Affiliation

Emi Tanaka

Australian National University

Published

January 12, 2025

Modified

January 12, 2025

1 Introduction

Tidyverse is a popular collection of R packages that provides a consistent interface for data science tasks, including data wrangling for tabular data, strings, factors, date and time, and lists (Wickham et al. 2019). This live document provides a comparison between Base R and tidyverse syntax for data wrangling. Note that this document includes many of the data wrangling tasks but not all and will be continously updated. The Base R approach is shown on the left and the Tidyverse approach is shown on the right.

For the following comparisons, each code cell is excecuted independently so the previous code cell do not affect the code cell thereafter1.

library(tidyverse)
Output
── Attaching core tidyverse packages ──────────────────────── tidyverse 2.0.0 ──
✔ dplyr     1.1.4     ✔ readr     2.1.5
✔ forcats   1.0.0     ✔ stringr   1.5.1
✔ ggplot2   3.5.1     ✔ tibble    3.2.1
✔ lubridate 1.9.3     ✔ tidyr     1.3.1
✔ purrr     1.0.2     
── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
✖ dplyr::filter() masks stats::filter()
✖ dplyr::lag()    masks stats::lag()
ℹ Use the conflicted package (<http://conflicted.r-lib.org/>) to force all conflicts to become errors

2 Tabular data

We make the syntax comparison using a subset of the penguins data in the palmerpenguins dataset (Horst, Hill, and Gorman 2020) named penguins_small as shown in Table 1. The presentation of the penguins_small is modified by changing it to a tibble and data frame, named as tbl and df, respectively.

library(palmerpenguins)
penguins_small <- penguins |> 
  slice(seq(1, 300, 30)) |> 
  select(species, sex, bill_length_mm, bill_depth_mm, body_mass_g) |> 
  rename(mass = body_mass_g)
Table 1: The penguins_small data.
species sex bill_length_mm bill_depth_mm mass
Adelie male 39.1 18.7 3750
Adelie female 39.5 16.7 3250
Adelie female 35.7 16.9 3150
Adelie female 35.7 18.0 3550
Adelie female 36.2 17.2 3150
Adelie female 36.0 17.1 3700
Gentoo female 48.2 14.3 4600
Gentoo female 43.2 14.5 4450
Gentoo female 47.5 14.0 4875
Gentoo female 47.2 13.7 4925
tbl <- as_tibble(penguins_small)
df <- as.data.frame(penguins_small)

As an aside, we also show comparison of the syntax of similar operations using data.table package (Barrett et al. 2024). The penguins_small data is converted to a data.table object named dt.

library(data.table)
dt <- as.data.table(penguins_small)

2.1 Selecting a single column as a vector

Extract the sex column as a vector.

df$sex
Output
 [1] male   female female female female female female female female female
Levels: female male
df[["sex"]]
Output
 [1] male   female female female female female female female female female
Levels: female male
df[, "sex", drop = TRUE]
Output
 [1] male   female female female female female female female female female
Levels: female male
pull(tbl, sex)
Output
 [1] male   female female female female female female female female female
Levels: female male
dt[, .(sex)]
Output
       sex
    <fctr>
 1:   male
 2: female
 3: female
 4: female
 5: female
 6: female
 7: female
 8: female
 9: female
10: female

2.2 Selecting columns

2.2.1 By column name

Select the sex and species columns.

subset(df, 
       select = c(sex, species))
Output
      sex species
1    male  Adelie
2  female  Adelie
3  female  Adelie
4  female  Adelie
5  female  Adelie
6  female  Adelie
7  female  Gentoo
8  female  Gentoo
9  female  Gentoo
10 female  Gentoo
subset(df, 
       select = c("sex", "species"))
Output
      sex species
1    male  Adelie
2  female  Adelie
3  female  Adelie
4  female  Adelie
5  female  Adelie
6  female  Adelie
7  female  Gentoo
8  female  Gentoo
9  female  Gentoo
10 female  Gentoo
df[c("sex", "species")]
Output
      sex species
1    male  Adelie
2  female  Adelie
3  female  Adelie
4  female  Adelie
5  female  Adelie
6  female  Adelie
7  female  Gentoo
8  female  Gentoo
9  female  Gentoo
10 female  Gentoo
df[, c("sex", "species")]
Output
      sex species
1    male  Adelie
2  female  Adelie
3  female  Adelie
4  female  Adelie
5  female  Adelie
6  female  Adelie
7  female  Gentoo
8  female  Gentoo
9  female  Gentoo
10 female  Gentoo
select(tbl, sex, species)
Output
# A tibble: 10 × 2
   sex    species
   <fct>  <fct>  
 1 male   Adelie 
 2 female Adelie 
 3 female Adelie 
 4 female Adelie 
 5 female Adelie 
 6 female Adelie 
 7 female Gentoo 
 8 female Gentoo 
 9 female Gentoo 
10 female Gentoo 
select(tbl, c(sex, species))
Output
# A tibble: 10 × 2
   sex    species
   <fct>  <fct>  
 1 male   Adelie 
 2 female Adelie 
 3 female Adelie 
 4 female Adelie 
 5 female Adelie 
 6 female Adelie 
 7 female Gentoo 
 8 female Gentoo 
 9 female Gentoo 
10 female Gentoo 
select(tbl, c("sex", "species"))
Output
# A tibble: 10 × 2
   sex    species
   <fct>  <fct>  
 1 male   Adelie 
 2 female Adelie 
 3 female Adelie 
 4 female Adelie 
 5 female Adelie 
 6 female Adelie 
 7 female Gentoo 
 8 female Gentoo 
 9 female Gentoo 
10 female Gentoo 
dt[, .(sex, species)]
Output
       sex species
    <fctr>  <fctr>
 1:   male  Adelie
 2: female  Adelie
 3: female  Adelie
 4: female  Adelie
 5: female  Adelie
 6: female  Adelie
 7: female  Gentoo
 8: female  Gentoo
 9: female  Gentoo
10: female  Gentoo

2.2.2 By position

Select the first and second columns.

subset(df, select = c(1, 2))
Output
   species    sex
1   Adelie   male
2   Adelie female
3   Adelie female
4   Adelie female
5   Adelie female
6   Adelie female
7   Gentoo female
8   Gentoo female
9   Gentoo female
10  Gentoo female
df[c(1, 2)]
Output
   species    sex
1   Adelie   male
2   Adelie female
3   Adelie female
4   Adelie female
5   Adelie female
6   Adelie female
7   Gentoo female
8   Gentoo female
9   Gentoo female
10  Gentoo female
df[, c(1, 2)]
Output
   species    sex
1   Adelie   male
2   Adelie female
3   Adelie female
4   Adelie female
5   Adelie female
6   Adelie female
7   Gentoo female
8   Gentoo female
9   Gentoo female
10  Gentoo female
select(tbl, 1, 2)
Output
# A tibble: 10 × 2
   species sex   
   <fct>   <fct> 
 1 Adelie  male  
 2 Adelie  female
 3 Adelie  female
 4 Adelie  female
 5 Adelie  female
 6 Adelie  female
 7 Gentoo  female
 8 Gentoo  female
 9 Gentoo  female
10 Gentoo  female
select(tbl, c(1, 2))
Output
# A tibble: 10 × 2
   species sex   
   <fct>   <fct> 
 1 Adelie  male  
 2 Adelie  female
 3 Adelie  female
 4 Adelie  female
 5 Adelie  female
 6 Adelie  female
 7 Gentoo  female
 8 Gentoo  female
 9 Gentoo  female
10 Gentoo  female
dt[, c(1, 2)]
Output
    species    sex
     <fctr> <fctr>
 1:  Adelie   male
 2:  Adelie female
 3:  Adelie female
 4:  Adelie female
 5:  Adelie female
 6:  Adelie female
 7:  Gentoo female
 8:  Gentoo female
 9:  Gentoo female
10:  Gentoo female

2.2.3 By booleans

Select columns that are factors.

subset(df, select = sapply(df, is.factor))
Output
   species    sex
1   Adelie   male
2   Adelie female
3   Adelie female
4   Adelie female
5   Adelie female
6   Adelie female
7   Gentoo female
8   Gentoo female
9   Gentoo female
10  Gentoo female
df[sapply(df, is.factor)]
Output
   species    sex
1   Adelie   male
2   Adelie female
3   Adelie female
4   Adelie female
5   Adelie female
6   Adelie female
7   Gentoo female
8   Gentoo female
9   Gentoo female
10  Gentoo female
df[, sapply(df, is.factor)]
Output
   species    sex
1   Adelie   male
2   Adelie female
3   Adelie female
4   Adelie female
5   Adelie female
6   Adelie female
7   Gentoo female
8   Gentoo female
9   Gentoo female
10  Gentoo female
select(tbl, where(is.factor))
Output
# A tibble: 10 × 2
   species sex   
   <fct>   <fct> 
 1 Adelie  male  
 2 Adelie  female
 3 Adelie  female
 4 Adelie  female
 5 Adelie  female
 6 Adelie  female
 7 Gentoo  female
 8 Gentoo  female
 9 Gentoo  female
10 Gentoo  female
dt[, sapply(dt, is.factor), with = FALSE]
Output
    species    sex
     <fctr> <fctr>
 1:  Adelie   male
 2:  Adelie female
 3:  Adelie female
 4:  Adelie female
 5:  Adelie female
 6:  Adelie female
 7:  Gentoo female
 8:  Gentoo female
 9:  Gentoo female
10:  Gentoo female

2.3 Deleting one column

Delete the sex column.

subset(df, select = -sex)
Output
   species bill_length_mm bill_depth_mm mass
1   Adelie           39.1          18.7 3750
2   Adelie           39.5          16.7 3250
3   Adelie           35.7          16.9 3150
4   Adelie           35.7          18.0 3550
5   Adelie           36.2          17.2 3150
6   Adelie           36.0          17.1 3700
7   Gentoo           48.2          14.3 4600
8   Gentoo           43.2          14.5 4450
9   Gentoo           47.5          14.0 4875
10  Gentoo           47.2          13.7 4925
df$sex <- NULL
df
Output
   species bill_length_mm bill_depth_mm mass
1   Adelie           39.1          18.7 3750
2   Adelie           39.5          16.7 3250
3   Adelie           35.7          16.9 3150
4   Adelie           35.7          18.0 3550
5   Adelie           36.2          17.2 3150
6   Adelie           36.0          17.1 3700
7   Gentoo           48.2          14.3 4600
8   Gentoo           43.2          14.5 4450
9   Gentoo           47.5          14.0 4875
10  Gentoo           47.2          13.7 4925
select(tbl, -sex)
Output
# A tibble: 10 × 4
   species bill_length_mm bill_depth_mm  mass
   <fct>            <dbl>         <dbl> <int>
 1 Adelie            39.1          18.7  3750
 2 Adelie            39.5          16.7  3250
 3 Adelie            35.7          16.9  3150
 4 Adelie            35.7          18    3550
 5 Adelie            36.2          17.2  3150
 6 Adelie            36            17.1  3700
 7 Gentoo            48.2          14.3  4600
 8 Gentoo            43.2          14.5  4450
 9 Gentoo            47.5          14    4875
10 Gentoo            47.2          13.7  4925
mutate(tbl, sex = NULL)
Output
# A tibble: 10 × 4
   species bill_length_mm bill_depth_mm  mass
   <fct>            <dbl>         <dbl> <int>
 1 Adelie            39.1          18.7  3750
 2 Adelie            39.5          16.7  3250
 3 Adelie            35.7          16.9  3150
 4 Adelie            35.7          18    3550
 5 Adelie            36.2          17.2  3150
 6 Adelie            36            17.1  3700
 7 Gentoo            48.2          14.3  4600
 8 Gentoo            43.2          14.5  4450
 9 Gentoo            47.5          14    4875
10 Gentoo            47.2          13.7  4925
dt[, sex := NULL]
dt
Output
    species bill_length_mm bill_depth_mm  mass
     <fctr>          <num>         <num> <int>
 1:  Adelie           39.1          18.7  3750
 2:  Adelie           39.5          16.7  3250
 3:  Adelie           35.7          16.9  3150
 4:  Adelie           35.7          18.0  3550
 5:  Adelie           36.2          17.2  3150
 6:  Adelie           36.0          17.1  3700
 7:  Gentoo           48.2          14.3  4600
 8:  Gentoo           43.2          14.5  4450
 9:  Gentoo           47.5          14.0  4875
10:  Gentoo           47.2          13.7  4925

2.4 Deleting multiple column

Delete the sex and species columns.

subset(df, select = -c(sex, species))
Output
   bill_length_mm bill_depth_mm mass
1            39.1          18.7 3750
2            39.5          16.7 3250
3            35.7          16.9 3150
4            35.7          18.0 3550
5            36.2          17.2 3150
6            36.0          17.1 3700
7            48.2          14.3 4600
8            43.2          14.5 4450
9            47.5          14.0 4875
10           47.2          13.7 4925
df[c("sex", "species")] <- NULL
df
Output
   bill_length_mm bill_depth_mm mass
1            39.1          18.7 3750
2            39.5          16.7 3250
3            35.7          16.9 3150
4            35.7          18.0 3550
5            36.2          17.2 3150
6            36.0          17.1 3700
7            48.2          14.3 4600
8            43.2          14.5 4450
9            47.5          14.0 4875
10           47.2          13.7 4925
select(tbl, -c(sex, species))
Output
# A tibble: 10 × 3
   bill_length_mm bill_depth_mm  mass
            <dbl>         <dbl> <int>
 1           39.1          18.7  3750
 2           39.5          16.7  3250
 3           35.7          16.9  3150
 4           35.7          18    3550
 5           36.2          17.2  3150
 6           36            17.1  3700
 7           48.2          14.3  4600
 8           43.2          14.5  4450
 9           47.5          14    4875
10           47.2          13.7  4925
select(tbl, !c(sex, species))
Output
# A tibble: 10 × 3
   bill_length_mm bill_depth_mm  mass
            <dbl>         <dbl> <int>
 1           39.1          18.7  3750
 2           39.5          16.7  3250
 3           35.7          16.9  3150
 4           35.7          18    3550
 5           36.2          17.2  3150
 6           36            17.1  3700
 7           48.2          14.3  4600
 8           43.2          14.5  4450
 9           47.5          14    4875
10           47.2          13.7  4925
dt[, !c("sex", "species")]
Output
    bill_length_mm bill_depth_mm  mass
             <num>         <num> <int>
 1:           39.1          18.7  3750
 2:           39.5          16.7  3250
 3:           35.7          16.9  3150
 4:           35.7          18.0  3550
 5:           36.2          17.2  3150
 6:           36.0          17.1  3700
 7:           48.2          14.3  4600
 8:           43.2          14.5  4450
 9:           47.5          14.0  4875
10:           47.2          13.7  4925

2.5 Arrange columns

Order the columns such that sex is the first column.

cols <- colnames(df)
new_order <- c("sex", setdiff(cols, "sex"))
subset(df, select = new_order)
Output
      sex species bill_length_mm bill_depth_mm mass
1    male  Adelie           39.1          18.7 3750
2  female  Adelie           39.5          16.7 3250
3  female  Adelie           35.7          16.9 3150
4  female  Adelie           35.7          18.0 3550
5  female  Adelie           36.2          17.2 3150
6  female  Adelie           36.0          17.1 3700
7  female  Gentoo           48.2          14.3 4600
8  female  Gentoo           43.2          14.5 4450
9  female  Gentoo           47.5          14.0 4875
10 female  Gentoo           47.2          13.7 4925
df[new_order]
Output
      sex species bill_length_mm bill_depth_mm mass
1    male  Adelie           39.1          18.7 3750
2  female  Adelie           39.5          16.7 3250
3  female  Adelie           35.7          16.9 3150
4  female  Adelie           35.7          18.0 3550
5  female  Adelie           36.2          17.2 3150
6  female  Adelie           36.0          17.1 3700
7  female  Gentoo           48.2          14.3 4600
8  female  Gentoo           43.2          14.5 4450
9  female  Gentoo           47.5          14.0 4875
10 female  Gentoo           47.2          13.7 4925
df[, new_order]
Output
      sex species bill_length_mm bill_depth_mm mass
1    male  Adelie           39.1          18.7 3750
2  female  Adelie           39.5          16.7 3250
3  female  Adelie           35.7          16.9 3150
4  female  Adelie           35.7          18.0 3550
5  female  Adelie           36.2          17.2 3150
6  female  Adelie           36.0          17.1 3700
7  female  Gentoo           48.2          14.3 4600
8  female  Gentoo           43.2          14.5 4450
9  female  Gentoo           47.5          14.0 4875
10 female  Gentoo           47.2          13.7 4925
select(tbl, sex, everything())
Output
# A tibble: 10 × 5
   sex    species bill_length_mm bill_depth_mm  mass
   <fct>  <fct>            <dbl>         <dbl> <int>
 1 male   Adelie            39.1          18.7  3750
 2 female Adelie            39.5          16.7  3250
 3 female Adelie            35.7          16.9  3150
 4 female Adelie            35.7          18    3550
 5 female Adelie            36.2          17.2  3150
 6 female Adelie            36            17.1  3700
 7 female Gentoo            48.2          14.3  4600
 8 female Gentoo            43.2          14.5  4450
 9 female Gentoo            47.5          14    4875
10 female Gentoo            47.2          13.7  4925
relocate(tbl, sex, .before = species)
Output
# A tibble: 10 × 5
   sex    species bill_length_mm bill_depth_mm  mass
   <fct>  <fct>            <dbl>         <dbl> <int>
 1 male   Adelie            39.1          18.7  3750
 2 female Adelie            39.5          16.7  3250
 3 female Adelie            35.7          16.9  3150
 4 female Adelie            35.7          18    3550
 5 female Adelie            36.2          17.2  3150
 6 female Adelie            36            17.1  3700
 7 female Gentoo            48.2          14.3  4600
 8 female Gentoo            43.2          14.5  4450
 9 female Gentoo            47.5          14    4875
10 female Gentoo            47.2          13.7  4925
cols <- colnames(dt)
new_order <- c("sex", setdiff(cols, "sex"))
dt[, ..new_order]
Output
       sex species bill_length_mm bill_depth_mm  mass
    <fctr>  <fctr>          <num>         <num> <int>
 1:   male  Adelie           39.1          18.7  3750
 2: female  Adelie           39.5          16.7  3250
 3: female  Adelie           35.7          16.9  3150
 4: female  Adelie           35.7          18.0  3550
 5: female  Adelie           36.2          17.2  3150
 6: female  Adelie           36.0          17.1  3700
 7: female  Gentoo           48.2          14.3  4600
 8: female  Gentoo           43.2          14.5  4450
 9: female  Gentoo           47.5          14.0  4875
10: female  Gentoo           47.2          13.7  4925

2.6 Rename columns

2.6.1 Rename one column

Rename the bill_length_mm column to length.

colnames(df)[3] <- "length"
df
Output
   species    sex length bill_depth_mm mass
1   Adelie   male   39.1          18.7 3750
2   Adelie female   39.5          16.7 3250
3   Adelie female   35.7          16.9 3150
4   Adelie female   35.7          18.0 3550
5   Adelie female   36.2          17.2 3150
6   Adelie female   36.0          17.1 3700
7   Gentoo female   48.2          14.3 4600
8   Gentoo female   43.2          14.5 4450
9   Gentoo female   47.5          14.0 4875
10  Gentoo female   47.2          13.7 4925
names(df)[3] <- "length"
df
Output
   species    sex length bill_depth_mm mass
1   Adelie   male   39.1          18.7 3750
2   Adelie female   39.5          16.7 3250
3   Adelie female   35.7          16.9 3150
4   Adelie female   35.7          18.0 3550
5   Adelie female   36.2          17.2 3150
6   Adelie female   36.0          17.1 3700
7   Gentoo female   48.2          14.3 4600
8   Gentoo female   43.2          14.5 4450
9   Gentoo female   47.5          14.0 4875
10  Gentoo female   47.2          13.7 4925
rename(tbl, 
       length = bill_length_mm)
Output
# A tibble: 10 × 5
   species sex    length bill_depth_mm  mass
   <fct>   <fct>   <dbl>         <dbl> <int>
 1 Adelie  male     39.1          18.7  3750
 2 Adelie  female   39.5          16.7  3250
 3 Adelie  female   35.7          16.9  3150
 4 Adelie  female   35.7          18    3550
 5 Adelie  female   36.2          17.2  3150
 6 Adelie  female   36            17.1  3700
 7 Gentoo  female   48.2          14.3  4600
 8 Gentoo  female   43.2          14.5  4450
 9 Gentoo  female   47.5          14    4875
10 Gentoo  female   47.2          13.7  4925
# changes order at the same time though
select(tbl, 
       length = bill_length_mm,
       everything())
Output
# A tibble: 10 × 5
   length species sex    bill_depth_mm  mass
    <dbl> <fct>   <fct>          <dbl> <int>
 1   39.1 Adelie  male            18.7  3750
 2   39.5 Adelie  female          16.7  3250
 3   35.7 Adelie  female          16.9  3150
 4   35.7 Adelie  female          18    3550
 5   36.2 Adelie  female          17.2  3150
 6   36   Adelie  female          17.1  3700
 7   48.2 Gentoo  female          14.3  4600
 8   43.2 Gentoo  female          14.5  4450
 9   47.5 Gentoo  female          14    4875
10   47.2 Gentoo  female          13.7  4925
nm_dict <- c("length" = "bill_length_mm")
rename(tbl, any_of(nm_dict))
Output
# A tibble: 10 × 5
   species sex    length bill_depth_mm  mass
   <fct>   <fct>   <dbl>         <dbl> <int>
 1 Adelie  male     39.1          18.7  3750
 2 Adelie  female   39.5          16.7  3250
 3 Adelie  female   35.7          16.9  3150
 4 Adelie  female   35.7          18    3550
 5 Adelie  female   36.2          17.2  3150
 6 Adelie  female   36            17.1  3700
 7 Gentoo  female   48.2          14.3  4600
 8 Gentoo  female   43.2          14.5  4450
 9 Gentoo  female   47.5          14    4875
10 Gentoo  female   47.2          13.7  4925
setnames(dt, "bill_length_mm", "length")
dt
Output
    species    sex length bill_depth_mm  mass
     <fctr> <fctr>  <num>         <num> <int>
 1:  Adelie   male   39.1          18.7  3750
 2:  Adelie female   39.5          16.7  3250
 3:  Adelie female   35.7          16.9  3150
 4:  Adelie female   35.7          18.0  3550
 5:  Adelie female   36.2          17.2  3150
 6:  Adelie female   36.0          17.1  3700
 7:  Gentoo female   48.2          14.3  4600
 8:  Gentoo female   43.2          14.5  4450
 9:  Gentoo female   47.5          14.0  4875
10:  Gentoo female   47.2          13.7  4925

2.6.2 Rename multiple columns

Rename the bill_length_mm and bill_depth_mm columns to bill_length and bill_depth, respectively, by a function that removes the suffix “_mm”.

cols <- colnames(df)
colnames(df) <- sub("_mm", "", cols)
df
Output
   species    sex bill_length bill_depth mass
1   Adelie   male        39.1       18.7 3750
2   Adelie female        39.5       16.7 3250
3   Adelie female        35.7       16.9 3150
4   Adelie female        35.7       18.0 3550
5   Adelie female        36.2       17.2 3150
6   Adelie female        36.0       17.1 3700
7   Gentoo female        48.2       14.3 4600
8   Gentoo female        43.2       14.5 4450
9   Gentoo female        47.5       14.0 4875
10  Gentoo female        47.2       13.7 4925
names(df)[3:4] <- sub("_mm", "", cols[3:4])
df
Output
   species    sex bill_length bill_depth mass
1   Adelie   male        39.1       18.7 3750
2   Adelie female        39.5       16.7 3250
3   Adelie female        35.7       16.9 3150
4   Adelie female        35.7       18.0 3550
5   Adelie female        36.2       17.2 3150
6   Adelie female        36.0       17.1 3700
7   Gentoo female        48.2       14.3 4600
8   Gentoo female        43.2       14.5 4450
9   Gentoo female        47.5       14.0 4875
10  Gentoo female        47.2       13.7 4925
rename_with(tbl, 
            ~str_remove(., "_mm"))
Output
# A tibble: 10 × 5
   species sex    bill_length bill_depth  mass
   <fct>   <fct>        <dbl>      <dbl> <int>
 1 Adelie  male          39.1       18.7  3750
 2 Adelie  female        39.5       16.7  3250
 3 Adelie  female        35.7       16.9  3150
 4 Adelie  female        35.7       18    3550
 5 Adelie  female        36.2       17.2  3150
 6 Adelie  female        36         17.1  3700
 7 Gentoo  female        48.2       14.3  4600
 8 Gentoo  female        43.2       14.5  4450
 9 Gentoo  female        47.5       14    4875
10 Gentoo  female        47.2       13.7  4925
rename_with(tbl, 
            ~str_remove(., "_mm"), 3:4)
Output
# A tibble: 10 × 5
   species sex    bill_length bill_depth  mass
   <fct>   <fct>        <dbl>      <dbl> <int>
 1 Adelie  male          39.1       18.7  3750
 2 Adelie  female        39.5       16.7  3250
 3 Adelie  female        35.7       16.9  3150
 4 Adelie  female        35.7       18    3550
 5 Adelie  female        36.2       17.2  3150
 6 Adelie  female        36         17.1  3700
 7 Gentoo  female        48.2       14.3  4600
 8 Gentoo  female        43.2       14.5  4450
 9 Gentoo  female        47.5       14    4875
10 Gentoo  female        47.2       13.7  4925
cols <- colnames(dt)
setnames(dt, cols, sub("_mm", "", cols))
dt
Output
    species    sex bill_length bill_depth  mass
     <fctr> <fctr>       <num>      <num> <int>
 1:  Adelie   male        39.1       18.7  3750
 2:  Adelie female        39.5       16.7  3250
 3:  Adelie female        35.7       16.9  3150
 4:  Adelie female        35.7       18.0  3550
 5:  Adelie female        36.2       17.2  3150
 6:  Adelie female        36.0       17.1  3700
 7:  Gentoo female        48.2       14.3  4600
 8:  Gentoo female        43.2       14.5  4450
 9:  Gentoo female        47.5       14.0  4875
10:  Gentoo female        47.2       13.7  4925
setnames(dt, cols[3:4], sub("_mm", "", cols[3:4]))
dt
Output
    species    sex bill_length bill_depth  mass
     <fctr> <fctr>       <num>      <num> <int>
 1:  Adelie   male        39.1       18.7  3750
 2:  Adelie female        39.5       16.7  3250
 3:  Adelie female        35.7       16.9  3150
 4:  Adelie female        35.7       18.0  3550
 5:  Adelie female        36.2       17.2  3150
 6:  Adelie female        36.0       17.1  3700
 7:  Gentoo female        48.2       14.3  4600
 8:  Gentoo female        43.2       14.5  4450
 9:  Gentoo female        47.5       14.0  4875
10:  Gentoo female        47.2       13.7  4925

2.7 Subset rows

2.7.1 By position

Select the first two rows.

df[1:2, ]
Output
  species    sex bill_length_mm bill_depth_mm mass
1  Adelie   male           39.1          18.7 3750
2  Adelie female           39.5          16.7 3250
df[-c(1:2), ]
Output
   species    sex bill_length_mm bill_depth_mm mass
3   Adelie female           35.7          16.9 3150
4   Adelie female           35.7          18.0 3550
5   Adelie female           36.2          17.2 3150
6   Adelie female           36.0          17.1 3700
7   Gentoo female           48.2          14.3 4600
8   Gentoo female           43.2          14.5 4450
9   Gentoo female           47.5          14.0 4875
10  Gentoo female           47.2          13.7 4925
slice(tbl, 1:2)
Output
# A tibble: 2 × 5
  species sex    bill_length_mm bill_depth_mm  mass
  <fct>   <fct>           <dbl>         <dbl> <int>
1 Adelie  male             39.1          18.7  3750
2 Adelie  female           39.5          16.7  3250
slice(tbl, -c(1:2))
Output
# A tibble: 8 × 5
  species sex    bill_length_mm bill_depth_mm  mass
  <fct>   <fct>           <dbl>         <dbl> <int>
1 Adelie  female           35.7          16.9  3150
2 Adelie  female           35.7          18    3550
3 Adelie  female           36.2          17.2  3150
4 Adelie  female           36            17.1  3700
5 Gentoo  female           48.2          14.3  4600
6 Gentoo  female           43.2          14.5  4450
7 Gentoo  female           47.5          14    4875
8 Gentoo  female           47.2          13.7  4925
dt[1:2, ]
Output
   species    sex bill_length_mm bill_depth_mm  mass
    <fctr> <fctr>          <num>         <num> <int>
1:  Adelie   male           39.1          18.7  3750
2:  Adelie female           39.5          16.7  3250

2.7.2 By booleans

Select rows where the classification of sex is not missing.

df[!is.na(df$sex), ]
Output
   species    sex bill_length_mm bill_depth_mm mass
1   Adelie   male           39.1          18.7 3750
2   Adelie female           39.5          16.7 3250
3   Adelie female           35.7          16.9 3150
4   Adelie female           35.7          18.0 3550
5   Adelie female           36.2          17.2 3150
6   Adelie female           36.0          17.1 3700
7   Gentoo female           48.2          14.3 4600
8   Gentoo female           43.2          14.5 4450
9   Gentoo female           47.5          14.0 4875
10  Gentoo female           47.2          13.7 4925
subset(df, !is.na(sex))
Output
   species    sex bill_length_mm bill_depth_mm mass
1   Adelie   male           39.1          18.7 3750
2   Adelie female           39.5          16.7 3250
3   Adelie female           35.7          16.9 3150
4   Adelie female           35.7          18.0 3550
5   Adelie female           36.2          17.2 3150
6   Adelie female           36.0          17.1 3700
7   Gentoo female           48.2          14.3 4600
8   Gentoo female           43.2          14.5 4450
9   Gentoo female           47.5          14.0 4875
10  Gentoo female           47.2          13.7 4925
filter(tbl, !is.na(sex))
Output
# A tibble: 10 × 5
   species sex    bill_length_mm bill_depth_mm  mass
   <fct>   <fct>           <dbl>         <dbl> <int>
 1 Adelie  male             39.1          18.7  3750
 2 Adelie  female           39.5          16.7  3250
 3 Adelie  female           35.7          16.9  3150
 4 Adelie  female           35.7          18    3550
 5 Adelie  female           36.2          17.2  3150
 6 Adelie  female           36            17.1  3700
 7 Gentoo  female           48.2          14.3  4600
 8 Gentoo  female           43.2          14.5  4450
 9 Gentoo  female           47.5          14    4875
10 Gentoo  female           47.2          13.7  4925
dt[!is.na(sex), ]
Output
    species    sex bill_length_mm bill_depth_mm  mass
     <fctr> <fctr>          <num>         <num> <int>
 1:  Adelie   male           39.1          18.7  3750
 2:  Adelie female           39.5          16.7  3250
 3:  Adelie female           35.7          16.9  3150
 4:  Adelie female           35.7          18.0  3550
 5:  Adelie female           36.2          17.2  3150
 6:  Adelie female           36.0          17.1  3700
 7:  Gentoo female           48.2          14.3  4600
 8:  Gentoo female           43.2          14.5  4450
 9:  Gentoo female           47.5          14.0  4875
10:  Gentoo female           47.2          13.7  4925

2.8 Arrange rows

Order the rows by sex and bill_length_mm in ascending order.

df[order(df$sex, df$bill_length_mm), ]
Output
   species    sex bill_length_mm bill_depth_mm mass
3   Adelie female           35.7          16.9 3150
4   Adelie female           35.7          18.0 3550
6   Adelie female           36.0          17.1 3700
5   Adelie female           36.2          17.2 3150
2   Adelie female           39.5          16.7 3250
8   Gentoo female           43.2          14.5 4450
10  Gentoo female           47.2          13.7 4925
9   Gentoo female           47.5          14.0 4875
7   Gentoo female           48.2          14.3 4600
1   Adelie   male           39.1          18.7 3750
arrange(tbl, sex, bill_length_mm)
Output
# A tibble: 10 × 5
   species sex    bill_length_mm bill_depth_mm  mass
   <fct>   <fct>           <dbl>         <dbl> <int>
 1 Adelie  female           35.7          16.9  3150
 2 Adelie  female           35.7          18    3550
 3 Adelie  female           36            17.1  3700
 4 Adelie  female           36.2          17.2  3150
 5 Adelie  female           39.5          16.7  3250
 6 Gentoo  female           43.2          14.5  4450
 7 Gentoo  female           47.2          13.7  4925
 8 Gentoo  female           47.5          14    4875
 9 Gentoo  female           48.2          14.3  4600
10 Adelie  male             39.1          18.7  3750
dt[order(sex, bill_length_mm), ]
Output
    species    sex bill_length_mm bill_depth_mm  mass
     <fctr> <fctr>          <num>         <num> <int>
 1:  Adelie female           35.7          16.9  3150
 2:  Adelie female           35.7          18.0  3550
 3:  Adelie female           36.0          17.1  3700
 4:  Adelie female           36.2          17.2  3150
 5:  Adelie female           39.5          16.7  3250
 6:  Gentoo female           43.2          14.5  4450
 7:  Gentoo female           47.2          13.7  4925
 8:  Gentoo female           47.5          14.0  4875
 9:  Gentoo female           48.2          14.3  4600
10:  Adelie   male           39.1          18.7  3750

Order the rows by sex and bill_length_mm in descending order.

df[order(df$sex, 
         df$bill_length_mm,
         decreasing = TRUE), ]
Output
   species    sex bill_length_mm bill_depth_mm mass
1   Adelie   male           39.1          18.7 3750
7   Gentoo female           48.2          14.3 4600
9   Gentoo female           47.5          14.0 4875
10  Gentoo female           47.2          13.7 4925
8   Gentoo female           43.2          14.5 4450
2   Adelie female           39.5          16.7 3250
5   Adelie female           36.2          17.2 3150
6   Adelie female           36.0          17.1 3700
3   Adelie female           35.7          16.9 3150
4   Adelie female           35.7          18.0 3550
arrange(tbl, desc(sex), desc(bill_length_mm))
Output
# A tibble: 10 × 5
   species sex    bill_length_mm bill_depth_mm  mass
   <fct>   <fct>           <dbl>         <dbl> <int>
 1 Adelie  male             39.1          18.7  3750
 2 Gentoo  female           48.2          14.3  4600
 3 Gentoo  female           47.5          14    4875
 4 Gentoo  female           47.2          13.7  4925
 5 Gentoo  female           43.2          14.5  4450
 6 Adelie  female           39.5          16.7  3250
 7 Adelie  female           36.2          17.2  3150
 8 Adelie  female           36            17.1  3700
 9 Adelie  female           35.7          16.9  3150
10 Adelie  female           35.7          18    3550
dt[order(sex, bill_length_mm, decreasing = TRUE), ]
Output
    species    sex bill_length_mm bill_depth_mm  mass
     <fctr> <fctr>          <num>         <num> <int>
 1:  Adelie   male           39.1          18.7  3750
 2:  Gentoo female           48.2          14.3  4600
 3:  Gentoo female           47.5          14.0  4875
 4:  Gentoo female           47.2          13.7  4925
 5:  Gentoo female           43.2          14.5  4450
 6:  Adelie female           39.5          16.7  3250
 7:  Adelie female           36.2          17.2  3150
 8:  Adelie female           36.0          17.1  3700
 9:  Adelie female           35.7          16.9  3150
10:  Adelie female           35.7          18.0  3550

Order the rows by sex in descending order and bill_length_mm in ascending order.

df[order(factor(df$sex, 
                levels = rev(levels(df$sex))), 
         df$bill_length_mm), ]
Output
   species    sex bill_length_mm bill_depth_mm mass
1   Adelie   male           39.1          18.7 3750
3   Adelie female           35.7          16.9 3150
4   Adelie female           35.7          18.0 3550
6   Adelie female           36.0          17.1 3700
5   Adelie female           36.2          17.2 3150
2   Adelie female           39.5          16.7 3250
8   Gentoo female           43.2          14.5 4450
10  Gentoo female           47.2          13.7 4925
9   Gentoo female           47.5          14.0 4875
7   Gentoo female           48.2          14.3 4600
arrange(tbl, desc(sex), bill_length_mm)
Output
# A tibble: 10 × 5
   species sex    bill_length_mm bill_depth_mm  mass
   <fct>   <fct>           <dbl>         <dbl> <int>
 1 Adelie  male             39.1          18.7  3750
 2 Adelie  female           35.7          16.9  3150
 3 Adelie  female           35.7          18    3550
 4 Adelie  female           36            17.1  3700
 5 Adelie  female           36.2          17.2  3150
 6 Adelie  female           39.5          16.7  3250
 7 Gentoo  female           43.2          14.5  4450
 8 Gentoo  female           47.2          13.7  4925
 9 Gentoo  female           47.5          14    4875
10 Gentoo  female           48.2          14.3  4600
dt[order(factor(sex, 
                levels = rev(levels(sex))), 
         bill_length_mm), ]
Output
    species    sex bill_length_mm bill_depth_mm  mass
     <fctr> <fctr>          <num>         <num> <int>
 1:  Adelie   male           39.1          18.7  3750
 2:  Adelie female           35.7          16.9  3150
 3:  Adelie female           35.7          18.0  3550
 4:  Adelie female           36.0          17.1  3700
 5:  Adelie female           36.2          17.2  3150
 6:  Adelie female           39.5          16.7  3250
 7:  Gentoo female           43.2          14.5  4450
 8:  Gentoo female           47.2          13.7  4925
 9:  Gentoo female           47.5          14.0  4875
10:  Gentoo female           48.2          14.3  4600

2.9 Insert or update column

Add a new column mpbl that is the ratio of mass to bill_length_mm.

transform(df, mpbl = mass / bill_length_mm)
Output
   species    sex bill_length_mm bill_depth_mm mass      mpbl
1   Adelie   male           39.1          18.7 3750  95.90793
2   Adelie female           39.5          16.7 3250  82.27848
3   Adelie female           35.7          16.9 3150  88.23529
4   Adelie female           35.7          18.0 3550  99.43978
5   Adelie female           36.2          17.2 3150  87.01657
6   Adelie female           36.0          17.1 3700 102.77778
7   Gentoo female           48.2          14.3 4600  95.43568
8   Gentoo female           43.2          14.5 4450 103.00926
9   Gentoo female           47.5          14.0 4875 102.63158
10  Gentoo female           47.2          13.7 4925 104.34322
df$mpbl <- df$mass / df$bill_length_mm
df
Output
   species    sex bill_length_mm bill_depth_mm mass      mpbl
1   Adelie   male           39.1          18.7 3750  95.90793
2   Adelie female           39.5          16.7 3250  82.27848
3   Adelie female           35.7          16.9 3150  88.23529
4   Adelie female           35.7          18.0 3550  99.43978
5   Adelie female           36.2          17.2 3150  87.01657
6   Adelie female           36.0          17.1 3700 102.77778
7   Gentoo female           48.2          14.3 4600  95.43568
8   Gentoo female           43.2          14.5 4450 103.00926
9   Gentoo female           47.5          14.0 4875 102.63158
10  Gentoo female           47.2          13.7 4925 104.34322
df[["mpbl"]] <- df$mass / df$bill_length_mm
df
Output
   species    sex bill_length_mm bill_depth_mm mass      mpbl
1   Adelie   male           39.1          18.7 3750  95.90793
2   Adelie female           39.5          16.7 3250  82.27848
3   Adelie female           35.7          16.9 3150  88.23529
4   Adelie female           35.7          18.0 3550  99.43978
5   Adelie female           36.2          17.2 3150  87.01657
6   Adelie female           36.0          17.1 3700 102.77778
7   Gentoo female           48.2          14.3 4600  95.43568
8   Gentoo female           43.2          14.5 4450 103.00926
9   Gentoo female           47.5          14.0 4875 102.63158
10  Gentoo female           47.2          13.7 4925 104.34322
mutate(tbl, mpbl = mass / bill_length_mm)
Output
# A tibble: 10 × 6
   species sex    bill_length_mm bill_depth_mm  mass  mpbl
   <fct>   <fct>           <dbl>         <dbl> <int> <dbl>
 1 Adelie  male             39.1          18.7  3750  95.9
 2 Adelie  female           39.5          16.7  3250  82.3
 3 Adelie  female           35.7          16.9  3150  88.2
 4 Adelie  female           35.7          18    3550  99.4
 5 Adelie  female           36.2          17.2  3150  87.0
 6 Adelie  female           36            17.1  3700 103. 
 7 Gentoo  female           48.2          14.3  4600  95.4
 8 Gentoo  female           43.2          14.5  4450 103. 
 9 Gentoo  female           47.5          14    4875 103. 
10 Gentoo  female           47.2          13.7  4925 104. 
dt[, mpbl := mass / bill_length_mm]
dt
Output
    species    sex bill_length_mm bill_depth_mm  mass      mpbl
     <fctr> <fctr>          <num>         <num> <int>     <num>
 1:  Adelie   male           39.1          18.7  3750  95.90793
 2:  Adelie female           39.5          16.7  3250  82.27848
 3:  Adelie female           35.7          16.9  3150  88.23529
 4:  Adelie female           35.7          18.0  3550  99.43978
 5:  Adelie female           36.2          17.2  3150  87.01657
 6:  Adelie female           36.0          17.1  3700 102.77778
 7:  Gentoo female           48.2          14.3  4600  95.43568
 8:  Gentoo female           43.2          14.5  4450 103.00926
 9:  Gentoo female           47.5          14.0  4875 102.63158
10:  Gentoo female           47.2          13.7  4925 104.34322

2.10 Group operations

Calculate the average, standard deviation, and number of observations for mass grouped by sex and species.

aggregate(df, mass ~ sex + species, 
          \(x) c(avg = mean(x), 
                 sd = sd(x), 
                 n = length(x)))
Output
     sex species mass.avg  mass.sd   mass.n
1 female  Adelie 3360.000  250.998    5.000
2   male  Adelie 3750.000       NA    1.000
3 female  Gentoo 4712.500  225.924    4.000
aggregate(df$mass, 
          by = list(df$sex, df$species), 
          \(x) c(avg = mean(x), 
                 sd = sd(x), 
                 n = length(x)))
Output
  Group.1 Group.2    x.avg     x.sd      x.n
1  female  Adelie 3360.000  250.998    5.000
2    male  Adelie 3750.000       NA    1.000
3  female  Gentoo 4712.500  225.924    4.000

Below doesn’t work where there are missing combinations.

with(df, 
     tapply(
       mass, 
       list(sex = sex, species = species), 
       \(x) c(avg = mean(x), 
              sd = sd(x), 
              n = length(x))
       )) |>
  array2DF() 
Output
     sex   species                    Value
1 female    Adelie 3360.000, 250.998, 5.000
2   male    Adelie              3750, NA, 1
3 female Chinstrap                     NULL
4   male Chinstrap                     NULL
5 female    Gentoo 4712.500, 225.924, 4.000
6   male    Gentoo                     NULL
summarise(tbl, 
          avg = mean(mass), 
          sd = sd(mass), 
          n = n(),
          .by = c(sex, species))
Output
# A tibble: 3 × 5
  sex    species   avg    sd     n
  <fct>  <fct>   <dbl> <dbl> <int>
1 male   Adelie  3750    NA      1
2 female Adelie  3360   251.     5
3 female Gentoo  4712.  226.     4
summarise(group_by(tbl, sex, species), 
          avg = mean(mass), 
          sd = sd(mass),
          n = n())
Output
# A tibble: 3 × 5
# Groups:   sex [2]
  sex    species   avg    sd     n
  <fct>  <fct>   <dbl> <dbl> <int>
1 female Adelie  3360   251.     5
2 female Gentoo  4712.  226.     4
3 male   Adelie  3750    NA      1
dt[, 
   .(avg = mean(mass), 
     sd = sd(mass),
     n = .N), 
   by = .(sex, species)]
Output
      sex species    avg      sd     n
   <fctr>  <fctr>  <num>   <num> <int>
1:   male  Adelie 3750.0      NA     1
2: female  Adelie 3360.0 250.998     5
3: female  Gentoo 4712.5 225.924     4

2.11 Row operations

The maximum value of numeric columns for each row.

apply(df[, sapply(df, is.numeric)], 1, max)  
Output
 [1] 3750 3250 3150 3550 3150 3700 4600 4450 4875 4925
tbl |> 
  rowwise() |> 
  mutate(result = max(c_across(where(is.numeric)))) |> 
  pull(result)
Output
 [1] 3750 3250 3150 3550 3150 3700 4600 4450 4875 4925
dt[, apply(.SD, 1, max), .SDcol = is.numeric]
Output
 [1] 3750 3250 3150 3550 3150 3700 4600 4450 4875 4925

2.12 Column operations

The maximum value of numeric columns for each column.

apply(df[, sapply(df, is.numeric)], 2, max)  
Output
bill_length_mm  bill_depth_mm           mass 
          48.2           18.7         4925.0 
summarise(tbl, 
          across(where(is.numeric), max))
Output
# A tibble: 1 × 3
  bill_length_mm bill_depth_mm  mass
           <dbl>         <dbl> <int>
1           48.2          18.7  4925
dt[, lapply(.SD, max), .SDcol = is.numeric]
Output
   bill_length_mm bill_depth_mm  mass     mpbl
            <num>         <num> <int>    <num>
1:           48.2          18.7  4925 104.3432

2.13 Update specific cells in a column with a single value

2.13.1 By position

Modify the first two rows of the mass column to be 0.

df[1:2, "mass"] <- 0
df
Output
   species    sex bill_length_mm bill_depth_mm mass
1   Adelie   male           39.1          18.7    0
2   Adelie female           39.5          16.7    0
3   Adelie female           35.7          16.9 3150
4   Adelie female           35.7          18.0 3550
5   Adelie female           36.2          17.2 3150
6   Adelie female           36.0          17.1 3700
7   Gentoo female           48.2          14.3 4600
8   Gentoo female           43.2          14.5 4450
9   Gentoo female           47.5          14.0 4875
10  Gentoo female           47.2          13.7 4925
df$mass[1:2] <- 0
df
Output
   species    sex bill_length_mm bill_depth_mm mass
1   Adelie   male           39.1          18.7    0
2   Adelie female           39.5          16.7    0
3   Adelie female           35.7          16.9 3150
4   Adelie female           35.7          18.0 3550
5   Adelie female           36.2          17.2 3150
6   Adelie female           36.0          17.1 3700
7   Gentoo female           48.2          14.3 4600
8   Gentoo female           43.2          14.5 4450
9   Gentoo female           47.5          14.0 4875
10  Gentoo female           47.2          13.7 4925
mutate(tbl, 
       mass = if_else(row_number() %in% 1:2, 
                      0, 
                      mass))
Output
# A tibble: 10 × 5
   species sex    bill_length_mm bill_depth_mm  mass
   <fct>   <fct>           <dbl>         <dbl> <dbl>
 1 Adelie  male             39.1          18.7     0
 2 Adelie  female           39.5          16.7     0
 3 Adelie  female           35.7          16.9  3150
 4 Adelie  female           35.7          18    3550
 5 Adelie  female           36.2          17.2  3150
 6 Adelie  female           36            17.1  3700
 7 Gentoo  female           48.2          14.3  4600
 8 Gentoo  female           43.2          14.5  4450
 9 Gentoo  female           47.5          14    4875
10 Gentoo  female           47.2          13.7  4925
dt[, mass := fifelse(.I %in% 1:2, 0, mass)][]
Output
    species    sex bill_length_mm bill_depth_mm  mass
     <fctr> <fctr>          <num>         <num> <num>
 1:  Adelie   male           39.1          18.7     0
 2:  Adelie female           39.5          16.7     0
 3:  Adelie female           35.7          16.9  3150
 4:  Adelie female           35.7          18.0  3550
 5:  Adelie female           36.2          17.2  3150
 6:  Adelie female           36.0          17.1  3700
 7:  Gentoo female           48.2          14.3  4600
 8:  Gentoo female           43.2          14.5  4450
 9:  Gentoo female           47.5          14.0  4875
10:  Gentoo female           47.2          13.7  4925

2.13.2 By boolean

Modify the mass column to 0 for rows where sex is female.

df[df$sex == "female", "mass"] <- 0
df
Output
   species    sex bill_length_mm bill_depth_mm mass
1   Adelie   male           39.1          18.7 3750
2   Adelie female           39.5          16.7    0
3   Adelie female           35.7          16.9    0
4   Adelie female           35.7          18.0    0
5   Adelie female           36.2          17.2    0
6   Adelie female           36.0          17.1    0
7   Gentoo female           48.2          14.3    0
8   Gentoo female           43.2          14.5    0
9   Gentoo female           47.5          14.0    0
10  Gentoo female           47.2          13.7    0
df$mass[df$sex == "female"] <- 0
df
Output
   species    sex bill_length_mm bill_depth_mm mass
1   Adelie   male           39.1          18.7 3750
2   Adelie female           39.5          16.7    0
3   Adelie female           35.7          16.9    0
4   Adelie female           35.7          18.0    0
5   Adelie female           36.2          17.2    0
6   Adelie female           36.0          17.1    0
7   Gentoo female           48.2          14.3    0
8   Gentoo female           43.2          14.5    0
9   Gentoo female           47.5          14.0    0
10  Gentoo female           47.2          13.7    0
mutate(tbl, 
       mass = ifelse(sex == "female", 
                     0, 
                     mass))
Output
# A tibble: 10 × 5
   species sex    bill_length_mm bill_depth_mm  mass
   <fct>   <fct>           <dbl>         <dbl> <dbl>
 1 Adelie  male             39.1          18.7  3750
 2 Adelie  female           39.5          16.7     0
 3 Adelie  female           35.7          16.9     0
 4 Adelie  female           35.7          18       0
 5 Adelie  female           36.2          17.2     0
 6 Adelie  female           36            17.1     0
 7 Gentoo  female           48.2          14.3     0
 8 Gentoo  female           43.2          14.5     0
 9 Gentoo  female           47.5          14       0
10 Gentoo  female           47.2          13.7     0
dt[sex == "female", mass := 0][]
Output
    species    sex bill_length_mm bill_depth_mm  mass
     <fctr> <fctr>          <num>         <num> <int>
 1:  Adelie   male           39.1          18.7  3750
 2:  Adelie female           39.5          16.7     0
 3:  Adelie female           35.7          16.9     0
 4:  Adelie female           35.7          18.0     0
 5:  Adelie female           36.2          17.2     0
 6:  Adelie female           36.0          17.1     0
 7:  Gentoo female           48.2          14.3     0
 8:  Gentoo female           43.2          14.5     0
 9:  Gentoo female           47.5          14.0     0
10:  Gentoo female           47.2          13.7     0

2.14 Update specific cells in a column with a vector

2.14.1 By position

Modify the first two rows of the mass column to be 0 and 1.

df[1:2, "mass"] <- c(0, 1)
df
Output
   species    sex bill_length_mm bill_depth_mm mass
1   Adelie   male           39.1          18.7    0
2   Adelie female           39.5          16.7    1
3   Adelie female           35.7          16.9 3150
4   Adelie female           35.7          18.0 3550
5   Adelie female           36.2          17.2 3150
6   Adelie female           36.0          17.1 3700
7   Gentoo female           48.2          14.3 4600
8   Gentoo female           43.2          14.5 4450
9   Gentoo female           47.5          14.0 4875
10  Gentoo female           47.2          13.7 4925
df$mass[1:2] <- c(0, 1)
df
Output
   species    sex bill_length_mm bill_depth_mm mass
1   Adelie   male           39.1          18.7    0
2   Adelie female           39.5          16.7    1
3   Adelie female           35.7          16.9 3150
4   Adelie female           35.7          18.0 3550
5   Adelie female           36.2          17.2 3150
6   Adelie female           36.0          17.1 3700
7   Gentoo female           48.2          14.3 4600
8   Gentoo female           43.2          14.5 4450
9   Gentoo female           47.5          14.0 4875
10  Gentoo female           47.2          13.7 4925
mutate(tbl, 
       mass = ifelse(row_number() %in% 1:2, 
                     c(0, 1), 
                     mass))
Output
# A tibble: 10 × 5
   species sex    bill_length_mm bill_depth_mm  mass
   <fct>   <fct>           <dbl>         <dbl> <dbl>
 1 Adelie  male             39.1          18.7     0
 2 Adelie  female           39.5          16.7     1
 3 Adelie  female           35.7          16.9  3150
 4 Adelie  female           35.7          18    3550
 5 Adelie  female           36.2          17.2  3150
 6 Adelie  female           36            17.1  3700
 7 Gentoo  female           48.2          14.3  4600
 8 Gentoo  female           43.2          14.5  4450
 9 Gentoo  female           47.5          14    4875
10 Gentoo  female           47.2          13.7  4925
dt[1:2, mass := c(0, 1)][]
Output
    species    sex bill_length_mm bill_depth_mm  mass
     <fctr> <fctr>          <num>         <num> <int>
 1:  Adelie   male           39.1          18.7     0
 2:  Adelie female           39.5          16.7     1
 3:  Adelie female           35.7          16.9  3150
 4:  Adelie female           35.7          18.0  3550
 5:  Adelie female           36.2          17.2  3150
 6:  Adelie female           36.0          17.1  3700
 7:  Gentoo female           48.2          14.3  4600
 8:  Gentoo female           43.2          14.5  4450
 9:  Gentoo female           47.5          14.0  4875
10:  Gentoo female           47.2          13.7  4925

2.14.2 By boolean

Modify the mass column to be half of its value for rows where sex is female.

within(df, 
       mass[sex == "female"] <- mass[sex == "female"] / 2)
Output
   species    sex bill_length_mm bill_depth_mm   mass
1   Adelie   male           39.1          18.7 3750.0
2   Adelie female           39.5          16.7 1625.0
3   Adelie female           35.7          16.9 1575.0
4   Adelie female           35.7          18.0 1775.0
5   Adelie female           36.2          17.2 1575.0
6   Adelie female           36.0          17.1 1850.0
7   Gentoo female           48.2          14.3 2300.0
8   Gentoo female           43.2          14.5 2225.0
9   Gentoo female           47.5          14.0 2437.5
10  Gentoo female           47.2          13.7 2462.5
mutate(tbl, 
       mass = if_else(sex == "female",
                      mass / 2, 
                      mass))
Output
# A tibble: 10 × 5
   species sex    bill_length_mm bill_depth_mm  mass
   <fct>   <fct>           <dbl>         <dbl> <dbl>
 1 Adelie  male             39.1          18.7 3750 
 2 Adelie  female           39.5          16.7 1625 
 3 Adelie  female           35.7          16.9 1575 
 4 Adelie  female           35.7          18   1775 
 5 Adelie  female           36.2          17.2 1575 
 6 Adelie  female           36            17.1 1850 
 7 Gentoo  female           48.2          14.3 2300 
 8 Gentoo  female           43.2          14.5 2225 
 9 Gentoo  female           47.5          14   2438.
10 Gentoo  female           47.2          13.7 2462.
dt[sex == "female", mass := mass / 2][]
Output
Warning in `[.data.table`(dt, sex == "female", `:=`(mass, mass/2)): 2437.500000
(type 'double') at RHS position 8 out-of-range(NA) or truncated (precision
lost) when assigning to type 'integer' (column 5 named 'mass')
Output
    species    sex bill_length_mm bill_depth_mm  mass
     <fctr> <fctr>          <num>         <num> <int>
 1:  Adelie   male           39.1          18.7  3750
 2:  Adelie female           39.5          16.7  1625
 3:  Adelie female           35.7          16.9  1575
 4:  Adelie female           35.7          18.0  1775
 5:  Adelie female           36.2          17.2  1575
 6:  Adelie female           36.0          17.1  1850
 7:  Gentoo female           48.2          14.3  2300
 8:  Gentoo female           43.2          14.5  2225
 9:  Gentoo female           47.5          14.0  2437
10:  Gentoo female           47.2          13.7  2462

2.15 Update specific cells in a column with multiple cases

Create a new column size based on the following conditions: if sex is female and mass is greater than 4000, then size is large, if sex is male and mass is greater than 4100, then size is large, otherwise size is small.

df$size <- ifelse(df$sex == "female" & df$mass > 4000,
                  "large",
                  ifelse(df$sex == "male" & df$mass > 4100, 
                         "large", 
                         "small"))
df                
Output
   species    sex bill_length_mm bill_depth_mm mass  size
1   Adelie   male           39.1          18.7 3750 small
2   Adelie female           39.5          16.7 3250 small
3   Adelie female           35.7          16.9 3150 small
4   Adelie female           35.7          18.0 3550 small
5   Adelie female           36.2          17.2 3150 small
6   Adelie female           36.0          17.1 3700 small
7   Gentoo female           48.2          14.3 4600 large
8   Gentoo female           43.2          14.5 4450 large
9   Gentoo female           47.5          14.0 4875 large
10  Gentoo female           47.2          13.7 4925 large
mutate(tbl, 
       size = case_when(sex == "female" & mass > 4000 ~ "large",
                        sex == "male" & mass > 4100 ~ "large",
                        .default = "small"))
Output
# A tibble: 10 × 6
   species sex    bill_length_mm bill_depth_mm  mass size 
   <fct>   <fct>           <dbl>         <dbl> <int> <chr>
 1 Adelie  male             39.1          18.7  3750 small
 2 Adelie  female           39.5          16.7  3250 small
 3 Adelie  female           35.7          16.9  3150 small
 4 Adelie  female           35.7          18    3550 small
 5 Adelie  female           36.2          17.2  3150 small
 6 Adelie  female           36            17.1  3700 small
 7 Gentoo  female           48.2          14.3  4600 large
 8 Gentoo  female           43.2          14.5  4450 large
 9 Gentoo  female           47.5          14    4875 large
10 Gentoo  female           47.2          13.7  4925 large
dt[, size := "small"]
dt[sex == "female" & mass > 4000, size := "large"]
dt[sex == "male" & mass > 4100, size := "large"][]
Output
    species    sex bill_length_mm bill_depth_mm  mass   size
     <fctr> <fctr>          <num>         <num> <int> <char>
 1:  Adelie   male           39.1          18.7  3750  small
 2:  Adelie female           39.5          16.7  3250  small
 3:  Adelie female           35.7          16.9  3150  small
 4:  Adelie female           35.7          18.0  3550  small
 5:  Adelie female           36.2          17.2  3150  small
 6:  Adelie female           36.0          17.1  3700  small
 7:  Gentoo female           48.2          14.3  4600  large
 8:  Gentoo female           43.2          14.5  4450  large
 9:  Gentoo female           47.5          14.0  4875  large
10:  Gentoo female           47.2          13.7  4925  large

2.16 Merge or join two tabular data

We create a second data frame df2 and tibble tbl2 that contains the sex, species, and name columns.

df2 <- data.frame(sex = c("female", "male",  "female", "male"),
                  species = c("Adelie", "Adelie", "Chinstrap", "Chinstrap"),
                  name = c("A", "B", "C", "D"))
tbl2 <- as_tibble(df2)

We also create a second data table dt2 that contains the sex, species, and name columns.

dt2 <- as.data.table(df2)

2.16.1 Left join

Left join df and df2 by species and sex.

merge(df, df2, by = c("species", "sex"),
      all.x = TRUE)        
Output
   species    sex bill_length_mm bill_depth_mm mass name
1   Adelie female           39.5          16.7 3250    A
2   Adelie female           35.7          16.9 3150    A
3   Adelie female           35.7          18.0 3550    A
4   Adelie female           36.2          17.2 3150    A
5   Adelie female           36.0          17.1 3700    A
6   Adelie   male           39.1          18.7 3750    B
7   Gentoo female           47.5          14.0 4875 <NA>
8   Gentoo female           47.2          13.7 4925 <NA>
9   Gentoo female           48.2          14.3 4600 <NA>
10  Gentoo female           43.2          14.5 4450 <NA>
left_join(df, df2, join_by(species, sex))  
Output
   species    sex bill_length_mm bill_depth_mm mass name
1   Adelie   male           39.1          18.7 3750    B
2   Adelie female           39.5          16.7 3250    A
3   Adelie female           35.7          16.9 3150    A
4   Adelie female           35.7          18.0 3550    A
5   Adelie female           36.2          17.2 3150    A
6   Adelie female           36.0          17.1 3700    A
7   Gentoo female           48.2          14.3 4600 <NA>
8   Gentoo female           43.2          14.5 4450 <NA>
9   Gentoo female           47.5          14.0 4875 <NA>
10  Gentoo female           47.2          13.7 4925 <NA>
merge(dt, dt2, by = c("species", "sex"),
      all.x = TRUE) 
Output
Key: <species, sex>
    species    sex bill_length_mm bill_depth_mm  mass   name
     <char> <char>          <num>         <num> <int> <char>
 1:  Adelie female           39.5          16.7  3250      A
 2:  Adelie female           35.7          16.9  3150      A
 3:  Adelie female           35.7          18.0  3550      A
 4:  Adelie female           36.2          17.2  3150      A
 5:  Adelie female           36.0          17.1  3700      A
 6:  Adelie   male           39.1          18.7  3750      B
 7:  Gentoo female           48.2          14.3  4600   <NA>
 8:  Gentoo female           43.2          14.5  4450   <NA>
 9:  Gentoo female           47.5          14.0  4875   <NA>
10:  Gentoo female           47.2          13.7  4925   <NA>

2.16.2 Right join

Right join df and df2 by species and sex.

merge(df, df2, by = c("species", "sex"),
      all.y = TRUE)              
Output
    species    sex bill_length_mm bill_depth_mm mass name
1    Adelie female           39.5          16.7 3250    A
2    Adelie female           35.7          16.9 3150    A
3    Adelie female           35.7          18.0 3550    A
4    Adelie female           36.2          17.2 3150    A
5    Adelie female           36.0          17.1 3700    A
6    Adelie   male           39.1          18.7 3750    B
7 Chinstrap female             NA            NA   NA    C
8 Chinstrap   male             NA            NA   NA    D
right_join(df, df2, join_by(species, sex)) 
Output
    species    sex bill_length_mm bill_depth_mm mass name
1    Adelie   male           39.1          18.7 3750    B
2    Adelie female           39.5          16.7 3250    A
3    Adelie female           35.7          16.9 3150    A
4    Adelie female           35.7          18.0 3550    A
5    Adelie female           36.2          17.2 3150    A
6    Adelie female           36.0          17.1 3700    A
7 Chinstrap female             NA            NA   NA    C
8 Chinstrap   male             NA            NA   NA    D
merge(dt, dt2, by = c("species", "sex"),
      all.y = TRUE)
Output
Key: <species, sex>
     species    sex bill_length_mm bill_depth_mm  mass   name
      <char> <char>          <num>         <num> <int> <char>
1:    Adelie female           39.5          16.7  3250      A
2:    Adelie female           35.7          16.9  3150      A
3:    Adelie female           35.7          18.0  3550      A
4:    Adelie female           36.2          17.2  3150      A
5:    Adelie female           36.0          17.1  3700      A
6:    Adelie   male           39.1          18.7  3750      B
7: Chinstrap female             NA            NA    NA      C
8: Chinstrap   male             NA            NA    NA      D

2.16.3 Full join

Full join df and df2 by species and sex.

merge(df, df2, by = c("species", "sex"),
      all = TRUE)              
Output
     species    sex bill_length_mm bill_depth_mm mass name
1     Adelie female           39.5          16.7 3250    A
2     Adelie female           35.7          16.9 3150    A
3     Adelie female           35.7          18.0 3550    A
4     Adelie female           36.2          17.2 3150    A
5     Adelie female           36.0          17.1 3700    A
6     Adelie   male           39.1          18.7 3750    B
7  Chinstrap female             NA            NA   NA    C
8  Chinstrap   male             NA            NA   NA    D
9     Gentoo female           47.5          14.0 4875 <NA>
10    Gentoo female           47.2          13.7 4925 <NA>
11    Gentoo female           48.2          14.3 4600 <NA>
12    Gentoo female           43.2          14.5 4450 <NA>
full_join(df, df2, join_by(species, sex)) 
Output
     species    sex bill_length_mm bill_depth_mm mass name
1     Adelie   male           39.1          18.7 3750    B
2     Adelie female           39.5          16.7 3250    A
3     Adelie female           35.7          16.9 3150    A
4     Adelie female           35.7          18.0 3550    A
5     Adelie female           36.2          17.2 3150    A
6     Adelie female           36.0          17.1 3700    A
7     Gentoo female           48.2          14.3 4600 <NA>
8     Gentoo female           43.2          14.5 4450 <NA>
9     Gentoo female           47.5          14.0 4875 <NA>
10    Gentoo female           47.2          13.7 4925 <NA>
11 Chinstrap female             NA            NA   NA    C
12 Chinstrap   male             NA            NA   NA    D
merge(dt, dt2, by = c("species", "sex"),
      all = TRUE)
Output
Key: <species, sex>
      species    sex bill_length_mm bill_depth_mm  mass   name
       <char> <char>          <num>         <num> <int> <char>
 1:    Adelie female           39.5          16.7  3250      A
 2:    Adelie female           35.7          16.9  3150      A
 3:    Adelie female           35.7          18.0  3550      A
 4:    Adelie female           36.2          17.2  3150      A
 5:    Adelie female           36.0          17.1  3700      A
 6:    Adelie   male           39.1          18.7  3750      B
 7: Chinstrap female             NA            NA    NA      C
 8: Chinstrap   male             NA            NA    NA      D
 9:    Gentoo female           48.2          14.3  4600   <NA>
10:    Gentoo female           43.2          14.5  4450   <NA>
11:    Gentoo female           47.5          14.0  4875   <NA>
12:    Gentoo female           47.2          13.7  4925   <NA>

2.16.4 Inner join

Inner join df and df2 by species and sex.

merge(df, df2, by = c("species", "sex"))              
Output
  species    sex bill_length_mm bill_depth_mm mass name
1  Adelie female           39.5          16.7 3250    A
2  Adelie female           35.7          16.9 3150    A
3  Adelie female           35.7          18.0 3550    A
4  Adelie female           36.2          17.2 3150    A
5  Adelie female           36.0          17.1 3700    A
6  Adelie   male           39.1          18.7 3750    B
inner_join(df, df2, join_by(species, sex)) 
Output
  species    sex bill_length_mm bill_depth_mm mass name
1  Adelie   male           39.1          18.7 3750    B
2  Adelie female           39.5          16.7 3250    A
3  Adelie female           35.7          16.9 3150    A
4  Adelie female           35.7          18.0 3550    A
5  Adelie female           36.2          17.2 3150    A
6  Adelie female           36.0          17.1 3700    A
merge(dt, dt2, by = c("species", "sex"))
Output
Key: <species, sex>
   species    sex bill_length_mm bill_depth_mm  mass   name
    <char> <char>          <num>         <num> <int> <char>
1:  Adelie female           39.5          16.7  3250      A
2:  Adelie female           35.7          16.9  3150      A
3:  Adelie female           35.7          18.0  3550      A
4:  Adelie female           36.2          17.2  3150      A
5:  Adelie female           36.0          17.1  3700      A
6:  Adelie   male           39.1          18.7  3750      B

2.16.5 Cross join

Cross join df and df2.

merge(df, df2, by = NULL)              
Output
   species.x  sex.x bill_length_mm bill_depth_mm mass  sex.y species.y name
1     Adelie   male           39.1          18.7 3750 female    Adelie    A
2     Adelie female           39.5          16.7 3250 female    Adelie    A
3     Adelie female           35.7          16.9 3150 female    Adelie    A
4     Adelie female           35.7          18.0 3550 female    Adelie    A
5     Adelie female           36.2          17.2 3150 female    Adelie    A
6     Adelie female           36.0          17.1 3700 female    Adelie    A
7     Gentoo female           48.2          14.3 4600 female    Adelie    A
8     Gentoo female           43.2          14.5 4450 female    Adelie    A
9     Gentoo female           47.5          14.0 4875 female    Adelie    A
10    Gentoo female           47.2          13.7 4925 female    Adelie    A
11    Adelie   male           39.1          18.7 3750   male    Adelie    B
12    Adelie female           39.5          16.7 3250   male    Adelie    B
13    Adelie female           35.7          16.9 3150   male    Adelie    B
14    Adelie female           35.7          18.0 3550   male    Adelie    B
15    Adelie female           36.2          17.2 3150   male    Adelie    B
16    Adelie female           36.0          17.1 3700   male    Adelie    B
17    Gentoo female           48.2          14.3 4600   male    Adelie    B
18    Gentoo female           43.2          14.5 4450   male    Adelie    B
19    Gentoo female           47.5          14.0 4875   male    Adelie    B
20    Gentoo female           47.2          13.7 4925   male    Adelie    B
21    Adelie   male           39.1          18.7 3750 female Chinstrap    C
22    Adelie female           39.5          16.7 3250 female Chinstrap    C
23    Adelie female           35.7          16.9 3150 female Chinstrap    C
24    Adelie female           35.7          18.0 3550 female Chinstrap    C
25    Adelie female           36.2          17.2 3150 female Chinstrap    C
26    Adelie female           36.0          17.1 3700 female Chinstrap    C
27    Gentoo female           48.2          14.3 4600 female Chinstrap    C
28    Gentoo female           43.2          14.5 4450 female Chinstrap    C
29    Gentoo female           47.5          14.0 4875 female Chinstrap    C
30    Gentoo female           47.2          13.7 4925 female Chinstrap    C
31    Adelie   male           39.1          18.7 3750   male Chinstrap    D
32    Adelie female           39.5          16.7 3250   male Chinstrap    D
33    Adelie female           35.7          16.9 3150   male Chinstrap    D
34    Adelie female           35.7          18.0 3550   male Chinstrap    D
35    Adelie female           36.2          17.2 3150   male Chinstrap    D
36    Adelie female           36.0          17.1 3700   male Chinstrap    D
37    Gentoo female           48.2          14.3 4600   male Chinstrap    D
38    Gentoo female           43.2          14.5 4450   male Chinstrap    D
39    Gentoo female           47.5          14.0 4875   male Chinstrap    D
40    Gentoo female           47.2          13.7 4925   male Chinstrap    D
cross_join(df, df2) 
Output
   species.x  sex.x bill_length_mm bill_depth_mm mass  sex.y species.y name
1     Adelie   male           39.1          18.7 3750 female    Adelie    A
2     Adelie   male           39.1          18.7 3750   male    Adelie    B
3     Adelie   male           39.1          18.7 3750 female Chinstrap    C
4     Adelie   male           39.1          18.7 3750   male Chinstrap    D
5     Adelie female           39.5          16.7 3250 female    Adelie    A
6     Adelie female           39.5          16.7 3250   male    Adelie    B
7     Adelie female           39.5          16.7 3250 female Chinstrap    C
8     Adelie female           39.5          16.7 3250   male Chinstrap    D
9     Adelie female           35.7          16.9 3150 female    Adelie    A
10    Adelie female           35.7          16.9 3150   male    Adelie    B
11    Adelie female           35.7          16.9 3150 female Chinstrap    C
12    Adelie female           35.7          16.9 3150   male Chinstrap    D
13    Adelie female           35.7          18.0 3550 female    Adelie    A
14    Adelie female           35.7          18.0 3550   male    Adelie    B
15    Adelie female           35.7          18.0 3550 female Chinstrap    C
16    Adelie female           35.7          18.0 3550   male Chinstrap    D
17    Adelie female           36.2          17.2 3150 female    Adelie    A
18    Adelie female           36.2          17.2 3150   male    Adelie    B
19    Adelie female           36.2          17.2 3150 female Chinstrap    C
20    Adelie female           36.2          17.2 3150   male Chinstrap    D
21    Adelie female           36.0          17.1 3700 female    Adelie    A
22    Adelie female           36.0          17.1 3700   male    Adelie    B
23    Adelie female           36.0          17.1 3700 female Chinstrap    C
24    Adelie female           36.0          17.1 3700   male Chinstrap    D
25    Gentoo female           48.2          14.3 4600 female    Adelie    A
26    Gentoo female           48.2          14.3 4600   male    Adelie    B
27    Gentoo female           48.2          14.3 4600 female Chinstrap    C
28    Gentoo female           48.2          14.3 4600   male Chinstrap    D
29    Gentoo female           43.2          14.5 4450 female    Adelie    A
30    Gentoo female           43.2          14.5 4450   male    Adelie    B
31    Gentoo female           43.2          14.5 4450 female Chinstrap    C
32    Gentoo female           43.2          14.5 4450   male Chinstrap    D
33    Gentoo female           47.5          14.0 4875 female    Adelie    A
34    Gentoo female           47.5          14.0 4875   male    Adelie    B
35    Gentoo female           47.5          14.0 4875 female Chinstrap    C
36    Gentoo female           47.5          14.0 4875   male Chinstrap    D
37    Gentoo female           47.2          13.7 4925 female    Adelie    A
38    Gentoo female           47.2          13.7 4925   male    Adelie    B
39    Gentoo female           47.2          13.7 4925 female Chinstrap    C
40    Gentoo female           47.2          13.7 4925   male Chinstrap    D
merge(dt, dt2, by = NULL)
Output
Key: <species, sex>
   species    sex bill_length_mm bill_depth_mm  mass   name
    <char> <char>          <num>         <num> <int> <char>
1:  Adelie female           39.5          16.7  3250      A
2:  Adelie female           35.7          16.9  3150      A
3:  Adelie female           35.7          18.0  3550      A
4:  Adelie female           36.2          17.2  3150      A
5:  Adelie female           36.0          17.1  3700      A
6:  Adelie   male           39.1          18.7  3750      B

2.16.6 Anti join

Anti join df and df2 by species and sex.

df[!(df$sex %in% df2$sex & df$species %in% df2$species), ]        
Output
   species    sex bill_length_mm bill_depth_mm mass
7   Gentoo female           48.2          14.3 4600
8   Gentoo female           43.2          14.5 4450
9   Gentoo female           47.5          14.0 4875
10  Gentoo female           47.2          13.7 4925
anti_join(df, df2, join_by(sex, species)) 
Output
  species    sex bill_length_mm bill_depth_mm mass
1  Gentoo female           48.2          14.3 4600
2  Gentoo female           43.2          14.5 4450
3  Gentoo female           47.5          14.0 4875
4  Gentoo female           47.2          13.7 4925
setDT(dt)[!dt2, on = c("sex", "species")]
Output
   species    sex bill_length_mm bill_depth_mm  mass
    <fctr> <fctr>          <num>         <num> <int>
1:  Gentoo female           48.2          14.3  4600
2:  Gentoo female           43.2          14.5  4450
3:  Gentoo female           47.5          14.0  4875
4:  Gentoo female           47.2          13.7  4925

2.16.7 Bind columns

Bind the first two columns of df and the third and fourth columns of df.

cbind(df[1:2], df[3:4])
Output
   species    sex bill_length_mm bill_depth_mm
1   Adelie   male           39.1          18.7
2   Adelie female           39.5          16.7
3   Adelie female           35.7          16.9
4   Adelie female           35.7          18.0
5   Adelie female           36.2          17.2
6   Adelie female           36.0          17.1
7   Gentoo female           48.2          14.3
8   Gentoo female           43.2          14.5
9   Gentoo female           47.5          14.0
10  Gentoo female           47.2          13.7
bind_cols(df[1:2], df[3:4])
Output
   species    sex bill_length_mm bill_depth_mm
1   Adelie   male           39.1          18.7
2   Adelie female           39.5          16.7
3   Adelie female           35.7          16.9
4   Adelie female           35.7          18.0
5   Adelie female           36.2          17.2
6   Adelie female           36.0          17.1
7   Gentoo female           48.2          14.3
8   Gentoo female           43.2          14.5
9   Gentoo female           47.5          14.0
10  Gentoo female           47.2          13.7
setDT(c(dt[, 1:2], dt[, 3:4]))

2.16.8 Bind rows

Bind df and df2 by rows.

df2[setdiff(names(df), names(df2))] <- NA
df[setdiff(names(df2), names(df))] <- NA
rbind(df, df2)
Output
     species    sex bill_length_mm bill_depth_mm mass name
1     Adelie   male           39.1          18.7 3750 <NA>
2     Adelie female           39.5          16.7 3250 <NA>
3     Adelie female           35.7          16.9 3150 <NA>
4     Adelie female           35.7          18.0 3550 <NA>
5     Adelie female           36.2          17.2 3150 <NA>
6     Adelie female           36.0          17.1 3700 <NA>
7     Gentoo female           48.2          14.3 4600 <NA>
8     Gentoo female           43.2          14.5 4450 <NA>
9     Gentoo female           47.5          14.0 4875 <NA>
10    Gentoo female           47.2          13.7 4925 <NA>
11    Adelie female             NA            NA   NA    A
12    Adelie   male             NA            NA   NA    B
13 Chinstrap female             NA            NA   NA    C
14 Chinstrap   male             NA            NA   NA    D
bind_rows(tbl, tbl2)
Output
# A tibble: 14 × 6
   species   sex    bill_length_mm bill_depth_mm  mass name 
   <chr>     <chr>           <dbl>         <dbl> <int> <chr>
 1 Adelie    male             39.1          18.7  3750 <NA> 
 2 Adelie    female           39.5          16.7  3250 <NA> 
 3 Adelie    female           35.7          16.9  3150 <NA> 
 4 Adelie    female           35.7          18    3550 <NA> 
 5 Adelie    female           36.2          17.2  3150 <NA> 
 6 Adelie    female           36            17.1  3700 <NA> 
 7 Gentoo    female           48.2          14.3  4600 <NA> 
 8 Gentoo    female           43.2          14.5  4450 <NA> 
 9 Gentoo    female           47.5          14    4875 <NA> 
10 Gentoo    female           47.2          13.7  4925 <NA> 
11 Adelie    female           NA            NA      NA A    
12 Adelie    male             NA            NA      NA B    
13 Chinstrap female           NA            NA      NA C    
14 Chinstrap male             NA            NA      NA D    
rbindlist(list(dt, dt2), 
          fill = TRUE)
Output
      species    sex bill_length_mm bill_depth_mm  mass   name
       <fctr> <fctr>          <num>         <num> <int> <char>
 1:    Adelie   male           39.1          18.7  3750   <NA>
 2:    Adelie female           39.5          16.7  3250   <NA>
 3:    Adelie female           35.7          16.9  3150   <NA>
 4:    Adelie female           35.7          18.0  3550   <NA>
 5:    Adelie female           36.2          17.2  3150   <NA>
 6:    Adelie female           36.0          17.1  3700   <NA>
 7:    Gentoo female           48.2          14.3  4600   <NA>
 8:    Gentoo female           43.2          14.5  4450   <NA>
 9:    Gentoo female           47.5          14.0  4875   <NA>
10:    Gentoo female           47.2          13.7  4925   <NA>
11:    Adelie female             NA            NA    NA      A
12:    Adelie   male             NA            NA    NA      B
13: Chinstrap female             NA            NA    NA      C
14: Chinstrap   male             NA            NA    NA      D

2.17 Reshape data to longer format

Reshape the df data frame to a longer format where the bill_length_mm and bill_depth_mm columns are stacked into a single column named bill and a new column named type is created to indicate the type of measurement.

reshape(df, 
        varying = grep("^bill", colnames(df)), 
        v.names = "bill", 
        direction = "long",
        timevar = "type",
        times = c("length", "depth"))
Output
          species    sex mass name   type bill id
1.length   Adelie   male 3750   NA length 39.1  1
2.length   Adelie female 3250   NA length 39.5  2
3.length   Adelie female 3150   NA length 35.7  3
4.length   Adelie female 3550   NA length 35.7  4
5.length   Adelie female 3150   NA length 36.2  5
6.length   Adelie female 3700   NA length 36.0  6
7.length   Gentoo female 4600   NA length 48.2  7
8.length   Gentoo female 4450   NA length 43.2  8
9.length   Gentoo female 4875   NA length 47.5  9
10.length  Gentoo female 4925   NA length 47.2 10
1.depth    Adelie   male 3750   NA  depth 18.7  1
2.depth    Adelie female 3250   NA  depth 16.7  2
3.depth    Adelie female 3150   NA  depth 16.9  3
4.depth    Adelie female 3550   NA  depth 18.0  4
5.depth    Adelie female 3150   NA  depth 17.2  5
6.depth    Adelie female 3700   NA  depth 17.1  6
7.depth    Gentoo female 4600   NA  depth 14.3  7
8.depth    Gentoo female 4450   NA  depth 14.5  8
9.depth    Gentoo female 4875   NA  depth 14.0  9
10.depth   Gentoo female 4925   NA  depth 13.7 10
pivot_longer(tbl, cols = starts_with("bill"), names_to = "type", names_pattern = "bill_(.+)_mm")
Output
# A tibble: 20 × 5
   species sex     mass type   value
   <fct>   <fct>  <int> <chr>  <dbl>
 1 Adelie  male    3750 length  39.1
 2 Adelie  male    3750 depth   18.7
 3 Adelie  female  3250 length  39.5
 4 Adelie  female  3250 depth   16.7
 5 Adelie  female  3150 length  35.7
 6 Adelie  female  3150 depth   16.9
 7 Adelie  female  3550 length  35.7
 8 Adelie  female  3550 depth   18  
 9 Adelie  female  3150 length  36.2
10 Adelie  female  3150 depth   17.2
11 Adelie  female  3700 length  36  
12 Adelie  female  3700 depth   17.1
13 Gentoo  female  4600 length  48.2
14 Gentoo  female  4600 depth   14.3
15 Gentoo  female  4450 length  43.2
16 Gentoo  female  4450 depth   14.5
17 Gentoo  female  4875 length  47.5
18 Gentoo  female  4875 depth   14  
19 Gentoo  female  4925 length  47.2
20 Gentoo  female  4925 depth   13.7
melt(dt, 
     measure.vars = grep("^bill", colnames(dt)), 
     variable.name = "type",
     value.name = "bill")[
       , type := sub("^bill_", "", type)][
       , type := sub("_mm$", "", type)][]
Output
    species    sex  mass   type  bill
     <fctr> <fctr> <int> <char> <num>
 1:  Adelie   male  3750 length  39.1
 2:  Adelie female  3250 length  39.5
 3:  Adelie female  3150 length  35.7
 4:  Adelie female  3550 length  35.7
 5:  Adelie female  3150 length  36.2
 6:  Adelie female  3700 length  36.0
 7:  Gentoo female  4600 length  48.2
 8:  Gentoo female  4450 length  43.2
 9:  Gentoo female  4875 length  47.5
10:  Gentoo female  4925 length  47.2
11:  Adelie   male  3750  depth  18.7
12:  Adelie female  3250  depth  16.7
13:  Adelie female  3150  depth  16.9
14:  Adelie female  3550  depth  18.0
15:  Adelie female  3150  depth  17.2
16:  Adelie female  3700  depth  17.1
17:  Gentoo female  4600  depth  14.3
18:  Gentoo female  4450  depth  14.5
19:  Gentoo female  4875  depth  14.0
20:  Gentoo female  4925  depth  13.7
    species    sex  mass   type  bill

2.18 Reshape data to wider format

Reshape the first two rows of df data frame to a wider format where the sex column is used as the id variable and the species column is used as the time variable.

reshape(df[1:2, ],  
        idvar = "species",
        timevar = "sex",
        direction = "wide")
Output
  species bill_length_mm.male bill_depth_mm.male mass.male name.male
1  Adelie                39.1               18.7      3750        NA
  bill_length_mm.female bill_depth_mm.female mass.female name.female
1                  39.5                 16.7        3250          NA
pivot_wider(tbl[1:2, ],
            names_from = sex, 
            values_from = -c(sex, species))
Output
# A tibble: 1 × 7
  species bill_length_mm_male bill_length_mm_female bill_depth_mm_male
  <fct>                 <dbl>                 <dbl>              <dbl>
1 Adelie                 39.1                  39.5               18.7
# ℹ 3 more variables: bill_depth_mm_female <dbl>, mass_male <int>,
#   mass_female <int>
dcast(dt[1:2, ], 
      species ~ sex, 
      value.var = setdiff(colnames(dt), c("sex", "species")))
Output
Key: <species>
   species bill_length_mm_female bill_length_mm_male bill_depth_mm_female
    <fctr>                 <num>               <num>                <num>
1:  Adelie                  39.5                39.1                 16.7
   bill_depth_mm_male mass_female mass_male
                <num>       <int>     <int>
1:               18.7        3250      3750

3 Strings

The comparison of this syntax benefitted from comparison by Stoudt (2024).

x <- c("Banana", " citrus ", "app le ")

3.1 Detect a pattern

3.1.1 Presence/absence of a pattern

grepl("a", x)
Output
[1]  TRUE FALSE  TRUE
str_detect(x, "a")
Output
[1]  TRUE FALSE  TRUE

3.1.2 Match

regmatches(x, m = regexec("a", x))
Output
[[1]]
[1] "a"

[[2]]
character(0)

[[3]]
[1] "a"
str_match(x, "a")
Output
     [,1]
[1,] "a" 
[2,] NA  
[3,] "a" 

3.1.3 Locate (first occurence only)

regexpr("a", x)
Output
[1]  2 -1  1
attr(,"match.length")
[1]  1 -1  1
attr(,"index.type")
[1] "chars"
attr(,"useBytes")
[1] TRUE
str_locate(x, "a")
Output
     start end
[1,]     2   2
[2,]    NA  NA
[3,]     1   1

3.1.4 Locate (all occurences)

gregexpr("a", x)
Output
[[1]]
[1] 2 4 6
attr(,"match.length")
[1] 1 1 1
attr(,"index.type")
[1] "chars"
attr(,"useBytes")
[1] TRUE

[[2]]
[1] -1
attr(,"match.length")
[1] -1
attr(,"index.type")
[1] "chars"
attr(,"useBytes")
[1] TRUE

[[3]]
[1] 1
attr(,"match.length")
[1] 1
attr(,"index.type")
[1] "chars"
attr(,"useBytes")
[1] TRUE
str_locate_all(x, "a")
Output
[[1]]
     start end
[1,]     2   2
[2,]     4   4
[3,]     6   6

[[2]]
     start end

[[3]]
     start end
[1,]     1   1

3.1.5 Which

grep(x, "a")
Output
Warning in grep(x, "a"): argument 'pattern' has length > 1 and only the first
element will be used
Output
integer(0)
str_which(x, "a")
Output
[1] 1 3

3.2 Subset strings

3.2.1 Extract (first occurence only)

regmatches(x, m = regexpr("an", x))
Output
[1] "an"

Note that the above doesn’t not return any value for non-matching patterns.

str_extract(x, "an")
Output
[1] "an" NA   NA  

3.2.2 Extract (all occurences)

regmatches(x, m = gregexpr("an", x))
Output
[[1]]
[1] "an" "an"

[[2]]
character(0)

[[3]]
character(0)
str_extract_all(x, "an")
Output
[[1]]
[1] "an" "an"

[[2]]
character(0)

[[3]]
character(0)

3.2.3 Subset string

substr(x, 1, 2)
Output
[1] "Ba" " c" "ap"
str_sub(x, 1, 2)
Output
[1] "Ba" " c" "ap"

3.2.4 Subset string vector

grep(x,"a", value = TRUE)
Output
Warning in grep(x, "a", value = TRUE): argument 'pattern' has length > 1 and
only the first element will be used
Output
character(0)
str_subset(x, "a")
Output
[1] "Banana"  "app le "

3.3 Cleaning strings

3.3.1 Length

nchar(x)
Output
[1] 6 8 7
str_length(x)
Output
[1] 6 8 7

3.3.2 Trim white spaces

trimws(x)
Output
[1] "Banana" "citrus" "app le"
str_trim(x)
Output
[1] "Banana" "citrus" "app le"

3.3.3 Wrap

strwrap(x, 4)
Output
[1] "Banana" "citrus" "app"    "le"    
str_wrap(x, 4)
Output
[1] "Banana"  "citrus"  "app\nle"

3.4 Transform strings

3.4.1 Replace (first occurence only)

sub("a", "e", x)
Output
[1] "Benana"   " citrus " "epp le " 
str_replace(x, "a", "e")
Output
[1] "Benana"   " citrus " "epp le " 

3.4.2 Replace (all occurences)

gsub("a", "e", x)
Output
[1] "Benene"   " citrus " "epp le " 
str_replace_all(x, "a", "e")
Output
[1] "Benene"   " citrus " "epp le " 

3.4.3 Duplicate string

strrep(x, 2)
Output
[1] "BananaBanana"     " citrus  citrus " "app le app le "  
str_dup(x, 2)
Output
[1] "BananaBanana"     " citrus  citrus " "app le app le "  

3.4.4 Transform to lower case

tolower(x)
Output
[1] "banana"   " citrus " "app le " 
str_to_lower(x)
Output
[1] "banana"   " citrus " "app le " 

3.4.5 Transform to upper case

toupper(x)
Output
[1] "BANANA"   " CITRUS " "APP LE " 
str_to_upper(x)
Output
[1] "BANANA"   " CITRUS " "APP LE " 

3.4.6 Transform to title case

None.

str_to_title(x)
Output
[1] "Banana"   " Citrus " "App Le " 

3.5 Sort strings

3.5.1 Order

order(x)
Output
[1] 2 3 1
str_order(x)
Output
[1] 2 3 1

3.5.2 Sort

sort(x)
Output
[1] " citrus " "app le "  "Banana"  
str_sort(x)
Output
[1] " citrus " "app le "  "Banana"  

3.6 Split

strsplit(x, "a")
Output
[[1]]
[1] "B" "n" "n"

[[2]]
[1] " citrus "

[[3]]
[1] ""       "pp le "
str_split(x, "a")
Output
[[1]]
[1] "B" "n" "n" "" 

[[2]]
[1] " citrus "

[[3]]
[1] ""       "pp le "

4 Factors

set.seed(1)
f <- factor(c("A", "C", "A", "B", NA, "A", "C"))
x1 <- runif(length(f))
x2 <- runif(length(f))
f0 <- factor(rep(LETTERS[1:10], times = c(1, 3, 5, 16, 18, 20, 23, 27, 30, 31)))
f1 <- factor(c("C", "D", "D", "C", "E", "E"))
f2 <- factor(c("1", "101", "009", "009", "12", "12"))

4.1 Anonymise levels

new_lvls <- seq(nlevels(f))
levels(f) <- sample(new_lvls)
factor(f, new_lvls)
Output
[1] 1    3    1    2    <NA> 1    3   
Levels: 1 2 3
fct_anon(f)
Output
[1] 2    1    2    3    <NA> 2    1   
Levels: 1 2 3

4.2 Concatenate

c(f1, f2)
Output
 [1] C   D   D   C   E   E   1   101 009 009 12  12 
Levels: C D E 009 1 101 12
fct_c(f1, f2)
Output
 [1] C   D   D   C   E   E   1   101 009 009 12  12 
Levels: C D E 009 1 101 12

4.3 Collapse levels

levels(f)[levels(f) %in% c("A", "B")] <- "a"
f
Output
[1] a    C    a    a    <NA> a    C   
Levels: a C
fct_collapse(f, a = c("A", "B"))
Output
[1] a    C    a    a    <NA> a    C   
Levels: a C

4.4 Counting levels

as.data.frame(table(f))
Output
  f Freq
1 A    3
2 B    1
3 C    2
fct_count(f)
Output
# A tibble: 4 × 2
  f         n
  <fct> <int>
1 A         3
2 B         1
3 C         2
4 <NA>      1

4.5 Crossing levels

factor(paste(f1, f2, sep = ":"))
Output
[1] C:1   D:101 D:009 C:009 E:12  E:12 
Levels: C:009 C:1 D:009 D:101 E:12
fct_cross(f1, f2)
Output
[1] C:1   D:101 D:009 C:009 E:12  E:12 
Levels: C:009 D:009 C:1 D:101 E:12

4.6 Drop levels not in data

factor(f[-4])
Output
[1] A    C    A    <NA> A    C   
Levels: A C
fct_drop(f[-4])
Output
[1] A    C    A    <NA> A    C   
Levels: A C

4.7 Add additional levels

levels(f) <- c(levels(f), c("D", "E"))
f
Output
[1] A    C    A    B    <NA> A    C   
Levels: A B C D E
fct_expand(f, c("D", "E"))
Output
[1] A    C    A    B    <NA> A    C   
Levels: A B C D E

4.8 Reorder levels by frequency

factor(f, levels = rev(names(sort(table(f))))) 
Output
[1] A    C    A    B    <NA> A    C   
Levels: A C B
fct_infreq(f)
Output
[1] A    C    A    B    <NA> A    C   
Levels: A C B

4.9 Reorder levels by order of appearance in data

factor(f, levels = unique(f)) 
Output
[1] A    C    A    B    <NA> A    C   
Levels: A C B
fct_inorder(f)
Output
[1] A    C    A    B    <NA> A    C   
Levels: A C B

4.10 Reorder levels by numeric order

lvls <- levels(f2)
factor(f2, 
       levels = lvls[order(as.numeric(lvls))]) 
Output
[1] 1   101 009 009 12  12 
Levels: 1 009 12 101
fct_inseq(f2)
Output
[1] 1   101 009 009 12  12 
Levels: 1 009 12 101

4.11 Lump levels

lvls <- levels(f0)
tt <- sort(table(f0))
lvls_order <- names(rev(tt))
levels(f0)[lvls %in% setdiff(lvls, lvls_order[1:2])] <- "Other"
f0
Output
  [1] Other Other Other Other Other Other Other Other Other Other Other Other
 [13] Other Other Other Other Other Other Other Other Other Other Other Other
 [25] Other Other Other Other Other Other Other Other Other Other Other Other
 [37] Other Other Other Other Other Other Other Other Other Other Other Other
 [49] Other Other Other Other Other Other Other Other Other Other Other Other
 [61] Other Other Other Other Other Other Other Other Other Other Other Other
 [73] Other Other Other Other Other Other Other Other Other Other Other Other
 [85] Other Other Other Other Other Other Other Other Other Other Other Other
 [97] Other Other Other Other Other Other Other Other Other Other Other Other
[109] Other Other Other Other Other I     I     I     I     I     I     I    
[121] I     I     I     I     I     I     I     I     I     I     I     I    
[133] I     I     I     I     I     I     I     I     I     I     I     J    
[145] J     J     J     J     J     J     J     J     J     J     J     J    
[157] J     J     J     J     J     J     J     J     J     J     J     J    
[169] J     J     J     J     J     J    
Levels: Other I J
fct_lump_n(f, n = 2)
Output
[1] A     C     A     Other <NA>  A     C    
Levels: A C Other
n <- min(ceiling(0.2 * length(f0)), length(f0))
levels(f0)[lvls %in% setdiff(lvls, lvls_order[1:n])] <- "Other"
f0
Output
  [1] A B B B C C C C C D D D D D D D D D D D D D D D D E E E E E E E E E E E E
 [38] E E E E E E F F F F F F F F F F F F F F F F F F F F G G G G G G G G G G G
 [75] G G G G G G G G G G G G H H H H H H H H H H H H H H H H H H H H H H H H H
[112] H H I I I I I I I I I I I I I I I I I I I I I I I I I I I I I I J J J J J
[149] J J J J J J J J J J J J J J J J J J J J J J J J J J
Levels: A B C D E F G H I J
fct_lump_prop(f0, prop = 0.2)
Output
  [1] Other Other Other Other Other Other Other Other Other Other Other Other
 [13] Other Other Other Other Other Other Other Other Other Other Other Other
 [25] Other Other Other Other Other Other Other Other Other Other Other Other
 [37] Other Other Other Other Other Other Other Other Other Other Other Other
 [49] Other Other Other Other Other Other Other Other Other Other Other Other
 [61] Other Other Other Other Other Other Other Other Other Other Other Other
 [73] Other Other Other Other Other Other Other Other Other Other Other Other
 [85] Other Other Other Other Other Other Other Other Other Other Other Other
 [97] Other Other Other Other Other Other Other Other Other Other Other Other
[109] Other Other Other Other Other Other Other Other Other Other Other Other
[121] Other Other Other Other Other Other Other Other Other Other Other Other
[133] Other Other Other Other Other Other Other Other Other Other Other Other
[145] Other Other Other Other Other Other Other Other Other Other Other Other
[157] Other Other Other Other Other Other Other Other Other Other Other Other
[169] Other Other Other Other Other Other
Levels: Other
levels(f0)[lvls %in% names(tt[tt < 2])] <- "Other"
f0
Output
  [1] Other B     B     B     C     C     C     C     C     D     D     D    
 [13] D     D     D     D     D     D     D     D     D     D     D     D    
 [25] D     E     E     E     E     E     E     E     E     E     E     E    
 [37] E     E     E     E     E     E     E     F     F     F     F     F    
 [49] F     F     F     F     F     F     F     F     F     F     F     F    
 [61] F     F     F     G     G     G     G     G     G     G     G     G    
 [73] G     G     G     G     G     G     G     G     G     G     G     G    
 [85] G     G     H     H     H     H     H     H     H     H     H     H    
 [97] H     H     H     H     H     H     H     H     H     H     H     H    
[109] H     H     H     H     H     I     I     I     I     I     I     I    
[121] I     I     I     I     I     I     I     I     I     I     I     I    
[133] I     I     I     I     I     I     I     I     I     I     I     J    
[145] J     J     J     J     J     J     J     J     J     J     J     J    
[157] J     J     J     J     J     J     J     J     J     J     J     J    
[169] J     J     J     J     J     J    
Levels: Other B C D E F G H I J
fct_lump_min(f0, min = 2)
Output
  [1] Other B     B     B     C     C     C     C     C     D     D     D    
 [13] D     D     D     D     D     D     D     D     D     D     D     D    
 [25] D     E     E     E     E     E     E     E     E     E     E     E    
 [37] E     E     E     E     E     E     E     F     F     F     F     F    
 [49] F     F     F     F     F     F     F     F     F     F     F     F    
 [61] F     F     F     G     G     G     G     G     G     G     G     G    
 [73] G     G     G     G     G     G     G     G     G     G     G     G    
 [85] G     G     H     H     H     H     H     H     H     H     H     H    
 [97] H     H     H     H     H     H     H     H     H     H     H     H    
[109] H     H     H     H     H     I     I     I     I     I     I     I    
[121] I     I     I     I     I     I     I     I     I     I     I     I    
[133] I     I     I     I     I     I     I     I     I     I     I     J    
[145] J     J     J     J     J     J     J     J     J     J     J     J    
[157] J     J     J     J     J     J     J     J     J     J     J     J    
[169] J     J     J     J     J     J    
Levels: B C D E F G H I J Other
bottom_lvls <- seq(which.min(cumsum(tt)[-length(tt)] - tt[-1]))
levels(f0)[lvls %in% names(tt[bottom_lvls])] <- "Other"
f0
Output
  [1] Other Other Other Other Other Other Other Other Other D     D     D    
 [13] D     D     D     D     D     D     D     D     D     D     D     D    
 [25] D     E     E     E     E     E     E     E     E     E     E     E    
 [37] E     E     E     E     E     E     E     F     F     F     F     F    
 [49] F     F     F     F     F     F     F     F     F     F     F     F    
 [61] F     F     F     G     G     G     G     G     G     G     G     G    
 [73] G     G     G     G     G     G     G     G     G     G     G     G    
 [85] G     G     H     H     H     H     H     H     H     H     H     H    
 [97] H     H     H     H     H     H     H     H     H     H     H     H    
[109] H     H     H     H     H     I     I     I     I     I     I     I    
[121] I     I     I     I     I     I     I     I     I     I     I     I    
[133] I     I     I     I     I     I     I     I     I     I     I     J    
[145] J     J     J     J     J     J     J     J     J     J     J     J    
[157] J     J     J     J     J     J     J     J     J     J     J     J    
[169] J     J     J     J     J     J    
Levels: Other D E F G H I J
fct_lump_lowfreq(f0)
Output
  [1] Other Other Other Other Other Other Other Other Other D     D     D    
 [13] D     D     D     D     D     D     D     D     D     D     D     D    
 [25] D     E     E     E     E     E     E     E     E     E     E     E    
 [37] E     E     E     E     E     E     E     F     F     F     F     F    
 [49] F     F     F     F     F     F     F     F     F     F     F     F    
 [61] F     F     F     G     G     G     G     G     G     G     G     G    
 [73] G     G     G     G     G     G     G     G     G     G     G     G    
 [85] G     G     H     H     H     H     H     H     H     H     H     H    
 [97] H     H     H     H     H     H     H     H     H     H     H     H    
[109] H     H     H     H     H     I     I     I     I     I     I     I    
[121] I     I     I     I     I     I     I     I     I     I     I     I    
[133] I     I     I     I     I     I     I     I     I     I     I     J    
[145] J     J     J     J     J     J     J     J     J     J     J     J    
[157] J     J     J     J     J     J     J     J     J     J     J     J    
[169] J     J     J     J     J     J    
Levels: D E F G H I J Other

4.12 Match levels

if("A" %in% levels(f)) {
  f %in% "A"
}
Output
[1]  TRUE FALSE  TRUE FALSE FALSE  TRUE FALSE
fct_match(f, "A")
Output
[1]  TRUE FALSE  TRUE FALSE FALSE  TRUE FALSE

4.13 Recode levels

factor(sapply(f, \(x) {
  x <- as.character(x)
  switch(x, 
         A = "apple", 
         B = "banana",
         x)
  }))
Output
[1] apple  C      apple  banana <NA>   apple  C     
Levels: apple banana C
fct_recode(f, 
           apple = "A", 
           banana = "B")
Output
[1] apple  C      apple  banana <NA>   apple  C     
Levels: apple banana C

4.14 Relabel levels

levels(f) <- paste0(levels(f), "1")
f
Output
[1] A1   C1   A1   B1   <NA> A1   C1  
Levels: A1 B1 C1
fct_relabel(f, ~str_c(., "1"))
Output
[1] A1   C1   A1   B1   <NA> A1   C1  
Levels: A1 B1 C1

4.15 Relevel

relevel(f, ref = "B")
Output
[1] A    C    A    B    <NA> A    C   
Levels: B A C
fct_relevel(f, "B")
Output
[1] A    C    A    B    <NA> A    C   
Levels: B A C

4.16 Reorder

reorder(f, x1, mean)
Output
[1] A    C    A    B    <NA> A    C   
attr(,"scores")
        A         B         C 
0.5789172 0.9082078 0.6583996 
Levels: A C B
fct_reorder(f, x1, mean)
Output
[1] A    C    A    B    <NA> A    C   
Levels: A C B
reorder(f, seq_along(f), 
        \(i) sum(x1[i] * x2[i]))
Output
[1] A    C    A    B    <NA> A    C   
attr(,"scores")
        A         B         C 
0.8280563 0.1870677 0.5969617 
Levels: B C A
fct_reorder2(f, x1, x2, 
             \(x, y) sum(x * y))
Output
[1] A    C    A    B    <NA> A    C   
Levels: A C B

4.17 Unique

unique(f)
Output
[1] A    C    B    <NA>
Levels: A B C
fct_unique(f)
Output
[1] A    B    C    <NA>
Levels: A B C

4.18 Reverse order of levels

factor(f, levels = rev(levels(f)))
Output
[1] A    C    A    B    <NA> A    C   
Levels: C B A
fct_rev(f)
Output
[1] A    C    A    B    <NA> A    C   
Levels: C B A

4.19 Shift order of labels

factor(f, c(levels(f)[-1], levels(f)[1]))
Output
[1] A    C    A    B    <NA> A    C   
Levels: B C A
fct_shift(f)
Output
[1] A    C    A    B    <NA> A    C   
Levels: B C A

4.20 Shuffle order of labels

factor(f, sample(levels(f)))
Output
[1] A    C    A    B    <NA> A    C   
Levels: A C B
fct_shuffle(f)
Output
[1] A    C    A    B    <NA> A    C   
Levels: A C B

4.21 Unify labels across list

lapply(list(f1, f2), 
       \(x) factor(x, c(levels(f1), levels(f2))))
Output
[[1]]
[1] C D D C E E
Levels: C D E 009 1 101 12

[[2]]
[1] 1   101 009 009 12  12 
Levels: C D E 009 1 101 12
fct_unify(list(f1, f2))
Output
[[1]]
[1] C D D C E E
Levels: C D E 009 1 101 12

[[2]]
[1] 1   101 009 009 12  12 
Levels: C D E 009 1 101 12

5 Date and time

5.1 Get today’s date

Sys.Date()
Output
[1] "2025-01-12"
today()
Output
[1] "2025-01-12"

5.2 Get current date and time

Sys.time()
Output
[1] "2025-01-12 21:27:16 AEDT"
now()
Output
[1] "2025-01-12 21:27:16 AEDT"

5.3 Change input to date

as.Date("2025-01-01")
Output
[1] "2025-01-01"
as_date("2025-01-01")
Output
[1] "2025-01-01"
as.Date("2025 Jan 1", format = "%Y %b %d")
Output
[1] "2025-01-01"
as.Date("1/3/2025", format = "%m/%d/%Y")
Output
[1] "2025-01-03"
as.Date("1-3-2025", format = "%d-%m-%Y")
Output
[1] "2025-03-01"
ymd("2025 Jan 1")
Output
[1] "2025-01-01"
mdy("1/3/2025")
Output
[1] "2025-01-03"
dmy("1-3-2025")
Output
[1] "2025-03-01"

5.4 Change input to date and time

as.POSIXct("2025-01-01 05:00:00", tz = "UTC")
Output
[1] "2025-01-01 05:00:00 UTC"
as.POSIXlt("2025-01-01 05:00:00", tz = "UTC")
Output
[1] "2025-01-01 05:00:00 UTC"
as_datetime("2025-01-01 05:00:00")
Output
[1] "2025-01-01 05:00:00 UTC"

5.5 Extract year, month, day, hour, minutes or seconds

d <- as.POSIXct("2025/01/03 01:20:40")
as.numeric(format(d, "%Y"))
Output
[1] 2025
year(d)
Output
[1] 2025
as.numeric(format(d, "%m"))
Output
[1] 1
month(d)
Output
[1] 1
as.numeric(format(d, "%d"))
Output
[1] 3
day(d)
Output
[1] 3
as.numeric(format(d, "%H"))
Output
[1] 1
hour(d)
Output
[1] 1
as.numeric(format(d, "%M"))
Output
[1] 20
minutes(d)
Output
[1] "1735827640M 0S"
as.numeric(format(d, "%S"))
Output
[1] 40
seconds(d)
Output
[1] "1735827640S"

6 Lists

x <- list(1:3, 2:5, 3:-1)
y <- list("a", "b", "c")
z <- list(1, 2, 3)

6.1 A single list

6.1.1 Return as list

lapply(x, sum)
Output
[[1]]
[1] 6

[[2]]
[1] 14

[[3]]
[1] 5
map(x, sum)
Output
[[1]]
[1] 6

[[2]]
[1] 14

[[3]]
[1] 5

6.1.2 Return as vector

vapply(x, length, integer(1))
Output
[1] 3 4 5
map_int(x, length)
Output
[1] 3 4 5
vapply(x, sum, double(1))
Output
[1]  6 14  5
map_dbl(x, sum)
Output
[1]  6 14  5
vapply(x, is.numeric, logical(1))
Output
[1] TRUE TRUE TRUE
map_lgl(x, is.numeric)
Output
[1] TRUE TRUE TRUE
vapply(x, \(x) letters[length(x)], character(1))
Output
[1] "c" "d" "e"
map_chr(x, ~letters[length(.x)])
Output
[1] "c" "d" "e"

6.2 Two lists

lapply(seq_along(x), 
       \(i) paste0(y[[i]], sum(x[[i]])))
Output
[[1]]
[1] "a6"

[[2]]
[1] "b14"

[[3]]
[1] "c5"
map2(x, y, ~paste0(.y, sum(.x)))
Output
[[1]]
[1] "a6"

[[2]]
[1] "b14"

[[3]]
[1] "c5"

6.3 Multiple lists

lapply(seq_along(x), 
       \(i) paste0(sum(x[[i]]), y[[i]], z[[i]]))
Output
[[1]]
[1] "6a1"

[[2]]
[1] "14b2"

[[3]]
[1] "5c3"
pmap(list(x, y, z), 
     ~paste0(sum(..1), ..2, ..3))
Output
[[1]]
[1] "6a1"

[[2]]
[1] "14b2"

[[3]]
[1] "5c3"

7 Computational details

sessioninfo::session_info(include_base = TRUE)

─ Session info ───────────────────────────────────────────────────────────────
 setting  value
 version  R version 4.3.3 (2024-02-29)
 os       macOS 15.1.1
 system   aarch64, darwin20
 ui       X11
 language (EN)
 collate  en_US.UTF-8
 ctype    en_US.UTF-8
 tz       Australia/Sydney
 date     2025-01-12
 pandoc   3.2 @ /Applications/RStudio.app/Contents/Resources/app/quarto/bin/tools/aarch64/ (via rmarkdown)

─ Packages ───────────────────────────────────────────────────────────────────
 ! package        * version date (UTC) lib source
   base           * 4.3.3   2024-03-01 [?] local
   cachem           1.1.0   2024-05-16 [1] CRAN (R 4.3.3)
   cli              3.6.3   2024-06-21 [1] CRAN (R 4.3.3)
   colorspace       2.1-1   2024-07-26 [1] CRAN (R 4.3.3)
 P compiler         4.3.3   2024-03-01 [1] local
   conflicted       1.2.0   2023-02-01 [1] CRAN (R 4.3.0)
   data.table     * 1.16.4  2024-12-06 [1] CRAN (R 4.3.3)
 P datasets       * 4.3.3   2024-03-01 [1] local
   digest           0.6.37  2024-08-19 [1] CRAN (R 4.3.3)
   dplyr          * 1.1.4   2023-11-17 [1] CRAN (R 4.3.1)
   evaluate         1.0.1   2024-10-10 [1] CRAN (R 4.3.3)
   fansi            1.0.6   2023-12-08 [1] CRAN (R 4.3.1)
   fastmap          1.2.0   2024-05-15 [1] CRAN (R 4.3.3)
   forcats        * 1.0.0   2023-01-29 [1] CRAN (R 4.3.0)
   generics         0.1.3   2022-07-05 [1] CRAN (R 4.3.0)
   ggplot2        * 3.5.1   2024-04-23 [1] CRAN (R 4.3.1)
   glue             1.8.0   2024-09-30 [1] CRAN (R 4.3.3)
 P graphics       * 4.3.3   2024-03-01 [1] local
 P grDevices      * 4.3.3   2024-03-01 [1] local
 P grid             4.3.3   2024-03-01 [1] local
   gtable           0.3.6   2024-10-25 [1] CRAN (R 4.3.3)
   here             1.0.1   2020-12-13 [1] CRAN (R 4.3.0)
   hms              1.1.3   2023-03-21 [1] CRAN (R 4.3.0)
   htmltools        0.5.8.1 2024-04-04 [1] CRAN (R 4.3.1)
   htmlwidgets      1.6.4   2023-12-06 [1] CRAN (R 4.3.1)
   jsonlite         1.8.9   2024-09-20 [1] CRAN (R 4.3.3)
   knitr            1.49    2024-11-08 [1] CRAN (R 4.3.3)
   lifecycle        1.0.4   2023-11-07 [1] CRAN (R 4.3.1)
   lubridate      * 1.9.3   2023-09-27 [1] CRAN (R 4.3.1)
   magrittr         2.0.3   2022-03-30 [1] CRAN (R 4.3.0)
   memoise          2.0.1   2021-11-26 [1] CRAN (R 4.3.0)
 P methods        * 4.3.3   2024-03-01 [1] local
   munsell          0.5.1   2024-04-01 [1] CRAN (R 4.3.1)
   palmerpenguins * 0.1.1   2022-08-15 [1] CRAN (R 4.3.0)
   pillar           1.9.0   2023-03-22 [1] CRAN (R 4.3.0)
   pkgconfig        2.0.3   2019-09-22 [1] CRAN (R 4.3.0)
   purrr          * 1.0.2   2023-08-10 [1] CRAN (R 4.3.0)
   R6               2.5.1   2021-08-19 [1] CRAN (R 4.3.0)
   readr          * 2.1.5   2024-01-10 [1] CRAN (R 4.3.1)
   rlang            1.1.4   2024-06-04 [1] CRAN (R 4.3.3)
   rmarkdown        2.29    2024-11-04 [1] CRAN (R 4.3.3)
   rprojroot        2.0.4   2023-11-05 [1] CRAN (R 4.3.1)
   rstudioapi       0.17.1  2024-10-22 [1] CRAN (R 4.3.3)
   scales           1.3.0   2023-11-28 [1] CRAN (R 4.3.1)
   sessioninfo      1.2.2   2021-12-06 [1] CRAN (R 4.3.0)
 P stats          * 4.3.3   2024-03-01 [1] local
   stringi          1.8.4   2024-05-06 [1] CRAN (R 4.3.1)
   stringr        * 1.5.1   2023-11-14 [1] CRAN (R 4.3.1)
   tibble         * 3.2.1   2023-03-20 [1] CRAN (R 4.3.0)
   tidyr          * 1.3.1   2024-01-24 [1] CRAN (R 4.3.1)
   tidyselect       1.2.1   2024-03-11 [1] CRAN (R 4.3.1)
   tidyverse      * 2.0.0   2023-02-22 [1] CRAN (R 4.3.0)
   timechange       0.3.0   2024-01-18 [1] CRAN (R 4.3.1)
 P tools            4.3.3   2024-03-01 [1] local
   tzdb             0.4.0   2023-05-12 [1] CRAN (R 4.3.0)
   utf8             1.2.4   2023-10-22 [1] CRAN (R 4.3.1)
 P utils          * 4.3.3   2024-03-01 [1] local
   vctrs            0.6.5   2023-12-01 [1] CRAN (R 4.3.1)
   withr            3.0.2   2024-10-28 [1] CRAN (R 4.3.3)
   xfun             0.49    2024-10-31 [1] CRAN (R 4.3.3)
   yaml             2.3.10  2024-07-26 [1] CRAN (R 4.3.3)

 [1] /Library/Frameworks/R.framework/Versions/4.3-arm64/Resources/library

 P ── Loaded and on-disk path mismatch.

──────────────────────────────────────────────────────────────────────────────


References

Barrett, Tyson, Matt Dowle, Arun Srinivasan, Jan Gorecki, Michael Chirico, Toby Hocking, and Benjamin Schwendinger. 2024. Data.table: Extension of ‘Data.frame‘. https://CRAN.R-project.org/package=data.table.
Horst, Allison Marie, Alison Presmanes Hill, and Kristen B Gorman. 2020. Palmerpenguins: Palmer Archipelago (Antarctica) Penguin Data. https://doi.org/10.5281/zenodo.3960218.
Stoudt, Sara. 2024. “Vignette in Stringr R Package: From Base R.” 2024. https://stringr.tidyverse.org/articles/from-base.html.
Wickham, Hadley, Mara Averick, Jennifer Bryan, Winston Chang, Lucy D’Agostino McGowan, Romain François, Garrett Grolemund, et al. 2019. “Welcome to the Tidyverse.” Journal of Open Source Software 4 (43): 1686. https://doi.org/10.21105/joss.01686.

Footnotes

  1. In practice, the data is just reset if it was modified.↩︎