Data Wrangling with R: Day 2

<div class="bg-black white" style="width:45%;right:0;bottom:0;padding-left:5px;border: solid 4px white;margin: auto;">
 These slides are viewed best by Chrome and occasionally need to be refreshed if elements did not load properly. See here for <a href=day2-session2.pdf>PDF </a>. 
</div>

---

count: false
background-image: url(images/bg1.jpg)
background-size: cover
class: hide-slide-number title-slide

.item.center[

# Data Wrangling with R: Day 2

]

.center.shade_black.animated.bounceInUp.slower[
 
## Formatting factors with `forcats`

Presented by Emi Tanaka

Department of Econometrics and Business Statistics

<img src="images/monash-one-line-reversed.png" style="width:500px"> 
 emi.tanaka@monash.edu
 @statsgen

.bottom_abs.width100.bg-black[
2nd December 2020 @ Statistical Society of Australia | Zoom
]

]

</div>

---

## There are two types of categorical variables

#### .monash-blue[**Nominal**] where there is no intrinsic ordering to the categories
E.g. blue, grey, black, white.

#### .monash-blue[**Ordinal**] where there is a clear order to the categories
E.g. Strongly disagree, disagree, neutral, agree, strongly agree.

---

# Categorical variables in R .font_small[.font_small[Part] 1]

* In R, categorical variables may be encoded in various ways.

```r
cat_chr <- c("red", "white", "blue")
cat_fct <- factor(c("red", "white", "blue"))

class(cat_chr)
```

```
## [1] "character"
```

```r
class(cat_fct)
```

```
## [1] "factor"
```
--

* Then you have categorical variables that look like a numerical variable (e.g. coded variables like say 1=male, 2=female)
* And also those that have fixed levels of numerical values (e.g. `ToothGrowth$dose`: 0.5, 1.0 and 2.0)

---

## So why encode as .monash-blue[`factor`] instead of .monash-blue[`character`]?

### In some cases, characters are converted to factors (or vice-versa) in functions so they can be similar.

### The main idea of a factor is that the variable has a *fixed number of levels*

---

# Categorical variables in R .font_small[.font_small[Part] 2]

* When a variable is encoded as a **factor** then there is an attribute with the levels

```r
data <- c(2, 2, 1, 1, 3, 3, 3, 1)
factor(data)
```

```
## [1] 2 2 1 1 3 3 3 1
## Levels: 1 2 3
```
* You can easily change the labels of the variables:

```r
factor(data, 
       labels = c("I", "II", "III"))
```

```
## [1] II  II  I   I   III III III I  
## Levels: I II III
```

---

# Categorical variables in R .font_small[.font_small[Part] 3]

* Order of the factors are determined by the input:

```r
*# numerical input are ordered in increasing order
factor(c(1, 3, 10))
```

```
## [1] 1  3  10
## Levels: 1 3 10
```

```r
*# character input are ordered alphabetically
factor(c("1", "3", "10"))
```

```
## [1] 1  3  10
## Levels: 1 10 3
```

```r
*# you can specify order of levels explicitly
factor(c("1", "3", "10"),  levels = c("1", "3", "10"))
```

```
## [1] 1  3  10
## Levels: 1 3 10
```

---

# Why would the order of the levels matter?

* Some downstream analysis may use it

```r
data("population", package = "tidyr")
population %>% 
  filter(year == 2013) %>% 
  # just choose 5 countries
  slice(c(1, 11, 21, 31, 41)) %>% 
  ggplot(aes(population, country)) +
  geom_col()
```

![](images/day2-session2/unnamed-chunk-3-1.png)

]
.item50[
{{content}}
]
]

```r
population %>% 
  filter(year == 2013) %>% 
  slice(c(1, 11, 21, 31, 41)) %>% 
  mutate(country = 
*     reorder(country, population)) %>%
  ggplot(aes(population, country)) +
  geom_col()
```

![](images/day2-session2/unnamed-chunk-4-1.png)

---

#  Cautionary tales of working with factors

---

# Numerical factors in R

```r
x <- factor(c(10, 20, 30, 10, 20))
mean(x)
```

```
## Warning in mean.default(x): argument is not numeric or logical: returning NA
```

```
## [1] NA
```

`as.numeric` function returns the internal integer values of the factor

```r
mean(as.numeric(x))
```

```
## [1] 1.8
```

You probably want to use:

```r
mean(as.numeric(levels(x)[x]))
```

```
## [1] 18
```

]
.item[

```r
mean(as.numeric(as.character(x)))
```

```
## [1] 18
```

</div>

---

# Defining levels explicitly .font_small[.font_small[Part] 1]

* If the variable contain values that are not in the levels of the factors, then those values will become a missing value

```r
factor(c("Yes", "No", "Maybe"), levels = c("Yes", "No"))
```

```
## [1] Yes No <NA>
## Levels: Yes No
```

* This can be useful at times, but it's a good idea to check the values before it is transformed as `NA`

```r
factor(c("Yes", "No", "No", "Yess"), levels = c("Yes", "No"))
```

```
## [1] Yes No No <NA>
## Levels: Yes No
```

---

# Defining levels explicitly .font_small[.font_small[Part] 2]

* You can have levels that are not observed

```r
f <- factor(c("Yes", "Yes", "Yes", "No"), levels = c("Yes", "Maybe", "No"))
f
```

```
## [1] Yes Yes Yes No 
## Levels: Yes Maybe No
```
--

* This can be useful at times downstream, e.g.

```r
table(f)
```

```
## f
##   Yes Maybe    No 
##     3     0     1
```

---

# Combining factors .font_small[as vectors]

```r
f1 <- factor(c("F", "M", "F"))
f2 <- factor(c("F", "F"))
```

* What do you think the output will be for below?

```r
c(f1, f2)
```

```
## [1] 1 2 1 1 1
```

* Was that expected?
--

* The `c` function strips the class when you combine factors

```r
unclass(f1)
```

```
## [1] 1 2 1
## attr(,"levels")
## [1] "F" "M"
```

---

# Combining factors .font_small[in a data frame]

```r
df1 <- data.frame(f = factor(c("a", "b")))
df2 <- data.frame(f = factor(c("c", "b")))
```
* What do you think the output below will be?

```r
rbind(df1, df2)
```
--

```
##   f
## 1 a
## 2 b
## 3 c
## 4 b
```
--

```r
rbind(df1, df2)$f
```

```
## [1] a b c b
## Levels: a b c
```

---

# Working with factors with `forcats`

---

# Formatting factors

.footnote[
Hadley Wickham (2020). forcats: Tools for Working with
  Categorical Variables (Factors). R package version 0.5.0.
]

* The `forcats` package is part of `tidyverse` 
--

* Like the `stringr` package the main functions in `forcats` **prefix with `fct_` or `lvls_`** and the **first argument is a factor (or a character) vector** .font_small[some functions do not allow character as input, e.g. `fct_c`]
--

* The list of available commands are:

.grid.font_small[
.item[
* `fct_anon`
* `fct_c`
* `fct_collapse`
* `fct_count`
* `fct_cross`
* `fct_drop`
* `fct_expand`
* `fct_explicit_na`

]
.item[
* `fct_infreq`
* `fct_inorder`
* `fct_inseq`
* `fct_lump`
* `fct_lump_lowfreq`
* `fct_lump_min`
* `fct_lump_n`
* `fct_lump_prop`
]
.item[
* `fct_match`
* `fct_other`
* `fct_recode`
* `fct_relabel`
* `fct_relevel`
* `fct_reorder`
* `fct_reorder2`
* `fct_rev`

]
.item[
* `fct_shift`
* `fct_shuffle`
* `fct_unify`
* `fct_unique`
* `lvls_expand`
* `lvls_reorder`
* `lvls_revalue`
* `lvls_union`

]
]
]

---

# Formatting factors

.footnote[
Hadley Wickham (2020). forcats: Tools for Working with
  Categorical Variables (Factors). R package version 0.5.0.
]

* The `forcats` package is part of `tidyverse`

* The list of available commands are:

.grid.font_small[
.item[
* `fct_anon`
* .monash-blue[`fct_c`]
* .monash-blue[`fct_collapse`]
* .monash-blue[`fct_count`]
* `fct_cross`
* `fct_drop`
* `fct_expand`
* `fct_explicit_na`

]
.item[
* `fct_infreq`
* `fct_inorder`
* `fct_inseq`
* .monash-blue[`fct_lump`]
* .monash-blue[`fct_lump_lowfreq`]
* .monash-blue[`fct_lump_min`]
* .monash-blue[`fct_lump_n`]
* .monash-blue[`fct_lump_prop`]
]
.item[
* `fct_match`
* `fct_other`
* `fct_recode`
* `fct_relabel`
* .monash-blue[`fct_relevel`]
* `fct_reorder`
* `fct_reorder2`
* `fct_rev`

]
.item[
* `fct_shift`
* `fct_shuffle`
* `fct_unify`
* `fct_unique`
* `lvls_expand`
* `lvls_reorder`
* `lvls_revalue`
* `lvls_union`

]
]
]

---

# Combining factors .font_small[as vectors with `forcats`]

```r
f1 <- factor(c("F", "M", "F"))
f2 <- factor(c("F", "F"))

c(f1, f2)
```

```
## [1] 1 2 1 1 1
```
--

```r
fct_c(f1, f2)
```
--

```
## [1] F M F F F
## Levels: F M
```
--

```r
c1 <- c("F", "M", "F")

fct_c(c1, f2)
```
--

```
## Error: All elements of `...` must be factors
```

---

# Count levels in a factor

```r
data("gss_cat", package = "forcats")
table(gss_cat$race)
```

```
## 
##          Other          Black          White Not applicable 
##           1959           3129          16395              0
```
* `table` in Base R is useful but you may want the output as a data frame
--

```r
fct_count(gss_cat$race, sort = TRUE, prop = TRUE)
```

```
## # A tibble: 4 x 3
## f n p
## <fct> <int> <dbl>
## 1 White 16395 0.763 
## 2 Black 3129 0.146 
## 3 Other 1959 0.0912
## 4 Not applicable 0 0
```

---

# Collapse levels in a factor

```r
levels(gss_cat$marital)
```

```
## [1] "No answer"     "Never married" "Separated"     "Divorced"     
## [5] "Widowed"       "Married"
```
--

```r
gss_cat$marital %>% 
* fct_collapse(Single = c("Never married", "Separated", "Divorced")) %>%
  fct_count()
```

```
## # A tibble: 4 x 2
## f n
## <fct> <int>
## 1 No answer 17
## 2 Single 9542
## 3 Widowed 1807
## 4 Married 10117
```

---

# Collapse levels in a factor

```r
levels(gss_cat$marital)
```

```
## [1] "No answer"     "Never married" "Separated"     "Divorced"     
## [5] "Widowed"       "Married"
```

```r
gss_cat$marital %>% 
  fct_collapse(Single = c("Never married", "Separated", "Divorced")) %>% 
* fct_relevel("No answer", after = Inf) %>% # move to last place
  fct_count()
```

```
## # A tibble: 4 x 2
## f n
## <fct> <int>
## 1 Single 9542
## 2 Widowed 1807
## 3 Married 10117
## 4 No answer 17
```

---

# Lumping factor levels .font_small[.font_small[Part] 1]

* Sometimes you have a lot of levels and you'd prefer to lump some of them together to the "Other" category
--

* What criterion do you use to lump levels together?

* There are four main criterion to lump levels using `fct_lump*` functions:
 * .monash-blue[`fct_lump_n`]: lump all levels except the `n` most frequent
 * .monash-blue[`fct_lump_min`]: lump together those less than `min` counts
 * .monash-blue[`fct_lump_prop`]: lump together those less than proportion of `prop`
 * .monash-blue[`fct_lump_lowfreq`]: lump up least frequent levels such that the Other level is still the smallest level
 * `fct_lump` <img src="https://raw.githubusercontent.com/r-lib/lifecycle/master/man/figures/lifecycle-superseded.svg">, it is better to use one of the above functions instead

---

# Lumping factor levels .font_small[.font_small[Part] 2]

```r
levels(gss_cat$relig)
```

```
##  [1] "No answer"               "Don't know"             
##  [3] "Inter-nondenominational" "Native american"        
##  [5] "Christian"               "Orthodox-christian"     
##  [7] "Moslem/islam"            "Other eastern"          
##  [9] "Hinduism"                "Buddhism"               
## [11] "Other"                   "None"                   
## [13] "Jewish"                  "Catholic"               
## [15] "Protestant"              "Not applicable"
```

]
.item[

]

```r
fct_lump_n(gss_cat$relig, n = 2) %>% 
  fct_count(sort = TRUE, prop = TRUE)
```

```
## # A tibble: 3 x 3
## f n p
## <fct> <int> <dbl>
## 1 Protestant 10846 0.505
## 2 Other 5513 0.257
## 3 Catholic 5124 0.239
```

```r
fct_lump_lowfreq(gss_cat$relig) %>% 
  fct_count(sort = TRUE, prop = TRUE)
```

```
## # A tibble: 2 x 3
## f n p
## <fct> <int> <dbl>
## 1 Protestant 10846 0.505
## 2 Other 10637 0.495
```

---

# If you installed the `dwexercise` package, run below in your R console

```r
learnr::run_tutorial("day2-exercise-02", package = "dwexercise")
```

# If the above doesn't work for you, go [here](https://ebsmonash.shinyapps.io/dw-day2-exercise-02).

# Questions or issues, let us know!
<center>
<div class="countdown clock" id="timer_5fc5f392" style="right:0;bottom:0;" data-warnwhen="0">
<code class="countdown-time">15:00</code>
</div>
</center>

---

# Session Information

```r
devtools::session_info()
```

```
## ─ Session info ───────────────────────────────────────────────────────────────
##  setting  value                       
##  version  R version 4.0.1 (2020-06-06)
##  os       macOS Catalina 10.15.7      
##  system   x86_64, darwin17.0          
##  ui       X11                         
##  language (EN)                        
##  collate  en_AU.UTF-8                 
##  ctype    en_AU.UTF-8                 
##  tz       Australia/Melbourne         
##  date     2020-12-01                  
## 
## ─ Packages ───────────────────────────────────────────────────────────────────
##  package     * version    date       lib source                              
##  anicon        0.1.0      2020-06-21 [1] Github (emitanaka/anicon@0b756df)   
##  assertthat    0.2.1      2019-03-21 [2] CRAN (R 4.0.0)                      
##  backports     1.2.0      2020-11-02 [1] CRAN (R 4.0.2)                      
##  broom         0.7.2      2020-10-20 [1] CRAN (R 4.0.2)                      
##  callr         3.5.1      2020-10-13 [1] CRAN (R 4.0.2)                      
##  cellranger    1.1.0      2016-07-27 [2] CRAN (R 4.0.0)                      
##  cli           2.2.0      2020-11-20 [1] CRAN (R 4.0.1)                      
##  colorspace    2.0-0      2020-11-11 [1] CRAN (R 4.0.2)                      
##  countdown     0.3.5      2020-07-20 [1] Github (gadenbuie/countdown@a544fa4)
##  crayon        1.3.4      2017-09-16 [2] CRAN (R 4.0.0)                      
##  DBI           1.1.0      2019-12-15 [1] CRAN (R 4.0.2)                      
##  dbplyr        2.0.0      2020-11-03 [1] CRAN (R 4.0.2)                      
##  desc          1.2.0      2018-05-01 [2] CRAN (R 4.0.0)                      
##  devtools      2.3.2      2020-09-18 [1] CRAN (R 4.0.2)                      
##  digest        0.6.27     2020-10-24 [1] CRAN (R 4.0.2)                      
##  dplyr       * 1.0.2      2020-08-18 [1] CRAN (R 4.0.2)                      
##  ellipsis      0.3.1      2020-05-15 [2] CRAN (R 4.0.0)                      
##  evaluate      0.14       2019-05-28 [2] CRAN (R 4.0.0)                      
##  fansi         0.4.1      2020-01-08 [2] CRAN (R 4.0.0)                      
##  farver        2.0.3.9000 2020-07-24 [1] Github (thomasp85/farver@f1bcb56)   
##  forcats     * 0.5.0      2020-03-01 [2] CRAN (R 4.0.0)                      
##  fs            1.5.0      2020-07-31 [1] CRAN (R 4.0.2)                      
##  generics      0.1.0      2020-10-31 [2] CRAN (R 4.0.2)                      
##  ggplot2     * 3.3.2      2020-06-19 [1] CRAN (R 4.0.2)                      
##  glue          1.4.2      2020-08-27 [1] CRAN (R 4.0.2)                      
##  gtable        0.3.0      2019-03-25 [2] CRAN (R 4.0.0)                      
##  haven         2.3.1      2020-06-01 [2] CRAN (R 4.0.0)                      
##  hms           0.5.3      2020-01-08 [2] CRAN (R 4.0.0)                      
##  htmltools     0.5.0      2020-06-16 [1] CRAN (R 4.0.2)                      
##  httr          1.4.2      2020-07-20 [1] CRAN (R 4.0.2)                      
##  icon          0.1.0      2020-06-21 [1] Github (emitanaka/icon@8458546)     
##  jsonlite      1.7.1      2020-09-07 [1] CRAN (R 4.0.2)                      
##  knitr         1.30       2020-09-22 [1] CRAN (R 4.0.2)                      
##  labeling      0.4.2      2020-10-20 [1] CRAN (R 4.0.2)                      
##  lifecycle     0.2.0      2020-03-06 [1] CRAN (R 4.0.0)                      
##  lubridate     1.7.9      2020-06-08 [2] CRAN (R 4.0.1)                      
##  magrittr      2.0.1      2020-11-17 [1] CRAN (R 4.0.2)                      
##  memoise       1.1.0      2017-04-21 [2] CRAN (R 4.0.0)                      
##  modelr        0.1.8      2020-05-19 [2] CRAN (R 4.0.0)                      
##  munsell       0.5.0      2018-06-12 [2] CRAN (R 4.0.0)                      
##  pillar        1.4.7      2020-11-20 [1] CRAN (R 4.0.1)                      
##  pkgbuild      1.1.0      2020-07-13 [2] CRAN (R 4.0.1)                      
##  pkgconfig     2.0.3      2019-09-22 [2] CRAN (R 4.0.0)                      
##  pkgload       1.1.0      2020-05-29 [2] CRAN (R 4.0.0)                      
##  prettyunits   1.1.1      2020-01-24 [2] CRAN (R 4.0.0)                      
##  processx      3.4.4      2020-09-03 [1] CRAN (R 4.0.2)                      
##  ps            1.4.0      2020-10-07 [1] CRAN (R 4.0.2)                      
##  purrr       * 0.3.4      2020-04-17 [2] CRAN (R 4.0.0)                      
##  R6            2.5.0      2020-10-28 [1] CRAN (R 4.0.2)                      
##  Rcpp          1.0.5      2020-07-06 [1] CRAN (R 4.0.0)                      
##  readr       * 1.4.0      2020-10-05 [2] CRAN (R 4.0.2)                      
##  readxl        1.3.1      2019-03-13 [2] CRAN (R 4.0.0)                      
##  remotes       2.2.0      2020-07-21 [1] CRAN (R 4.0.2)                      
##  reprex        0.3.0.9001 2020-08-08 [1] Github (tidyverse/reprex@9594ee9)   
##  rlang         0.4.8      2020-10-08 [1] CRAN (R 4.0.2)                      
##  rmarkdown     2.5        2020-10-21 [1] CRAN (R 4.0.1)                      
##  rprojroot     2.0.2      2020-11-15 [1] CRAN (R 4.0.2)                      
##  rstudioapi    0.13       2020-11-12 [1] CRAN (R 4.0.1)                      
##  rvest         0.3.6      2020-07-25 [1] CRAN (R 4.0.2)                      
##  scales        1.1.1      2020-05-11 [2] CRAN (R 4.0.0)                      
##  sessioninfo   1.1.1      2018-11-05 [2] CRAN (R 4.0.0)                      
##  stringi       1.5.3      2020-09-09 [2] CRAN (R 4.0.2)                      
##  stringr     * 1.4.0      2019-02-10 [2] CRAN (R 4.0.0)                      
##  testthat      3.0.0      2020-10-31 [1] CRAN (R 4.0.2)                      
##  tibble      * 3.0.4.9000 2020-11-26 [1] Github (tidyverse/tibble@9eeef4d)   
##  tidyr       * 1.1.2      2020-08-27 [1] CRAN (R 4.0.2)                      
##  tidyselect    1.1.0      2020-05-11 [2] CRAN (R 4.0.0)                      
##  tidyverse   * 1.3.0      2019-11-21 [1] CRAN (R 4.0.2)                      
##  usethis       1.6.3      2020-09-17 [1] CRAN (R 4.0.2)                      
##  utf8          1.1.4      2018-05-24 [2] CRAN (R 4.0.0)                      
##  vctrs         0.3.5.9000 2020-11-26 [1] Github (r-lib/vctrs@957baf7)        
##  whisker       0.4        2019-08-28 [2] CRAN (R 4.0.0)                      
##  withr         2.3.0      2020-09-22 [1] CRAN (R 4.0.2)                      
##  xaringan      0.18       2020-10-21 [1] CRAN (R 4.0.2)                      
##  xfun          0.19       2020-10-30 [1] CRAN (R 4.0.2)                      
##  xml2          1.3.2      2020-04-23 [2] CRAN (R 4.0.0)                      
##  yaml          2.2.1      2020-02-01 [1] CRAN (R 4.0.2)                      
## 
## [1] /Users/etan0038/Library/R/4.0/library
## [2] /Library/Frameworks/R.framework/Versions/4.0/Resources/library
```
]

These slides are licensed under <center><a href="https://creativecommons.org/licenses/by-sa/3.0/au/"><img src="images/cc.svg" style="height:2em;"/><img src="images/by.svg" style="height:2em;"/><img src="images/sa.svg" style="height:2em;"/></a></center>