class: monash-bg-blue center middle hide-slide-number <div class="bg-black white" style="width:45%;right:0;bottom:0;padding-left:5px;border: solid 4px white;margin: auto;"> <i class="fas fa-exclamation-circle"></i> These slides are viewed best by Chrome and occasionally need to be refreshed if elements did not load properly. See here for <a href=day2-session2.pdf>PDF <i class="fas fa-file-pdf"></i></a>. </div> <br> .white[Press the **right arrow** to progress to the next slide!] --- count: false background-image: url(images/bg1.jpg) background-size: cover class: hide-slide-number title-slide <div class="grid-row" style="grid: 1fr / 2fr;"> .item.center[ # <span style="text-shadow: 2px 2px 30px white;">Data Wrangling with R: Day 2</span> <!-- ## <span style="color:;text-shadow: 2px 2px 30px black;">Formatting factors with `forcats`</span> --> ] .center.shade_black.animated.bounceInUp.slower[ <br><br> ## <span style="color: #ccf2ff; text-shadow: 10px 10px 100px white;">Formatting factors with `forcats`</span> <br> Presented by Emi Tanaka Department of Econometrics and Business Statistics <img src="images/monash-one-line-reversed.png" style="width:500px"><br>
<i class="fas fa-envelope faa-float animated "></i>
emi.tanaka@monash.edu
<i class="fab fa-twitter faa-float animated faa-fast "></i>
@statsgen .bottom_abs.width100.bg-black[ 2nd December 2020 @ Statistical Society of Australia | Zoom ] ] </div> --- class: middle ## There are two types of categorical variables -- <br><br> #### .monash-blue[**Nominal**] where there is no intrinsic ordering to the categories E.g. blue, grey, black, white. -- <br> #### .monash-blue[**Ordinal**] where there is a clear order to the categories E.g. Strongly disagree, disagree, neutral, agree, strongly agree. --- # Categorical variables in R .font_small[.font_small[Part] 1] * In R, categorical variables may be encoded in various ways. ```r cat_chr <- c("red", "white", "blue") cat_fct <- factor(c("red", "white", "blue")) class(cat_chr) ``` ``` ## [1] "character" ``` ```r class(cat_fct) ``` ``` ## [1] "factor" ``` -- * Then you have categorical variables that look like a numerical variable<br> (e.g. coded variables like say 1=male, 2=female) * And also those that have fixed levels of numerical values <br>(e.g. `ToothGrowth$dose`: 0.5, 1.0 and 2.0) --- class: middle center ## So why encode as .monash-blue[`factor`] instead of .monash-blue[`character`]? -- <br> ### In some cases, characters are converted to factors (or vice-versa) in functions so they can be similar. -- <br> ### The main idea of a factor is that the variable has a<br> *fixed number of levels* --- # Categorical variables in R .font_small[.font_small[Part] 2] * When a variable is encoded as a **factor** then there is an attribute with the levels ```r data <- c(2, 2, 1, 1, 3, 3, 3, 1) factor(data) ``` ``` ## [1] 2 2 1 1 3 3 3 1 ## Levels: 1 2 3 ``` * You can easily change the labels of the variables: ```r factor(data, labels = c("I", "II", "III")) ``` ``` ## [1] II II I I III III III I ## Levels: I II III ``` --- # Categorical variables in R .font_small[.font_small[Part] 3] * Order of the factors are determined by the input: ```r *# numerical input are ordered in increasing order factor(c(1, 3, 10)) ``` ``` ## [1] 1 3 10 ## Levels: 1 3 10 ``` ```r *# character input are ordered alphabetically factor(c("1", "3", "10")) ``` ``` ## [1] 1 3 10 ## Levels: 1 10 3 ``` ```r *# you can specify order of levels explicitly factor(c("1", "3", "10"), levels = c("1", "3", "10")) ``` ``` ## [1] 1 3 10 ## Levels: 1 3 10 ``` --- # Why would the order of the levels matter? -- * Some downstream analysis may use it -- .grid[ .item[ ```r data("population", package = "tidyr") population %>% filter(year == 2013) %>% # just choose 5 countries slice(c(1, 11, 21, 31, 41)) %>% ggplot(aes(population, country)) + geom_col() ``` ![](images/day2-session2/unnamed-chunk-3-1.png)<!-- --> ] .item50[ {{content}} ] ] -- ```r population %>% filter(year == 2013) %>% slice(c(1, 11, 21, 31, 41)) %>% mutate(country = * reorder(country, population)) %>% ggplot(aes(population, country)) + geom_col() ``` ![](images/day2-session2/unnamed-chunk-4-1.png)<!-- --> --- class: middle transition <img src="images/no-toilet-roll.png" width = "200px"> # Cautionary tales of working with factors --- # Numerical factors in R ```r x <- factor(c(10, 20, 30, 10, 20)) mean(x) ``` ``` ## Warning in mean.default(x): argument is not numeric or logical: returning NA ``` ``` ## [1] NA ``` -- <i class="fas fa-exclamation-triangle"></i> `as.numeric` function returns the internal integer values of the factor ```r mean(as.numeric(x)) ``` ``` ## [1] 1.8 ``` -- You probably want to use: <div class="grid" style="padding-left:5%;margin-right:20%"> .item[ ```r mean(as.numeric(levels(x)[x])) ``` ``` ## [1] 18 ``` ] .item[ ```r mean(as.numeric(as.character(x))) ``` ``` ## [1] 18 ``` ]. </div> --- # Defining levels explicitly .font_small[.font_small[Part] 1] * If the variable contain values that are not in the levels of the factors, then those values will become a missing value ```r factor(c("Yes", "No", "Maybe"), levels = c("Yes", "No")) ``` ``` ## [1] Yes No <NA> ## Levels: Yes No ``` -- * This can be useful at times, but it's a good idea to check the values before it is transformed as `NA` ```r factor(c("Yes", "No", "No", "Yess"), levels = c("Yes", "No")) ``` ``` ## [1] Yes No No <NA> ## Levels: Yes No ``` --- # Defining levels explicitly .font_small[.font_small[Part] 2] * You can have levels that are not observed ```r f <- factor(c("Yes", "Yes", "Yes", "No"), levels = c("Yes", "Maybe", "No")) f ``` ``` ## [1] Yes Yes Yes No ## Levels: Yes Maybe No ``` -- * This can be useful at times downstream, e.g. ```r table(f) ``` ``` ## f ## Yes Maybe No ## 3 0 1 ``` --- # Combining factors .font_small[as vectors] ```r f1 <- factor(c("F", "M", "F")) f2 <- factor(c("F", "F")) ``` * What do you think the output will be for below? ```r c(f1, f2) ``` -- ``` ## [1] 1 2 1 1 1 ``` -- * Was that expected? -- * The `c` function strips the class when you combine factors ```r unclass(f1) ``` ``` ## [1] 1 2 1 ## attr(,"levels") ## [1] "F" "M" ``` --- # Combining factors .font_small[in a data frame] ```r df1 <- data.frame(f = factor(c("a", "b"))) df2 <- data.frame(f = factor(c("c", "b"))) ``` * What do you think the output below will be? ```r rbind(df1, df2) ``` -- ``` ## f ## 1 a ## 2 b ## 3 c ## 4 b ``` -- ```r rbind(df1, df2)$f ``` ``` ## [1] a b c b ## Levels: a b c ``` --- class: transition middle # Working with factors with `forcats` --- # Formatting factors .footnote[ Hadley Wickham (2020). forcats: Tools for Working with Categorical Variables (Factors). R package version 0.5.0. ] * The `forcats` package is part of `tidyverse` -- * Like the `stringr` package the main functions in `forcats` **prefix with `fct_` or `lvls_`** and the **first argument is a factor (or a character) vector**<br> .font_small[some functions do not allow character as input, e.g. `fct_c`] -- * The list of available commands are: .grid.font_small[ .item[ * `fct_anon` * `fct_c` * `fct_collapse` * `fct_count` * `fct_cross` * `fct_drop` * `fct_expand` * `fct_explicit_na` ] .item[ * `fct_infreq` * `fct_inorder` * `fct_inseq` * `fct_lump` * `fct_lump_lowfreq` * `fct_lump_min` * `fct_lump_n` * `fct_lump_prop` ] .item[ * `fct_match` * `fct_other` * `fct_recode` * `fct_relabel` * `fct_relevel` * `fct_reorder` * `fct_reorder2` * `fct_rev` ] .item[ * `fct_shift` * `fct_shuffle` * `fct_unify` * `fct_unique` * `lvls_expand` * `lvls_reorder` * `lvls_revalue` * `lvls_union` ] ] ] --- count: false # Formatting factors .footnote[ Hadley Wickham (2020). forcats: Tools for Working with Categorical Variables (Factors). R package version 0.5.0. ] * The `forcats` package is part of `tidyverse` * Like the `stringr` package the main functions in `forcats` **prefix with `fct_` or `lvls_`** and the **first argument is a factor (or a character) vector**<br> .font_small[some functions do not allow character as input, e.g. `fct_c`] * The list of available commands are: .grid.font_small[ .item[ * `fct_anon` * .monash-blue[`fct_c`] * .monash-blue[`fct_collapse`] * .monash-blue[`fct_count`] * `fct_cross` * `fct_drop` * `fct_expand` * `fct_explicit_na` ] .item[ * `fct_infreq` * `fct_inorder` * `fct_inseq` * .monash-blue[`fct_lump`] * .monash-blue[`fct_lump_lowfreq`] * .monash-blue[`fct_lump_min`] * .monash-blue[`fct_lump_n`] * .monash-blue[`fct_lump_prop`] ] .item[ * `fct_match` * `fct_other` * `fct_recode` * `fct_relabel` * .monash-blue[`fct_relevel`] * `fct_reorder` * `fct_reorder2` * `fct_rev` ] .item[ * `fct_shift` * `fct_shuffle` * `fct_unify` * `fct_unique` * `lvls_expand` * `lvls_reorder` * `lvls_revalue` * `lvls_union` ] ] ] --- # Combining factors .font_small[as vectors with `forcats`] ```r f1 <- factor(c("F", "M", "F")) f2 <- factor(c("F", "F")) c(f1, f2) ``` ``` ## [1] 1 2 1 1 1 ``` -- ```r fct_c(f1, f2) ``` -- ``` ## [1] F M F F F ## Levels: F M ``` -- ```r c1 <- c("F", "M", "F") fct_c(c1, f2) ``` -- ``` ## Error: All elements of `...` must be factors ``` --- # Count levels in a factor ```r data("gss_cat", package = "forcats") table(gss_cat$race) ``` ``` ## ## Other Black White Not applicable ## 1959 3129 16395 0 ``` * `table` in Base R is useful but you may want the output as a data frame -- ```r fct_count(gss_cat$race, sort = TRUE, prop = TRUE) ``` ``` ## # A tibble: 4 x 3 ## f n p ## <fct> <int> <dbl> ## 1 White 16395 0.763 ## 2 Black 3129 0.146 ## 3 Other 1959 0.0912 ## 4 Not applicable 0 0 ``` --- # Collapse levels in a factor ```r levels(gss_cat$marital) ``` ``` ## [1] "No answer" "Never married" "Separated" "Divorced" ## [5] "Widowed" "Married" ``` -- ```r gss_cat$marital %>% * fct_collapse(Single = c("Never married", "Separated", "Divorced")) %>% fct_count() ``` ``` ## # A tibble: 4 x 2 ## f n ## <fct> <int> ## 1 No answer 17 ## 2 Single 9542 ## 3 Widowed 1807 ## 4 Married 10117 ``` --- count: false # Collapse levels in a factor ```r levels(gss_cat$marital) ``` ``` ## [1] "No answer" "Never married" "Separated" "Divorced" ## [5] "Widowed" "Married" ``` ```r gss_cat$marital %>% fct_collapse(Single = c("Never married", "Separated", "Divorced")) %>% * fct_relevel("No answer", after = Inf) %>% # move to last place fct_count() ``` ``` ## # A tibble: 4 x 2 ## f n ## <fct> <int> ## 1 Single 9542 ## 2 Widowed 1807 ## 3 Married 10117 ## 4 No answer 17 ``` --- # Lumping factor levels .font_small[.font_small[Part] 1] * Sometimes you have a lot of levels and you'd prefer to lump some of them together to the "Other" category -- * What criterion do you use to lump levels together? -- * There are four main criterion to lump levels using `fct_lump*` functions: * .monash-blue[`fct_lump_n`]: lump all levels except the `n` most frequent * .monash-blue[`fct_lump_min`]: lump together those less than `min` counts * .monash-blue[`fct_lump_prop`]: lump together those less than proportion of `prop` * .monash-blue[`fct_lump_lowfreq`]: lump up least frequent levels such that the Other level is still the smallest level * `fct_lump` <img src="https://raw.githubusercontent.com/r-lib/lifecycle/master/man/figures/lifecycle-superseded.svg">, it is better to use one of the above functions instead --- # Lumping factor levels .font_small[.font_small[Part] 2] .grid[ .item[ ```r levels(gss_cat$relig) ``` ``` ## [1] "No answer" "Don't know" ## [3] "Inter-nondenominational" "Native american" ## [5] "Christian" "Orthodox-christian" ## [7] "Moslem/islam" "Other eastern" ## [9] "Hinduism" "Buddhism" ## [11] "Other" "None" ## [13] "Jewish" "Catholic" ## [15] "Protestant" "Not applicable" ``` ] .item[ {{content}} ] ] -- ```r fct_lump_n(gss_cat$relig, n = 2) %>% fct_count(sort = TRUE, prop = TRUE) ``` ``` ## # A tibble: 3 x 3 ## f n p ## <fct> <int> <dbl> ## 1 Protestant 10846 0.505 ## 2 Other 5513 0.257 ## 3 Catholic 5124 0.239 ``` ```r fct_lump_lowfreq(gss_cat$relig) %>% fct_count(sort = TRUE, prop = TRUE) ``` ``` ## # A tibble: 2 x 3 ## f n p ## <fct> <int> <dbl> ## 1 Protestant 10846 0.505 ## 2 Other 10637 0.495 ``` --- class: exercise middle hide-slide-number # <i class="fas fa-code"></i> If you installed the `dwexercise` package, <br> run below in your R console ```r learnr::run_tutorial("day2-exercise-02", package = "dwexercise") ``` <br> # <i class="fas fa-link"></i> If the above doesn't work for you, go [here](https://ebsmonash.shinyapps.io/dw-day2-exercise-02). # <i class="fas fa-question"></i> Questions or issues, let us know! <center>
15
:
00
</center> --- class: font_smaller background-color: #e5e5e5 # Session Information .scroll-350[ ```r devtools::session_info() ``` ``` ## ─ Session info ─────────────────────────────────────────────────────────────── ## setting value ## version R version 4.0.1 (2020-06-06) ## os macOS Catalina 10.15.7 ## system x86_64, darwin17.0 ## ui X11 ## language (EN) ## collate en_AU.UTF-8 ## ctype en_AU.UTF-8 ## tz Australia/Melbourne ## date 2020-12-01 ## ## ─ Packages ─────────────────────────────────────────────────────────────────── ## package * version date lib source ## anicon 0.1.0 2020-06-21 [1] Github (emitanaka/anicon@0b756df) ## assertthat 0.2.1 2019-03-21 [2] CRAN (R 4.0.0) ## backports 1.2.0 2020-11-02 [1] CRAN (R 4.0.2) ## broom 0.7.2 2020-10-20 [1] CRAN (R 4.0.2) ## callr 3.5.1 2020-10-13 [1] CRAN (R 4.0.2) ## cellranger 1.1.0 2016-07-27 [2] CRAN (R 4.0.0) ## cli 2.2.0 2020-11-20 [1] CRAN (R 4.0.1) ## colorspace 2.0-0 2020-11-11 [1] CRAN (R 4.0.2) ## countdown 0.3.5 2020-07-20 [1] Github (gadenbuie/countdown@a544fa4) ## crayon 1.3.4 2017-09-16 [2] CRAN (R 4.0.0) ## DBI 1.1.0 2019-12-15 [1] CRAN (R 4.0.2) ## dbplyr 2.0.0 2020-11-03 [1] CRAN (R 4.0.2) ## desc 1.2.0 2018-05-01 [2] CRAN (R 4.0.0) ## devtools 2.3.2 2020-09-18 [1] CRAN (R 4.0.2) ## digest 0.6.27 2020-10-24 [1] CRAN (R 4.0.2) ## dplyr * 1.0.2 2020-08-18 [1] CRAN (R 4.0.2) ## ellipsis 0.3.1 2020-05-15 [2] CRAN (R 4.0.0) ## evaluate 0.14 2019-05-28 [2] CRAN (R 4.0.0) ## fansi 0.4.1 2020-01-08 [2] CRAN (R 4.0.0) ## farver 2.0.3.9000 2020-07-24 [1] Github (thomasp85/farver@f1bcb56) ## forcats * 0.5.0 2020-03-01 [2] CRAN (R 4.0.0) ## fs 1.5.0 2020-07-31 [1] CRAN (R 4.0.2) ## generics 0.1.0 2020-10-31 [2] CRAN (R 4.0.2) ## ggplot2 * 3.3.2 2020-06-19 [1] CRAN (R 4.0.2) ## glue 1.4.2 2020-08-27 [1] CRAN (R 4.0.2) ## gtable 0.3.0 2019-03-25 [2] CRAN (R 4.0.0) ## haven 2.3.1 2020-06-01 [2] CRAN (R 4.0.0) ## hms 0.5.3 2020-01-08 [2] CRAN (R 4.0.0) ## htmltools 0.5.0 2020-06-16 [1] CRAN (R 4.0.2) ## httr 1.4.2 2020-07-20 [1] CRAN (R 4.0.2) ## icon 0.1.0 2020-06-21 [1] Github (emitanaka/icon@8458546) ## jsonlite 1.7.1 2020-09-07 [1] CRAN (R 4.0.2) ## knitr 1.30 2020-09-22 [1] CRAN (R 4.0.2) ## labeling 0.4.2 2020-10-20 [1] CRAN (R 4.0.2) ## lifecycle 0.2.0 2020-03-06 [1] CRAN (R 4.0.0) ## lubridate 1.7.9 2020-06-08 [2] CRAN (R 4.0.1) ## magrittr 2.0.1 2020-11-17 [1] CRAN (R 4.0.2) ## memoise 1.1.0 2017-04-21 [2] CRAN (R 4.0.0) ## modelr 0.1.8 2020-05-19 [2] CRAN (R 4.0.0) ## munsell 0.5.0 2018-06-12 [2] CRAN (R 4.0.0) ## pillar 1.4.7 2020-11-20 [1] CRAN (R 4.0.1) ## pkgbuild 1.1.0 2020-07-13 [2] CRAN (R 4.0.1) ## pkgconfig 2.0.3 2019-09-22 [2] CRAN (R 4.0.0) ## pkgload 1.1.0 2020-05-29 [2] CRAN (R 4.0.0) ## prettyunits 1.1.1 2020-01-24 [2] CRAN (R 4.0.0) ## processx 3.4.4 2020-09-03 [1] CRAN (R 4.0.2) ## ps 1.4.0 2020-10-07 [1] CRAN (R 4.0.2) ## purrr * 0.3.4 2020-04-17 [2] CRAN (R 4.0.0) ## R6 2.5.0 2020-10-28 [1] CRAN (R 4.0.2) ## Rcpp 1.0.5 2020-07-06 [1] CRAN (R 4.0.0) ## readr * 1.4.0 2020-10-05 [2] CRAN (R 4.0.2) ## readxl 1.3.1 2019-03-13 [2] CRAN (R 4.0.0) ## remotes 2.2.0 2020-07-21 [1] CRAN (R 4.0.2) ## reprex 0.3.0.9001 2020-08-08 [1] Github (tidyverse/reprex@9594ee9) ## rlang 0.4.8 2020-10-08 [1] CRAN (R 4.0.2) ## rmarkdown 2.5 2020-10-21 [1] CRAN (R 4.0.1) ## rprojroot 2.0.2 2020-11-15 [1] CRAN (R 4.0.2) ## rstudioapi 0.13 2020-11-12 [1] CRAN (R 4.0.1) ## rvest 0.3.6 2020-07-25 [1] CRAN (R 4.0.2) ## scales 1.1.1 2020-05-11 [2] CRAN (R 4.0.0) ## sessioninfo 1.1.1 2018-11-05 [2] CRAN (R 4.0.0) ## stringi 1.5.3 2020-09-09 [2] CRAN (R 4.0.2) ## stringr * 1.4.0 2019-02-10 [2] CRAN (R 4.0.0) ## testthat 3.0.0 2020-10-31 [1] CRAN (R 4.0.2) ## tibble * 3.0.4.9000 2020-11-26 [1] Github (tidyverse/tibble@9eeef4d) ## tidyr * 1.1.2 2020-08-27 [1] CRAN (R 4.0.2) ## tidyselect 1.1.0 2020-05-11 [2] CRAN (R 4.0.0) ## tidyverse * 1.3.0 2019-11-21 [1] CRAN (R 4.0.2) ## usethis 1.6.3 2020-09-17 [1] CRAN (R 4.0.2) ## utf8 1.1.4 2018-05-24 [2] CRAN (R 4.0.0) ## vctrs 0.3.5.9000 2020-11-26 [1] Github (r-lib/vctrs@957baf7) ## whisker 0.4 2019-08-28 [2] CRAN (R 4.0.0) ## withr 2.3.0 2020-09-22 [1] CRAN (R 4.0.2) ## xaringan 0.18 2020-10-21 [1] CRAN (R 4.0.2) ## xfun 0.19 2020-10-30 [1] CRAN (R 4.0.2) ## xml2 1.3.2 2020-04-23 [2] CRAN (R 4.0.0) ## yaml 2.2.1 2020-02-01 [1] CRAN (R 4.0.2) ## ## [1] /Users/etan0038/Library/R/4.0/library ## [2] /Library/Frameworks/R.framework/Versions/4.0/Resources/library ``` ] These slides are licensed under <br><center><a href="https://creativecommons.org/licenses/by-sa/3.0/au/"><img src="images/cc.svg" style="height:2em;"/><img src="images/by.svg" style="height:2em;"/><img src="images/sa.svg" style="height:2em;"/></a></center>