STAT1003 – Statistical Techniques
Dr. Emi Tanaka
Australian National University
These slides are best viewed on a modern browser like Google Chrome on a desktop or laptop. Some interactive components may require some time to fully load.
' or double " quotesThe string may be manipulated using Base R functions, e.g. paste0(), strsplit()
But instead we use the stringr package from the Tidyverse.
stringr package is powered by the stringi package, which in turn uses the ICU C library to provide fast performance for string manipulation.
Main functions in stringr prefix with str_ (stringi prefix with stri_) and the first argument is a string (or a vector of strings)
What do you think str_trim and str_squish do?
stringr?stringr ensures consistency in syntax and user expectationaddress that is comprised of street number, street name, suburb, state (or territory), and postcode.<street number> <street name>, <suburb> <state> <postcode>[digits] [alphabets], [alphabets] [NSW|VIC|WA|ACT|QLD|SA|NT|TAS] [4 digits]
[:digit:] or [0-9] matches any digit (0-9). matches any single character[:alpha:] or [A-Za-z] matches any alphabetic character (a-z, A-Z)+ matches 1 or more of the preceding character( and ) are used to create capture groups| acts as a logical OR{n} matches exactly n occurrences of the preceding characterBut in the context of data, it may be better to use the separate_wider_regex() from tidyr package.
🎯 Extract the LGA status from the data
Recall: paste0(), paste() or stringr::str_c() can combine strings:
{}:stringr cheatsheet

ToothGrowth$dose: 0.5, 1.0 and 2.0)factor] instead of [character]? as.numeric function returns the internal integer values of the factor
You probably want to use:
NAforcats package is part of tidyversestringr package the main functions in forcats prefix with fct_ or lvls_ and the first argument is a factor (or a character) vectorfct_anonfct_cfct_collapsefct_countfct_crossfct_dropfct_expandfct_explicit_nafct_infreqfct_inorderfct_inseqfct_lumpfct_lump_lowfreqfct_lump_minfct_lump_nfct_lump_propfct_matchfct_na_level_to_valuefct_na_value_to_levelfct_otherfct_recodefct_relabelfct_relevelfct_reorderfct_reorder2fct_revfct_shiftfct_shufflefct_unifyfct_uniquelvls_expandlvls_reorderlvls_revaluelvls_uniongss_cat is a dataset in forcats package from the General Social Survey (GSS) that contains a number of categorical variables.fct_lump* functions:
fct_lump_n: lump all levels except the n most frequentfct_lump_min: lump together those less than min countsfct_lump_prop: lump together those less than proportion of propfct_lump_lowfreq: lump up least frequent levels such that the Other level is still the smallest levelfct_lump gss_cat datasetforcats cheatsheet
Date 📅 even though it looks like character 🔢1st January 1970 is a special reference point
Let’s have a look at the numerical value under the hood of Date objects
Yup, the number under the hood is the number of days after (if positive) or before (if negative) 1st January 1970
And yes, you can use as.Date to convert objects to Date
You can find some widely used conversion specification in documentation at
?strptime but some depends on your operating system
Below are some common ones:
%b abbreviated month%B full month%e day of the month (01, 02, …, 31)%d day of the month (1, 2, …, 31)%y year without century (00-99)%Y year with century, e.g. 1999POSIXctR has two main date-time classes in R: POSIXct and POSIXlt (avoid using POSIXlt if possible)
POSIX stands for Portable Operating System Interface
ct stands for calendar time
POSIXltPOSIXlt seems like it’s the same as POSIXctOlsonNames()lubridateDate, you can use ymd and friends. E.g.You might have guessed it but:
y = year, m = month, and d = day.The order determines the expected order of its appearance in the string
lubridatePOSIXct, you can use ymd_hms and friendsy = year, m = month, and d = dayh = hour, m = minute, and s = second.It’s remarkably clever!
The time has to be after date though.
lubridateMaking Date from individual date components:
Making POSIXct from individual components:
lubridateDuration is a special class in lubridateDuration are:Period is a special class in lubridatePeriod are like for Duration but without the prefix “d”:lubridate cheatsheet


STAT1003 – Statistical Techniques