Data manipulation

STAT1003 – Statistical Techniques

Dr. Emi Tanaka

Australian National University

These slides are best viewed on a modern browser like Google Chrome on a desktop or laptop. Some interactive components may require some time to fully load.

Data Manipulation

Data doesn’t always come in a tidy format.
The cell in a tidy data may contain multiple values or may need to be reformatted.
E.g.
- Strings may contain extra whitespace, inconsistent casing, or other irregularities that need to be addressed.
- Factors may need to be relevelled or relabelled.
- Dates and times may need to be parsed or reformatted.

Manipulating Strings

What are strings?

Strings are sequences of characters enclosed in quotes
In R, strings can use single ' or double " quotes

What if you want to add quotes in the string then?

Special strings

How to specify both single and double quote then?

What if you want to write backslash then?

Or just use the raw string (from R version 4.0.0)

Manipulating strings

The string may be manipulated using Base R functions, e.g. paste0(), strsplit()
But instead we use the stringr package from the Tidyverse.
stringr package is powered by the stringi package, which in turn uses the ICU C library to provide fast performance for string manipulation.
Main functions in stringr prefix with str_ (stringi prefix with stri_) and the first argument is a string (or a vector of strings)
What do you think str_trim and str_squish do?

Why use `stringr`?

Let’s consider combining multiple strings into one:

stringr ensures consistency in syntax and user expectation

E.g. missing values in the data can have different expected results

Basic string operations

Pattern matching

Suppose we have a variable address that is comprised of street number, street name, suburb, state (or territory), and postcode.

You can see the pattern is like:

<street number> <street name>, <suburb> <state> <postcode>

[digits] [alphabets], [alphabets] [NSW|VIC|WA|ACT|QLD|SA|NT|TAS] [4 digits]

Regular expressions

Regular expressions, or regex, is a string of characters that define a search pattern.

[:digit:] or [0-9] matches any digit (0-9)
. matches any single character
[:alpha:] or [A-Za-z] matches any alphabetic character (a-z, A-Z)
+ matches 1 or more of the preceding character
( and ) are used to create capture groups
| acts as a logical OR
{n} matches exactly n occurrences of the preceding character

Named capture groups

But in the context of data, it may be better to use the separate_wider_regex() from tidyr package.

Case study Australian Local Government Area

The LGA names include the LGA status in the bracket:
- C = Cities
- B = Boroughs
- M = Municipalities
- RegC = Regional Councils
- A = Areas
- S = Shires
- T = Towns
- RC = Rural Cities
- DC = District Councils
- AC = Aboriginal Councils

🎯 Extract the LGA status from the data

Checking the pattern

If regex is difficult for you, AI can help you with refining your pattern.
But be sure to check if the pattern works for your data!

String interpolation

Recall: paste0(), paste() or stringr::str_c() can combine strings:

Above works, but it can be more convenient to interpolate strings with {}:

`stringr` cheatsheet

Formatting Factors

Categorical variables in R

In R, categorical variables may be represented as factors.

Then you have categorical variables that look like a numerical variable
(e.g. coded variables like say 1=male, 2=female)
And also those that have fixed levels of numerical values
(e.g. ToothGrowth$dose: 0.5, 1.0 and 2.0)

So why encode as [`factor`] instead of [`character`]?

In some cases, characters are converted to factors (or vice-versa) in functions so there may be no difference.
The main idea of a factor is that the variable has a fixed number of known levels.
This can be useful for:
- Data integrity: It can help prevent errors by ensuring that only valid categories are used.
- Memory efficiency: Factors can be more memory efficient than character vectors, especially when there are many repeated values.
- Downstream analysis: A number of downstream analysis in R treat factors differently from characters.

Factors in R

When a variable is encoded as a factor then there is an attribute with the levels

You can easily change the labels of the variables:

Or make it an ordered factor:

Order of the levels in a factor

Order of the factors are determined by the input.

Why would the order of the levels matter?

Some downstream analysis may use it.

Numerical factors in R

as.numeric function returns the internal integer values of the factor

You probably want to use:

Defining levels explicitly

If the variable contain values that are not in the levels of the factors, then those values will become a missing value

This can be useful at times, but it’s a good idea to check the values before it is transformed as NA

Defining levels explicitly

You can have levels that are not observed

This can be useful at times downstream, e.g.

Formatting factors

The forcats package is part of tidyverse
Like the stringr package the main functions in forcats prefix with fct_ or lvls_ and the first argument is a factor (or a character) vector
The list of available commands are:

fct_anon
fct_c
fct_collapse
fct_count
fct_cross
fct_drop
fct_expand
fct_explicit_na
fct_infreq

fct_inorder
fct_inseq
fct_lump
fct_lump_lowfreq
fct_lump_min
fct_lump_n
fct_lump_prop
fct_match

fct_na_level_to_value
fct_na_value_to_level
fct_other
fct_recode
fct_relabel
fct_relevel
fct_reorder
fct_reorder2
fct_rev

fct_shift
fct_shuffle
fct_unify
fct_unique
lvls_expand
lvls_reorder
lvls_revalue
lvls_union

Collapse levels in a factor

gss_cat is a dataset in forcats package from the General Social Survey (GSS) that contains a number of categorical variables.

Lumping factor levels

Sometimes you have a lot of levels and you’d prefer to lump some of them together to the “Other” category
What criterion do you use to lump levels together?
There are four main criterion to lump levels using fct_lump* functions:
- fct_lump_n: lump all levels except the n most frequent
- fct_lump_min: lump together those less than min counts
- fct_lump_prop: lump together those less than proportion of prop
- fct_lump_lowfreq: lump up least frequent levels such that the Other level is still the smallest level
- fct_lump , it is better to use one of the above functions instead

Lumping factor levels in `gss_cat` dataset

`forcats` cheatsheet

Dealing with Dates and Times

Date in R

Dates in R have class Date 📅 even though it looks like character 🔢

It’s actually a numerical value under the hood

Reference point for Date objects

1st January 1970 is a special reference point
Let’s have a look at the numerical value under the hood of Date objects

Yup, the number under the hood is the number of days after (if positive) or before (if negative) 1st January 1970
And yes, you can use as.Date to convert objects to Date

Converting string to Date

Dates do no have to be in the format of “YYYY/MM/DD” (in fact, there are many format in the wild)
If it has a different format, then you can use the conversion specification with a “%” symbol followed by a single letter not quite regex, but like it

You can find some widely used conversion specification in documentation at
?strptime but some depends on your operating system
Below are some common ones:

%b abbreviated month
%B full month

%e day of the month (01, 02, …, 31)
%d day of the month (1, 2, …, 31)

%y year without century (00-99)
%Y year with century, e.g. 1999

System locale

“aralık” is December in Turkey

as.Date("Xmas is 25 aralık 2020", format = "Xmas is %d %B %Y")

[1] NA

Let’s temorary set our system locale to Turkey

Sys.setlocale("LC_TIME", "tr_TR.UTF-8") # temporary set to Turkey locale

[1] "tr_TR.UTF-8"

as.Date("Xmas is 25 aralık 2020", format = "Xmas is %d %B %Y")

[1] "2020-12-25"

(And set it back to English again) “UTF-8” might only work for Unix and Linux systems

Sys.setlocale("LC_TIME", "en_AU.UTF-8")

[1] "en_AU.UTF-8"

Date and Time in R: `POSIXct`

R has two main date-time classes in R: POSIXct and POSIXlt (avoid using POSIXlt if possible)
POSIX stands for Portable Operating System Interface
ct stands for calendar time

1970/01/01 00:00:00 UTC is a special reference point called Unix epoch and the above number is the number of seconds after Unix epoch

Date and Time in R: `POSIXlt`

POSIXlt seems like it’s the same as POSIXct

But under the hood, it’s a list of time attributes

Time zone

You can find the names of the time zones using OlsonNames()
If you want to know which time zone your system is using:

Date in R with `lubridate`

To convert string to a Date, you can use ymd and friends. E.g.

You might have guessed it but:

y = year, m = month, and d = day.

The order determines the expected order of its appearance in the string

Date and time in R with `lubridate`

To convert string to POSIXct, you can use ymd_hms and friends

y = year, m = month, and d = day
h = hour, m = minute, and s = second.

It’s remarkably clever!

The time has to be after date though.

Conversion to date and time with `lubridate`

Making Date from individual date components:

Making POSIXct from individual components:

Extracting date or time components with `lubridate`

Date and time modifiers

Durations

Duration is a special class in lubridate
Some convenient constructors for Duration are:

Maths with Durations

Day light saving started at Sun 4th Oct 2020 2AM in Melbourne

Period

Period is a special class in lubridate
Constructors for Period are like for Duration but without the prefix “d”:

Data manipulation

Data Manipulation

Manipulating Strings

What are strings?

Special strings

Manipulating strings

Why use stringr?

Basic string operations

Pattern matching

Regular expressions

Named capture groups

Case study Australian Local Government Area

Checking the pattern

String interpolation

stringr cheatsheet

Formatting Factors

Categorical variables in R

So why encode as [factor] instead of [character]?

Factors in R

Order of the levels in a factor

Why would the order of the levels matter?

Numerical factors in R

Defining levels explicitly

Defining levels explicitly

Formatting factors

Collapse levels in a factor

Lumping factor levels

Lumping factor levels in gss_cat dataset

forcats cheatsheet

Dealing with Dates and Times

Date in R

Reference point for Date objects

Converting string to Date

System locale

Date and Time in R: POSIXct

Date and Time in R: POSIXlt

Time zone

Date in R with lubridate

Date and time in R with lubridate

Conversion to date and time with lubridate

Extracting date or time components with lubridate

Date and time modifiers

Durations

Maths with Durations

Period

Maths with Period

lubridate cheatsheet

Why use `stringr`?

`stringr` cheatsheet

So why encode as [`factor`] instead of [`character`]?

Lumping factor levels in `gss_cat` dataset

`forcats` cheatsheet

Date and Time in R: `POSIXct`

Date and Time in R: `POSIXlt`

Date in R with `lubridate`

Date and time in R with `lubridate`

Conversion to date and time with `lubridate`

Extracting date or time components with `lubridate`

`lubridate` cheatsheet