Manipulating strings

STAT1003 – Statistical Techniques

Dr. Emi Tanaka

Australian National University

These slides are best viewed on a modern browser like Google Chrome on a desktop or laptop. Some interactive components may require some time to fully load.

What are strings?

Strings are sequences of characters enclosed in quotes
In R, strings can use single ' or double " quotes

What if you want to add quotes in the string then?

Special strings

How to specify both single and double quote then?

What if you want to write backslash then?

Or just use the raw string (from R version 4.0.0)

Manipulating strings

The string may be manipulated using Base R functions, e.g. paste0(), strsplit()
But instead we use the stringr package from the Tidyverse.
stringr package is powered by the stringi package, which in turn uses the ICU C library to provide fast performance for string manipulation.
Main functions in stringr prefix with str_ (stringi prefix with stri_) and the first argument is a string (or a vector of strings)
What do you think str_trim and str_squish do?

Why use `stringr`?

Let’s consider combining multiple strings into one:

stringr ensures consistency in syntax and user expectation

E.g. missing values in the data can have different expected results

Basic string operations

Pattern matching

Suppose we have a variable address that is comprised of street number, street name, suburb, state (or territory), and postcode.

You can see the pattern is like:

<street number> <street name>, <suburb> <state> <postcode>

[digits] [alphabets], [alphabets] [NSW|VIC|WA|ACT|QLD|SA|NT|TAS] [4 digits]

Regular expressions

Regular expressions, or regex, is a string of characters that define a search pattern.

[:digit:] or [0-9] matches any digit (0-9)
. matches any single character
[:alpha:] or [A-Za-z] matches any alphabetic character (a-z, A-Z)
+ matches 1 or more of the preceding character
( and ) are used to create capture groups
| acts as a logical OR
{n} matches exactly n occurrences of the preceding character

Named capture groups

But in the context of data, it may be better to use the separate_wider_regex() from tidyr package.

Case study Australian Local Government Area

The LGA names include the LGA status in the bracket:
- C = Cities
- B = Boroughs
- M = Municipalities
- RegC = Regional Councils
- A = Areas
- S = Shires
- T = Towns
- RC = Rural Cities
- DC = District Councils
- AC = Aboriginal Councils

🎯 Extract the LGA status from the data

Checking the pattern

If regex is difficult for you, AI can help you with refining your pattern.
But be sure to check if the pattern works for your data!

String interpolation

Recall: paste0(), paste() or stringr::str_c() can combine strings:

Above works, but it can be more convenient to interpolate strings with {}:

Manipulating strings

What are strings?

Special strings

Manipulating strings

Why use stringr?

Basic string operations

Pattern matching

Regular expressions

Named capture groups

Case study Australian Local Government Area

Checking the pattern

String interpolation

stringr cheatsheet

Why use `stringr`?

`stringr` cheatsheet