Processing math: 100%
+ - 0:00:00
Notes for current slide
Notes for next slide

These slides are viewed best by Chrome or Firefox and occasionally need to be refreshed if elements did not load properly. See here for the PDF .


Press the right arrow to progress to the next slide!

1/39

ETC5512: Wild Caught Data


Australian census

Lecturer: Emi Tanaka

Department of Econometrics and Business Statistics

ETC5512.Clayton-x@monash.edu

Week 4


1/39

Population data

Recall from lecture 2:

Collecting data on the entire population is normally too expensive or infeasible! (If we can, it is called a census.)

We therefore collect data only on a subset of the population.

2/39

Population data

Recall from lecture 2:

Collecting data on the entire population is normally too expensive or infeasible! (If we can, it is called a census.)

We therefore collect data only on a subset of the population.

  • There are exceptions to this and one such example, as mentioned, is the census.
2/39

Population data

Recall from lecture 2:

Collecting data on the entire population is normally too expensive or infeasible! (If we can, it is called a census.)

We therefore collect data only on a subset of the population.

  • There are exceptions to this and one such example, as mentioned, is the census.
  1. When was the last time that the Australian census was run?
  2. How often is the census conducted in Australia?
2/39

Population data

Recall from lecture 2:

Collecting data on the entire population is normally too expensive or infeasible! (If we can, it is called a census.)

We therefore collect data only on a subset of the population.

  • There are exceptions to this and one such example, as mentioned, is the census.
  1. When was the last time that the Australian census was run?
  2. How often is the census conducted in Australia?

  3. Why do we run the census?

  4. What data does the Australian census collect?
2/39

Sample survey

Census

Advantages
Disadvantages
3/39

Sample survey

Census

Advantages
Disadvantages
  • Reduces cost
  • Timely collection of data
  • Data available, even for small geographical areas or subpopulations
  • Statistics are not subject to sampling error
  • Better accuracy and details
  • Lack of data on sub-population (particularly minorities) or small geographical areas
  • Requires careful construction of sampling design
  • Estimates are subject to sampling error
  • The estimates may not be accurate or reliable
  • Estimating and communicating precision of estimates is difficult
  • Expensive or infeasible
  • Time consuming to collect all data
3/39

Australian Bureau of Statistics (ABS)

  • ABS is the independent statistical agency of the Government of Australia.

4/39

Australian Bureau of Statistics (ABS)

4/39

Australian Bureau of Statistics (ABS)

  • ABS provides key statistics on a wide range of economic, population, environmental and social issues, to assist and encourage informed decision making, research and discussion within governments and the community.

4/39

ABS Census Data

  • The first Australian census was held in 1911.

5/39

ABS Census Data

  • The first Australian census was held in 1911.

  • Since 1961, the census occurs every 5 years in Australia.

5/39

ABS Census Data

  • The first Australian census was held in 1911.

  • Since 1961, the census occurs every 5 years in Australia.

  • The census in 2016 at a cost of $440 million.

5/39

ABS Census Data

  • The first Australian census was held in 1911.

  • Since 1961, the census occurs every 5 years in Australia.

  • The census in 2016 at a cost of $440 million.

  • The next census will be held in 2026!

5/39

ABS Census Data

  • The first Australian census was held in 1911.

  • Since 1961, the census occurs every 5 years in Australia.

  • The census in 2016 at a cost of $440 million.

  • The next census will be held in 2026!

  • The ABS is legislated to collect and disseminate census data under the ABS Act 1975 and Census and Statistics Act 1905.

5/39

ABS Census Data

  • The first Australian census was held in 1911.

  • Since 1961, the census occurs every 5 years in Australia.

  • The census in 2016 at a cost of $440 million.

  • The next census will be held in 2026!

  • The ABS is legislated to collect and disseminate census data under the ABS Act 1975 and Census and Statistics Act 1905.

  • Similar legislation are in place in many countries.

5/39

Getting the ABS Census Data

https://www.abs.gov.au/census/find-census-data

There are two main types of data that you can download:

6/39

Navigating ABS Census data

  • The DataPacks is available only for the 2011 and 2016 census.
7/39

Navigating ABS Census data

  • The DataPacks is available only for the 2011 and 2016 census.
  • There are slight differences in the available profiles between years, e.g. the General Community Profile in 2016 is a replacement for Basic and Expanded Community Profiles in 2011.
7/39

Navigating ABS Census data

  • The DataPacks is available only for the 2011 and 2016 census.
  • There are slight differences in the available profiles between years, e.g. the General Community Profile in 2016 is a replacement for Basic and Expanded Community Profiles in 2011.
  • The information related the census are detailed on the website. See for example here.
7/39

Navigating ABS Census data

  • The DataPacks is available only for the 2011 and 2016 census.
  • There are slight differences in the available profiles between years, e.g. the General Community Profile in 2016 is a replacement for Basic and Expanded Community Profiles in 2011.
  • The information related the census are detailed on the website. See for example here.
  • Note: there are sometimes data corrections at a later date.
7/39

Navigating ABS Census data

  • The DataPacks is available only for the 2011 and 2016 census.
  • There are slight differences in the available profiles between years, e.g. the General Community Profile in 2016 is a replacement for Basic and Expanded Community Profiles in 2011.
  • The information related the census are detailed on the website. See for example here.
  • Note: there are sometimes data corrections at a later date.
Navigating data and deducing what it is often requires you to do some "detective work" 🕵️‍♀️
  • Much like real detective work, just locating the data and understanding the data variables can take a long time; the work often is not glamorous; and there's far more attention in "catching criminals" (the discoveries from statistical analysis).
7/39

Today,

  • We'll navigate through the personal income data from the 2016 census together for you to get some "detective" experience
  • You'll learn to manipulate strings and a bit about regular expressions to deal with string data.
  • You'll learn about tidy data.
8/39

DataPack directory structure

  • 2016_GCP_ALL_for_Vic_short-header
    • 2016 Census GCP All Geographies for VIC
      • CED
        • VIC
          • 2016Census_G01_VIC_CED.csv
          • 2016Census_G02_VIC_CED.csv
          • ...
      • GCCSA
        • VIC
          • 2016Census_G01_VIC_GCCSA.csv
          • 2016Census_G01_VIC_GCCSA.csv
          • ...
      • LGA
        • VIC
          • 2016Census_G01_VIC_LGA.csv
          • 2016Census_G02_VIC_LGA.csv
          • ...
      • POA
        • VIC
          • 2016Census_G01_VIC_POA.csv
          • 2016Census_G02_VIC_POA.csv
          • ...
      • RA
        • VIC
          • 2016Census_G01_VIC_RA.csv
          • 2016Census_G02_VIC_RA.csv
          • ...
      • SA1
        • VIC
          • 2016Census_G01_VIC_SA1.csv
          • 2016Census_G92_VIC_SA1.csv
          • ...
      • SA2
        • VIC
          • 2016Census_G01_VIC_SA2.csv
          • 2016Census_G02_VIC_SA2.csv
          • ...
      • SA3
        • VIC
          • 2016Census_G01_VIC_SA3.csv
          • 2016Census_G01_VIC_SA3.csv
          • ...
      • SA4
        • VIC
          • 2016Census_G01_VIC_SA4.csv
          • 2016Census_G02_VIC_SA4.csv
          • ...
      • SED
        • VIC
          • 2016Census_G01_VIC_SED.csv
          • 2016Census_G02_VIC_SED.csv
          • ...
      • SOS
        • VIC
          • 2016Census_G01_VIC_SOS.csv
          • 2016Census_G02_VIC_SOS.csv
          • ...
      • SOSR
        • VIC
          • 2016Census_G01_VIC_SOSR.csv
          • 2016Census_G02_VIC_SOSR.csv
          • ...
      • SSC
        • VIC
          • 2016Census_G01_VIC_SSC.csv
          • 2016Census_G02_VIC_SSC.csv
          • ...
      • STE
        • VIC
          • 2016Census_G01_VIC_STE.csv
          • 2016Census_G02_VIC_STE.csv
          • ...
      • SUA
        • VIC
          • 2016Census_G01_VIC_SUA.csv
          • 2016Census_G02_VIC_SUA.csv
          • ...
      • UCL
        • VIC
          • 2016Census_G01_VIC_UCL.csv
          • 2016Census_G02_VIC_UCL.csv
          • ...
    • Metadata
      • 2016_GCP_Sequential_Template.xlsx
      • 2016Census_geog_desc_1st_2nd_3rd_release.xlsx
      • Metadata_2016_GCP_DataPack.xlsx
    • Readme
      • 2016POA_readme.txt
      • AboutDatapacks_readme.txt
      • CreativeCommons_Licensing_readme.txt
      • esri_arcmap_readme.txt
      • Formats_readme.txt
      • mapinfo_readme.txt
      • Summary_of_Changes.txt
  • The data is nested within folders.
    Click on the folder name to see folders and files nested within.
9/39

DataPack directory structure

  • 2016_GCP_ALL_for_Vic_short-header
    • 2016 Census GCP All Geographies for VIC
      • CED
        • VIC
          • 2016Census_G01_VIC_CED.csv
          • 2016Census_G02_VIC_CED.csv
          • ...
      • GCCSA
        • VIC
          • 2016Census_G01_VIC_GCCSA.csv
          • 2016Census_G01_VIC_GCCSA.csv
          • ...
      • LGA
        • VIC
          • 2016Census_G01_VIC_LGA.csv
          • 2016Census_G02_VIC_LGA.csv
          • ...
      • POA
        • VIC
          • 2016Census_G01_VIC_POA.csv
          • 2016Census_G02_VIC_POA.csv
          • ...
      • RA
        • VIC
          • 2016Census_G01_VIC_RA.csv
          • 2016Census_G02_VIC_RA.csv
          • ...
      • SA1
        • VIC
          • 2016Census_G01_VIC_SA1.csv
          • 2016Census_G92_VIC_SA1.csv
          • ...
      • SA2
        • VIC
          • 2016Census_G01_VIC_SA2.csv
          • 2016Census_G02_VIC_SA2.csv
          • ...
      • SA3
        • VIC
          • 2016Census_G01_VIC_SA3.csv
          • 2016Census_G01_VIC_SA3.csv
          • ...
      • SA4
        • VIC
          • 2016Census_G01_VIC_SA4.csv
          • 2016Census_G02_VIC_SA4.csv
          • ...
      • SED
        • VIC
          • 2016Census_G01_VIC_SED.csv
          • 2016Census_G02_VIC_SED.csv
          • ...
      • SOS
        • VIC
          • 2016Census_G01_VIC_SOS.csv
          • 2016Census_G02_VIC_SOS.csv
          • ...
      • SOSR
        • VIC
          • 2016Census_G01_VIC_SOSR.csv
          • 2016Census_G02_VIC_SOSR.csv
          • ...
      • SSC
        • VIC
          • 2016Census_G01_VIC_SSC.csv
          • 2016Census_G02_VIC_SSC.csv
          • ...
      • STE
        • VIC
          • 2016Census_G01_VIC_STE.csv
          • 2016Census_G02_VIC_STE.csv
          • ...
      • SUA
        • VIC
          • 2016Census_G01_VIC_SUA.csv
          • 2016Census_G02_VIC_SUA.csv
          • ...
      • UCL
        • VIC
          • 2016Census_G01_VIC_UCL.csv
          • 2016Census_G02_VIC_UCL.csv
          • ...
    • Metadata
      • 2016_GCP_Sequential_Template.xlsx
      • 2016Census_geog_desc_1st_2nd_3rd_release.xlsx
      • Metadata_2016_GCP_DataPack.xlsx
    • Readme
      • 2016POA_readme.txt
      • AboutDatapacks_readme.txt
      • CreativeCommons_Licensing_readme.txt
      • esri_arcmap_readme.txt
      • Formats_readme.txt
      • mapinfo_readme.txt
      • Summary_of_Changes.txt
  • The data is nested within folders.
    Click on the folder name to see folders and files nested within.
  • Preserve the data in the original structure as much as you can! That is, don't modify the data!
9/39

DataPack directory structure

  • 2016_GCP_ALL_for_Vic_short-header
    • 2016 Census GCP All Geographies for VIC
      • CED
        • VIC
          • 2016Census_G01_VIC_CED.csv
          • 2016Census_G02_VIC_CED.csv
          • ...
      • GCCSA
        • VIC
          • 2016Census_G01_VIC_GCCSA.csv
          • 2016Census_G01_VIC_GCCSA.csv
          • ...
      • LGA
        • VIC
          • 2016Census_G01_VIC_LGA.csv
          • 2016Census_G02_VIC_LGA.csv
          • ...
      • POA
        • VIC
          • 2016Census_G01_VIC_POA.csv
          • 2016Census_G02_VIC_POA.csv
          • ...
      • RA
        • VIC
          • 2016Census_G01_VIC_RA.csv
          • 2016Census_G02_VIC_RA.csv
          • ...
      • SA1
        • VIC
          • 2016Census_G01_VIC_SA1.csv
          • 2016Census_G92_VIC_SA1.csv
          • ...
      • SA2
        • VIC
          • 2016Census_G01_VIC_SA2.csv
          • 2016Census_G02_VIC_SA2.csv
          • ...
      • SA3
        • VIC
          • 2016Census_G01_VIC_SA3.csv
          • 2016Census_G01_VIC_SA3.csv
          • ...
      • SA4
        • VIC
          • 2016Census_G01_VIC_SA4.csv
          • 2016Census_G02_VIC_SA4.csv
          • ...
      • SED
        • VIC
          • 2016Census_G01_VIC_SED.csv
          • 2016Census_G02_VIC_SED.csv
          • ...
      • SOS
        • VIC
          • 2016Census_G01_VIC_SOS.csv
          • 2016Census_G02_VIC_SOS.csv
          • ...
      • SOSR
        • VIC
          • 2016Census_G01_VIC_SOSR.csv
          • 2016Census_G02_VIC_SOSR.csv
          • ...
      • SSC
        • VIC
          • 2016Census_G01_VIC_SSC.csv
          • 2016Census_G02_VIC_SSC.csv
          • ...
      • STE
        • VIC
          • 2016Census_G01_VIC_STE.csv
          • 2016Census_G02_VIC_STE.csv
          • ...
      • SUA
        • VIC
          • 2016Census_G01_VIC_SUA.csv
          • 2016Census_G02_VIC_SUA.csv
          • ...
      • UCL
        • VIC
          • 2016Census_G01_VIC_UCL.csv
          • 2016Census_G02_VIC_UCL.csv
          • ...
    • Metadata
      • 2016_GCP_Sequential_Template.xlsx
      • 2016Census_geog_desc_1st_2nd_3rd_release.xlsx
      • Metadata_2016_GCP_DataPack.xlsx
    • Readme
      • 2016POA_readme.txt
      • AboutDatapacks_readme.txt
      • CreativeCommons_Licensing_readme.txt
      • esri_arcmap_readme.txt
      • Formats_readme.txt
      • mapinfo_readme.txt
      • Summary_of_Changes.txt
  • The data is nested within folders.
    Click on the folder name to see folders and files nested within.
  • Preserve the data in the original structure as much as you can! That is, don't modify the data!
  • Where do we get started??
9/39

Getting started

  • First, pray hard that there is some description!
10/39

Getting started

  • First, pray hard that there is some description!

  • Without some description or understanding of the variables, it will be near impossible to extract meaningful information from the data.

10/39

Getting started

  • First, pray hard that there is some description!

  • Without some description or understanding of the variables, it will be near impossible to extract meaningful information from the data.

    • 2016_GCP_ALL_for_Vic_short-header
    • Metadata
      • 2016_GCP_Sequential_Template.xlsx
      • 2016Census_geog_desc_1st_2nd_3rd_release.xlsx
      • Metadata_2016_GCP_DataPack.xlsx
    • Readme
      • 2016POA_readme.txt
      • AboutDatapacks_readme.txt
      • CreativeCommons_Licensing_readme.txt
      • esri_arcmap_readme.txt
      • Formats_readme.txt
      • mapinfo_readme.txt
      • Summary_of_Changes.txt
    • Readme is a good place to start here (phew!)
    "About DataPacks_readme.md - "Read Me" documentation containing helpful information for users about the data and how it is structured (.md)"
    • But there is no `DataPacks_readme.md`??
10/39

Getting started

  • First, pray hard that there is some description!

  • Without some description or understanding of the variables, it will be near impossible to extract meaningful information from the data.

    • 2016_GCP_ALL_for_Vic_short-header
    • Metadata
      • 2016_GCP_Sequential_Template.xlsx
      • 2016Census_geog_desc_1st_2nd_3rd_release.xlsx
      • Metadata_2016_GCP_DataPack.xlsx
    • Readme
      • 2016POA_readme.txt
      • AboutDatapacks_readme.txt
      • CreativeCommons_Licensing_readme.txt
      • esri_arcmap_readme.txt
      • Formats_readme.txt
      • mapinfo_readme.txt
      • Summary_of_Changes.txt
    • Readme is a good place to start here (phew!)
    "About DataPacks_readme.md - "Read Me" documentation containing helpful information for users about the data and how it is structured (.md)"
    • But there is no `DataPacks_readme.md`??
    • We go through other files in the Readme.
10/39

Meta-data

  • 2016_GCP_ALL_for_Vic_short-header
  • Metadata
    • 2016_GCP_Sequential_Template.xlsx
    • 2016Census_geog_desc_1st_2nd_3rd_release.xlsx
    • Metadata_2016_GCP_DataPack.xlsx
  • Readme

We could also try going through the meta-data.

11/39

Meta-data

  • 2016_GCP_ALL_for_Vic_short-header
  • Metadata
    • 2016_GCP_Sequential_Template.xlsx
    • 2016Census_geog_desc_1st_2nd_3rd_release.xlsx
    • Metadata_2016_GCP_DataPack.xlsx
  • Readme

We could also try going through the meta-data.

Metadata_2016_GCP_DataPack.xlsx


11/39

Finding Table G17

  • 2016_GCP_ALL_for_Vic_short-header
    • 2016 Census GCP All Geographies for VIC
      • CED
        • VIC
          • ...
          • 2016Census_G17A_VIC_CED.csv
          • 2016Census_G17B_VIC_CED.csv
          • 2016Census_G17C_VIC_CED.csv
          • ...
      • GCCSA
        • VIC
          • ...
          • 2016Census_G17A_VIC_GCCSA.csv
          • 2016Census_G17B_VIC_GCCSA.csv
          • 2016Census_G17C_VIC_GCCSA.csv
          • ...
      • LGA
        • VIC
          • ...
          • 2016Census_G17A_VIC_LGA.csv
          • 2016Census_G17B_VIC_LGA.csv
          • 2016Census_G17C_VIC_LGA.csv
          • ...
      • POA
        • VIC
          • ...
          • 2016Census_G17A_VIC_POA.csv
          • 2016Census_G17B_VIC_POA.csv
          • 2016Census_G17C_VIC_POA.csv
          • ...
      • RA
        • VIC
          • ...
          • 2016Census_G17A_VIC_RA.csv
          • 2016Census_G17B_VIC_RA.csv
          • 2016Census_G17C_VIC_RA.csv
          • ...
      • SA1
        • VIC
          • ...
          • 2016Census_G17A_VIC_SA1.csv
          • 2016Census_G17B_VIC_SA1.csv
          • 2016Census_G17C_VIC_SA1.csv
          • ...
      • SA2
        • VIC
          • ...
          • 2016Census_G17A_VIC_SA2.csv
          • 2016Census_G17B_VIC_SA2.csv
          • 2016Census_G17C_VIC_SA2.csv
          • ...
      • SA3
        • VIC
          • ...
          • 2016Census_G17A_VIC_SA3.csv
          • 2016Census_G17B_VIC_SA3.csv
          • 2016Census_G17C_VIC_SA3.csv
          • ...
      • SA4
        • VIC
          • ...
          • 2016Census_G17A_VIC_SA4.csv
          • 2016Census_G17B_VIC_SA4.csv
          • 2016Census_G17C_VIC_SA4.csv
          • ...
      • SED
        • VIC
          • ...
          • 2016Census_G17A_VIC_SED.csv
          • 2016Census_G17B_VIC_SED.csv
          • 2016Census_G17C_VIC_SED.csv
          • ...
      • SOS
        • VIC
          • ...
          • 2016Census_G17A_VIC_SOS.csv
          • 2016Census_G17B_VIC_SOS.csv
          • 2016Census_G17C_VIC_SOS.csv
          • ...
      • SOSR
        • VIC
          • ...
          • 2016Census_G17A_VIC_SOSR.csv
          • 2016Census_G17B_VIC_SOSR.csv
          • 2016Census_G17C_VIC_SOSR.csv
          • ...
      • SSC
        • VIC
          • ...
          • 2016Census_G17A_VIC_SSC.csv
          • 2016Census_G17B_VIC_SSC.csv
          • 2016Census_G17C_VIC_SSC.csv
          • ...
      • STE
        • VIC
          • ...
          • 2016Census_G17A_VIC_STE.csv
          • 2016Census_G17B_VIC_STE.csv
          • 2016Census_G17C_VIC_STE.csv
          • ...
      • SUA
        • VIC
          • ...
          • 2016Census_G17A_VIC_SUA.csv
          • 2016Census_G17B_VIC_SUA.csv
          • 2016Census_G17C_VIC_SUA.csv
          • ...
      • UCL
        • VIC
          • ...
          • 2016Census_G17A_VIC_UCL.csv
          • 2016Census_G17B_VIC_UCL.csv
          • 2016Census_G17C_VIC_UCL.csv
          • ...
    • Metadata
    • Readme
  • Where is Table G17?
  • Which Table G17?
  • 12/39

    Back to metadata

    • Metadata
      • 2016_GCP_Sequential_Template.xlsx
      • 2016Census_geog_desc_1st_2nd_3rd_release.xlsx
      • Metadata_2016_GCP_DataPack.xlsx

    Let's open 2016Census_geog_desc_1st_2nd_3rd_release.xlsx

    13/39

    Back to metadata

    • Metadata
      • 2016_GCP_Sequential_Template.xlsx
      • 2016Census_geog_desc_1st_2nd_3rd_release.xlsx
      • Metadata_2016_GCP_DataPack.xlsx

    Let's open 2016Census_geog_desc_1st_2nd_3rd_release.xlsx

    ... and there are the region names of each geographical code.

    13/39

    Back to metadata

    • Metadata
      • 2016_GCP_Sequential_Template.xlsx
      • 2016Census_geog_desc_1st_2nd_3rd_release.xlsx
      • Metadata_2016_GCP_DataPack.xlsx

    Let's open 2016Census_geog_desc_1st_2nd_3rd_release.xlsx

    ... and there are the region names of each geographical code.


    Let's go with the easy one: STE Victoria.

    13/39

    Found Table G17?

  • 2016_GCP_ALL_for_Vic_short-header
    • 2016 Census GCP All Geographies for VIC
      • ...
      • STE
        • VIC
          • ...
          • 2016Census_G17A_VIC_STE.csv
          • 2016Census_G17B_VIC_STE.csv
          • 2016Census_G17C_VIC_STE.csv
          • ...
      • ...
    • G17A, G17B, G17C?


    Why is the table organised like this?

    14/39

    Tables G17A-G17C

    2016Census_G17A_VIC_STE.csv


    2016Census_G17B_VIC_STE.csv


    2016Census_G17C_VIC_STE.csv

    15/39

    Table G17

    There are few things to note:

    • There are 201 columns in G17A and G17B and 81 columns in G17C.
    • Perhaps there is an export limitation for a data that contains more than 200 columns, thus it is broken up into different csv files.
    • Which means that you have to join the tables G17A, G17B and G17C as one (you'll do this in the tutorial ).


    But what does the data show?

    16/39

    What is Tidy Data?


    Tidy Data Principles

    1. Each variable must have its own column
    2. Each observation must have its own row
    3. Each value must have its own cell

    Wickham (2014) Tidy Data. Journal of Statistical Software 59

    17/39

    What is Tidy Data?


    Tidy Data Principles

    1. Each variable must have its own column
    2. Each observation must have its own row
    3. Each value must have its own cell

    So what about the ABS 2016 Census Data?

    • The table header in fact contains information!
    • E.g. F_400_499_15_19_yrs is female aged 15-19 years old who earn $400-499 per week (in Victoria).
    • The number in the cells are the counts.
    • Is the data tidy?

    Wickham (2014) Tidy Data. Journal of Statistical Software 59

    17/39

    Tidying the ABS 2016 Census Data

    • Ideally we want the data to look like:


    • You can include other information, e.g. geography code (useful if combining with other geographical area) or average age/income.

    • Note that some don't have upper bounds, e.g. M_3000_more_85ov. In R, -Inf and Inf are used to represent and , respectively.

    • You'll wrangle the data into the tidy form in tutorial

    18/39

    Manipulating strings

    19/39

    Manipulating strings

    • The stringr package is powered by the stringi package which in turn uses the ICU C library to provide fast peformance for string manipulation
    library(tidyverse) # includes `stringr`

    Hadley Wickham (2019). stringr: Simple, Consistent Wrappers for Common String Operations. R package version 1.4.0.

    Gagolewski M. and others (2020). R package stringi: Character string processing facilities.

    20/39

    Manipulating strings

    • The stringr package is powered by the stringi package which in turn uses the ICU C library to provide fast peformance for string manipulation
    library(tidyverse) # includes `stringr`

    Hadley Wickham (2019). stringr: Simple, Consistent Wrappers for Common String Operations. R package version 1.4.0.

    Gagolewski M. and others (2020). R package stringi: Character string processing facilities.

    • Main functions in stringr prefix with str_ (stringi prefix with stri_) and the first argument is string (or a vector of strings)
    20/39

    Manipulating strings

    • The stringr package is powered by the stringi package which in turn uses the ICU C library to provide fast peformance for string manipulation
    library(tidyverse) # includes `stringr`

    Hadley Wickham (2019). stringr: Simple, Consistent Wrappers for Common String Operations. R package version 1.4.0.

    Gagolewski M. and others (2020). R package stringi: Character string processing facilities.

    • Main functions in stringr prefix with str_ (stringi prefix with stri_) and the first argument is string (or a vector of strings)
    • What do you think str_trim and str_squish do?
    str_trim(c(" Apple ", " Goji Berry "))
    ## [1] "Apple" "Goji Berry"
    str_squish(c(" Apple ", " Goji Berry "))
    ## [1] "Apple" "Goji Berry"
    20/39

    Base R and stringr

    21/39

    Why use stringr?

    • There are a number of considerations to ensure there is consistency in syntax and user expectation (both for input and output)
    22/39

    Why use stringr?

    • There are a number of considerations to ensure there is consistency in syntax and user expectation (both for input and output)
    • For example, let's consider combining multiple strings into one.
    22/39

    Why use stringr?

    • There are a number of considerations to ensure there is consistency in syntax and user expectation (both for input and output)
    • For example, let's consider combining multiple strings into one.

      Base R

      paste0("Area", "1", c("A", "B"))
      ## [1] "Area1A" "Area1B"

      stringr

      str_c("Area", "1", c("A", "B"))
      ## [1] "Area1A" "Area1B"
    22/39

    Why use stringr?

    • There are a number of considerations to ensure there is consistency in syntax and user expectation (both for input and output)

    • For example, let's consider combining multiple strings into one.

      Base R

      paste0("Area", "1", c("A", "B"))
      ## [1] "Area1A" "Area1B"
      paste0("Area", "1", c("A", NA, "C"))

      stringr

      str_c("Area", "1", c("A", "B"))
      ## [1] "Area1A" "Area1B"
      str_c("Area", "1", c("A", NA, "C"))
    22/39

    Why use stringr?

    • There are a number of considerations to ensure there is consistency in syntax and user expectation (both for input and output)

    • For example, let's consider combining multiple strings into one.

      Base R

      paste0("Area", "1", c("A", "B"))
      ## [1] "Area1A" "Area1B"
      paste0("Area", "1", c("A", NA, "C"))
      ## [1] "Area1A" "Area1NA" "Area1C"

      stringr

      str_c("Area", "1", c("A", "B"))
      ## [1] "Area1A" "Area1B"
      str_c("Area", "1", c("A", NA, "C"))
      ## [1] "Area1A" NA "Area1C"
    • If the Base R result is preferable then NA can be replaced with character with str_replace_na("A", NA, "C") first

    22/39

    Case study Aussie Local Government Area

    LGA <- ozmaps::abs_lga %>% pull(NAME)
    LGA[1:7]
    ## [1] "Broken Hill (C)" "Waroona (S)" "Toowoomba (R)" "West Arthur (S)"
    ## [5] "Moreton Bay (R)" "Etheridge (S)" "Cleve (DC)"
    C = Cities A = Areas RC = Rural Cities
    B = Boroughs S = Shires DC = District Councils
    M = Municipalities T = Towns AC = Aboriginal Councils
    RegC = Regional Councils


    🎯 Extract the LGA status from the LGA names

    Michael Sumner (2020). ozmaps: Australia Maps. R package version 0.3.6.

    23/39

    Case study Aussie Local Government Area

    LGA <- ozmaps::abs_lga %>% pull(NAME)
    LGA[1:7]
    ## [1] "Broken Hill (C)" "Waroona (S)" "Toowoomba (R)" "West Arthur (S)"
    ## [5] "Moreton Bay (R)" "Etheridge (S)" "Cleve (DC)"
    C = Cities A = Areas RC = Rural Cities
    B = Boroughs S = Shires DC = District Councils
    M = Municipalities T = Towns AC = Aboriginal Councils
    RegC = Regional Councils


    🎯 Extract the LGA status from the LGA names

    How?

    Michael Sumner (2020). ozmaps: Australia Maps. R package version 0.3.6.

    23/39

    Extracting the string

    str_extract(LGA, "\\(.+\\)")
    ## [1] "(C)" "(S)" "(R)" "(S)" "(R)"
    ## [6] "(S)" "(DC)" "(R)" "(DC)" "(C)"
    ## [11] "(DC)" "(S)" "(S)" "(S)" "(DC)"
    ## [16] "(A)" "(C)" "(A)" "(T)" "(RC)"
    ## [21] "(A)" "(S)" "(S)" "(S)" "(C)"
    ## [26] "(DC)" "(R)" "(A)" "(C)" "(DC)"
    ## [31] "(S)" "(S)" "(A)" "(S)" "(S)"
    ## [36] "(R)" "(M)" "(A)" "(C)" "(S)"
    ## [41] "(S)" "(C)" "(A)" "(S)" "(C)"
    ## [46] "(AC)" "(A)" "(S)" "(A)" "(C)"
    ## [51] "(A)" "(R)" "(S)" "(T)" "(C)"
    ## [56] "(S)" "(S)" "(R)" "(C)" "(T)"
    ## [61] "(C)" "(S)" "(C)" "(C)" "(C)"
    ## [66] "(C)" "(S)" "(DC)" "(DC)" "(S)"
    ## [71] "(R)" "(R)" "(S)" "(B)" "(DC)"
    ## [76] "(M)" "(A)" "(C)" "(S)" "(S)"
    ## [81] "(S)" "(S)" "(S)" "(S)" "(S)"
    ## [86] "(C)" "(A)" "(C)" "(A)" "(S)"
    ## [91] "(C)" "(A)" "(S)" "(S)" "(S)"
    ## [96] "(S)" "(DC)" "(S)" "(S)" "(S)"
    ## [101] "(C)" "(C)" "(DC)" "(S)" "(S)"
    ## [106] "(C)" "(S)" "(DC)" "(C)" "(C)"
    ## [111] "(S)" "(S)" "(S)" "(S)" "(S)"
    ## [116] "(S)" "(A)" "(DC)" "(S)" "(A)"
    ## [121] "(C)" "(A)" "(S)" "(A)" "(DC)"
    ## [126] "(S)" "(C)" "(S)" "(A)" "(S)"
    ## [131] "(M)" "(S)" "(DC)" "(R)" "(C)"
    ## [136] "(C)" "(S)" "(C)" "(S)" "(T)"
    ## [141] "(S)" "(S)" "(DC)" "(S)" "(T)"
    ## [146] "(C)" "(S)" "(M)" "(S)" "(DC)"
    ## [151] "(C)" "(S)" "(M)" "(C)" "(S)"
    ## [156] "(C)" "(C)" "(R)" "(S)" "(C)"
    ## [161] "(C)" "(R)" "(S)" "(C)" "(A)"
    ## [166] "(T)" "(S)" "(RC)" "(C)" "(A)"
    ## [171] "(A)" "(A)" "(S)" "(A)" "(S)"
    ## [176] "(S)" "(T)" "(S)" "(S)" "(S)"
    ## [181] "(A)" "(DC)" "(M)" "(C)" "(S)"
    ## [186] "(A)" "(T)" "(A)" "(C)" "(S)"
    ## [191] "(C)" "(R)" "(C)" "(S)" "(S)"
    ## [196] "(S)" "(S)" "(R)" "(C)" "(DC)"
    ## [201] "(A)" "(DC)" "(R)" "(C)" "(S)"
    ## [206] "(S)" "(C)" "(C)" "(R)" "(S)"
    ## [211] "(S)" "(C)" "(A)" "(S)" "(S)"
    ## [216] "(C)" "(DC)" "(S)" "(M) (Tas.)" "(M) (Tas.)"
    ## [221] "(C) (Vic.)" "(C) (Vic.)" "(S)" "(DC)" "(S)"
    ## [226] "(RC)" "(S)" "(DC)" "(S)" "(S)"
    ## [231] "(R)" "(S)" "(A)" "(C)" "(C)"
    ## [236] "(A)" "(A)" "(RC)" "(S)" "(C)"
    ## [241] "(S)" "(S)" "(S)" "(C)" "(C)"
    ## [246] "(S)" "(C)" "(C)" "(C)" "(A)"
    ## [251] "(C)" "(S)" "(S)" "(S)" "(S)"
    ## [256] "(S)" "(A)" "(A)" "(A)" "(S)"
    ## [261] "(A)" "(A)" "(S)" "(S)" "(C)"
    ## [266] "(A)" "(M)" "(S)" "(S)" "(C)"
    ## [271] "(R)" "(S)" "(R)" "(DC)" "(R)"
    ## [276] "(C)" "(S)" "(S)" "(C)" "(S)"
    ## [281] "(A)" "(R)" "(DC)" "(A)" "(C)"
    ## [286] "(A)" "(S)" "(S)" "(A)" "(C)"
    ## [291] "(C)" "(A)" "(T)" "(S)" "(C)"
    ## [296] "(A)" "(A)" "(S)" "(S)" "(T)"
    ## [301] "(C)" "(A)" "(A)" "(DC)" "(A)"
    ## [306] "(C)" "(M)" "(M)" "(S)" "(A)"
    ## [311] "(A)" "(C)" "(C)" "(S)" "(DC)"
    ## [316] "(S)" "(C)" "(S)" "(S)" "(DC)"
    ## [321] "(RegC)" "(C)" "(S)" "(S)" NA
    ## [326] "(A)" "(S)" "(A)" "(S)" "(A)"
    ## [331] "(S)" "(C)" "(R)" "(C)" "(S)"
    ## [336] "(A)" "(DC)" "(S)" "(A)" "(R)"
    ## [341] "(S)" "(S)" "(RC)" "(T)" "(A)"
    ## [346] "(M)" "(A)" "(S)" "(S)" "(S)"
    ## [351] "(S)" "(A)" "(RC)" "(S)" "(A)"
    ## [356] "(R)" "(S)" "(S)" "(C)" "(S)"
    ## [361] "(DC)" "(M)" "(M)" "(AC)" "(DC)"
    ## [366] "(A)" "(A)" "(S)" "(S)" "(A)"
    ## [371] "(C)" "(S)" "(S)" "(C)" "(R)"
    ## [376] "(S)" "(S)" NA "(A)" "(T)"
    ## [381] "(S)" "(A)" "(C)" "(C)" "(A)"
    ## [386] "(C)" "(DC)" "(C)" "(A)" "(A)"
    ## [391] "(A)" "(S)" "(DC)" "(DC)" "(S)"
    ## [396] "(M)" "(R)" "(DC)" "(C)" "(S)"
    ## [401] "(S)" "(C)" "(C)" "(C)" "(C)"
    ## [406] "(C)" "(S)" "(A)" NA "(S)"
    ## [411] "(C)" "(S)" "(M)" "(C)" "(S)"
    ## [416] "(S)" NA "(C)" "(S)" "(C)"
    ## [421] "(DC)" "(S)" "(C)" "(S)" "(C)"
    ## [426] "(M)" "(A)" "(A)" "(A)" "(S)"
    ## [431] "(C)" "(S)" "(S)" "(S)" "(A)"
    ## [436] "(A)" "(A)" "(S)" "(S)" "(S)"
    ## [441] "(C)" "(S)" "(C)" "(C)" "(C)"
    ## [446] "(C) (NSW)" "(S) (Qld)" "(R) (Qld)" "(DC) (SA)" "(C) (SA)"
    ## [451] "(M) (Tas.)" "(M) (Tas.)" "(C)" "(R)" "(M)"
    ## [456] "(C)" "(R)" "(S)" "(RC)" "(S)"
    ## [461] "(M)" "(C)" "(R)" "(C)" "(DC)"
    ## [466] "(C)" "(C)" "(M)" "(C)" "(S)"
    ## [471] "(C)" "(DC)" "(M)" "(S)" "(C)"
    ## [476] "(C)" "(A)" "(DC)" "(R)" "(C)"
    ## [481] "(C)" "(A)" "(M)" "(C)" "(C)"
    ## [486] "(S)" "(S)" "(S)" "(A)" "(R)"
    ## [491] "(M)" "(A)" "(R)" "(A)" "(A)"
    ## [496] "(R)" "(R)" "(R)" "(S)" "(C)"
    ## [501] "(C)" "(S)" "(A)" "(S)" "(M)"
    ## [506] "(M)" "(S)" "(A)" "(A)" "(S)"
    ## [511] "(A)" "(C)" "(DC)" "(S)" "(S)"
    ## [516] NA "(A)" NA "(R)" "(C)"
    ## [521] "(S)" "(C)" "(S)" "(A)" "(A)"
    ## [526] "(A)" "(A)" "(C)" "(A)" "(A)"
    ## [531] "(A)" "(A)" "(C) (NSW)" "(A)" "(C)"
    ## [536] "(R)" "(S)" "(A)" "(R)" "(C)"
    ## [541] "(A)" "(S)" "(A)" "(A)"
    24/39

    Extracting the string

    str_extract(LGA, "\\(.+\\)")
    ## [1] "(C)" "(S)" "(R)" "(S)" "(R)"
    ## [6] "(S)" "(DC)" "(R)" "(DC)" "(C)"
    ## [11] "(DC)" "(S)" "(S)" "(S)" "(DC)"
    ## [16] "(A)" "(C)" "(A)" "(T)" "(RC)"
    ## [21] "(A)" "(S)" "(S)" "(S)" "(C)"
    ## [26] "(DC)" "(R)" "(A)" "(C)" "(DC)"
    ## [31] "(S)" "(S)" "(A)" "(S)" "(S)"
    ## [36] "(R)" "(M)" "(A)" "(C)" "(S)"
    ## [41] "(S)" "(C)" "(A)" "(S)" "(C)"
    ## [46] "(AC)" "(A)" "(S)" "(A)" "(C)"
    ## [51] "(A)" "(R)" "(S)" "(T)" "(C)"
    ## [56] "(S)" "(S)" "(R)" "(C)" "(T)"
    ## [61] "(C)" "(S)" "(C)" "(C)" "(C)"
    ## [66] "(C)" "(S)" "(DC)" "(DC)" "(S)"
    ## [71] "(R)" "(R)" "(S)" "(B)" "(DC)"
    ## [76] "(M)" "(A)" "(C)" "(S)" "(S)"
    ## [81] "(S)" "(S)" "(S)" "(S)" "(S)"
    ## [86] "(C)" "(A)" "(C)" "(A)" "(S)"
    ## [91] "(C)" "(A)" "(S)" "(S)" "(S)"
    ## [96] "(S)" "(DC)" "(S)" "(S)" "(S)"
    ## [101] "(C)" "(C)" "(DC)" "(S)" "(S)"
    ## [106] "(C)" "(S)" "(DC)" "(C)" "(C)"
    ## [111] "(S)" "(S)" "(S)" "(S)" "(S)"
    ## [116] "(S)" "(A)" "(DC)" "(S)" "(A)"
    ## [121] "(C)" "(A)" "(S)" "(A)" "(DC)"
    ## [126] "(S)" "(C)" "(S)" "(A)" "(S)"
    ## [131] "(M)" "(S)" "(DC)" "(R)" "(C)"
    ## [136] "(C)" "(S)" "(C)" "(S)" "(T)"
    ## [141] "(S)" "(S)" "(DC)" "(S)" "(T)"
    ## [146] "(C)" "(S)" "(M)" "(S)" "(DC)"
    ## [151] "(C)" "(S)" "(M)" "(C)" "(S)"
    ## [156] "(C)" "(C)" "(R)" "(S)" "(C)"
    ## [161] "(C)" "(R)" "(S)" "(C)" "(A)"
    ## [166] "(T)" "(S)" "(RC)" "(C)" "(A)"
    ## [171] "(A)" "(A)" "(S)" "(A)" "(S)"
    ## [176] "(S)" "(T)" "(S)" "(S)" "(S)"
    ## [181] "(A)" "(DC)" "(M)" "(C)" "(S)"
    ## [186] "(A)" "(T)" "(A)" "(C)" "(S)"
    ## [191] "(C)" "(R)" "(C)" "(S)" "(S)"
    ## [196] "(S)" "(S)" "(R)" "(C)" "(DC)"
    ## [201] "(A)" "(DC)" "(R)" "(C)" "(S)"
    ## [206] "(S)" "(C)" "(C)" "(R)" "(S)"
    ## [211] "(S)" "(C)" "(A)" "(S)" "(S)"
    ## [216] "(C)" "(DC)" "(S)" "(M) (Tas.)" "(M) (Tas.)"
    ## [221] "(C) (Vic.)" "(C) (Vic.)" "(S)" "(DC)" "(S)"
    ## [226] "(RC)" "(S)" "(DC)" "(S)" "(S)"
    ## [231] "(R)" "(S)" "(A)" "(C)" "(C)"
    ## [236] "(A)" "(A)" "(RC)" "(S)" "(C)"
    ## [241] "(S)" "(S)" "(S)" "(C)" "(C)"
    ## [246] "(S)" "(C)" "(C)" "(C)" "(A)"
    ## [251] "(C)" "(S)" "(S)" "(S)" "(S)"
    ## [256] "(S)" "(A)" "(A)" "(A)" "(S)"
    ## [261] "(A)" "(A)" "(S)" "(S)" "(C)"
    ## [266] "(A)" "(M)" "(S)" "(S)" "(C)"
    ## [271] "(R)" "(S)" "(R)" "(DC)" "(R)"
    ## [276] "(C)" "(S)" "(S)" "(C)" "(S)"
    ## [281] "(A)" "(R)" "(DC)" "(A)" "(C)"
    ## [286] "(A)" "(S)" "(S)" "(A)" "(C)"
    ## [291] "(C)" "(A)" "(T)" "(S)" "(C)"
    ## [296] "(A)" "(A)" "(S)" "(S)" "(T)"
    ## [301] "(C)" "(A)" "(A)" "(DC)" "(A)"
    ## [306] "(C)" "(M)" "(M)" "(S)" "(A)"
    ## [311] "(A)" "(C)" "(C)" "(S)" "(DC)"
    ## [316] "(S)" "(C)" "(S)" "(S)" "(DC)"
    ## [321] "(RegC)" "(C)" "(S)" "(S)" NA
    ## [326] "(A)" "(S)" "(A)" "(S)" "(A)"
    ## [331] "(S)" "(C)" "(R)" "(C)" "(S)"
    ## [336] "(A)" "(DC)" "(S)" "(A)" "(R)"
    ## [341] "(S)" "(S)" "(RC)" "(T)" "(A)"
    ## [346] "(M)" "(A)" "(S)" "(S)" "(S)"
    ## [351] "(S)" "(A)" "(RC)" "(S)" "(A)"
    ## [356] "(R)" "(S)" "(S)" "(C)" "(S)"
    ## [361] "(DC)" "(M)" "(M)" "(AC)" "(DC)"
    ## [366] "(A)" "(A)" "(S)" "(S)" "(A)"
    ## [371] "(C)" "(S)" "(S)" "(C)" "(R)"
    ## [376] "(S)" "(S)" NA "(A)" "(T)"
    ## [381] "(S)" "(A)" "(C)" "(C)" "(A)"
    ## [386] "(C)" "(DC)" "(C)" "(A)" "(A)"
    ## [391] "(A)" "(S)" "(DC)" "(DC)" "(S)"
    ## [396] "(M)" "(R)" "(DC)" "(C)" "(S)"
    ## [401] "(S)" "(C)" "(C)" "(C)" "(C)"
    ## [406] "(C)" "(S)" "(A)" NA "(S)"
    ## [411] "(C)" "(S)" "(M)" "(C)" "(S)"
    ## [416] "(S)" NA "(C)" "(S)" "(C)"
    ## [421] "(DC)" "(S)" "(C)" "(S)" "(C)"
    ## [426] "(M)" "(A)" "(A)" "(A)" "(S)"
    ## [431] "(C)" "(S)" "(S)" "(S)" "(A)"
    ## [436] "(A)" "(A)" "(S)" "(S)" "(S)"
    ## [441] "(C)" "(S)" "(C)" "(C)" "(C)"
    ## [446] "(C) (NSW)" "(S) (Qld)" "(R) (Qld)" "(DC) (SA)" "(C) (SA)"
    ## [451] "(M) (Tas.)" "(M) (Tas.)" "(C)" "(R)" "(M)"
    ## [456] "(C)" "(R)" "(S)" "(RC)" "(S)"
    ## [461] "(M)" "(C)" "(R)" "(C)" "(DC)"
    ## [466] "(C)" "(C)" "(M)" "(C)" "(S)"
    ## [471] "(C)" "(DC)" "(M)" "(S)" "(C)"
    ## [476] "(C)" "(A)" "(DC)" "(R)" "(C)"
    ## [481] "(C)" "(A)" "(M)" "(C)" "(C)"
    ## [486] "(S)" "(S)" "(S)" "(A)" "(R)"
    ## [491] "(M)" "(A)" "(R)" "(A)" "(A)"
    ## [496] "(R)" "(R)" "(R)" "(S)" "(C)"
    ## [501] "(C)" "(S)" "(A)" "(S)" "(M)"
    ## [506] "(M)" "(S)" "(A)" "(A)" "(S)"
    ## [511] "(A)" "(C)" "(DC)" "(S)" "(S)"
    ## [516] NA "(A)" NA "(R)" "(C)"
    ## [521] "(S)" "(C)" "(S)" "(A)" "(A)"
    ## [526] "(A)" "(A)" "(C)" "(A)" "(A)"
    ## [531] "(A)" "(A)" "(C) (NSW)" "(A)" "(C)"
    ## [536] "(R)" "(S)" "(A)" "(R)" "(C)"
    ## [541] "(A)" "(S)" "(A)" "(A)"
    • What is "\\(.+\\)"???
    24/39

    Extracting the string

    str_extract(LGA, "\\(.+\\)")
    ## [1] "(C)" "(S)" "(R)" "(S)" "(R)"
    ## [6] "(S)" "(DC)" "(R)" "(DC)" "(C)"
    ## [11] "(DC)" "(S)" "(S)" "(S)" "(DC)"
    ## [16] "(A)" "(C)" "(A)" "(T)" "(RC)"
    ## [21] "(A)" "(S)" "(S)" "(S)" "(C)"
    ## [26] "(DC)" "(R)" "(A)" "(C)" "(DC)"
    ## [31] "(S)" "(S)" "(A)" "(S)" "(S)"
    ## [36] "(R)" "(M)" "(A)" "(C)" "(S)"
    ## [41] "(S)" "(C)" "(A)" "(S)" "(C)"
    ## [46] "(AC)" "(A)" "(S)" "(A)" "(C)"
    ## [51] "(A)" "(R)" "(S)" "(T)" "(C)"
    ## [56] "(S)" "(S)" "(R)" "(C)" "(T)"
    ## [61] "(C)" "(S)" "(C)" "(C)" "(C)"
    ## [66] "(C)" "(S)" "(DC)" "(DC)" "(S)"
    ## [71] "(R)" "(R)" "(S)" "(B)" "(DC)"
    ## [76] "(M)" "(A)" "(C)" "(S)" "(S)"
    ## [81] "(S)" "(S)" "(S)" "(S)" "(S)"
    ## [86] "(C)" "(A)" "(C)" "(A)" "(S)"
    ## [91] "(C)" "(A)" "(S)" "(S)" "(S)"
    ## [96] "(S)" "(DC)" "(S)" "(S)" "(S)"
    ## [101] "(C)" "(C)" "(DC)" "(S)" "(S)"
    ## [106] "(C)" "(S)" "(DC)" "(C)" "(C)"
    ## [111] "(S)" "(S)" "(S)" "(S)" "(S)"
    ## [116] "(S)" "(A)" "(DC)" "(S)" "(A)"
    ## [121] "(C)" "(A)" "(S)" "(A)" "(DC)"
    ## [126] "(S)" "(C)" "(S)" "(A)" "(S)"
    ## [131] "(M)" "(S)" "(DC)" "(R)" "(C)"
    ## [136] "(C)" "(S)" "(C)" "(S)" "(T)"
    ## [141] "(S)" "(S)" "(DC)" "(S)" "(T)"
    ## [146] "(C)" "(S)" "(M)" "(S)" "(DC)"
    ## [151] "(C)" "(S)" "(M)" "(C)" "(S)"
    ## [156] "(C)" "(C)" "(R)" "(S)" "(C)"
    ## [161] "(C)" "(R)" "(S)" "(C)" "(A)"
    ## [166] "(T)" "(S)" "(RC)" "(C)" "(A)"
    ## [171] "(A)" "(A)" "(S)" "(A)" "(S)"
    ## [176] "(S)" "(T)" "(S)" "(S)" "(S)"
    ## [181] "(A)" "(DC)" "(M)" "(C)" "(S)"
    ## [186] "(A)" "(T)" "(A)" "(C)" "(S)"
    ## [191] "(C)" "(R)" "(C)" "(S)" "(S)"
    ## [196] "(S)" "(S)" "(R)" "(C)" "(DC)"
    ## [201] "(A)" "(DC)" "(R)" "(C)" "(S)"
    ## [206] "(S)" "(C)" "(C)" "(R)" "(S)"
    ## [211] "(S)" "(C)" "(A)" "(S)" "(S)"
    ## [216] "(C)" "(DC)" "(S)" "(M) (Tas.)" "(M) (Tas.)"
    ## [221] "(C) (Vic.)" "(C) (Vic.)" "(S)" "(DC)" "(S)"
    ## [226] "(RC)" "(S)" "(DC)" "(S)" "(S)"
    ## [231] "(R)" "(S)" "(A)" "(C)" "(C)"
    ## [236] "(A)" "(A)" "(RC)" "(S)" "(C)"
    ## [241] "(S)" "(S)" "(S)" "(C)" "(C)"
    ## [246] "(S)" "(C)" "(C)" "(C)" "(A)"
    ## [251] "(C)" "(S)" "(S)" "(S)" "(S)"
    ## [256] "(S)" "(A)" "(A)" "(A)" "(S)"
    ## [261] "(A)" "(A)" "(S)" "(S)" "(C)"
    ## [266] "(A)" "(M)" "(S)" "(S)" "(C)"
    ## [271] "(R)" "(S)" "(R)" "(DC)" "(R)"
    ## [276] "(C)" "(S)" "(S)" "(C)" "(S)"
    ## [281] "(A)" "(R)" "(DC)" "(A)" "(C)"
    ## [286] "(A)" "(S)" "(S)" "(A)" "(C)"
    ## [291] "(C)" "(A)" "(T)" "(S)" "(C)"
    ## [296] "(A)" "(A)" "(S)" "(S)" "(T)"
    ## [301] "(C)" "(A)" "(A)" "(DC)" "(A)"
    ## [306] "(C)" "(M)" "(M)" "(S)" "(A)"
    ## [311] "(A)" "(C)" "(C)" "(S)" "(DC)"
    ## [316] "(S)" "(C)" "(S)" "(S)" "(DC)"
    ## [321] "(RegC)" "(C)" "(S)" "(S)" NA
    ## [326] "(A)" "(S)" "(A)" "(S)" "(A)"
    ## [331] "(S)" "(C)" "(R)" "(C)" "(S)"
    ## [336] "(A)" "(DC)" "(S)" "(A)" "(R)"
    ## [341] "(S)" "(S)" "(RC)" "(T)" "(A)"
    ## [346] "(M)" "(A)" "(S)" "(S)" "(S)"
    ## [351] "(S)" "(A)" "(RC)" "(S)" "(A)"
    ## [356] "(R)" "(S)" "(S)" "(C)" "(S)"
    ## [361] "(DC)" "(M)" "(M)" "(AC)" "(DC)"
    ## [366] "(A)" "(A)" "(S)" "(S)" "(A)"
    ## [371] "(C)" "(S)" "(S)" "(C)" "(R)"
    ## [376] "(S)" "(S)" NA "(A)" "(T)"
    ## [381] "(S)" "(A)" "(C)" "(C)" "(A)"
    ## [386] "(C)" "(DC)" "(C)" "(A)" "(A)"
    ## [391] "(A)" "(S)" "(DC)" "(DC)" "(S)"
    ## [396] "(M)" "(R)" "(DC)" "(C)" "(S)"
    ## [401] "(S)" "(C)" "(C)" "(C)" "(C)"
    ## [406] "(C)" "(S)" "(A)" NA "(S)"
    ## [411] "(C)" "(S)" "(M)" "(C)" "(S)"
    ## [416] "(S)" NA "(C)" "(S)" "(C)"
    ## [421] "(DC)" "(S)" "(C)" "(S)" "(C)"
    ## [426] "(M)" "(A)" "(A)" "(A)" "(S)"
    ## [431] "(C)" "(S)" "(S)" "(S)" "(A)"
    ## [436] "(A)" "(A)" "(S)" "(S)" "(S)"
    ## [441] "(C)" "(S)" "(C)" "(C)" "(C)"
    ## [446] "(C) (NSW)" "(S) (Qld)" "(R) (Qld)" "(DC) (SA)" "(C) (SA)"
    ## [451] "(M) (Tas.)" "(M) (Tas.)" "(C)" "(R)" "(M)"
    ## [456] "(C)" "(R)" "(S)" "(RC)" "(S)"
    ## [461] "(M)" "(C)" "(R)" "(C)" "(DC)"
    ## [466] "(C)" "(C)" "(M)" "(C)" "(S)"
    ## [471] "(C)" "(DC)" "(M)" "(S)" "(C)"
    ## [476] "(C)" "(A)" "(DC)" "(R)" "(C)"
    ## [481] "(C)" "(A)" "(M)" "(C)" "(C)"
    ## [486] "(S)" "(S)" "(S)" "(A)" "(R)"
    ## [491] "(M)" "(A)" "(R)" "(A)" "(A)"
    ## [496] "(R)" "(R)" "(R)" "(S)" "(C)"
    ## [501] "(C)" "(S)" "(A)" "(S)" "(M)"
    ## [506] "(M)" "(S)" "(A)" "(A)" "(S)"
    ## [511] "(A)" "(C)" "(DC)" "(S)" "(S)"
    ## [516] NA "(A)" NA "(R)" "(C)"
    ## [521] "(S)" "(C)" "(S)" "(A)" "(A)"
    ## [526] "(A)" "(A)" "(C)" "(A)" "(A)"
    ## [531] "(A)" "(A)" "(C) (NSW)" "(A)" "(C)"
    ## [536] "(R)" "(S)" "(A)" "(R)" "(C)"
    ## [541] "(A)" "(S)" "(A)" "(A)"
    • What is "\\(.+\\)"???
    • This is a pattern expressed as regular expression or regex for short
    24/39

    Extracting the string

    str_extract(LGA, "\\(.+\\)")
    ## [1] "(C)" "(S)" "(R)" "(S)" "(R)"
    ## [6] "(S)" "(DC)" "(R)" "(DC)" "(C)"
    ## [11] "(DC)" "(S)" "(S)" "(S)" "(DC)"
    ## [16] "(A)" "(C)" "(A)" "(T)" "(RC)"
    ## [21] "(A)" "(S)" "(S)" "(S)" "(C)"
    ## [26] "(DC)" "(R)" "(A)" "(C)" "(DC)"
    ## [31] "(S)" "(S)" "(A)" "(S)" "(S)"
    ## [36] "(R)" "(M)" "(A)" "(C)" "(S)"
    ## [41] "(S)" "(C)" "(A)" "(S)" "(C)"
    ## [46] "(AC)" "(A)" "(S)" "(A)" "(C)"
    ## [51] "(A)" "(R)" "(S)" "(T)" "(C)"
    ## [56] "(S)" "(S)" "(R)" "(C)" "(T)"
    ## [61] "(C)" "(S)" "(C)" "(C)" "(C)"
    ## [66] "(C)" "(S)" "(DC)" "(DC)" "(S)"
    ## [71] "(R)" "(R)" "(S)" "(B)" "(DC)"
    ## [76] "(M)" "(A)" "(C)" "(S)" "(S)"
    ## [81] "(S)" "(S)" "(S)" "(S)" "(S)"
    ## [86] "(C)" "(A)" "(C)" "(A)" "(S)"
    ## [91] "(C)" "(A)" "(S)" "(S)" "(S)"
    ## [96] "(S)" "(DC)" "(S)" "(S)" "(S)"
    ## [101] "(C)" "(C)" "(DC)" "(S)" "(S)"
    ## [106] "(C)" "(S)" "(DC)" "(C)" "(C)"
    ## [111] "(S)" "(S)" "(S)" "(S)" "(S)"
    ## [116] "(S)" "(A)" "(DC)" "(S)" "(A)"
    ## [121] "(C)" "(A)" "(S)" "(A)" "(DC)"
    ## [126] "(S)" "(C)" "(S)" "(A)" "(S)"
    ## [131] "(M)" "(S)" "(DC)" "(R)" "(C)"
    ## [136] "(C)" "(S)" "(C)" "(S)" "(T)"
    ## [141] "(S)" "(S)" "(DC)" "(S)" "(T)"
    ## [146] "(C)" "(S)" "(M)" "(S)" "(DC)"
    ## [151] "(C)" "(S)" "(M)" "(C)" "(S)"
    ## [156] "(C)" "(C)" "(R)" "(S)" "(C)"
    ## [161] "(C)" "(R)" "(S)" "(C)" "(A)"
    ## [166] "(T)" "(S)" "(RC)" "(C)" "(A)"
    ## [171] "(A)" "(A)" "(S)" "(A)" "(S)"
    ## [176] "(S)" "(T)" "(S)" "(S)" "(S)"
    ## [181] "(A)" "(DC)" "(M)" "(C)" "(S)"
    ## [186] "(A)" "(T)" "(A)" "(C)" "(S)"
    ## [191] "(C)" "(R)" "(C)" "(S)" "(S)"
    ## [196] "(S)" "(S)" "(R)" "(C)" "(DC)"
    ## [201] "(A)" "(DC)" "(R)" "(C)" "(S)"
    ## [206] "(S)" "(C)" "(C)" "(R)" "(S)"
    ## [211] "(S)" "(C)" "(A)" "(S)" "(S)"
    ## [216] "(C)" "(DC)" "(S)" "(M) (Tas.)" "(M) (Tas.)"
    ## [221] "(C) (Vic.)" "(C) (Vic.)" "(S)" "(DC)" "(S)"
    ## [226] "(RC)" "(S)" "(DC)" "(S)" "(S)"
    ## [231] "(R)" "(S)" "(A)" "(C)" "(C)"
    ## [236] "(A)" "(A)" "(RC)" "(S)" "(C)"
    ## [241] "(S)" "(S)" "(S)" "(C)" "(C)"
    ## [246] "(S)" "(C)" "(C)" "(C)" "(A)"
    ## [251] "(C)" "(S)" "(S)" "(S)" "(S)"
    ## [256] "(S)" "(A)" "(A)" "(A)" "(S)"
    ## [261] "(A)" "(A)" "(S)" "(S)" "(C)"
    ## [266] "(A)" "(M)" "(S)" "(S)" "(C)"
    ## [271] "(R)" "(S)" "(R)" "(DC)" "(R)"
    ## [276] "(C)" "(S)" "(S)" "(C)" "(S)"
    ## [281] "(A)" "(R)" "(DC)" "(A)" "(C)"
    ## [286] "(A)" "(S)" "(S)" "(A)" "(C)"
    ## [291] "(C)" "(A)" "(T)" "(S)" "(C)"
    ## [296] "(A)" "(A)" "(S)" "(S)" "(T)"
    ## [301] "(C)" "(A)" "(A)" "(DC)" "(A)"
    ## [306] "(C)" "(M)" "(M)" "(S)" "(A)"
    ## [311] "(A)" "(C)" "(C)" "(S)" "(DC)"
    ## [316] "(S)" "(C)" "(S)" "(S)" "(DC)"
    ## [321] "(RegC)" "(C)" "(S)" "(S)" NA
    ## [326] "(A)" "(S)" "(A)" "(S)" "(A)"
    ## [331] "(S)" "(C)" "(R)" "(C)" "(S)"
    ## [336] "(A)" "(DC)" "(S)" "(A)" "(R)"
    ## [341] "(S)" "(S)" "(RC)" "(T)" "(A)"
    ## [346] "(M)" "(A)" "(S)" "(S)" "(S)"
    ## [351] "(S)" "(A)" "(RC)" "(S)" "(A)"
    ## [356] "(R)" "(S)" "(S)" "(C)" "(S)"
    ## [361] "(DC)" "(M)" "(M)" "(AC)" "(DC)"
    ## [366] "(A)" "(A)" "(S)" "(S)" "(A)"
    ## [371] "(C)" "(S)" "(S)" "(C)" "(R)"
    ## [376] "(S)" "(S)" NA "(A)" "(T)"
    ## [381] "(S)" "(A)" "(C)" "(C)" "(A)"
    ## [386] "(C)" "(DC)" "(C)" "(A)" "(A)"
    ## [391] "(A)" "(S)" "(DC)" "(DC)" "(S)"
    ## [396] "(M)" "(R)" "(DC)" "(C)" "(S)"
    ## [401] "(S)" "(C)" "(C)" "(C)" "(C)"
    ## [406] "(C)" "(S)" "(A)" NA "(S)"
    ## [411] "(C)" "(S)" "(M)" "(C)" "(S)"
    ## [416] "(S)" NA "(C)" "(S)" "(C)"
    ## [421] "(DC)" "(S)" "(C)" "(S)" "(C)"
    ## [426] "(M)" "(A)" "(A)" "(A)" "(S)"
    ## [431] "(C)" "(S)" "(S)" "(S)" "(A)"
    ## [436] "(A)" "(A)" "(S)" "(S)" "(S)"
    ## [441] "(C)" "(S)" "(C)" "(C)" "(C)"
    ## [446] "(C) (NSW)" "(S) (Qld)" "(R) (Qld)" "(DC) (SA)" "(C) (SA)"
    ## [451] "(M) (Tas.)" "(M) (Tas.)" "(C)" "(R)" "(M)"
    ## [456] "(C)" "(R)" "(S)" "(RC)" "(S)"
    ## [461] "(M)" "(C)" "(R)" "(C)" "(DC)"
    ## [466] "(C)" "(C)" "(M)" "(C)" "(S)"
    ## [471] "(C)" "(DC)" "(M)" "(S)" "(C)"
    ## [476] "(C)" "(A)" "(DC)" "(R)" "(C)"
    ## [481] "(C)" "(A)" "(M)" "(C)" "(C)"
    ## [486] "(S)" "(S)" "(S)" "(A)" "(R)"
    ## [491] "(M)" "(A)" "(R)" "(A)" "(A)"
    ## [496] "(R)" "(R)" "(R)" "(S)" "(C)"
    ## [501] "(C)" "(S)" "(A)" "(S)" "(M)"
    ## [506] "(M)" "(S)" "(A)" "(A)" "(S)"
    ## [511] "(A)" "(C)" "(DC)" "(S)" "(S)"
    ## [516] NA "(A)" NA "(R)" "(C)"
    ## [521] "(S)" "(C)" "(S)" "(A)" "(A)"
    ## [526] "(A)" "(A)" "(C)" "(A)" "(A)"
    ## [531] "(A)" "(A)" "(C) (NSW)" "(A)" "(C)"
    ## [536] "(R)" "(S)" "(A)" "(R)" "(C)"
    ## [541] "(A)" "(S)" "(A)" "(A)"
    • What is "\\(.+\\)"???
    • This is a pattern expressed as regular expression or regex for short
    • Note in R, you have to add an extra \ when \ is included in the pattern (yes this means that you can have a lot of backslashes... just keep adding \ until it works! Enjoy this xkcd comic.)
    24/39

    Extracting the string

    str_extract(LGA, "\\(.+\\)")
    ## [1] "(C)" "(S)" "(R)" "(S)" "(R)"
    ## [6] "(S)" "(DC)" "(R)" "(DC)" "(C)"
    ## [11] "(DC)" "(S)" "(S)" "(S)" "(DC)"
    ## [16] "(A)" "(C)" "(A)" "(T)" "(RC)"
    ## [21] "(A)" "(S)" "(S)" "(S)" "(C)"
    ## [26] "(DC)" "(R)" "(A)" "(C)" "(DC)"
    ## [31] "(S)" "(S)" "(A)" "(S)" "(S)"
    ## [36] "(R)" "(M)" "(A)" "(C)" "(S)"
    ## [41] "(S)" "(C)" "(A)" "(S)" "(C)"
    ## [46] "(AC)" "(A)" "(S)" "(A)" "(C)"
    ## [51] "(A)" "(R)" "(S)" "(T)" "(C)"
    ## [56] "(S)" "(S)" "(R)" "(C)" "(T)"
    ## [61] "(C)" "(S)" "(C)" "(C)" "(C)"
    ## [66] "(C)" "(S)" "(DC)" "(DC)" "(S)"
    ## [71] "(R)" "(R)" "(S)" "(B)" "(DC)"
    ## [76] "(M)" "(A)" "(C)" "(S)" "(S)"
    ## [81] "(S)" "(S)" "(S)" "(S)" "(S)"
    ## [86] "(C)" "(A)" "(C)" "(A)" "(S)"
    ## [91] "(C)" "(A)" "(S)" "(S)" "(S)"
    ## [96] "(S)" "(DC)" "(S)" "(S)" "(S)"
    ## [101] "(C)" "(C)" "(DC)" "(S)" "(S)"
    ## [106] "(C)" "(S)" "(DC)" "(C)" "(C)"
    ## [111] "(S)" "(S)" "(S)" "(S)" "(S)"
    ## [116] "(S)" "(A)" "(DC)" "(S)" "(A)"
    ## [121] "(C)" "(A)" "(S)" "(A)" "(DC)"
    ## [126] "(S)" "(C)" "(S)" "(A)" "(S)"
    ## [131] "(M)" "(S)" "(DC)" "(R)" "(C)"
    ## [136] "(C)" "(S)" "(C)" "(S)" "(T)"
    ## [141] "(S)" "(S)" "(DC)" "(S)" "(T)"
    ## [146] "(C)" "(S)" "(M)" "(S)" "(DC)"
    ## [151] "(C)" "(S)" "(M)" "(C)" "(S)"
    ## [156] "(C)" "(C)" "(R)" "(S)" "(C)"
    ## [161] "(C)" "(R)" "(S)" "(C)" "(A)"
    ## [166] "(T)" "(S)" "(RC)" "(C)" "(A)"
    ## [171] "(A)" "(A)" "(S)" "(A)" "(S)"
    ## [176] "(S)" "(T)" "(S)" "(S)" "(S)"
    ## [181] "(A)" "(DC)" "(M)" "(C)" "(S)"
    ## [186] "(A)" "(T)" "(A)" "(C)" "(S)"
    ## [191] "(C)" "(R)" "(C)" "(S)" "(S)"
    ## [196] "(S)" "(S)" "(R)" "(C)" "(DC)"
    ## [201] "(A)" "(DC)" "(R)" "(C)" "(S)"
    ## [206] "(S)" "(C)" "(C)" "(R)" "(S)"
    ## [211] "(S)" "(C)" "(A)" "(S)" "(S)"
    ## [216] "(C)" "(DC)" "(S)" "(M) (Tas.)" "(M) (Tas.)"
    ## [221] "(C) (Vic.)" "(C) (Vic.)" "(S)" "(DC)" "(S)"
    ## [226] "(RC)" "(S)" "(DC)" "(S)" "(S)"
    ## [231] "(R)" "(S)" "(A)" "(C)" "(C)"
    ## [236] "(A)" "(A)" "(RC)" "(S)" "(C)"
    ## [241] "(S)" "(S)" "(S)" "(C)" "(C)"
    ## [246] "(S)" "(C)" "(C)" "(C)" "(A)"
    ## [251] "(C)" "(S)" "(S)" "(S)" "(S)"
    ## [256] "(S)" "(A)" "(A)" "(A)" "(S)"
    ## [261] "(A)" "(A)" "(S)" "(S)" "(C)"
    ## [266] "(A)" "(M)" "(S)" "(S)" "(C)"
    ## [271] "(R)" "(S)" "(R)" "(DC)" "(R)"
    ## [276] "(C)" "(S)" "(S)" "(C)" "(S)"
    ## [281] "(A)" "(R)" "(DC)" "(A)" "(C)"
    ## [286] "(A)" "(S)" "(S)" "(A)" "(C)"
    ## [291] "(C)" "(A)" "(T)" "(S)" "(C)"
    ## [296] "(A)" "(A)" "(S)" "(S)" "(T)"
    ## [301] "(C)" "(A)" "(A)" "(DC)" "(A)"
    ## [306] "(C)" "(M)" "(M)" "(S)" "(A)"
    ## [311] "(A)" "(C)" "(C)" "(S)" "(DC)"
    ## [316] "(S)" "(C)" "(S)" "(S)" "(DC)"
    ## [321] "(RegC)" "(C)" "(S)" "(S)" NA
    ## [326] "(A)" "(S)" "(A)" "(S)" "(A)"
    ## [331] "(S)" "(C)" "(R)" "(C)" "(S)"
    ## [336] "(A)" "(DC)" "(S)" "(A)" "(R)"
    ## [341] "(S)" "(S)" "(RC)" "(T)" "(A)"
    ## [346] "(M)" "(A)" "(S)" "(S)" "(S)"
    ## [351] "(S)" "(A)" "(RC)" "(S)" "(A)"
    ## [356] "(R)" "(S)" "(S)" "(C)" "(S)"
    ## [361] "(DC)" "(M)" "(M)" "(AC)" "(DC)"
    ## [366] "(A)" "(A)" "(S)" "(S)" "(A)"
    ## [371] "(C)" "(S)" "(S)" "(C)" "(R)"
    ## [376] "(S)" "(S)" NA "(A)" "(T)"
    ## [381] "(S)" "(A)" "(C)" "(C)" "(A)"
    ## [386] "(C)" "(DC)" "(C)" "(A)" "(A)"
    ## [391] "(A)" "(S)" "(DC)" "(DC)" "(S)"
    ## [396] "(M)" "(R)" "(DC)" "(C)" "(S)"
    ## [401] "(S)" "(C)" "(C)" "(C)" "(C)"
    ## [406] "(C)" "(S)" "(A)" NA "(S)"
    ## [411] "(C)" "(S)" "(M)" "(C)" "(S)"
    ## [416] "(S)" NA "(C)" "(S)" "(C)"
    ## [421] "(DC)" "(S)" "(C)" "(S)" "(C)"
    ## [426] "(M)" "(A)" "(A)" "(A)" "(S)"
    ## [431] "(C)" "(S)" "(S)" "(S)" "(A)"
    ## [436] "(A)" "(A)" "(S)" "(S)" "(S)"
    ## [441] "(C)" "(S)" "(C)" "(C)" "(C)"
    ## [446] "(C) (NSW)" "(S) (Qld)" "(R) (Qld)" "(DC) (SA)" "(C) (SA)"
    ## [451] "(M) (Tas.)" "(M) (Tas.)" "(C)" "(R)" "(M)"
    ## [456] "(C)" "(R)" "(S)" "(RC)" "(S)"
    ## [461] "(M)" "(C)" "(R)" "(C)" "(DC)"
    ## [466] "(C)" "(C)" "(M)" "(C)" "(S)"
    ## [471] "(C)" "(DC)" "(M)" "(S)" "(C)"
    ## [476] "(C)" "(A)" "(DC)" "(R)" "(C)"
    ## [481] "(C)" "(A)" "(M)" "(C)" "(C)"
    ## [486] "(S)" "(S)" "(S)" "(A)" "(R)"
    ## [491] "(M)" "(A)" "(R)" "(A)" "(A)"
    ## [496] "(R)" "(R)" "(R)" "(S)" "(C)"
    ## [501] "(C)" "(S)" "(A)" "(S)" "(M)"
    ## [506] "(M)" "(S)" "(A)" "(A)" "(S)"
    ## [511] "(A)" "(C)" "(DC)" "(S)" "(S)"
    ## [516] NA "(A)" NA "(R)" "(C)"
    ## [521] "(S)" "(C)" "(S)" "(A)" "(A)"
    ## [526] "(A)" "(A)" "(C)" "(A)" "(A)"
    ## [531] "(A)" "(A)" "(C) (NSW)" "(A)" "(C)"
    ## [536] "(R)" "(S)" "(A)" "(R)" "(C)"
    ## [541] "(A)" "(S)" "(A)" "(A)"
    • What is "\\(.+\\)"???
    • This is a pattern expressed as regular expression or regex for short
    • Note in R, you have to add an extra \ when \ is included in the pattern (yes this means that you can have a lot of backslashes... just keep adding \ until it works! Enjoy this xkcd comic.)
    • From R v4.0.0 onwards, you can use raw string to elimiate all the extra \, e.g. r"(\(.+\))" is the same as "\\(.+\\)"
    24/39

    Regular expressions Part 1

    • Regular expression, or regex, is a string of characters that define a search pattern for text
    25/39

    Regular expressions Part 1

    • Regular expression, or regex, is a string of characters that define a search pattern for text
    • Regular expression is...
    25/39

    Regular expressions Part 1

    • Regular expression, or regex, is a string of characters that define a search pattern for text
    • Regular expression is... hard
    25/39

    Regular expressions Part 1

    • Regular expression, or regex, is a string of characters that define a search pattern for text
    • Regular expression is... hard, but comes up often enough that it's worth learning
    25/39

    Regular expressions Part 1

    • Regular expression, or regex, is a string of characters that define a search pattern for text
    • Regular expression is... hard, but comes up often enough that it's worth learning
    ozanimals <- c("koala", "kangaroo", "kookaburra", "numbat")
    25/39

    Regular expressions Part 1

    • Regular expression, or regex, is a string of characters that define a search pattern for text
    • Regular expression is... hard, but comes up often enough that it's worth learning
    ozanimals <- c("koala", "kangaroo", "kookaburra", "numbat")

    = Basic match

    str_detect(ozanimals, "oo")
    ## [1] FALSE TRUE TRUE FALSE
    str_extract(ozanimals, "oo")
    ## [1] NA "oo" "oo" NA
    str_match(ozanimals, "oo")
    ## [,1]
    ## [1,] NA
    ## [2,] "oo"
    ## [3,] "oo"
    ## [4,] NA
    25/39

    Regular expressions Part 2

    = Meta-characters

    • "." a wildcard to match any character except a new line
    str_starts(c("color", "colouur", "colour", "red-column"), "col...")
    ## [1] FALSE TRUE TRUE FALSE
    26/39

    Regular expressions Part 2

    = Meta-characters

    • "." a wildcard to match any character except a new line
    str_starts(c("color", "colouur", "colour", "red-column"), "col...")
    ## [1] FALSE TRUE TRUE FALSE
    • "(.|.)" a marked subexpression with alternate possibilites marked with |
    str_replace(c("lovelove", "move", "stove", "drove"), "(l|dr|st)o", "ha")
    ## [1] "havelove" "move" "have" "have"
    26/39

    Regular expressions Part 2

    = Meta-characters

    • "." a wildcard to match any character except a new line
    str_starts(c("color", "colouur", "colour", "red-column"), "col...")
    ## [1] FALSE TRUE TRUE FALSE
    • "(.|.)" a marked subexpression with alternate possibilites marked with |
    str_replace(c("lovelove", "move", "stove", "drove"), "(l|dr|st)o", "ha")
    ## [1] "havelove" "move" "have" "have"
    • "[...]" matches a single character contained in the bracket
    str_replace_all(c("cake", "cookie", "lamington"), "[aeiou]", "_")
    ## [1] "c_k_" "c__k__" "l_m_ngt_n"
    26/39

    Regular expressions Part 3

    = Meta-character quantifiers

    • "?" zero or one occurence of preceding element
    str_extract(c("color", "colouur", "colour", "red"), "colou?r")
    ## [1] "color" NA "colour" NA
    27/39

    Regular expressions Part 3

    = Meta-character quantifiers

    • "?" zero or one occurence of preceding element
    str_extract(c("color", "colouur", "colour", "red"), "colou?r")
    ## [1] "color" NA "colour" NA
    • "*" zero or more occurence of preceding element
    str_extract(c("color", "colouur", "colour", "red"), "colou*r")
    ## [1] "color" "colouur" "colour" NA
    27/39

    Regular expressions Part 3

    = Meta-character quantifiers

    • "?" zero or one occurence of preceding element
    str_extract(c("color", "colouur", "colour", "red"), "colou?r")
    ## [1] "color" NA "colour" NA
    • "*" zero or more occurence of preceding element
    str_extract(c("color", "colouur", "colour", "red"), "colou*r")
    ## [1] "color" "colouur" "colour" NA
    • "+" one or more occurence of preceding element
    str_extract(c("color", "colouur", "colour", "red"), "colou+r")
    ## [1] NA "colouur" "colour" NA
    27/39

    Regular expressions Part 4

    • "{n}" preceding element is matched exactly n times
    str_replace(c("banana", "bananana", "bana", "banananana"), "ba(na){2}", "-")
    ## [1] "-" "-na" "bana" "-nana"
    28/39

    Regular expressions Part 4

    • "{n}" preceding element is matched exactly n times
    str_replace(c("banana", "bananana", "bana", "banananana"), "ba(na){2}", "-")
    ## [1] "-" "-na" "bana" "-nana"
    • "{min,}" preceding element is matched min times or more
    str_replace(c("banana", "bananana", "bana", "banananana"), "ba(na){2,}", "-")
    ## [1] "-" "-" "bana" "-"
    28/39

    Regular expressions Part 4

    • "{n}" preceding element is matched exactly n times
    str_replace(c("banana", "bananana", "bana", "banananana"), "ba(na){2}", "-")
    ## [1] "-" "-na" "bana" "-nana"
    • "{min,}" preceding element is matched min times or more
    str_replace(c("banana", "bananana", "bana", "banananana"), "ba(na){2,}", "-")
    ## [1] "-" "-" "bana" "-"
    • "{min,max}" preceding element is matched at least min times but no more than max times
    str_replace(c("banana", "bananana", "bana", "banananana"), "ba(na){1,2}", "-")
    ## [1] "-" "-na" "-" "-nana"
    28/39

    Regular expressions Part 5

    = Character classes

    • [:alpha:] or [A-Za-z] to match alphabetic characters
    • [:alnum:] or [A-Za-z0-9] to match alphanumeric characters
    • [:digit:] or [0-9] or \\d to match a digit
    • [^0-9] to match non-digits
    • [a-c] to match a, b or c
    • [A-Z] to match uppercase letters
    • [a-z] to match lowercase letters
    • [:space:] or [ \t\r\n\v\f] to match whitespace characters
    • and more...
    29/39

    View matches with regular expressions

    str_view(c("banana", "bananana", "bana", "banabanana"), "ba(na){1,2}")
    • banana
    • bananana
    • bana
    • banabanana
    str_view_all(c("banana", "bananana", "bana", "banabanana"), "ba(na){1,2}")
    • banana
    • bananana
    • bana
    • banabanana
    30/39

    View matches with regular expressions

    str_view(c("banana", "bananana", "bana", "banabanana"), "ba(na){1,2}")
    • banana
    • bananana
    • bana
    • banabanana
    str_view_all(c("banana", "bananana", "bana", "banabanana"), "ba(na){1,2}")
    • banana
    • bananana
    • bana
    • banabanana
    • When a function in stringr ends with _all, all matches of the pattern are considered
    • The one without _all only considers the first match
    30/39

    Back to Extracting the string

    str_extract(LGA, "\\(.+\\)")
    ## [1] "(C)" "(S)" "(R)" "(S)" "(R)"
    ## [6] "(S)" "(DC)" "(R)" "(DC)" "(C)"
    ## [11] "(DC)" "(S)" "(S)" "(S)" "(DC)"
    ## [16] "(A)" "(C)" "(A)" "(T)" "(RC)"
    ## [21] "(A)" "(S)" "(S)" "(S)" "(C)"
    ## [26] "(DC)" "(R)" "(A)" "(C)" "(DC)"
    ## [31] "(S)" "(S)" "(A)" "(S)" "(S)"
    ## [36] "(R)" "(M)" "(A)" "(C)" "(S)"
    ## [41] "(S)" "(C)" "(A)" "(S)" "(C)"
    ## [46] "(AC)" "(A)" "(S)" "(A)" "(C)"
    ## [51] "(A)" "(R)" "(S)" "(T)" "(C)"
    ## [56] "(S)" "(S)" "(R)" "(C)" "(T)"
    ## [61] "(C)" "(S)" "(C)" "(C)" "(C)"
    ## [66] "(C)" "(S)" "(DC)" "(DC)" "(S)"
    ## [71] "(R)" "(R)" "(S)" "(B)" "(DC)"
    ## [76] "(M)" "(A)" "(C)" "(S)" "(S)"
    ## [81] "(S)" "(S)" "(S)" "(S)" "(S)"
    ## [86] "(C)" "(A)" "(C)" "(A)" "(S)"
    ## [91] "(C)" "(A)" "(S)" "(S)" "(S)"
    ## [96] "(S)" "(DC)" "(S)" "(S)" "(S)"
    ## [101] "(C)" "(C)" "(DC)" "(S)" "(S)"
    ## [106] "(C)" "(S)" "(DC)" "(C)" "(C)"
    ## [111] "(S)" "(S)" "(S)" "(S)" "(S)"
    ## [116] "(S)" "(A)" "(DC)" "(S)" "(A)"
    ## [121] "(C)" "(A)" "(S)" "(A)" "(DC)"
    ## [126] "(S)" "(C)" "(S)" "(A)" "(S)"
    ## [131] "(M)" "(S)" "(DC)" "(R)" "(C)"
    ## [136] "(C)" "(S)" "(C)" "(S)" "(T)"
    ## [141] "(S)" "(S)" "(DC)" "(S)" "(T)"
    ## [146] "(C)" "(S)" "(M)" "(S)" "(DC)"
    ## [151] "(C)" "(S)" "(M)" "(C)" "(S)"
    ## [156] "(C)" "(C)" "(R)" "(S)" "(C)"
    ## [161] "(C)" "(R)" "(S)" "(C)" "(A)"
    ## [166] "(T)" "(S)" "(RC)" "(C)" "(A)"
    ## [171] "(A)" "(A)" "(S)" "(A)" "(S)"
    ## [176] "(S)" "(T)" "(S)" "(S)" "(S)"
    ## [181] "(A)" "(DC)" "(M)" "(C)" "(S)"
    ## [186] "(A)" "(T)" "(A)" "(C)" "(S)"
    ## [191] "(C)" "(R)" "(C)" "(S)" "(S)"
    ## [196] "(S)" "(S)" "(R)" "(C)" "(DC)"
    ## [201] "(A)" "(DC)" "(R)" "(C)" "(S)"
    ## [206] "(S)" "(C)" "(C)" "(R)" "(S)"
    ## [211] "(S)" "(C)" "(A)" "(S)" "(S)"
    ## [216] "(C)" "(DC)" "(S)" "(M) (Tas.)" "(M) (Tas.)"
    ## [221] "(C) (Vic.)" "(C) (Vic.)" "(S)" "(DC)" "(S)"
    ## [226] "(RC)" "(S)" "(DC)" "(S)" "(S)"
    ## [231] "(R)" "(S)" "(A)" "(C)" "(C)"
    ## [236] "(A)" "(A)" "(RC)" "(S)" "(C)"
    ## [241] "(S)" "(S)" "(S)" "(C)" "(C)"
    ## [246] "(S)" "(C)" "(C)" "(C)" "(A)"
    ## [251] "(C)" "(S)" "(S)" "(S)" "(S)"
    ## [256] "(S)" "(A)" "(A)" "(A)" "(S)"
    ## [261] "(A)" "(A)" "(S)" "(S)" "(C)"
    ## [266] "(A)" "(M)" "(S)" "(S)" "(C)"
    ## [271] "(R)" "(S)" "(R)" "(DC)" "(R)"
    ## [276] "(C)" "(S)" "(S)" "(C)" "(S)"
    ## [281] "(A)" "(R)" "(DC)" "(A)" "(C)"
    ## [286] "(A)" "(S)" "(S)" "(A)" "(C)"
    ## [291] "(C)" "(A)" "(T)" "(S)" "(C)"
    ## [296] "(A)" "(A)" "(S)" "(S)" "(T)"
    ## [301] "(C)" "(A)" "(A)" "(DC)" "(A)"
    ## [306] "(C)" "(M)" "(M)" "(S)" "(A)"
    ## [311] "(A)" "(C)" "(C)" "(S)" "(DC)"
    ## [316] "(S)" "(C)" "(S)" "(S)" "(DC)"
    ## [321] "(RegC)" "(C)" "(S)" "(S)" NA
    ## [326] "(A)" "(S)" "(A)" "(S)" "(A)"
    ## [331] "(S)" "(C)" "(R)" "(C)" "(S)"
    ## [336] "(A)" "(DC)" "(S)" "(A)" "(R)"
    ## [341] "(S)" "(S)" "(RC)" "(T)" "(A)"
    ## [346] "(M)" "(A)" "(S)" "(S)" "(S)"
    ## [351] "(S)" "(A)" "(RC)" "(S)" "(A)"
    ## [356] "(R)" "(S)" "(S)" "(C)" "(S)"
    ## [361] "(DC)" "(M)" "(M)" "(AC)" "(DC)"
    ## [366] "(A)" "(A)" "(S)" "(S)" "(A)"
    ## [371] "(C)" "(S)" "(S)" "(C)" "(R)"
    ## [376] "(S)" "(S)" NA "(A)" "(T)"
    ## [381] "(S)" "(A)" "(C)" "(C)" "(A)"
    ## [386] "(C)" "(DC)" "(C)" "(A)" "(A)"
    ## [391] "(A)" "(S)" "(DC)" "(DC)" "(S)"
    ## [396] "(M)" "(R)" "(DC)" "(C)" "(S)"
    ## [401] "(S)" "(C)" "(C)" "(C)" "(C)"
    ## [406] "(C)" "(S)" "(A)" NA "(S)"
    ## [411] "(C)" "(S)" "(M)" "(C)" "(S)"
    ## [416] "(S)" NA "(C)" "(S)" "(C)"
    ## [421] "(DC)" "(S)" "(C)" "(S)" "(C)"
    ## [426] "(M)" "(A)" "(A)" "(A)" "(S)"
    ## [431] "(C)" "(S)" "(S)" "(S)" "(A)"
    ## [436] "(A)" "(A)" "(S)" "(S)" "(S)"
    ## [441] "(C)" "(S)" "(C)" "(C)" "(C)"
    ## [446] "(C) (NSW)" "(S) (Qld)" "(R) (Qld)" "(DC) (SA)" "(C) (SA)"
    ## [451] "(M) (Tas.)" "(M) (Tas.)" "(C)" "(R)" "(M)"
    ## [456] "(C)" "(R)" "(S)" "(RC)" "(S)"
    ## [461] "(M)" "(C)" "(R)" "(C)" "(DC)"
    ## [466] "(C)" "(C)" "(M)" "(C)" "(S)"
    ## [471] "(C)" "(DC)" "(M)" "(S)" "(C)"
    ## [476] "(C)" "(A)" "(DC)" "(R)" "(C)"
    ## [481] "(C)" "(A)" "(M)" "(C)" "(C)"
    ## [486] "(S)" "(S)" "(S)" "(A)" "(R)"
    ## [491] "(M)" "(A)" "(R)" "(A)" "(A)"
    ## [496] "(R)" "(R)" "(R)" "(S)" "(C)"
    ## [501] "(C)" "(S)" "(A)" "(S)" "(M)"
    ## [506] "(M)" "(S)" "(A)" "(A)" "(S)"
    ## [511] "(A)" "(C)" "(DC)" "(S)" "(S)"
    ## [516] NA "(A)" NA "(R)" "(C)"
    ## [521] "(S)" "(C)" "(S)" "(A)" "(A)"
    ## [526] "(A)" "(A)" "(C)" "(A)" "(A)"
    ## [531] "(A)" "(A)" "(C) (NSW)" "(A)" "(C)"
    ## [536] "(R)" "(S)" "(A)" "(R)" "(C)"
    ## [541] "(A)" "(S)" "(A)" "(A)"
    31/39

    Back to Extracting the string

    str_extract(LGA, "\\(.+\\)") %>%
    table()
    ## .
    ## (A) (AC) (B) (C) (C) (NSW) (C) (SA) (C) (Vic.)
    ## 100 2 1 120 2 1 2
    ## (DC) (DC) (SA) (M) (M) (Tas.) (R) (R) (Qld) (RC)
    ## 40 1 23 4 38 1 7
    ## (RegC) (S) (S) (Qld) (T)
    ## 1 182 1 12
    31/39

    Back to Extracting the string

    str_extract(LGA, "\\(.+\\)") %>%
    table()
    ## .
    ## (A) (AC) (B) (C) (C) (NSW) (C) (SA) (C) (Vic.)
    ## 100 2 1 120 2 1 2
    ## (DC) (DC) (SA) (M) (M) (Tas.) (R) (R) (Qld) (RC)
    ## 40 1 23 4 38 1 7
    ## (RegC) (S) (S) (Qld) (T)
    ## 1 182 1 12
    Where the same Local Government Area name appears in different States or Territories, the State or Territory abbreviation appears in parenthesis after the name. Local Government Area names are therefore unique.
    -Australian Bureau of Statistics
    31/39

    Retry Extracting the string

    str_extract(LGA, "\\([^)]+\\)") %>%
    table()
    ## .
    ## (A) (AC) (B) (C) (DC) (M) (R) (RC) (RegC) (S) (T)
    ## 100 2 1 125 41 27 39 7 1 183 12
    32/39

    Retry Extracting the string

    str_extract(LGA, "\\([^)]+\\)") %>%
    # remove the brackets
    str_replace_all("[\\(\\)]", "") %>%
    table()
    ## .
    ## A AC B C DC M R RC RegC S T
    ## 100 2 1 125 41 27 39 7 1 183 12
    • "[]" for single character match
    • We want to match ( and ) but these are meta-characters
    • So we need to escape it to have it as a literal: \( and \)
    • But we must escape the escape character... so it's actually \\( \\)
    32/39

    R v4.0.0 Extracting the string

    str_extract(LGA, r"(\([^)]+\))") %>%
    # remove the brackets
    str_replace_all(r"([\(\)])", "") %>%
    table()
    ## .
    ## A AC B C DC M R RC RegC S T
    ## 100 2 1 125 41 27 39 7 1 183 12
    • If using R v4.0.0 onwards, you can use the raw string version instead
    33/39

    Back to Census

    34/39

    Raw Data vs. Aggregated Data

    • Although the data collected was from individual households surveying each person in the household (see sample form here), the downloaded data are aggregated.
    • Aggregated data presents summary statistics from the raw data. When the only summary statistics are counts then it is generally called frequency data.
    • The raw data collected would be similar to the form
    35/39

    What you lose in aggregate data

    • For aggregate data, there are less scope for you to draw insights conditioned on other variables.
    • E.g. based on frequency data alone, you cannot answer questions like: how many middle income families with 2 children?
    • Raw data are desirable if you can get hold of it!
    36/39

    What you lose in aggregate data

    • For aggregate data, there are less scope for you to draw insights conditioned on other variables.
    • E.g. based on frequency data alone, you cannot answer questions like: how many middle income families with 2 children?
    • Raw data are desirable if you can get hold of it!

    Trust and skepticism

    • By the way, did you notice anything odd about the dummy data presented in the last slide?
    36/39

    What you lose in aggregate data

    • For aggregate data, there are less scope for you to draw insights conditioned on other variables.
    • E.g. based on frequency data alone, you cannot answer questions like: how many middle income families with 2 children?
    • Raw data are desirable if you can get hold of it!

    Trust and skepticism

    • By the way, did you notice anything odd about the dummy data presented in the last slide?
    • John Smith was recorded as female and Jane Smith as male. Data may have been incorrectly recorded.
    36/39

    What you lose in aggregate data

    • For aggregate data, there are less scope for you to draw insights conditioned on other variables.
    • E.g. based on frequency data alone, you cannot answer questions like: how many middle income families with 2 children?
    • Raw data are desirable if you can get hold of it!

    Trust and skepticism

    • By the way, did you notice anything odd about the dummy data presented in the last slide?
    • John Smith was recorded as female and Jane Smith as male. Data may have been incorrectly recorded.
    • How much do you trust the aggregate data?
    • Have some healthy dose of skepticism in your data.
    36/39

    Data Confidentiality

    • The data is not just aggregated, but it is also anonymised
    • E.g. in 2016_GCP_Sequential_Template.xlsx, Sheet "G 17a", footnote says "Please note that there are small random adjustments made to all cell values to protect the confidentiality of data. These adjustments may cause the sum of rows or columns to differ by small amounts from table totals."
    37/39

    Data Confidentiality

    • The data is not just aggregated, but it is also anonymised
    • E.g. in 2016_GCP_Sequential_Template.xlsx, Sheet "G 17a", footnote says "Please note that there are small random adjustments made to all cell values to protect the confidentiality of data. These adjustments may cause the sum of rows or columns to differ by small amounts from table totals."

    Do you think that you'll get the same numbers if you use the ones from different geographical code? E.g. SA1 and STE.

    37/39

    Data Confidentiality

    • The data is not just aggregated, but it is also anonymised
    • E.g. in 2016_GCP_Sequential_Template.xlsx, Sheet "G 17a", footnote says "Please note that there are small random adjustments made to all cell values to protect the confidentiality of data. These adjustments may cause the sum of rows or columns to differ by small amounts from table totals."

    Do you think that you'll get the same numbers if you use the ones from different geographical code? E.g. SA1 and STE.

    • You can check this in the tutorial 🔧
    37/39

    Summary

    • We went through how to locate and understand the data variables for the personal income data from the 2016 Australian census.
    • We know some limitations with this data.
    • We learnt how to manipulate strings and a little about regular expression.
    • We learnt about what tidy data is.
    38/39

    Creative Commons License
    This work is licensed under a Creative Commons Attribution-ShareAlike 4.0 International License.

    Lecturer: Emi Tanaka

    Department of Econometrics and Business Statistics

    ETC5512.Clayton-x@monash.edu

    Week 4


    39/39

    ETC5512: Wild Caught Data


    Australian census

    Lecturer: Emi Tanaka

    Department of Econometrics and Business Statistics

    ETC5512.Clayton-x@monash.edu

    Week 4


    1/39
    Paused

    Help

    Keyboard shortcuts

    , , Pg Up, k Go to previous slide
    , , Pg Dn, Space, j Go to next slide
    Home Go to first slide
    End Go to last slide
    Number + Return Go to specific slide
    b / m / f Toggle blackout / mirrored / fullscreen mode
    c Clone slideshow
    p Toggle presenter mode
    t Restart the presentation timer
    ?, h Toggle this help
    Esc Back to slideshow