+ - 0:00:00
Notes for current slide
Notes for next slide

These slides are viewed best by Chrome or Firefox and occasionally need to be refreshed if elements did not load properly. See here for the PDF .


Press the right arrow to progress to the next slide!

1/31

ETC5512: Wild Caught Data


Introduction to data collection methods

Lecturer: Emi Tanaka

Department of Econometrics and Business Statistics

ETC5512.Clayton-x@monash.edu

Week 1


1/31

Starts with a question

2/31

What questions do you have ... ?

  • ... about a virus?
  • ... forecasting the weather?
  • ... about the stock prices?
3/31
  • This is an example where the data has been already collected.
  • Interested in data collection methods

Planet Monash

4/31

Planet Monash

How many yellow, green and red alien creatures?

4/31

Planet Monash

How many yellow, green and red alien creatures?
What is the distribution of the height of the alien creatures?

4/31

Planet Monash

How many yellow, green and red alien creatures?
What is the distribution of the height of the alien creatures?
Are yellow creatures more likely to have hair?

4/31

Planet Monash

How many yellow, green and red alien creatures?
What is the distribution of the height of the alien creatures?
Are yellow creatures more likely to have hair?
Does the hair growth formula work on these creatures?

4/31

Now that we have a question ...

5/31

Sampling the population

Collecting data on the entire population is normally too expensive or infeasible!

7/31

Sampling the population

Collecting data on the entire population is normally too expensive or infeasible!

  • We therefore collect data only on a subset of the population.
7/31

Sampling the population

Collecting data on the entire population is normally too expensive or infeasible!

  • We therefore collect data only on a subset of the population.
  • How should we sample the population?
7/31

Sampling the population

Collecting data on the entire population is normally too expensive or infeasible!

  • We therefore collect data only on a subset of the population.
  • How should we sample the population? There are many sampling schemes.
7/31

Sampling the population

Collecting data on the entire population is normally too expensive or infeasible!

  • We therefore collect data only on a subset of the population.

  • How should we sample the population? There are many sampling schemes.

Simple random sampling
Every unit in the population has the same sample probability to be drawn.

7/31

Sampling the population

Collecting data on the entire population is normally too expensive or infeasible!

  • We therefore collect data only on a subset of the population.

  • How should we sample the population? There are many sampling schemes.

Simple random sampling
Every unit in the population has the same sample probability to be drawn.

Stratified random sampling
Units are drawn from non-overlapping sub-populations.

7/31
  • Stratified random sampling requires identifying different subpopulations and may involve SRS on subpopulation

Goal of sampling schemes

The goal of a sampling scheme is to get accurate information from the sample in order to answer your question.

8/31

Goal of sampling schemes

The goal of a sampling scheme is to get accurate information from the sample in order to answer your question.

  • This involves identifying:
    • the population of interest (e.g. if studying about male baldness pattern, your population of interest is the biologically male population),
    • what responses (dependent variables) or covariates (explanatory or independent variables) to capture and how to measure it (e.g. do you collect their age? Which range of age they are in? Their hair count? The thickness of the hair?),
    • the sample size (how many samples do we need?),
    • any structure that will be in the data (e.g. population structures, repeated cross-sectional data, panel or longitudinal data), and
    • any restrictions (e.g. ethical concerns, limitation on collecting data).
8/31

Sampling strategies

  • Sampling strategies combine knowledge about the population with statistical methods.
9/31

Sampling strategies

  • Sampling strategies combine knowledge about the population with statistical methods.

  • For example,

    • designing so your sample estimates give (theoretically) unbiased estimates of the population parameters,
    • sample so the data will be representative of the subpopulations (e.g. stratified random sampling), or
    • oversampling or undersampling to compensate for imbalance in classes.
9/31

Sampling strategies

  • Sampling strategies combine knowledge about the population with statistical methods.

  • For example,

    • designing so your sample estimates give (theoretically) unbiased estimates of the population parameters,
    • sample so the data will be representative of the subpopulations (e.g. stratified random sampling), or
    • oversampling or undersampling to compensate for imbalance in classes.

What might go wrong with a simple random sampling of 10 creatures from this population?

9/31

Random and non-random selections

  • Units ideally are sampled randomly, but more than often selections are non-random.
10/31

Random and non-random selections

  • Units ideally are sampled randomly, but more than often selections are non-random.

    If I survey every 10th household in a street, is that a random selection?

10/31

Random and non-random selections

  • Units ideally are sampled randomly, but more than often selections are non-random.

    If I survey every 10th household in a street, is that a random selection?

    What do you think can go wrong if we don't sample randomly?

10/31

Random and non-random selections

  • Units ideally are sampled randomly, but more than often selections are non-random.

    If I survey every 10th household in a street, is that a random selection?

    What do you think can go wrong if we don't sample randomly?


  • What's wrong with these examples?

  • You want to know the attitude of the creatures about working at home.
  • You call phone numbers listed in the order of white pages and stop when you have 20 observations.
10/31

Random and non-random selections

  • Units ideally are sampled randomly, but more than often selections are non-random.

    If I survey every 10th household in a street, is that a random selection?

    What do you think can go wrong if we don't sample randomly?


  • What's wrong with these examples?

  • You want to know the attitude of the creatures about working at home.
  • You call phone numbers listed in the order of white pages and stop when you have 20 observations.
  • You want to get the hair count distribution of the Planet Monash population.
  • You sample creatures from the Society of Bald Extraterrestrials.
10/31

Reality of data collection ...

  • Making an appropriate sampling design is hard.
11/31

Reality of data collection ...

  • Making an appropriate sampling design is hard.
    • There may be unknown or hidden structures in the population.
11/31

Reality of data collection ...

  • Making an appropriate sampling design is hard.
    • There may be unknown or hidden structures in the population.
    • You may introduce intentional data structures, e.g.
      • Cross-sectional data,
      • Repeated cross-sectional data (e.g. case-control),
      • Panel or longitudinal data (e.g. cohort studies), and so on.
11/31

Reality of data collection ...

  • Making an appropriate sampling design is hard.

    • There may be unknown or hidden structures in the population.
    • You may introduce intentional data structures, e.g.

      • Cross-sectional data,
      • Repeated cross-sectional data (e.g. case-control),
      • Panel or longitudinal data (e.g. cohort studies), and so on.
    • You may have unintended or unknown structures in the data, e.g. confounded variables.

11/31

Reality of data collection ...

  • Making an appropriate sampling design is hard.

    • There may be unknown or hidden structures in the population.
    • You may introduce intentional data structures, e.g.

      • Cross-sectional data,
      • Repeated cross-sectional data (e.g. case-control),
      • Panel or longitudinal data (e.g. cohort studies), and so on.
    • You may have unintended or unknown structures in the data, e.g. confounded variables.

    • It's further complicated by:

      • Non-response,
      • Missing data,
      • Mis-measured data,
      • Sample attrition, and so on. 😱
11/31

Observational studies

  • Studies mentioned so far has been observational studies.
  • An observational study aims to draw inferences about a population from a sample where independent variables are not intentionally allocated to units within the sample for the purpose of a study.
  • Data considered in observational studies are observational data.
12/31

Observational studies

  • Studies mentioned so far has been observational studies.
  • An observational study aims to draw inferences about a population from a sample where independent variables are not intentionally allocated to units within the sample for the purpose of a study.
  • Data considered in observational studies are observational data.

Examples:

  • Who will win the 2022 Australian federal election?
  • Survey households
  • Where are the best schools?
  • Government administrative data
  • Who are buying my products?
  • Customer database
12/31

Experimental studies

  • A scientific claim generally need to be validated by an experimental study.
  • In an experimental study, a causal variable of interest (referred to as treatment) is administered to recipients while holding other covariates at controlled settings to observe responses.
  • Data from an experiment are referred to as experimental data.
13/31

Experimental studies

  • A scientific claim generally need to be validated by an experimental study.
  • In an experimental study, a causal variable of interest (referred to as treatment) is administered to recipients while holding other covariates at controlled settings to observe responses.
  • Data from an experiment are referred to as experimental data.

Examples:

  • Is the vaccine effective against flu?
  • The data of whether the person who was administered the vaccine or placebo caught the flu afterwards.
  • Which fertilizer brand is most effective for wheat yield?
  • Yield data from crop field trial with plots treated with one of the three fertilizer brands.
13/31

Experimental units

Experimental units are recipients of the allocated treatment such that no sub-division of it can receive another treatment independently.

14/31

Experimental units

Experimental units are recipients of the allocated treatment such that no sub-division of it can receive another treatment independently.

  • Prof Android delivers their lecture by reciting word-to-word from the text in a monotone.
  • Prof Alien delivers their lecture by transmitting the information directly to the students mind.
  • You want to see if one of the methods is more effective.
  • Students in class 1, 3, 4, 7 and 10 have Prof Android.
  • Students in class 2, 5, 6, 8 and 9 have Prof Alien.

What are the experimental units?

14/31

Experimental units

Experimental units are recipients of the allocated treatment such that no sub-division of it can receive another treatment independently.

  • Prof Android delivers their lecture by reciting word-to-word from the text in a monotone.
  • Prof Alien delivers their lecture by transmitting the information directly to the students mind.
  • You want to see if one of the methods is more effective.
  • Students in class 1, 3, 4, 7 and 10 have Prof Android.
  • Students in class 2, 5, 6, 8 and 9 have Prof Alien.

What are the experimental units? It's the classes.

14/31

Observational units

Observational units are units that you measure the response on.

Carrying on from the previous example...

  • Students all sit for the same exam.
  • You record the exam mark for each student.

What are the observational units?

15/31

Observational units

Observational units are units that you measure the response on.

Carrying on from the previous example...

  • Students all sit for the same exam.
  • You record the exam mark for each student.

What are the observational units? It's the students.

15/31

Observational units

Observational units are units that you measure the response on.

Carrying on from the previous example...

  • Students all sit for the same exam.
  • You record the exam mark for each student.

What are the observational units? It's the students.

  • Note: observational unit is not the observation (the response)!
15/31

Wheat Yield Trial

  • A selective breeding experiment with 107 wheat varieties (or genotypes) were conducted in South Australia in a field with plots laid out in a rectangular array with 22 rows and 15 columns.
  • The breeders want to find a variety with high yield.

Source: Gilmour et al. (1997) Accounting for natural and extraneous variation in the analysis of field experiments. Journal of Agric Biol Env Statistics, 2, 269-293.

16/31

Wheat Yield Trial

  • A selective breeding experiment with 107 wheat varieties (or genotypes) were conducted in South Australia in a field with plots laid out in a rectangular array with 22 rows and 15 columns.
  • The breeders want to find a variety with high yield.
  • The treatments are the 107 wheat varieties.
  • The experimental units are the 330 plots.
  • The observational units are also the 330 plots.

Source: Gilmour et al. (1997) Accounting for natural and extraneous variation in the analysis of field experiments. Journal of Agric Biol Env Statistics, 2, 269-293.

16/31

Replications

  • The varieties VF655, TINCURRIN and WW1477 have a replication of 6, the remaining 104 varieties each have a replication of 3.
17/31

Replications

  • The varieties VF655, TINCURRIN and WW1477 have a replication of 6, the remaining 104 varieties each have a replication of 3.
  • Treatment replications are essential in an experiment; without any replication, no treatment variation can be measured nor distinguished from unit variation.
17/31

Replications

  • The varieties VF655, TINCURRIN and WW1477 have a replication of 6, the remaining 104 varieties each have a replication of 3.
  • Treatment replications are essential in an experiment; without any replication, no treatment variation can be measured nor distinguished from unit variation.
  • More replications are desirable for accuracy, however, there is always a tension to balance between accuracy and the cost of the experiment.
17/31

Pseudo-replication

Carrying on from the teaching example...

  • Suppose there were 30 students in each class.
  • The treatments were the two teaching method confounded with each professor.
  • There were two professors and 10 classes.
  • Each professor was randomly assigned to 5 classes, so each professor manages 150 students.

What are the replications of each treatment?

18/31

Pseudo-replication

Carrying on from the teaching example...

  • Suppose there were 30 students in each class.
  • The treatments were the two teaching method confounded with each professor.
  • There were two professors and 10 classes.
  • Each professor was randomly assigned to 5 classes, so each professor manages 150 students.

What are the replications of each treatment? It's 5.

18/31

Pseudo-replication

Carrying on from the teaching example...

  • Suppose there were 30 students in each class.
  • The treatments were the two teaching method confounded with each professor.
  • There were two professors and 10 classes.
  • Each professor was randomly assigned to 5 classes, so each professor manages 150 students.

What are the replications of each treatment? It's 5.

The treament of repetition as replication in the analysis is referred to as pseudo-replication.

18/31

Systematic Design of Experiments

  • The treatments appear to be randomly ordered before.
19/31

Systematic Design of Experiments

  • The treatments appear to be randomly ordered before.
  • Why don't we order the treatments in a systematic order like on the left?
19/31

Systematic Design of Experiments

  • The treatments appear to be randomly ordered before.
  • Why don't we order the treatments in a systematic order like on the left?
  • Isn't this easier to manage the experiment?
19/31

Systematic Design of Experiments

  • The treatments appear to be randomly ordered before.
  • Why don't we order the treatments in a systematic order like on the left?
  • Isn't this easier to manage the experiment?

    Systematic designs are prone to bias and confounding.
19/31

Randomisation

  • Treatments should be allocated randomly to experimental units.
  • This avoids:
    • systematic bias - e.g. all flu vaccine A tested in January (summer) and all flu vaccine B tested in July (winter).
    • selection bias - e.g. giving the treatment that you are testing to the sick patients and placebo to those that are healthy.
    • other bias - e.g. the lab technician giving the treatment to the first rat that is taken out of the cage.
20/31

Blocking

Blocks are used to group the experimental units into alike units.

21/31

Blocking

Blocks are used to group the experimental units into alike units.

  • If well done, blocking can lower the variance of treatment contrasts which increase power.
  • A non-homogeneous block (i.e. units within block are not alike) can decrease the power of the experiment.
21/31

Blocking

Blocks are used to group the experimental units into alike units.

  • If well done, blocking can lower the variance of treatment contrasts which increase power.
  • A non-homogeneous block (i.e. units within block are not alike) can decrease the power of the experiment.

You can form blocks from:

  • Natural discrete divisions between experimental units.
    E.g. in experiments with people, the gender make an obvious block.
  • Grouping experimental units with similar continuous gradients.
    E.g., if the experiment is spread out in time or space and there exists no obvious natural boundaries, then an arbitrary boundary may be chosen to group experimental units that are contiguous in time or space.
21/31

The Salk Vaccine Field Trial

Source: Freedman, Pisani & Purves (2010) Statistics. 4th edition

  • The first polio epidemic hit the United States in 1916 claiming hundreds of thousands of victims, especially children.
  • National Foundation for Infantile Paralysis (NFIP) was ready to test the vaccine developed by Jonas Salk in the real world.
  • A controlled experiment was proposed to test the effectiveness of the vaccine on grade 1, 2 and 3 children at selected school districts though the country where the risk of polio was high.
  • In total two million children were involved although not all parents consented to their children to be vaccinated.
22/31

Design for the NFIP Study

Vaccinate all grade 2 children whose parents would consent, leaving children in grades 1 and 3 as controls.

  • Can grade 2 children whose parents did not consent be included as control?
  • What are the potential issues with such a design?
  • Polio is a contact disease. Would incidences of disease be higher in grade 2?
23/31

Design for the NFIP Study

Vaccinate all grade 2 children whose parents would consent, leaving children in grades 1 and 3 as controls.

  • Can grade 2 children whose parents did not consent be included as control?
  • What are the potential issues with such a design?
  • Polio is a contact disease. Would incidences of disease be higher in grade 2?

Randomised controlled trial

An alternate vaccine trial randomly assigned the vaccine and placebo to children.
23/31

Vaccine Results

The NFIP Study
Group Participants Rate
Vaccinated (Grade 2) 221,998 25
Control (Grade 1 & 3) 725,173 54
Not Vaccination
(Grade 2, no consent)
123,605 44
Incomplete Vaccination
(Grade 2, incomplete)
9,904 40
Randomised controlled trial
Group Participants Rate
Vaccinated 200,745 28
Placebo 201,229 71
Not Vaccination
(no consent)
338,778 46
Incomplete Vaccination 8,484 24
  • The rate is the number of polio cases per 100,000 in each group.
  • RCT and NFIP trial sampled from school districts with similar exposures to the polio virus.
24/31

Vaccine Results

The NFIP Study
Group Participants Rate
Vaccinated (Grade 2) 221,998 25
Control (Grade 1 & 3) 725,173 54
Not Vaccination
(Grade 2, no consent)
123,605 44
Incomplete Vaccination
(Grade 2, incomplete)
9,904 40
Randomised controlled trial
Group Participants Rate
Vaccinated 200,745 28
Placebo 201,229 71
Not Vaccination
(no consent)
338,778 46
Incomplete Vaccination 8,484 24
  • The rate is the number of polio cases per 100,000 in each group.
  • RCT and NFIP trial sampled from school districts with similar exposures to the polio virus.

Both the not vaccinated (no consent) and placebo/control group did not receive the treatment but why is the rate of polio cases less in the not vaccinated (no consent) group?

24/31

Possible explanations

  • Higher income parents would more likely consent to treatment than lower-income parents.
25/31

Possible explanations

  • Higher income parents would more likely consent to treatment than lower-income parents.
  • Children of higher income parents are more vulnerable to polio.
25/31

Possible explanations

  • Higher income parents would more likely consent to treatment than lower-income parents.
  • Children of higher income parents are more vulnerable to polio.
  • Many forms of polio are hard to diagnose and in borderline cases.
25/31

Limitations in (social) experiments

  • Cooperation needed from participants
  • Ethical objections
  • Substitution bias
  • Sample attrition
  • Hawthorne effect
26/31

Limitations in (social) experiments

  • Cooperation needed from participants
  • Ethical objections
  • Substitution bias
  • Sample attrition
  • Hawthorne effect

Basically, designing and running experiments are hard.

26/31

Pop Quizzes

27/31

Observational or experimental data?

The Academic Performance Index is computed for all California schools based on standardised testing of students. The data sets contain information and characteristics for 100 schools.

28/31

Observational or experimental data?

The Academic Performance Index is computed for all California schools based on standardised testing of students. The data sets contain information and characteristics for 100 schools.

Observational

28/31

Observational or experimental data?

The response is the length of odontoblasts in 60 guinea pigs. Each animal received one of three dose levels of vitamin C by one of two delivery methods by the technician.

29/31

Observational or experimental data?

The response is the length of odontoblasts in 60 guinea pigs. Each animal received one of three dose levels of vitamin C by one of two delivery methods by the technician.

Experimental

29/31

Observational or experimental data?

Can people really tell the difference between different flavours associated with the color of the skittles? You blind your friends so they can't see the color and collect data on their guess after giving them one skittle at a time.

30/31

Observational or experimental data?

Can people really tell the difference between different flavours associated with the color of the skittles? You blind your friends so they can't see the color and collect data on their guess after giving them one skittle at a time.

Experimental

30/31

Creative Commons License
This work is licensed under a Creative Commons Attribution-ShareAlike 4.0 International License.

Lecturer: Emi Tanaka

Department of Econometrics and Business Statistics

ETC5512.Clayton-x@monash.edu

Week 1


31/31

ETC5512: Wild Caught Data


Introduction to data collection methods

Lecturer: Emi Tanaka

Department of Econometrics and Business Statistics

ETC5512.Clayton-x@monash.edu

Week 1


1/31
Paused

Help

Keyboard shortcuts

, , Pg Up, k Go to previous slide
, , Pg Dn, Space, j Go to next slide
Home Go to first slide
End Go to last slide
Number + Return Go to specific slide
b / m / f Toggle blackout / mirrored / fullscreen mode
c Clone slideshow
p Toggle presenter mode
t Restart the presentation timer
?, h Toggle this help
Esc Back to slideshow