ETC5521: Exploratory Data Analysis

.info-box.w-50.bg-white[
These slides are viewed best by Chrome or Firefox and occasionally need to be refreshed if elements did not load properly. See <a href=lecture-12B.pdf>here for the PDF <i class="fas fa-file-pdf"></i></a>. 
]

<br>

---

# .monash-blue[ETC5521: Exploratory Data Analysis]

<br>

<h2 style="font-weight:900!important;">Extending beyond the data, what can and cannot be inferred more generally, given the data collection</h2>

.bottom_abs.width100[

Lecturer: *Emi Tanaka*

<i class="fas fa-envelope"></i>  ETC5521.Clayton-x@monash.edu

<i class="fas fa-calendar-alt"></i> Week 12 - Session 2

<br>

]

---

# Sample size calculation

---

# How many people should you survey?

<img src="images/week12B/plot-1.png" width="648" style="display: block; margin: auto;" />
]
.panel[.panel-name[data]

```r
set.seed(1)
df <- tibble(id = 1:200) %>% 
  mutate(y = rgamma(n(), shape = 3, rate = 4))
```

]
.panel[.panel-name[R]

```r
set.seed(1)
g <- lineup(null_dist("y", dist = "exp", params = list(rate = 1/mean(df$y))), true = df, n = 20, pos = 15) %>% 
  ggplot(aes(y)) +
  geom_histogram(color = "white") + 
  facet_wrap(~.sample) +
  theme(axis.text = element_blank(),
        axis.title = element_blank(),
        axis.ticks.length = unit(0, "pt")) 
g
```

]
]
]
.w-40[

* Here we are testing `$H_0: Y \sim exp(\lambda)$`.
{{content}}

]]

--
* Suppose we only have one person to assess the lineup.
{{content}}

--
* If there is only a single response, then there are only two scenarios possible:
   * **Scenario 1**: the person detects the data plot
   * **Scenario 2**: the person does *not* detect the data plot
{{content}}

--
* The visual inference p-value under:
   * **Scenario 1** is 0.05
   * **Scenario 2** is 1
{{content}}

--
* Neither scenario yield `$p$`-values < 0.05!

---

# Power of a binary hypothesis test

.info-box.w-70[
The statistical **power** of a binary hypothesis test is the probability that the test correctly rejects the null hypothesis `$(H_0)$` when a _specific_ alternative hypothesis `$(H_1)$` is true.

]
.flex[
.w-50[
* Since `$m \geq 2$`, i.e. under `$H_0$`, `$0 < p = 1/m \leq 0.5$`. 
* Recall visual inference `$p$`-value is `$P(X \geq x) = \sum_{k = x}^n {n\choose k} (1/m)^k(1 - 1/m)^{n-k}$`.
* So for `$m = 20$` and `$n = 10$`,

<img src="images/week12B/unnamed-chunk-3-1.png" width="432" style="display: block; margin: auto;" />
]
.w-50[
{{content}}
]

]

* So if we have `$X > 2$`, then `$p$`-value < 0.05. 
{{content}}

--
* Suppose then the true detection probability is 0.9, therefore `$H_1$` is true.
{{content}}

--
* Under `$p = 0.9$`, 
`$$P(X > 2) = \sum_{k = 3}^{10} 0.9^k0.1^{(10 - k)} = 0.9999996$$`
{{content}}

--
* Therefore the power of the test is 0.9999996 if `$p = 0.9$`.

---

# Power analysis

* Let's suppose `$H_1$` is true and that specifically `$p = 0.9$`.
* Let's fix `$m = 20$` and reject `$H_0$` if `$p$`-value `$< \alpha = 0.05$`. 
<img src="images/week12B/power-analysis-1.png" width="864" style="display: block; margin: auto;" />

---

# Estimating the detection probability `$p$`

.w-60[
* For a fixed power `$(1-\beta)$`, the minimum sample size `$n$` we need depends on the detection probability `$p$`
{{content}}
]
--

* Generally if `$p$` is larger, less `$n$` is sufficient to get equivalent or larger power.
{{content}}
--

* But we don't know what the true `$p$` is! (If we did, we don't need to test for it.)
{{content}}
--

* Either you will need to make a guess from past experience or run a pilot test.
{{content}}
--

* If you find in the pilot test, `$x_p$` out of `$n_p$` participants detected the data plot then an estimate of `$\hat{p} = x_p / n_p$`.

---

# Sample size calculation

.flex[
.w-45[
* The sample size calculation depends on:
   * the selected false positive rate `$(\alpha)$`
   * the detection probability `$p$`
   * the number of plots in the lineup `$m$`
   * the minimum power `$(1 - \beta)$` desired
   * the expected dropout rate `$d$` (i.e. proportion of samples that cannot be used due to incomplete results or other quality issues)
   
{{content}}
]
.w-55[
.f4[

```r
p <- 0.1
m <- 20
d <- 0.95
power_df <- tibble(n = 2:200) %>% 
  mutate(power = map_dbl(n, function(n) {
      x <- 1:n
      pval <- map_dbl(x, ~1 - pbinom(.x - 1, n, 1/m))
      xmin <- x[which.max(pval < alpha)]
      1 - pbinom(xmin - 1, n, p)
    }))

power_df %>% 
  filter(power > 0.8) %>% 
  pull(n) %>% 
  min() %>% 
  magrittr::divide_by(d) %>% 
  ceiling()
```

```
## [1] 178
```

]
]]

* Say if `$\alpha = 0.05$`, `$p = 0.1$`, `$m = 20$`, `$d=0.95$` and at least `$80\%$` power is desired then at least `$178$` samples is required. 
   
---

# Simulating from the null distribution

---

# Recap: Simulating data from parametric models

* Recall in lecture 8, we studied how to simulate data from parametric models.

```r
set.seed(1)
df1 <- tibble(id = 1:200) %>% 
  mutate(x = runif(n(), 0, 5),
         y = 2 * x + 1 + rnorm(n()))

ggplot(df1, aes(x, y)) + geom_point()
```

<img src="images/week12B/unnamed-chunk-5-1.png" width="432" style="display: block; margin: auto;" />
]

* We also briefly discussed how to simulate data from the null distribution in lecture 11.

---

# .orange[Case study] .circle.bg-orange.white[1] Testing for normality

.panelset[
.panel[.panel-name[📊]
<img src="images/week12B/plot2-1.png" width="648" style="display: block; margin: auto;" />
]
.panel[.panel-name[data]

```r
set.seed(1)
df <- tibble(id = 1:200) %>% 
  mutate(y = runif(n(), -4, 4))
```

]
.panel[.panel-name[R]

```r
set.seed(1)
ldf <- lineup(null_dist("y", dist = "norm", params = list(mean = mean(df$y), sd = sd(df$y))), 
                      true = df, n = 20, pos = 4)
ggplot(ldf, aes(y)) +
  geom_histogram(color = "white") + 
  facet_wrap(~.sample) +
  theme(axis.text = element_blank(),
        axis.title = element_blank(),
        axis.ticks.length = unit(0, "pt")) 
```

]
]
]
.w-40[

* We are testing `$H_0: Y \sim N(\mu, \sigma^2)$`.
{{content}}

]]

* An estimate of `$\hat{\mu} = \bar{Y}$` is estimated the sample mean
* An estimate of `$\hat{\sigma} = sd(Y)$` is estimated the sample standard deviation
{{content}}
--

* A null data here is simply simulated from `$N(\hat{\mu}, \hat{\sigma})$`.

---

# .orange[Case study] .circle.bg-orange.white[2] Testing for a distribution

.flex[
.w-60[
<img src="images/week12B/qqplot-1.png" width="648" style="display: block; margin: auto;" />
]
.w-40[
* It is easier to compare a distribution using Q-Q plot
{{content}}
]]

* Plot 4 is in indeed the data plot.
* In fact the data is generated from a uniform distribution.

---

# .orange[Case study] .circle.bg-orange.white[3] Checking if there is a pattern in residual plot

<img src="images/week12B/resplot-1.png" width="576" style="display: block; margin: auto;" />
]
.w-50[

* In the left lineup, we are testing the data plot to see if there is any pattern.
* When the null distribution is imprecise, for example in search of a pattern in residual plot, you need to choose a null generation method that mimics an appropriate distribution under the null.

]]

---

# Selecting an appropriate null generation method

.flex[
.w-50[
<img src="images/week12B/unnamed-chunk-6-1.png" width="576" style="display: block; margin: auto;" />

]
.w-50[
<img src="images/week12B/resplot2-1.png" width="576" style="display: block; margin: auto;" />
]]

---

# Mis-specifying the null distribution

.flex[
.w-50[
<img src="images/week12B/unnamed-chunk-7-1.png" width="576" style="display: block; margin: auto;" />
]
.w-50[

* If the null distribution is mis-specified, this can make the detection probability larger.
* This however can result in an incorrect conclusion.

]]

---

.f2[While today's focus was on data collection from visual inference surveys, concepts such as data quality checks and sufficient sample size to draw inference is applicable to other data collection.]

.f2[There's always more to learn but .f1[**stay curious**] and make sure you .f1[**plot your data**] before rushing off to fitting some models!]

---

background-size: cover
class: title-slide
background-image: url("images/bg-01.png")

<a rel="license" href="http://creativecommons.org/licenses/by-sa/4.0/"><img alt="Creative Commons License" style="border-width:0" src="https://i.creativecommons.org/l/by-sa/4.0/88x31.png" /></a><br />This work is licensed under a <a rel="license" href="http://creativecommons.org/licenses/by-sa/4.0/">Creative Commons Attribution-ShareAlike 4.0 International License</a>.

.bottom_abs.width100[

Lecturer: *Emi Tanaka*

<i class="fas fa-envelope"></i>  ETC5521.Clayton-x@monash.edu

<i class="fas fa-calendar-alt"></i> Week 12 - Session 2

<br>

]