class: center, middle, inverse, title-slide # Central Limit Theorem ### Dr. Dogucu --- layout: true <div class="my-header"></div> <div class="my-footer"> Copyright © <a href="https://mdogucu.ics.uci.edu">Dr. Mine Dogucu</a>. <a href="https://creativecommons.org/licenses/by-nc-sa/4.0/">CC BY-NC-SA 4.0</a></div> --- class: middle ## Data We will be using payroll data from Los Angeles Police Department (LAPD) from 2018. ```r glimpse(lapd) ``` ``` ## Rows: 14,824 ## Columns: 1 ## $ base_pay <dbl> 119321.60, 113270.70, 148116.00, 78676.87, 109373.63, 9500... ``` --- ## Population Distribution <img src="slide-4-clt_files/figure-html/unnamed-chunk-3-1.png" style="display: block; margin: auto;" /> --- class: middle ## Population Mean We have data on everyone who worked for LAPD in the year 2018. So the distribution we just looked at is a population distribution. We can go ahead and calculate the population mean ( `\(\mu\)` ). ```r summarize(lapd, mean(base_pay)) ``` ``` ## # A tibble: 1 x 1 ## `mean(base_pay)` ## <dbl> ## 1 85149. ``` --- class: middle ## Population Standard Deviation We can calculate the population standard deviation ( `\(\sigma\)` ). ``` ## # A tibble: 1 x 1 ## `sd(base_pay)` ## <dbl> ## 1 38423. ``` --- class: middle What if we did not have access to all this data? What would we do? -- Rely on a sample! --- class: middle Let's assume we went ahead and took a (random) sample of LAPD staff and asked their salary information (and they report to us truthfully) and calculated a mean, would we find a mean of 85149.05? Why, why not? --- class: middle Let's pretend we have never seen the data and we do not know the population parameter `\(\mu\)`. In fact this is usually what happens in real life. We do not have the population information but we do want to know a population __parameter__ (does not necessarily have to be the mean). -- If we took a sample and calculated the sample mean, we would name this __point estimate__ of the parameter. --- class: middle center | | Parameter of Interest | Point Estimate / Sample Statistic | |-------------------------------|-----------------------|-----------------------------------| | Mean | `\(\mu\)` | `\(\bar x\)` | | Difference of Two Means | `\(\mu_1 - \mu_2\)` | `\(\bar x_1 - \bar x_2\)` | | Proportion | `\(\pi\)` | `\(p\)` | | Difference of Two Proportions | `\(\pi_1 - \pi_2\)` | `\(p_1 - p_2\)` | --- class: middle ## First Sample We would like to know about `\(\mu\)` but we cannot access the whole population. A researcher takes a random sample of 20 LAPD staff and ask them about their base pay. -- ``` ## [1] 0.00 109368.20 95924.46 29417.88 32236.80 98306.29 0.00 ## [8] 95877.27 0.00 61521.20 109054.97 53726.44 89835.29 0.00 ## [15] 109378.40 69640.00 43810.12 109409.10 103408.00 3600.00 ``` -- **Mean of first sample**, `\(\bar x_1\)` = ``` ## [1] 60725.72 ``` --- ## Mean of second sample `\(\bar x_2\)` = ``` ## [1] 81837.23 ``` -- ## Mean of third sample `\(\bar x_3\)` = ``` ## [1] 85614.37 ``` --- class: middle We could do this over and over again. Don't you worry! I did it. I have taken 10,000 samples of size 200 (sample size of 20 is just too small) and calculated their mean. The following slide shows the distribution of the **sample means**. --- ### Sampling Distribution of the Mean <img src="slide-4-clt_files/figure-html/unnamed-chunk-12-1.png" style="display: block; margin: auto;" /> --- ### Sampling Distribution of the Mean <img src="slide-4-clt_files/figure-html/unnamed-chunk-13-1.png" style="display: block; margin: auto;" /> --- class: middle ## Conclusion When certain conditions are met then: `$$\bar x \sim \text{approximately }N( \mu, \frac{\sigma^2}{n})$$` -- `$$(\bar x_1 - \bar x_2) \sim \text{approximately } N(\mu_1 - \mu_2, \frac{\sigma_1^2}{n_1}+ \frac{\sigma_2^2}{n_2})$$` -- `$$p \sim \text{approximately } N(\pi, \frac{\pi(1-\pi)}{n})$$` -- `$$(p_1 - p_2) \sim \text{approximately } N((\pi_1 - \pi_2), {\frac{\pi_1(1-\pi_1)}{n_1} + \frac{\pi_2(1-\pi_2)}{n_2}})$$` --- class: middle ## Central Limit Theorem (CLT) If certain conditions are met, the sampling distribution will be normally distributed with a mean equal to the population parameter. The standard deviation will be inversely proportional to the square root of the sample size. -- We will learn the conditions in the upcoming lectures. -- Moving forward we will use CLT to make __inference__ about population parameters using sample statistics.