Haleshot commited on
Commit
0256607
·
unverified ·
1 Parent(s): 3b3d72c

enhance explanations, improve LaTeX formatting

Browse files
probability/18_central_limit_theorem.py CHANGED
@@ -6,12 +6,13 @@
6
  # "scipy==1.15.2",
7
  # "numpy==2.2.4",
8
  # "plotly==5.18.0",
 
9
  # ]
10
  # ///
11
 
12
  import marimo
13
 
14
- __generated_with = "0.11.30"
15
  app = marimo.App(width="medium", app_title="Central Limit Theorem")
16
 
17
 
@@ -23,7 +24,20 @@ def _(mo):
23
 
24
  _This notebook is a computational companion to ["Probability for Computer Scientists"](https://chrispiech.github.io/probabilityForComputerScientists/en/part4/clt/), by Stanford professor Chris Piech._
25
 
26
- The Central Limit Theorem (CLT) is one of the most important concepts in probability theory and statistics. It explains why many real-world distributions tend to be normal, even when the underlying processes are not.
 
 
 
 
 
 
 
 
 
 
 
 
 
27
  """
28
  )
29
  return
@@ -41,7 +55,7 @@ def _(mo):
41
 
42
  Let $X_1, X_2, \dots, X_n$ be independent and identically distributed random variables. The sum of these random variables approaches a normal distribution as $n \rightarrow \infty$:
43
 
44
- $n∑i=1Xi∼N(n⋅μ,n⋅σ2)\sum_{i=1}^{n}X_i \sim \mathcal{N}(n \cdot \mu, n \cdot \sigma^2)$
45
 
46
  Where $\mu = E[X_i]$ and $\sigma^2 = \text{Var}(X_i)$. Since each $X_i$ is identically distributed, they share the same expectation and variance.
47
 
@@ -49,7 +63,7 @@ def _(mo):
49
 
50
  Let $X_1, X_2, \dots, X_n$ be independent and identically distributed random variables. The average of these random variables approaches a normal distribution as $n \rightarrow \infty$:
51
 
52
- $\frac{1}{n} ∑i=1Xi∼N(μ,σ2n)\frac{1}{n}\sum_{i=1}^{n}X_i \sim \mathcal{N}(\mu, \frac{\sigma^2}{n})$
53
 
54
  Where $\mu = E[X_i]$ and $\sigma^2 = \text{Var}(X_i)$.
55
 
@@ -67,7 +81,7 @@ def _(mo):
67
 
68
  Let's explore what happens when you add random variables together. For example, what if we add 100 different uniform random variables?
69
 
70
- ```python
71
  from random import random
72
 
73
  def add_100_uniforms():
@@ -77,7 +91,7 @@ def _(mo):
77
  x_i = random()
78
  total += x_i
79
  return total
80
- ```
81
 
82
  The value returned by this function will be a random variable. Click the button below to run the function and observe the resulting value of total:
83
  """
@@ -311,41 +325,37 @@ def _(mo):
311
  r"""
312
  ### Example 1: Dice Game
313
 
314
- You will roll a 6-sided dice 10 times. Let $X$ be the total value of all 10 dice: $X = X_1 + X_2 + \dots + X_{10}$. You win the game if $X \leq 25$ or $X \geq 45$. Use the central limit theorem to calculate the probability that you win.
315
 
316
- Recall that for a single die roll $X_i$:
317
 
 
318
  - $E[X_i] = 3.5$
319
  - $\text{Var}(X_i) = \frac{35}{12}$
320
 
321
- **Solution:**
322
-
323
- Let $Y$ be the approximating normal distribution. By the Central Limit Theorem:
324
-
325
- $Y∼N(10⋅E[Xi],10⋅Var(Xi))Y \sim \mathcal{N}(10 \cdot E[X_i], 10 \cdot \text{Var}(X_i))$
326
 
327
- Substituting in the known values:
328
 
329
- $Y∼N(10⋅3.5,10⋅3512)=N(35,29.2)Y \sim \mathcal{N}(10 \cdot 3.5, 10 \cdot \frac{35}{12}) = \mathcal{N}(35, 29.2)$
330
 
331
- Now we calculate the probability:
332
 
333
- $P(X25 or X45)P(X \leq 25 \text{ or } X \geq 45)$
334
 
335
- $=P(X≤25)+P(X≥45)= P(X \leq 25) + P(X \geq 45)$
336
 
337
- $≈P(Y<25.5)+P(Y>44.5) (Continuity Correction)\approx P(Y < 25.5) + P(Y > 44.5) \text{ (Continuity Correction)}$
338
 
339
- $≈P(Y<25.5)+[1−P(Y<44.5)]\approx P(Y < 25.5) + [1 - P(Y < 44.5)]$
340
 
341
- $≈Φ(25.5−35√29.2)+[1−Φ(44.5−35√29.2)]\approx \Phi\left(\frac{25.5 - 35}{\sqrt{29.2}}\right) + \left[1 - \Phi\left(\frac{44.5 - 35}{\sqrt{29.2}}\right)\right]$
342
 
343
- $≈Φ(−1.76)+[1−Φ(1.76)]\approx \Phi(-1.76) + [1 - \Phi(1.76)]$
344
 
345
- $≈0.039+(1−0.961)\approx 0.039 + (1 - 0.961)$
346
 
347
- $≈0.078\approx 0.078$
348
- So, the probability of winning the game is approximately 7.8%.
349
  """
350
  )
351
  return
@@ -359,17 +369,17 @@ def _(create_dice_game_visualization, fig_to_image, mo):
359
 
360
  dice_explanation = mo.md(
361
  r"""
362
- **Visualization Explanation:**
363
 
364
- The graph shows the distribution of the sum of 10 dice rolls. The blue bars represent the actual probability mass function (PMF), while the red curve shows the normal approximation using the Central Limit Theorem.
365
 
366
- The winning regions are shaded in orange:
367
  - The left region where $X \leq 25$
368
  - The right region where $X \geq 45$
369
 
370
- The total probability of these regions is approximately 0.078 or 7.8%.
371
 
372
- Notice how the normal approximation provides a good fit to the discrete distribution, demonstrating the power of the Central Limit Theorem.
373
  """
374
  )
375
 
@@ -383,50 +393,45 @@ def _(mo):
383
  r"""
384
  ### Example 2: Algorithm Runtime Estimation
385
 
386
- Say you have a new algorithm and you want to test its running time. You know the variance of the algorithm's run time is $\sigma^2 = 4 \text{ sec}^2$, but you want to estimate the mean run time $t$ in seconds.
 
 
387
 
388
- You can run the algorithm repeatedly (IID trials). How many trials do you have to run so that your estimated runtime is within ±0.5 seconds of $t$ with 95% certainty?
389
 
390
- Let $X_i$ be the run time of the $i$-th run (for $1 \leq i \leq n$).
391
 
392
  **Solution:**
393
 
394
  We need to find $n$ such that:
395
 
396
- $0.95=P(−0.5≤∑ni=1Xin−t≤0.5)0.95 = P\left(-0.5 \leq \frac{\sum_{i=1}^n X_i}{n} - t \leq 0.5\right)$
397
-
398
- By the central limit theorem, the sample mean follows a normal distribution.
399
- We can standardize this to work with the standard normal:
400
-
401
- $Z=(∑ni=1Xi)−nμσ√nZ = \frac{\left(\sum_{i=1}^n X_i\right) - n\mu}{\sigma \sqrt{n}}$
402
 
403
- $=(∑ni=1Xi)−nt2√n= \frac{\left(\sum_{i=1}^n X_i\right) - nt}{2 \sqrt{n}}$
404
 
405
- Rewriting our probability inequality so that the central term is $Z$:
406
 
407
- $0.95=P(−0.5≤∑ni=1Xin−t≤0.5)0.95 = P\left(-0.5 \leq \frac{\sum_{i=1}^n X_i}{n} - t \leq 0.5\right)$
408
 
409
- $=P(0.5√n2≤Z≤0.5√n2)= P\left(\frac{-0.5 \sqrt{n}}{2} \leq Z \leq \frac{0.5 \sqrt{n}}{2}\right)$
410
 
411
- And now we find the value of $n$ that makes this equation hold:
412
 
413
- $0.95=Φ(√n4)−Φ(−√n4)0.95 = \Phi\left(\frac{\sqrt{n}}{4}\right) - \Phi\left(-\frac{\sqrt{n}}{4}\right)$
414
-
415
- $4=Φ(√n4)−(1−Φ(√n4))= \Phi\left(\frac{\sqrt{n}}{4}\right) - \left(1 - \Phi\left(\frac{\sqrt{n}}{4}\right)\right)$
416
-
417
- $=2Φ(√n4)−1= 2\Phi\left(\frac{\sqrt{n}}{4}\right) - 1$
418
 
419
  Solving for $\Phi\left(\frac{\sqrt{n}}{4}\right)$:
420
 
421
- $0.975=Φ(√n4)0.975 = \Phi\left(\frac{\sqrt{n}}{4}\right)$
422
 
423
- $Φ−1(0.975)=√n4\Phi^{-1}(0.975) = \frac{\sqrt{n}}{4}$
424
 
425
- $1.96=√n41.96 = \frac{\sqrt{n}}{4}$
426
 
427
- $n=61.4n = 61.4$
428
 
429
- Therefore, we need to run the algorithm 62 times to estimate the mean runtime within ±0.5 seconds with 95% confidence.
 
 
430
  """
431
  )
432
  return
@@ -929,7 +934,6 @@ def _(mo):
929
  mo.vstack([distribution_type, sample_size, sim_count_slider]),
930
  run_explorer_button
931
  ], justify='space-around')
932
-
933
  return (
934
  controls,
935
  distribution_type,
 
6
  # "scipy==1.15.2",
7
  # "numpy==2.2.4",
8
  # "plotly==5.18.0",
9
+ # "wigglystuff==0.1.13",
10
  # ]
11
  # ///
12
 
13
  import marimo
14
 
15
+ __generated_with = "0.12.6"
16
  app = marimo.App(width="medium", app_title="Central Limit Theorem")
17
 
18
 
 
24
 
25
  _This notebook is a computational companion to ["Probability for Computer Scientists"](https://chrispiech.github.io/probabilityForComputerScientists/en/part4/clt/), by Stanford professor Chris Piech._
26
 
27
+ The central limit theorem is honestly mind-blowing it's like magic that no matter what distribution you start with, the sampling distribution of means approaches a normal distribution as sample size increases.
28
+
29
+ Mathematically, if we have:
30
+
31
+ $X_1, X_2, \ldots, X_n$ as independent, identically distributed random variables with:
32
+
33
+ - Mean: $\mu$
34
+ - Variance: $\sigma^2 < \infty$
35
+
36
+ Then as $n \to \infty$:
37
+
38
+ $$\sqrt{n}\left(\frac{1}{n}\sum_{i=1}^{n}X_i - \mu\right) \xrightarrow{d} \mathcal{N}(0, \sigma^2)$$
39
+
40
+ > _Note:_ The above LaTeX derivation is included as a reference. Credit for this formulation goes to the original source linked at the top of the notebook.
41
  """
42
  )
43
  return
 
55
 
56
  Let $X_1, X_2, \dots, X_n$ be independent and identically distributed random variables. The sum of these random variables approaches a normal distribution as $n \rightarrow \infty$:
57
 
58
+ $\sum_{i=1}^{n}X_i \sim \mathcal{N}(n \cdot \mu, n \cdot \sigma^2)$
59
 
60
  Where $\mu = E[X_i]$ and $\sigma^2 = \text{Var}(X_i)$. Since each $X_i$ is identically distributed, they share the same expectation and variance.
61
 
 
63
 
64
  Let $X_1, X_2, \dots, X_n$ be independent and identically distributed random variables. The average of these random variables approaches a normal distribution as $n \rightarrow \infty$:
65
 
66
+ $\frac{1}{n}\sum_{i=1}^{n}X_i \sim \mathcal{N}(\mu, \frac{\sigma^2}{n})$
67
 
68
  Where $\mu = E[X_i]$ and $\sigma^2 = \text{Var}(X_i)$.
69
 
 
81
 
82
  Let's explore what happens when you add random variables together. For example, what if we add 100 different uniform random variables?
83
 
84
+ `python
85
  from random import random
86
 
87
  def add_100_uniforms():
 
91
  x_i = random()
92
  total += x_i
93
  return total
94
+ `
95
 
96
  The value returned by this function will be a random variable. Click the button below to run the function and observe the resulting value of total:
97
  """
 
325
  r"""
326
  ### Example 1: Dice Game
327
 
328
+ > _Note:_ The following application demonstrates the practical use of the Central Limit Theorem. The mathematical derivation is based on concepts from ["Probability for Computer Scientists"](https://chrispiech.github.io/probabilityForComputerScientists/en/part2/clt/) by Chris Piech.
329
 
330
+ Let's solve a fun probability problem: You roll a 6-sided die 10 times and let $X$ represent the total value of all 10 dice: $X = X_1 + X_2 + \dots + X_{10}$. You win if $X \leq 25$ or $X \geq 45$. What's your probability of winning?
331
 
332
+ For a single die roll $X_i$, we know:
333
  - $E[X_i] = 3.5$
334
  - $\text{Var}(X_i) = \frac{35}{12}$
335
 
336
+ **Solution Approach:**
 
 
 
 
337
 
338
+ This is where the Central Limit Theorem shines! Since we're summing 10 independent, identically distributed random variables, we can approximate this sum with a normal distribution $Y$:
339
 
340
+ $Y \sim \mathcal{N}(10 \cdot E[X_i], 10 \cdot \text{Var}(X_i)) = \mathcal{N}(35, 29.2)$
341
 
342
+ Now calculating our winning probability:
343
 
344
+ $P(X \leq 25 \text{ or } X \geq 45) = P(X \leq 25) + P(X \geq 45)$
345
 
346
+ Since we're approximating a discrete distribution with a continuous one, we apply a continuity correction:
347
 
348
+ $\approx P(Y < 25.5) + P(Y > 44.5) = P(Y < 25.5) + [1 - P(Y < 44.5)]$
349
 
350
+ Converting to standard normal form:
351
 
352
+ $\approx \Phi\left(\frac{25.5 - 35}{\sqrt{29.2}}\right) + \left[1 - \Phi\left(\frac{44.5 - 35}{\sqrt{29.2}}\right)\right]$
353
 
354
+ $\approx \Phi(-1.76) + [1 - \Phi(1.76)]$
355
 
356
+ $\approx 0.039 + (1 - 0.961) \approx 0.078$
357
 
358
+ So your chance of winning is about 7.8% — not great odds, but that's probability for you!
 
359
  """
360
  )
361
  return
 
369
 
370
  dice_explanation = mo.md(
371
  r"""
372
+ **Understanding the Visualization:**
373
 
374
+ This graph shows our dice game in action. The blue bars represent the exact probability distribution for summing 10 dice, while the red curve shows our normal approximation from the Central Limit Theorem.
375
 
376
+ I've highlighted the winning regions in orange:
377
  - The left region where $X \leq 25$
378
  - The right region where $X \geq 45$
379
 
380
+ Together these regions cover about 7.8% of the total probability.
381
 
382
+ What's fascinating here is how closely the normal curve approximates the actual discrete distribution this is the Central Limit Theorem working its magic, even with just 10 random variables.
383
  """
384
  )
385
 
 
393
  r"""
394
  ### Example 2: Algorithm Runtime Estimation
395
 
396
+ > _Note:_ The following derivation demonstrates the practical application of the Central Limit Theorem for experimental design. The mathematical approach is based on concepts from ["Probability for Computer Scientists"](https://chrispiech.github.io/probabilityForComputerScientists/en/part2/clt/) by Chris Piech.
397
+
398
+ Here's a practical problem I encounter in performance testing: You've developed a new algorithm and want to measure its average runtime. You know the variance is $\sigma^2 = 4 \text{ sec}^2$, but need to estimate the true mean runtime $t$.
399
 
400
+ The question: How many test runs do you need to be 95% confident your estimated mean is within ±0.5 seconds of the true value?
401
 
402
+ Let $X_i$ represent the runtime of the $i$-th test (for $1 \leq i \leq n$).
403
 
404
  **Solution:**
405
 
406
  We need to find $n$ such that:
407
 
408
+ $0.95 = P\left(-0.5 \leq \frac{\sum_{i=1}^n X_i}{n} - t \leq 0.5\right)$
 
 
 
 
 
409
 
410
+ The Central Limit Theorem tells us that as $n$ increases, the sample mean approaches a normal distribution. Let's standardize this to work with the standard normal distribution:
411
 
412
+ $Z = \frac{\left(\sum_{i=1}^n X_i\right) - n\mu}{\sigma \sqrt{n}} = \frac{\left(\sum_{i=1}^n X_i\right) - nt}{2 \sqrt{n}}$
413
 
414
+ Rewriting our probability constraint in terms of $Z$:
415
 
416
+ $0.95 = P\left(-0.5 \leq \frac{\sum_{i=1}^n X_i}{n} - t \leq 0.5\right) = P\left(\frac{-0.5 \sqrt{n}}{2} \leq Z \leq \frac{0.5 \sqrt{n}}{2}\right)$
417
 
418
+ Using the properties of the standard normal CDF:
419
 
420
+ $0.95 = \Phi\left(\frac{\sqrt{n}}{4}\right) - \Phi\left(-\frac{\sqrt{n}}{4}\right) = 2\Phi\left(\frac{\sqrt{n}}{4}\right) - 1$
 
 
 
 
421
 
422
  Solving for $\Phi\left(\frac{\sqrt{n}}{4}\right)$:
423
 
424
+ $0.975 = \Phi\left(\frac{\sqrt{n}}{4}\right)$
425
 
426
+ Using the inverse CDF:
427
 
428
+ $\Phi^{-1}(0.975) = \frac{\sqrt{n}}{4}$
429
 
430
+ $1.96 = \frac{\sqrt{n}}{4}$
431
 
432
+ $n = 61.4$
433
+
434
+ Rounding up, we need 62 test runs to achieve our desired confidence interval — a practical result we can immediately apply to our testing protocol.
435
  """
436
  )
437
  return
 
934
  mo.vstack([distribution_type, sample_size, sim_count_slider]),
935
  run_explorer_button
936
  ], justify='space-around')
 
937
  return (
938
  controls,
939
  distribution_type,