Spaces:
Running
Running
enhance explanations, improve LaTeX formatting
Browse files
probability/18_central_limit_theorem.py
CHANGED
@@ -6,12 +6,13 @@
|
|
6 |
# "scipy==1.15.2",
|
7 |
# "numpy==2.2.4",
|
8 |
# "plotly==5.18.0",
|
|
|
9 |
# ]
|
10 |
# ///
|
11 |
|
12 |
import marimo
|
13 |
|
14 |
-
__generated_with = "0.
|
15 |
app = marimo.App(width="medium", app_title="Central Limit Theorem")
|
16 |
|
17 |
|
@@ -23,7 +24,20 @@ def _(mo):
|
|
23 |
|
24 |
_This notebook is a computational companion to ["Probability for Computer Scientists"](https://chrispiech.github.io/probabilityForComputerScientists/en/part4/clt/), by Stanford professor Chris Piech._
|
25 |
|
26 |
-
The
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
27 |
"""
|
28 |
)
|
29 |
return
|
@@ -41,7 +55,7 @@ def _(mo):
|
|
41 |
|
42 |
Let $X_1, X_2, \dots, X_n$ be independent and identically distributed random variables. The sum of these random variables approaches a normal distribution as $n \rightarrow \infty$:
|
43 |
|
44 |
-
|
45 |
|
46 |
Where $\mu = E[X_i]$ and $\sigma^2 = \text{Var}(X_i)$. Since each $X_i$ is identically distributed, they share the same expectation and variance.
|
47 |
|
@@ -49,7 +63,7 @@ def _(mo):
|
|
49 |
|
50 |
Let $X_1, X_2, \dots, X_n$ be independent and identically distributed random variables. The average of these random variables approaches a normal distribution as $n \rightarrow \infty$:
|
51 |
|
52 |
-
$\frac{1}{n}
|
53 |
|
54 |
Where $\mu = E[X_i]$ and $\sigma^2 = \text{Var}(X_i)$.
|
55 |
|
@@ -67,7 +81,7 @@ def _(mo):
|
|
67 |
|
68 |
Let's explore what happens when you add random variables together. For example, what if we add 100 different uniform random variables?
|
69 |
|
70 |
-
|
71 |
from random import random
|
72 |
|
73 |
def add_100_uniforms():
|
@@ -77,7 +91,7 @@ def _(mo):
|
|
77 |
x_i = random()
|
78 |
total += x_i
|
79 |
return total
|
80 |
-
|
81 |
|
82 |
The value returned by this function will be a random variable. Click the button below to run the function and observe the resulting value of total:
|
83 |
"""
|
@@ -311,41 +325,37 @@ def _(mo):
|
|
311 |
r"""
|
312 |
### Example 1: Dice Game
|
313 |
|
314 |
-
|
315 |
|
316 |
-
|
317 |
|
|
|
318 |
- $E[X_i] = 3.5$
|
319 |
- $\text{Var}(X_i) = \frac{35}{12}$
|
320 |
|
321 |
-
**Solution:**
|
322 |
-
|
323 |
-
Let $Y$ be the approximating normal distribution. By the Central Limit Theorem:
|
324 |
-
|
325 |
-
$Y∼N(10⋅E[Xi],10⋅Var(Xi))Y \sim \mathcal{N}(10 \cdot E[X_i], 10 \cdot \text{Var}(X_i))$
|
326 |
|
327 |
-
|
328 |
|
329 |
-
$Y
|
330 |
|
331 |
-
Now
|
332 |
|
333 |
-
$P(X
|
334 |
|
335 |
-
|
336 |
|
337 |
-
|
338 |
|
339 |
-
|
340 |
|
341 |
-
|
342 |
|
343 |
-
|
344 |
|
345 |
-
|
346 |
|
347 |
-
|
348 |
-
So, the probability of winning the game is approximately 7.8%.
|
349 |
"""
|
350 |
)
|
351 |
return
|
@@ -359,17 +369,17 @@ def _(create_dice_game_visualization, fig_to_image, mo):
|
|
359 |
|
360 |
dice_explanation = mo.md(
|
361 |
r"""
|
362 |
-
**Visualization
|
363 |
|
364 |
-
|
365 |
|
366 |
-
|
367 |
- The left region where $X \leq 25$
|
368 |
- The right region where $X \geq 45$
|
369 |
|
370 |
-
|
371 |
|
372 |
-
|
373 |
"""
|
374 |
)
|
375 |
|
@@ -383,50 +393,45 @@ def _(mo):
|
|
383 |
r"""
|
384 |
### Example 2: Algorithm Runtime Estimation
|
385 |
|
386 |
-
|
|
|
|
|
387 |
|
388 |
-
|
389 |
|
390 |
-
Let $X_i$
|
391 |
|
392 |
**Solution:**
|
393 |
|
394 |
We need to find $n$ such that:
|
395 |
|
396 |
-
$0.95
|
397 |
-
|
398 |
-
By the central limit theorem, the sample mean follows a normal distribution.
|
399 |
-
We can standardize this to work with the standard normal:
|
400 |
-
|
401 |
-
$Z=(∑ni=1Xi)−nμσ√nZ = \frac{\left(\sum_{i=1}^n X_i\right) - n\mu}{\sigma \sqrt{n}}$
|
402 |
|
403 |
-
|
404 |
|
405 |
-
|
406 |
|
407 |
-
|
408 |
|
409 |
-
|
410 |
|
411 |
-
|
412 |
|
413 |
-
$0.95
|
414 |
-
|
415 |
-
$4=Φ(√n4)−(1−Φ(√n4))= \Phi\left(\frac{\sqrt{n}}{4}\right) - \left(1 - \Phi\left(\frac{\sqrt{n}}{4}\right)\right)$
|
416 |
-
|
417 |
-
$=2Φ(√n4)−1= 2\Phi\left(\frac{\sqrt{n}}{4}\right) - 1$
|
418 |
|
419 |
Solving for $\Phi\left(\frac{\sqrt{n}}{4}\right)$:
|
420 |
|
421 |
-
$0.975
|
422 |
|
423 |
-
|
424 |
|
425 |
-
|
426 |
|
427 |
-
$
|
428 |
|
429 |
-
|
|
|
|
|
430 |
"""
|
431 |
)
|
432 |
return
|
@@ -929,7 +934,6 @@ def _(mo):
|
|
929 |
mo.vstack([distribution_type, sample_size, sim_count_slider]),
|
930 |
run_explorer_button
|
931 |
], justify='space-around')
|
932 |
-
|
933 |
return (
|
934 |
controls,
|
935 |
distribution_type,
|
|
|
6 |
# "scipy==1.15.2",
|
7 |
# "numpy==2.2.4",
|
8 |
# "plotly==5.18.0",
|
9 |
+
# "wigglystuff==0.1.13",
|
10 |
# ]
|
11 |
# ///
|
12 |
|
13 |
import marimo
|
14 |
|
15 |
+
__generated_with = "0.12.6"
|
16 |
app = marimo.App(width="medium", app_title="Central Limit Theorem")
|
17 |
|
18 |
|
|
|
24 |
|
25 |
_This notebook is a computational companion to ["Probability for Computer Scientists"](https://chrispiech.github.io/probabilityForComputerScientists/en/part4/clt/), by Stanford professor Chris Piech._
|
26 |
|
27 |
+
The central limit theorem is honestly mind-blowing — it's like magic that no matter what distribution you start with, the sampling distribution of means approaches a normal distribution as sample size increases.
|
28 |
+
|
29 |
+
Mathematically, if we have:
|
30 |
+
|
31 |
+
$X_1, X_2, \ldots, X_n$ as independent, identically distributed random variables with:
|
32 |
+
|
33 |
+
- Mean: $\mu$
|
34 |
+
- Variance: $\sigma^2 < \infty$
|
35 |
+
|
36 |
+
Then as $n \to \infty$:
|
37 |
+
|
38 |
+
$$\sqrt{n}\left(\frac{1}{n}\sum_{i=1}^{n}X_i - \mu\right) \xrightarrow{d} \mathcal{N}(0, \sigma^2)$$
|
39 |
+
|
40 |
+
> _Note:_ The above LaTeX derivation is included as a reference. Credit for this formulation goes to the original source linked at the top of the notebook.
|
41 |
"""
|
42 |
)
|
43 |
return
|
|
|
55 |
|
56 |
Let $X_1, X_2, \dots, X_n$ be independent and identically distributed random variables. The sum of these random variables approaches a normal distribution as $n \rightarrow \infty$:
|
57 |
|
58 |
+
$\sum_{i=1}^{n}X_i \sim \mathcal{N}(n \cdot \mu, n \cdot \sigma^2)$
|
59 |
|
60 |
Where $\mu = E[X_i]$ and $\sigma^2 = \text{Var}(X_i)$. Since each $X_i$ is identically distributed, they share the same expectation and variance.
|
61 |
|
|
|
63 |
|
64 |
Let $X_1, X_2, \dots, X_n$ be independent and identically distributed random variables. The average of these random variables approaches a normal distribution as $n \rightarrow \infty$:
|
65 |
|
66 |
+
$\frac{1}{n}\sum_{i=1}^{n}X_i \sim \mathcal{N}(\mu, \frac{\sigma^2}{n})$
|
67 |
|
68 |
Where $\mu = E[X_i]$ and $\sigma^2 = \text{Var}(X_i)$.
|
69 |
|
|
|
81 |
|
82 |
Let's explore what happens when you add random variables together. For example, what if we add 100 different uniform random variables?
|
83 |
|
84 |
+
`python
|
85 |
from random import random
|
86 |
|
87 |
def add_100_uniforms():
|
|
|
91 |
x_i = random()
|
92 |
total += x_i
|
93 |
return total
|
94 |
+
`
|
95 |
|
96 |
The value returned by this function will be a random variable. Click the button below to run the function and observe the resulting value of total:
|
97 |
"""
|
|
|
325 |
r"""
|
326 |
### Example 1: Dice Game
|
327 |
|
328 |
+
> _Note:_ The following application demonstrates the practical use of the Central Limit Theorem. The mathematical derivation is based on concepts from ["Probability for Computer Scientists"](https://chrispiech.github.io/probabilityForComputerScientists/en/part2/clt/) by Chris Piech.
|
329 |
|
330 |
+
Let's solve a fun probability problem: You roll a 6-sided die 10 times and let $X$ represent the total value of all 10 dice: $X = X_1 + X_2 + \dots + X_{10}$. You win if $X \leq 25$ or $X \geq 45$. What's your probability of winning?
|
331 |
|
332 |
+
For a single die roll $X_i$, we know:
|
333 |
- $E[X_i] = 3.5$
|
334 |
- $\text{Var}(X_i) = \frac{35}{12}$
|
335 |
|
336 |
+
**Solution Approach:**
|
|
|
|
|
|
|
|
|
337 |
|
338 |
+
This is where the Central Limit Theorem shines! Since we're summing 10 independent, identically distributed random variables, we can approximate this sum with a normal distribution $Y$:
|
339 |
|
340 |
+
$Y \sim \mathcal{N}(10 \cdot E[X_i], 10 \cdot \text{Var}(X_i)) = \mathcal{N}(35, 29.2)$
|
341 |
|
342 |
+
Now calculating our winning probability:
|
343 |
|
344 |
+
$P(X \leq 25 \text{ or } X \geq 45) = P(X \leq 25) + P(X \geq 45)$
|
345 |
|
346 |
+
Since we're approximating a discrete distribution with a continuous one, we apply a continuity correction:
|
347 |
|
348 |
+
$\approx P(Y < 25.5) + P(Y > 44.5) = P(Y < 25.5) + [1 - P(Y < 44.5)]$
|
349 |
|
350 |
+
Converting to standard normal form:
|
351 |
|
352 |
+
$\approx \Phi\left(\frac{25.5 - 35}{\sqrt{29.2}}\right) + \left[1 - \Phi\left(\frac{44.5 - 35}{\sqrt{29.2}}\right)\right]$
|
353 |
|
354 |
+
$\approx \Phi(-1.76) + [1 - \Phi(1.76)]$
|
355 |
|
356 |
+
$\approx 0.039 + (1 - 0.961) \approx 0.078$
|
357 |
|
358 |
+
So your chance of winning is about 7.8% — not great odds, but that's probability for you!
|
|
|
359 |
"""
|
360 |
)
|
361 |
return
|
|
|
369 |
|
370 |
dice_explanation = mo.md(
|
371 |
r"""
|
372 |
+
**Understanding the Visualization:**
|
373 |
|
374 |
+
This graph shows our dice game in action. The blue bars represent the exact probability distribution for summing 10 dice, while the red curve shows our normal approximation from the Central Limit Theorem.
|
375 |
|
376 |
+
I've highlighted the winning regions in orange:
|
377 |
- The left region where $X \leq 25$
|
378 |
- The right region where $X \geq 45$
|
379 |
|
380 |
+
Together these regions cover about 7.8% of the total probability.
|
381 |
|
382 |
+
What's fascinating here is how closely the normal curve approximates the actual discrete distribution — this is the Central Limit Theorem working its magic, even with just 10 random variables.
|
383 |
"""
|
384 |
)
|
385 |
|
|
|
393 |
r"""
|
394 |
### Example 2: Algorithm Runtime Estimation
|
395 |
|
396 |
+
> _Note:_ The following derivation demonstrates the practical application of the Central Limit Theorem for experimental design. The mathematical approach is based on concepts from ["Probability for Computer Scientists"](https://chrispiech.github.io/probabilityForComputerScientists/en/part2/clt/) by Chris Piech.
|
397 |
+
|
398 |
+
Here's a practical problem I encounter in performance testing: You've developed a new algorithm and want to measure its average runtime. You know the variance is $\sigma^2 = 4 \text{ sec}^2$, but need to estimate the true mean runtime $t$.
|
399 |
|
400 |
+
The question: How many test runs do you need to be 95% confident your estimated mean is within ±0.5 seconds of the true value?
|
401 |
|
402 |
+
Let $X_i$ represent the runtime of the $i$-th test (for $1 \leq i \leq n$).
|
403 |
|
404 |
**Solution:**
|
405 |
|
406 |
We need to find $n$ such that:
|
407 |
|
408 |
+
$0.95 = P\left(-0.5 \leq \frac{\sum_{i=1}^n X_i}{n} - t \leq 0.5\right)$
|
|
|
|
|
|
|
|
|
|
|
409 |
|
410 |
+
The Central Limit Theorem tells us that as $n$ increases, the sample mean approaches a normal distribution. Let's standardize this to work with the standard normal distribution:
|
411 |
|
412 |
+
$Z = \frac{\left(\sum_{i=1}^n X_i\right) - n\mu}{\sigma \sqrt{n}} = \frac{\left(\sum_{i=1}^n X_i\right) - nt}{2 \sqrt{n}}$
|
413 |
|
414 |
+
Rewriting our probability constraint in terms of $Z$:
|
415 |
|
416 |
+
$0.95 = P\left(-0.5 \leq \frac{\sum_{i=1}^n X_i}{n} - t \leq 0.5\right) = P\left(\frac{-0.5 \sqrt{n}}{2} \leq Z \leq \frac{0.5 \sqrt{n}}{2}\right)$
|
417 |
|
418 |
+
Using the properties of the standard normal CDF:
|
419 |
|
420 |
+
$0.95 = \Phi\left(\frac{\sqrt{n}}{4}\right) - \Phi\left(-\frac{\sqrt{n}}{4}\right) = 2\Phi\left(\frac{\sqrt{n}}{4}\right) - 1$
|
|
|
|
|
|
|
|
|
421 |
|
422 |
Solving for $\Phi\left(\frac{\sqrt{n}}{4}\right)$:
|
423 |
|
424 |
+
$0.975 = \Phi\left(\frac{\sqrt{n}}{4}\right)$
|
425 |
|
426 |
+
Using the inverse CDF:
|
427 |
|
428 |
+
$\Phi^{-1}(0.975) = \frac{\sqrt{n}}{4}$
|
429 |
|
430 |
+
$1.96 = \frac{\sqrt{n}}{4}$
|
431 |
|
432 |
+
$n = 61.4$
|
433 |
+
|
434 |
+
Rounding up, we need 62 test runs to achieve our desired confidence interval — a practical result we can immediately apply to our testing protocol.
|
435 |
"""
|
436 |
)
|
437 |
return
|
|
|
934 |
mo.vstack([distribution_type, sample_size, sim_count_slider]),
|
935 |
run_explorer_button
|
936 |
], justify='space-around')
|
|
|
937 |
return (
|
938 |
controls,
|
939 |
distribution_type,
|