Spaces:

marimo-team
/

marimo-learn

Running

App Files Files Community

Akshay Agrawal commited on 21 days ago

Commit

a797de9

unverified ·

2 Parent(s): 1c791c5 4996422

Merge pull request #93 from marimo-team/haleshot/refine

Browse files

Files changed (8) hide show

probability/10_probability_mass_function.py +13 -19
probability/11_expectation.py +5 -5
probability/13_bernoulli_distribution.py +23 -29
probability/14_binomial_distribution.py +5 -7
probability/15_poisson_distribution.py +20 -16
probability/16_continuous_distribution.py +39 -40
probability/18_central_limit_theorem.py +57 -53
probability/19_maximum_likelihood_estimation.py +2 -0

probability/10_probability_mass_function.py CHANGED Viewed

@@ -10,7 +10,7 @@
 import marimo
-__generated_with = "0.11.17"
 app = marimo.App(width="medium", app_title="Probability Mass Functions")
@@ -22,9 +22,9 @@ def _(mo):
         _This notebook is a computational companion to ["Probability for Computer Scientists"](https://chrispiech.github.io/probabilityForComputerScientists/en/part2/pmf/), by Stanford professor Chris Piech._
-        For a random variable, the most important thing to know is: how likely is each outcome? For a discrete random variable, this information is called the "**Probability Mass Function**". The probability mass function (PMF) provides the "mass" (i.e. amount) of "probability" for each possible assignment of the random variable.
-        Formally, the Probability Mass Function is a mapping between the values that the random variable could take on and the probability of the random variable taking on said value. In mathematics, we call these associations functions. There are many different ways of representing functions: you can write an equation, you can make a graph, you can even store many samples in a list.
         """
     )
     return
@@ -36,18 +36,12 @@ def _(mo):
         r"""
         ## Properties of a PMF
-        For a function $p_X(x)$ to be a valid PMF, it must satisfy:
-        1. **Non-negativity**: $p_X(x) \geq 0$ for all $x$
-        2. **Unit total probability**: $\sum_x p_X(x) = 1$
-        ### Probabilities Must Sum to 1
-        For a variable (call it $X$) to be a proper random variable, it must be the case that if you summed up the values of $P(X=x)$ for all possible values $x$ that $X$ can take on, the result must be 1:
-        $$\sum_x P(X=x) = 1$$
-        This is because a random variable taking on a value is an event (for example $X=3$). Each of those events is mutually exclusive because a random variable will take on exactly one value. Those mutually exclusive cases define an entire sample space. Why? Because $X$ must take on some value.
         """
     )
     return
@@ -125,11 +119,11 @@ def _(np, plt):
 def _(mo):
     mo.md(
         r"""
-        The information provided in these graphs shows the likelihood of a random variable taking on different values.
-        In the graph on the right, the value "6" on the $x$-axis is associated with the probability $\frac{5}{36}$ on the $y$-axis. This $x$-axis refers to the event "the sum of two dice is 6" or $Y = 6$. The $y$-axis tells us that the probability of that event is $\frac{5}{36}$. In full: $P(Y = 6) = \frac{5}{36}$.
-        The value "2" is associated with "$\frac{1}{36}$" which tells us that, $P(Y = 2) = \frac{1}{36}$, the probability that two dice sum to 2 is $\frac{1}{36}$. There is no value associated with "1" because the sum of two dice cannot be 1.
         """
     )
     return
@@ -220,7 +214,7 @@ def _(mo):
         r"""
         ## Data to Histograms to Probability Mass Functions
-        One surprising way to store a likelihood function (recall that a PMF is the name of the likelihood function for discrete random variables) is simply a list of data. Let's simulate summing two dice many times to create an empirical PMF:
         """
     )
     return
@@ -323,9 +317,9 @@ def _(collections, np, plt, sim_dice_sums):
 def _(mo):
     mo.md(
         r"""
-        A normalized histogram (where each value is divided by the length of your data list) is an approximation of the PMF. For a dataset of discrete numbers, a histogram shows the count of each value. By the definition of probability, if you divide this count by the number of experiments run, you arrive at an approximation of the probability of the event $P(Y=y)$.
-        Let's look at a specific example. If we want to approximate $P(Y=3)$ (the probability that the sum of two dice is 3), we can count the number of times that "3" occurs in our data and divide by the total number of trials:
         """
     )
     return

 import marimo
+__generated_with = "0.12.6"
 app = marimo.App(width="medium", app_title="Probability Mass Functions")
         _This notebook is a computational companion to ["Probability for Computer Scientists"](https://chrispiech.github.io/probabilityForComputerScientists/en/part2/pmf/), by Stanford professor Chris Piech._
+        PMFs are really important in discrete probability. They tell us how likely each possible outcome is for a discrete random variable.
+        What's interesting about PMFs is that they can be represented in multiple ways - equations, graphs, or even empirical data. The core idea is simple: they map each possible value to its probability.
         """
     )
     return
         r"""
         ## Properties of a PMF
+        For a function $p_X(x)$ to be a valid PMF:
+        1. **Non-negativity**: probability can't be negative, so $p_X(x) \geq 0$ for all $x$
+        2. **Unit total probability**: all probabilities sum to 1, i.e., $\sum_x p_X(x) = 1$
+        The second property makes intuitive sense - a random variable must take some value, and the sum of all possibilities should be 100%.
         """
     )
     return
 def _(mo):
     mo.md(
         r"""
+        These graphs really show us how likely each value is when we roll the dice.
+        looking at the right graph, when we see "6" on the $x$-axis with probability $\frac{5}{36}$ on the $y$-axis, that's telling us there's a $\frac{5}{36}$ chance of rolling a sum of 6 with two dice. or more formally: $P(Y = 6) = \frac{5}{36}$.
+        Similarly, the value "2" has probability "$\frac{1}{36}$" - that's because there's only one way to get a sum of 2 (rolling 1 on both dice). and you'll notice there's no value for "1" since you can't get a sum of 1 with two dice - the minimum possible is 2.
         """
     )
     return
         r"""
         ## Data to Histograms to Probability Mass Functions
+        Here's something I find interesting — one way to represent a likelihood function is just through raw data. instead of mathematical formulas, we can actually approximate a PMF by collecting data points. let's see this in action by simulating lots of dice rolls and building an empirical PMF:
         """
     )
     return
 def _(mo):
     mo.md(
         r"""
+        When we normalize a histogram (divide each count by total sample size), we get a pretty good approximation of the true PMF. it's a simple yet powerful idea - count how many times each value appears, then divide by the total number of trials.
+        let's make this concrete. say we want to estimate $P(Y=3)$ - the probability of rolling a sum of 3 with two dice. we just count how many 3's show up in our simulated rolls and divide by the total number of rolls:
         """
     )
     return

probability/11_expectation.py CHANGED Viewed

@@ -10,7 +10,7 @@
 import marimo
-__generated_with = "0.11.19"
 app = marimo.App(width="medium", app_title="Expectation")
@@ -22,9 +22,9 @@ def _(mo):
         _This notebook is a computational companion to ["Probability for Computer Scientists"](https://chrispiech.github.io/probabilityForComputerScientists/en/part2/expectation/), by Stanford professor Chris Piech._
-        A random variable is fully represented by its Probability Mass Function (PMF), which describes each value the random variable can take on and the corresponding probabilities. However, a PMF can contain a lot of information. Sometimes it's useful to summarize a random variable with a single value!
-        The most common, and arguably the most useful, summary of a random variable is its **Expectation** (also called the expected value or mean).
         """
     )
     return
@@ -36,11 +36,11 @@ def _(mo):
         r"""
         ## Definition of Expectation
-        The expectation of a random variable $X$, written $E[X]$, is the average of all the values the random variable can take on, each weighted by the probability that the random variable will take on that value.
         $$E[X] = \sum_x x \cdot P(X=x)$$
-        Expectation goes by many other names: Mean, Weighted Average, Center of Mass, 1st Moment. All of these are calculated using the same formula.
         """
     )
     return

 import marimo
+__generated_with = "0.12.6"
 app = marimo.App(width="medium", app_title="Expectation")
         _This notebook is a computational companion to ["Probability for Computer Scientists"](https://chrispiech.github.io/probabilityForComputerScientists/en/part2/expectation/), by Stanford professor Chris Piech._
+        Expectations are fascinating — they represent the "center of mass" of a probability distribution. while they're often called "expected values" or "averages," they don't always match our intuition about what's "expected" to happen.
+        For me, the most interesting part about expectations is how they quantify what happens "on average" in the long run, even if that average isn't a possible outcome (like expecting 3.5 on a standard die roll).
         """
     )
     return
         r"""
         ## Definition of Expectation
+        Expectation (written as $E[X]$) is basically the "average outcome" of a random variable, but with a twist - we weight each possible value by how likely it is to occur. I like to think of it as the "center of gravity" for probability.
         $$E[X] = \sum_x x \cdot P(X=x)$$
+        People call this concept by different names - mean, weighted average, center of mass, or 1st moment if you're being fancy. They're all calculated the same way, though: multiply each value by its probability, then add everything up.
         """
     )
     return

probability/13_bernoulli_distribution.py CHANGED Viewed

@@ -10,7 +10,7 @@
 import marimo
-__generated_with = "0.11.22"
 app = marimo.App(width="medium", app_title="Bernoulli Distribution")
@@ -20,15 +20,15 @@ def _(mo):
         r"""
         # Bernoulli Distribution
-        _This notebook is a computational companion to ["Probability for Computer Scientists"](https://chrispiech.github.io/probabilityForComputerScientists/en/part2/bernoulli/), by Stanford professor Chris Piech._
         ## Parametric Random Variables
-        There are many classic and commonly-seen random variable abstractions that show up in the world of probability. At this point, we'll learn about several of the most significant parametric discrete distributions.
-        When solving problems, if you can recognize that a random variable fits one of these formats, then you can use its pre-derived Probability Mass Function (PMF), expectation, variance, and other properties. Random variables of this sort are called **parametric random variables**. If you can argue that a random variable falls under one of the studied parametric types, you simply need to provide parameters.
-        > A good analogy is a `class` in programming. Creating a parametric random variable is very similar to calling a constructor with input parameters.
         """
     )
     return
@@ -40,18 +40,16 @@ def _(mo):
         r"""
         ## Bernoulli Random Variables
-        A **Bernoulli random variable** (also called a boolean or indicator random variable) is the simplest kind of parametric random variable. It can take on two values: 1 and 0.
-        It takes on a 1 if an experiment with probability $p$ resulted in success and a 0 otherwise.
-        Some example uses include:
-        - A coin flip (heads = 1, tails = 0)
-        - A random binary digit
-        - Whether a disk drive crashed
-        - Whether someone likes a Netflix movie
-        Here $p$ is the parameter, but different instances of Bernoulli random variables might have different values of $p$.
         """
     )
     return
@@ -167,9 +165,11 @@ def _(expected_value, p_slider, plt, probabilities, values, variance):
 def _(mo):
     mo.md(
         r"""
-        ## Proof: Expectation of a Bernoulli
-        If $X$ is a Bernoulli with parameter $p$, $X \sim \text{Bern}(p)$:
         \begin{align}
         E[X] &= \sum_x x \cdot (X=x) && \text{Definition of expectation} \\
@@ -178,11 +178,7 @@ def _(mo):
         &= p && \text{Remove the 0 term}
         \end{align}
-        ## Proof: Variance of a Bernoulli
-        If $X$ is a Bernoulli with parameter $p$, $X \sim \text{Bern}(p)$:
-        To compute variance, first compute $E[X^2]$:
         \begin{align}
         E[X^2]
@@ -206,18 +202,16 @@ def _(mo):
 def _(mo):
     mo.md(
         r"""
-        ## Indicator Random Variable
-        > **Definition**: An indicator variable is a Bernoulli random variable which takes on the value 1 if an **underlying event occurs**, and 0 _otherwise_.
-        Indicator random variables are a convenient way to convert the "true/false" outcome of an event into a number. That number may be easier to incorporate into an equation.
-        A random variable $I$ is an indicator variable for an event $A$ if $I = 1$ when $A$ occurs and $I = 0$ if $A$ does not occur. Indicator random variables are Bernoulli random variables, with $p = P(A)$. $I_A$ is a common choice of name for an indicator random variable.
-        Here are some properties of indicator random variables:
-        - $P(I=1)=P(A)$
-        - $E[I]=P(A)$
         """
     )
     return

 import marimo
+__generated_with = "0.12.6"
 app = marimo.App(width="medium", app_title="Bernoulli Distribution")
         r"""
         # Bernoulli Distribution
+        > _Note:_ This notebook builds on concepts from ["Probability for Computer Scientists"](https://chrispiech.github.io/probabilityForComputerScientists/en/part2/bernoulli/) by Chris Piech.
         ## Parametric Random Variables
+        Probability has a bunch of classic random variable patterns that show up over and over. Let's explore some of the most important parametric discrete distributions.
+        Bernoulli is honestly the simplest distribution you'll ever see, but it's ridiculously powerful in practice. What makes it fascinating to me is how it captures any yes/no scenario: success/failure, heads/tails, 1/0.
+        I think of these distributions as the atoms of probability — they're the fundamental building blocks that everything else is made from.
         """
     )
     return
         r"""
         ## Bernoulli Random Variables
+        A Bernoulli random variable boils down to just two possible values: 1 (success) or 0 (failure). dead simple, but incredibly useful.
+        Some everyday examples where I see these:
+        - Coin flip (heads=1, tails=0)
+        - Whether that sketchy email is spam
+        - If someone actually clicks my ad
+        - Whether my code compiles first try (almost always 0 for me)
+        All you need (the classic expression) is a single parameter $p$ - the probability of success.
         """
     )
     return
 def _(mo):
     mo.md(
         r"""
+        ## Expectation and Variance of a Bernoulli
+        > _Note:_ The following derivations are included as reference material. The credit for these mathematical formulations belongs to ["Probability for Computer Scientists"](https://chrispiech.github.io/probabilityForComputerScientists/en/part2/bernoulli/) by Chris Piech.
+        Let's work through why $E[X] = p$ for a Bernoulli:
         \begin{align}
         E[X] &= \sum_x x \cdot (X=x) && \text{Definition of expectation} \\
         &= p && \text{Remove the 0 term}
         \end{align}
+        And for variance, we first need $E[X^2]$:
         \begin{align}
         E[X^2]
 def _(mo):
     mo.md(
         r"""
+        ## Indicator Random Variables
+        Indicator variables are a clever trick I like to use — they turn events into numbers. Instead of dealing with "did the event happen?" (yes/no), we get "1" if it happened and "0" if it didn't.
+        Formally: an indicator variable $I$ for event $A$ equals 1 when $A$ occurs and 0 otherwise. These are just bernoulli variables where $p = P(A)$. people often use notation like $I_A$ to name them.
+        Two key properties that make them super useful:
+        - $P(I=1)=P(A)$ - probability of getting a 1 is just the probability of the event
+        - $E[I]=P(A)$ - the expected value equals the probability (this one's a game-changer!)
         """
     )
     return

probability/14_binomial_distribution.py CHANGED Viewed

@@ -13,7 +13,7 @@
 import marimo
-__generated_with = "0.11.24"
 app = marimo.App(width="medium", app_title="Binomial Distribution")
@@ -25,11 +25,9 @@ def _(mo):
         _This notebook is a computational companion to ["Probability for Computer Scientists"](https://chrispiech.github.io/probabilityForComputerScientists/en/part2/binomial/), by Stanford professor Chris Piech._
-        In this section, we will discuss the binomial distribution. To start, imagine the following example:
-        Consider $n$ independent trials of an experiment where each trial is a "success" with probability $p$. Let $X$ be the number of successes in $n$ trials.
-        This situation is truly common in the natural world, and as such, there has been a lot of research into such phenomena. Random variables like $X$ are called **binomial random variables**. If you can identify that a process fits this description, you can inherit many already proved properties such as the PMF formula, expectation, and variance!
         """
     )
     return
@@ -197,11 +195,11 @@ def _(mo):
         r"""
         ## Relationship to Bernoulli Random Variables
-        One way to think of the binomial is as the sum of $n$ Bernoulli variables. Say that $Y_i$ is an indicator Bernoulli random variable which is 1 if experiment $i$ is a success. Then if $X$ is the total number of successes in $n$ experiments, $X \sim \text{Bin}(n, p)$:
         $$X = \sum_{i=1}^n Y_i$$
-        Recall that the outcome of $Y_i$ will be 1 or 0, so one way to think of $X$ is as the sum of those 1s and 0s.
         """
     )
     return

 import marimo
+__generated_with = "0.12.6"
 app = marimo.App(width="medium", app_title="Binomial Distribution")
         _This notebook is a computational companion to ["Probability for Computer Scientists"](https://chrispiech.github.io/probabilityForComputerScientists/en/part2/binomial/), by Stanford professor Chris Piech._
+        The binomial distribution is essentially what happens when you run multiple Bernoulli trials and count the successes. I love this distribution because it appears everywhere in practical scenarios.
+        Think about it: whenever you're counting how many times something happens across multiple attempts, you're likely dealing with a binomial. Website conversions, A/B testing results, even counting heads in multiple coin flips — all binomial!
         """
     )
     return
         r"""
         ## Relationship to Bernoulli Random Variables
+        One way I like to think about the binomial: it's just adding up a bunch of Bernoullis. If each $Y_i$ is a Bernoulli that tells us if the $i$-th trial succeeded, then:
         $$X = \sum_{i=1}^n Y_i$$
+        This makes the distribution really intuitive to me - we're just counting 1s across our $n$ experiments.
         """
     )
     return

probability/15_poisson_distribution.py CHANGED Viewed

@@ -13,7 +13,7 @@
 import marimo
-__generated_with = "0.11.25"
 app = marimo.App(width="medium", app_title="Poisson Distribution")
@@ -25,7 +25,9 @@ def _(mo):
         _This notebook is a computational companion to ["Probability for Computer Scientists"](https://chrispiech.github.io/probabilityForComputerScientists/en/part2/poisson/), by Stanford professor Chris Piech._
-        A Poisson random variable gives the probability of a given number of events in a fixed interval of time (or space). It makes the Poisson assumption that events occur with a known constant mean rate and independently of the time since the last event.
         """
     )
     return
@@ -180,11 +182,11 @@ def _(mo):
         r"""
         ## Poisson Intuition: Relation to Binomial Distribution
-        The Poisson distribution can be derived as a limiting case of the [binomial distribution](http://marimo.app/https://github.com/marimo-team/learn/blob/main/probability/14_binomial_distribution.py).
-        Let's work on a practical example: predicting the number of ride-sharing requests in a specific area over a one-minute interval. From historical data, we know that the average number of requests per minute is $\lambda = 5$.
-        We could approximate this using a binomial distribution by dividing our minute into smaller intervals. For example, we can divide a minute into 60 seconds and treat each second as a [Bernoulli trial](http://marimo.app/https://github.com/marimo-team/learn/blob/main/probability/13_bernoulli_distribution.py) - either there's a request (success) or there isn't (failure).
         Let's visualize this concept:
         """
@@ -231,7 +233,7 @@ def _(fig_to_image, mo, plt):
     _explanation = mo.md(
         r"""
         In this visualization:
         - Each rectangle represents a 1-second interval
         - Blue rectangles indicate intervals where an event occurred
         - Red dots show the actual event times (2.75s and 7.12s)
@@ -247,9 +249,9 @@ def _(fig_to_image, mo, plt):
 def _(mo):
     mo.md(
         r"""
-        The total number of requests received over the minute can be approximated as the sum of the sixty indicator variables, which conveniently matches the description of a binomial — a sum of Bernoullis.
-        Specifically, if we define $X$ to be the number of requests in a minute, $X$ is a binomial with $n=60$ trials. What is the probability, $p$, of a success on a single trial? To make the expectation of $X$ equal the observed historical average $\lambda$, we should choose $p$ so that:
         \begin{align}
         \lambda &= E[X] && \text{Expectation matches historical average} \\
@@ -257,7 +259,7 @@ def _(mo):
         p &= \frac{\lambda}{n} && \text{Solving for $p$}
         \end{align}
-        In this case, since $\lambda=5$ and $n=60$, we should choose $p=\frac{5}{60}=\frac{1}{12}$ and state that $X \sim \text{Bin}(n=60, p=\frac{5}{60})$. Now we can calculate the probability of different numbers of requests using the binomial PMF:
         $P(X = x) = {n \choose x} p^x (1-p)^{n-x}$
@@ -269,7 +271,7 @@ def _(mo):
         P(X=3) &= {60 \choose 3} (5/60)^3 (55/60)^{60-3} \approx 0.1389
         \end{align}
-        This is a good approximation, but it doesn't account for the possibility of multiple events in a single second. One solution is to divide our minute into even more fine-grained intervals. Let's try 600 deciseconds (tenths of a second):
         """
     )
     return
@@ -283,7 +285,7 @@ def _(fig_to_image, mo, plt):
         # Example events at 2.75s and 7.12s (convert to deciseconds)
         events = [27.5, 71.2]
         for i in range(100):
             color = 'royalblue' if any(i <= event_val < i + 1 for event_val in events) else 'lightgray'
             ax.add_patch(plt.Rectangle((i, 0), 0.9, 1, color=color))
@@ -434,21 +436,23 @@ def _(df, fig, fig_to_image, mo, n, p):
 def _(mo):
     mo.md(
         r"""
-        As you can see from the interactive comparison above, as the number of intervals increases, the binomial distribution approaches the Poisson distribution! This is not a coincidence - the Poisson distribution is actually the limiting case of the binomial distribution when:
         - The number of trials $n$ approaches infinity
         - The probability of success $p$ approaches zero
         - The product $np = \lambda$ remains constant
-        This relationship is why the Poisson distribution is so useful - it's easier to work with than a binomial with a very large number of trials and a very small probability of success.
         ## Derivation of the Poisson PMF
-        Let's derive the Poisson PMF by taking the limit of the binomial PMF as $n \to \infty$. We start with:
         $P(X=x) = \lim_{n \rightarrow \infty} {n \choose x} (\lambda / n)^x(1-\lambda/n)^{n-x}$
-        While this expression looks intimidating, it simplifies nicely:
         \begin{align}
         P(X=x)
@@ -495,7 +499,7 @@ def _(mo):
             && \text{Simplifying}\\
         \end{align}
-        This gives us our elegant Poisson PMF formula: $P(X=x) = \frac{\lambda^x \cdot e^{-\lambda}}{x!}$
         """
     )
     return

 import marimo
+__generated_with = "0.12.6"
 app = marimo.App(width="medium", app_title="Poisson Distribution")
         _This notebook is a computational companion to ["Probability for Computer Scientists"](https://chrispiech.github.io/probabilityForComputerScientists/en/part2/poisson/), by Stanford professor Chris Piech._
+        The Poisson distribution is my go-to for modeling random events occurring over time or space. What makes it cool is that it only needs a single parameter λ (lambda), which represents both the mean and variance.
+        I find it particularly useful when events happen rarely but the opportunities for them to occur are numerous — like modeling website visits, dust/particle emissions or even typos in a document.
         """
     )
     return
         r"""
         ## Poisson Intuition: Relation to Binomial Distribution
+        The Poisson distribution can be derived as a limiting case of the binomial distribution. I find this connection fascinating because it shows how seemingly different distributions are actually related.
+        Let's work through a practical example: predicting ride-sharing requests in a specific area over a one-minute interval. From historical data, we know that the average number of requests per minute is $\lambda = 5$.
+        We could model this using a binomial distribution by dividing our minute into smaller intervals. For example, splitting a minute into 60 seconds, where each second is a Bernoulli trial — either a request arrives (success) or it doesn't (failure).
         Let's visualize this concept:
         """
     _explanation = mo.md(
         r"""
         In this visualization:
         - Each rectangle represents a 1-second interval
         - Blue rectangles indicate intervals where an event occurred
         - Red dots show the actual event times (2.75s and 7.12s)
 def _(mo):
     mo.md(
         r"""
+        The total number of requests received over the minute can be approximated as the sum of sixty indicator variables, which aligns perfectly with the binomial distribution — a sum of Bernoullis.
+        If we define $X$ as the number of requests in a minute, $X$ follows a binomial with $n=60$ trials. To determine the success probability $p$, we need to match the expected value with our historical average $\lambda$:
         \begin{align}
         \lambda &= E[X] && \text{Expectation matches historical average} \\
         p &= \frac{\lambda}{n} && \text{Solving for $p$}
         \end{align}
+        With $\lambda=5$ and $n=60$, we get $p=\frac{5}{60}=\frac{1}{12}$, so $X \sim \text{Bin}(n=60, p=\frac{5}{60})$. Using the binomial PMF:
         $P(X = x) = {n \choose x} p^x (1-p)^{n-x}$
         P(X=3) &= {60 \choose 3} (5/60)^3 (55/60)^{60-3} \approx 0.1389
         \end{align}
+        This approximation works well, but it doesn't account for multiple events occurring in a single second. To address this limitation, we can use even finer intervals — perhaps 600 deciseconds (tenths of a second):
         """
     )
     return
         # Example events at 2.75s and 7.12s (convert to deciseconds)
         events = [27.5, 71.2]
         for i in range(100):
             color = 'royalblue' if any(i <= event_val < i + 1 for event_val in events) else 'lightgray'
             ax.add_patch(plt.Rectangle((i, 0), 0.9, 1, color=color))
 def _(mo):
     mo.md(
         r"""
+        As our interactive comparison demonstrates, the binomial distribution converges to the Poisson distribution as we increase the number of intervals! This remarkable relationship exists because the Poisson distribution is actually the limiting case of the binomial when:
         - The number of trials $n$ approaches infinity
         - The probability of success $p$ approaches zero
         - The product $np = \lambda$ remains constant
+        This elegance is why I find the Poisson distribution so powerful — it simplifies what would otherwise be a cumbersome binomial with numerous trials and tiny success probabilities.
         ## Derivation of the Poisson PMF
+        > _Note:_ The following mathematical derivation is included as reference material. The credit for this formulation belongs to ["Probability for Computer Scientists"](https://chrispiech.github.io/probabilityForComputerScientists/en/part2/poisson/) by Chris Piech.
+        The Poisson PMF can be derived by taking the limit of the binomial PMF as $n \to \infty$:
         $P(X=x) = \lim_{n \rightarrow \infty} {n \choose x} (\lambda / n)^x(1-\lambda/n)^{n-x}$
+        Through a series of algebraic manipulations:
         \begin{align}
         P(X=x)
             && \text{Simplifying}\\
         \end{align}
+        This gives us the elegant Poisson PMF formula: $P(X=x) = \frac{\lambda^x \cdot e^{-\lambda}}{x!}$
         """
     )
     return

probability/16_continuous_distribution.py CHANGED Viewed

@@ -14,7 +14,7 @@
 import marimo
-__generated_with = "0.11.26"
 app = marimo.App(width="medium")
@@ -26,7 +26,9 @@ def _(mo):
         _This notebook is a computational companion to ["Probability for Computer Scientists"](https://chrispiech.github.io/probabilityForComputerScientists/en/part2/continuous/), by Stanford professor Chris Piech._
-        So far, all the random variables we've explored have been discrete, taking on only specific values (usually integers). Now we'll move into the world of **continuous random variables**, which can take on any real number value. Continuous random variables are used to model measurements with arbitrary precision like height, weight, time, and many natural phenomena.
         """
     )
     return
@@ -38,20 +40,17 @@ def _(mo):
         r"""
         ## From Discrete to Continuous
-        To make the transition from discrete to continuous random variables, let's start with a thought experiment:
-        > Imagine you're running to catch a bus. You know you'll arrive at 2:15pm, but you don't know exactly when the bus will arrive. You want to model the bus arrival time (in minutes past 2pm) as a random variable $T$ so you can calculate the probability that you'll wait more than five minutes: $P(15 < T < 20)$.
-        This immediately highlights a key difference from discrete distributions. For discrete distributions, we described the probability that a random variable takes on exact values. But this doesn't make sense for continuous values like time.
-        For example:
         - What's the probability the bus arrives at exactly 2:17pm and 12.12333911102389234 seconds?
-        - What's the probability of a child being born weighing exactly 3.523112342234 kilograms?
-        These questions don't have meaningful answers because real-world measurements can have infinite precision. The probability of a continuous random variable taking on any specific exact value is actually zero!
-        ### Visualizing the Transition
         Let's visualize this transition from discrete to continuous:
         """
@@ -150,44 +149,43 @@ def _(mo):
         r"""
         ## Probability Density Functions
-        In the world of discrete random variables, we used **Probability Mass Functions (PMFs)** to describe the probability of a random variable taking on specific values. In the continuous world, we need a different approach.
-        For continuous random variables, we use a **Probability Density Function (PDF)** which defines the relative likelihood that a random variable takes on a particular value. We traditionally denote the PDF with the symbol $f$ and write it as:
         $$f(X=x) \quad \text{or simply} \quad f(x)$$
-        Where the lowercase $x$ implies that we're talking about the relative likelihood of a continuous random variable which is the uppercase $X$.
         ### Key Properties of PDFs
-        A **Probability Density Function (PDF)** $f(x)$ for a continuous random variable $X$ has these key properties:
-        1. The probability that $X$ takes a value in the interval $[a, b]$ is:
            $$P(a \leq X \leq b) = \int_a^b f(x) \, dx$$
-        2. The PDF must be non-negative everywhere:
            $$f(x) \geq 0 \text{ for all } x$$
-        3. The total probability must sum to 1:
            $$\int_{-\infty}^{\infty} f(x) \, dx = 1$$
-        4. The probability that $X$ takes any specific exact value is 0:
            $$P(X = a) = \int_a^a f(x) \, dx = 0$$
-        This last property highlights a key difference from discrete distributions: the probability of a continuous random variable taking on an exact value is always 0. Probabilities only make sense when talking about ranges of values.
-        ### Caution: Density ≠ Probability
-        A common misconception is to think of $f(x)$ as a probability. It is instead a **probability density**, representing probability per unit of $x$. The values of $f(x)$ can actually exceed 1, as long as the total area under the curve equals 1.
-        The interpretation of $f(x)$ is only meaningful when:
-        1. We integrate over a range to get a probability, or
-        2. We compare densities at different points to determine relative likelihoods.
         """
     )
     return
@@ -665,16 +663,18 @@ def _(fig_to_image, mo, np, plt, sympy):
     # Detailed calculations for our example
     _calculations = mo.md(
         f"""
-        ### Calculating Expectation and Variance for Our Example
-        Let's calculate the expectation and variance for the PDF:
         $$f(x) = \\begin{{cases}}
         \\frac{{3}}{{8}}(4x - 2x^2) & \\text{{when }} 0 < x < 2 \\\\
         0 & \\text{{otherwise}}
         \\end{{cases}}$$
-        #### Expectation Calculation
         $$E[X] = \\int_{{-\\infty}}^{{\\infty}} x \\cdot f(x) \\, dx = \\int_0^2 x \\cdot \\frac{{3}}{{8}}(4x - 2x^2) \\, dx$$
@@ -684,9 +684,9 @@ def _(fig_to_image, mo, np, plt, sympy):
         $$E[X] = \\frac{{3}}{{8}} \\cdot \\frac{{32 - 12}}{{3}} = \\frac{{3}}{{8}} \\cdot \\frac{{20}}{{3}} = \\frac{{20}}{{8}} = {E_X}$$
-        #### Variance Calculation
-        First, we need $E[X^2]$:
         $$E[X^2] = \\int_{{-\\infty}}^{{\\infty}} x^2 \\cdot f(x) \\, dx = \\int_0^2 x^2 \\cdot \\frac{{3}}{{8}}(4x - 2x^2) \\, dx$$
@@ -696,11 +696,11 @@ def _(fig_to_image, mo, np, plt, sympy):
         $$E[X^2] = \\frac{{3}}{{8}} \\cdot \\frac{{20 - 64/5}}{{1}} = {E_X2}$$
-        Now we can calculate the variance:
         $$\\text{{Var}}(X) = E[X^2] - (E[X])^2 = {E_X2} - ({E_X})^2 = {Var_X}$$
-        Therefore, the standard deviation is $\\sqrt{{\\text{{Var}}(X)}} = {Std_X}$.
         """
     )
     mo.vstack([_img, _calculations])
@@ -765,11 +765,11 @@ def _(mo):
         Some key points to remember:
-        • PDFs give us relative likelihood, not actual probabilities - that's why they can exceed 1
-        • The probability between two points is the area under the PDF curve
-        • CDFs offer a convenient shortcut to find probabilities without integrating
-        • Expectation and variance work similarly to discrete variables, just with integrals instead of sums
-        • Constants in PDFs are determined by ensuring the total probability equals 1
         This foundation will serve you well as we explore specific continuous distributions like normal, exponential, and beta in future notebooks. These distributions are the workhorses of probability theory and statistics, appearing everywhere from quality control to financial modeling.
@@ -779,7 +779,7 @@ def _(mo):
     return
-@app.cell
 def _(mo):
     mo.md(r"""Appendix code (helper functions, variables, etc.):""")
     return
@@ -971,7 +971,6 @@ def _(np, plt, sympy):
         1. Total probability: ∫₀² {C}(4x - 2x²) dx = {total_prob}
         2. P(X > 1): ∫₁² {C}(4x - 2x²) dx = {prob_gt_1}
         """
     return create_example_pdf_visualization, symbolic_calculation

 import marimo
+__generated_with = "0.12.6"
 app = marimo.App(width="medium")
         _This notebook is a computational companion to ["Probability for Computer Scientists"](https://chrispiech.github.io/probabilityForComputerScientists/en/part2/continuous/), by Stanford professor Chris Piech._
+        Continuous distributions are what we need when dealing with random variables that can take any value in a range, rather than just discrete values.
+        The key difference here is that we work with probability density functions (PDFs) instead of probability mass functions (PMFs). It took me a while to really get this - the PDF at a point isn't actually a probability, but rather a density.
         """
     )
     return
         r"""
         ## From Discrete to Continuous
+        Making the jump from discrete to continuous random variables requires a fundamental shift in thinking. Let me walk you through a thought experiment:
+        > You're rushing to catch a bus. You know you'll arrive at 2:15pm, but the bus arrival time is uncertain. If you model the bus arrival time (in minutes past 2pm) as a random variable $T$, how would you calculate the probability of waiting more than five minutes: $P(15 < T < 20)$?
+        This highlights a crucial difference from discrete distributions. With discrete distributions, we calculated probabilities for exact values, but this approach breaks down with continuous values like time.
+        Consider these questions:
         - What's the probability the bus arrives at exactly 2:17pm and 12.12333911102389234 seconds?
+        - What's the probability a newborn weighs exactly 3.523112342234 kilograms?
+        These questions have no meaningful answers because continuous measurements can have infinite precision. In the continuous world, the probability of a random variable taking any specific exact value is actually zero!
         Let's visualize this transition from discrete to continuous:
         """
         r"""
         ## Probability Density Functions
+        While discrete random variables use Probability Mass Functions (PMFs), continuous random variables require a different approach — Probability Density Functions (PDFs).
+        A PDF defines the relative likelihood of a continuous random variable taking particular values. We typically denote this with $f$ and write it as:
         $$f(X=x) \quad \text{or simply} \quad f(x)$$
+        Where the lowercase $x$ represents a specific value our random variable $X$ might take.
         ### Key Properties of PDFs
+        For a PDF $f(x)$ to be valid, it must satisfy these properties:
+        1. The probability that $X$ falls within interval $[a, b]$ is:
            $$P(a \leq X \leq b) = \int_a^b f(x) \, dx$$
+        2. Non-negativity — the PDF can't be negative:
            $$f(x) \geq 0 \text{ for all } x$$
+        3. Total probability equals 1:
            $$\int_{-\infty}^{\infty} f(x) \, dx = 1$$
+        4. The probability of any exact value is zero:
            $$P(X = a) = \int_a^a f(x) \, dx = 0$$
+        This last property reveals a fundamental difference from discrete distributions — with continuous random variables, probabilities only make sense for ranges, not specific points.
+        ### Important Distinction: Density ≠ Probability
+        One common mistake is interpreting $f(x)$ as a probability. It's actually a **density** — representing probability per unit of $x$. This is why $f(x)$ values can exceed 1, provided the total area under the curve equals 1.
+        The true meaning of $f(x)$ emerges only when:
+        1. We integrate over a range to obtain an actual probability, or
+        2. We compare densities at different points to understand relative likelihoods.
         """
     )
     return
     # Detailed calculations for our example
     _calculations = mo.md(
         f"""
+        ### Computing Expectation and Variance
+        > _Note:_ The following mathematical derivation is included as reference material. The credit for this approach belongs to ["Probability for Computer Scientists"](https://chrispiech.github.io/probabilityForComputerScientists/en/part2/continuous/) by Chris Piech.
+        Let's work through the calculations for our PDF:
         $$f(x) = \\begin{{cases}}
         \\frac{{3}}{{8}}(4x - 2x^2) & \\text{{when }} 0 < x < 2 \\\\
         0 & \\text{{otherwise}}
         \\end{{cases}}$$
+        #### Finding the Expectation
         $$E[X] = \\int_{{-\\infty}}^{{\\infty}} x \\cdot f(x) \\, dx = \\int_0^2 x \\cdot \\frac{{3}}{{8}}(4x - 2x^2) \\, dx$$
         $$E[X] = \\frac{{3}}{{8}} \\cdot \\frac{{32 - 12}}{{3}} = \\frac{{3}}{{8}} \\cdot \\frac{{20}}{{3}} = \\frac{{20}}{{8}} = {E_X}$$
+        #### Computing the Variance
+        We first need $E[X^2]$:
         $$E[X^2] = \\int_{{-\\infty}}^{{\\infty}} x^2 \\cdot f(x) \\, dx = \\int_0^2 x^2 \\cdot \\frac{{3}}{{8}}(4x - 2x^2) \\, dx$$
         $$E[X^2] = \\frac{{3}}{{8}} \\cdot \\frac{{20 - 64/5}}{{1}} = {E_X2}$$
+        Now we calculate variance using the formula $Var(X) = E[X^2] - (E[X])^2$:
         $$\\text{{Var}}(X) = E[X^2] - (E[X])^2 = {E_X2} - ({E_X})^2 = {Var_X}$$
+        This gives us a standard deviation of $\\sqrt{{\\text{{Var}}(X)}} = {Std_X}$.
         """
     )
     mo.vstack([_img, _calculations])
         Some key points to remember:
+        - PDFs give us relative likelihood, not actual probabilities - that's why they can exceed 1
+        - The probability between two points is the area under the PDF curve
+        - CDFs offer a convenient shortcut to find probabilities without integrating
+        - Expectation and variance work similarly to discrete variables, just with integrals instead of sums
+        - Constants in PDFs are determined by ensuring the total probability equals 1
         This foundation will serve you well as we explore specific continuous distributions like normal, exponential, and beta in future notebooks. These distributions are the workhorses of probability theory and statistics, appearing everywhere from quality control to financial modeling.
     return
+@app.cell(hide_code=True)
 def _(mo):
     mo.md(r"""Appendix code (helper functions, variables, etc.):""")
     return
         1. Total probability: ∫₀² {C}(4x - 2x²) dx = {total_prob}
         2. P(X > 1): ∫₁² {C}(4x - 2x²) dx = {prob_gt_1}
         """
     return create_example_pdf_visualization, symbolic_calculation

probability/18_central_limit_theorem.py CHANGED Viewed

@@ -6,12 +6,13 @@
 #     "scipy==1.15.2",
 #     "numpy==2.2.4",
 #     "plotly==5.18.0",
 # ]
 # ///
 import marimo
-__generated_with = "0.11.30"
 app = marimo.App(width="medium", app_title="Central Limit Theorem")
@@ -23,7 +24,20 @@ def _(mo):
         _This notebook is a computational companion to ["Probability for Computer Scientists"](https://chrispiech.github.io/probabilityForComputerScientists/en/part4/clt/), by Stanford professor Chris Piech._
-        The Central Limit Theorem (CLT) is one of the most important concepts in probability theory and statistics. It explains why many real-world distributions tend to be normal, even when the underlying processes are not.
         """
     )
     return
@@ -41,7 +55,7 @@ def _(mo):
         Let $X_1, X_2, \dots, X_n$ be independent and identically distributed random variables. The sum of these random variables approaches a normal distribution as $n \rightarrow \infty$:
-        $n∑i=1Xi∼N(n⋅μ,n⋅σ2)\sum_{i=1}^{n}X_i \sim \mathcal{N}(n \cdot \mu, n \cdot \sigma^2)$
         Where $\mu = E[X_i]$ and $\sigma^2 = \text{Var}(X_i)$. Since each $X_i$ is identically distributed, they share the same expectation and variance.
@@ -49,7 +63,7 @@ def _(mo):
         Let $X_1, X_2, \dots, X_n$ be independent and identically distributed random variables. The average of these random variables approaches a normal distribution as $n \rightarrow \infty$:
-        $\frac{1}{n} ∑i=1Xi∼N(μ,σ2n)\frac{1}{n}\sum_{i=1}^{n}X_i \sim \mathcal{N}(\mu, \frac{\sigma^2}{n})$
         Where $\mu = E[X_i]$ and $\sigma^2 = \text{Var}(X_i)$.
@@ -311,41 +325,37 @@ def _(mo):
         r"""
         ### Example 1: Dice Game
-        You will roll a 6-sided dice 10 times. Let $X$ be the total value of all 10 dice: $X = X_1 + X_2 + \dots + X_{10}$. You win the game if $X \leq 25$ or $X \geq 45$. Use the central limit theorem to calculate the probability that you win.
-        Recall that for a single die roll $X_i$:
         - $E[X_i] = 3.5$
         - $\text{Var}(X_i) = \frac{35}{12}$
-        **Solution:**
-        Let $Y$ be the approximating normal distribution. By the Central Limit Theorem:
-        $Y∼N(10⋅E[Xi],10⋅Var(Xi))Y \sim \mathcal{N}(10 \cdot E[X_i], 10 \cdot \text{Var}(X_i))$
-        Substituting in the known values:
-        $Y∼N(10⋅3.5,10⋅3512)=N(35,29.2)Y \sim \mathcal{N}(10 \cdot 3.5, 10 \cdot \frac{35}{12}) = \mathcal{N}(35, 29.2)$
-        Now we calculate the probability:
-        $P(X≤25 or X≥45)P(X \leq 25 \text{ or } X \geq 45)$
-        $=P(X≤25)+P(X≥45)= P(X \leq 25) + P(X \geq 45)$
-        $≈P(Y<25.5)+P(Y>44.5) (Continuity Correction)\approx P(Y < 25.5) + P(Y > 44.5) \text{ (Continuity Correction)}$
-        $≈P(Y<25.5)+[1−P(Y<44.5)]\approx P(Y < 25.5) + [1 - P(Y < 44.5)]$
-        $≈Φ(25.5−35√29.2)+[1−Φ(44.5−35√29.2)]\approx \Phi\left(\frac{25.5 - 35}{\sqrt{29.2}}\right) + \left[1 - \Phi\left(\frac{44.5 - 35}{\sqrt{29.2}}\right)\right]$
-        $≈Φ(−1.76)+[1−Φ(1.76)]\approx \Phi(-1.76) + [1 - \Phi(1.76)]$
-        $≈0.039+(1−0.961)\approx 0.039 + (1 - 0.961)$
-        $≈0.078\approx 0.078$
-        So, the probability of winning the game is approximately 7.8%.
         """
     )
     return
@@ -359,17 +369,17 @@ def _(create_dice_game_visualization, fig_to_image, mo):
     dice_explanation = mo.md(
         r"""
-        **Visualization Explanation:**
-        The graph shows the distribution of the sum of 10 dice rolls. The blue bars represent the actual probability mass function (PMF), while the red curve shows the normal approximation using the Central Limit Theorem.
-        The winning regions are shaded in orange:
         - The left region where $X \leq 25$
         - The right region where $X \geq 45$
-        The total probability of these regions is approximately 0.078 or 7.8%.
-        Notice how the normal approximation provides a good fit to the discrete distribution, demonstrating the power of the Central Limit Theorem.
         """
     )
@@ -383,50 +393,45 @@ def _(mo):
         r"""
         ### Example 2: Algorithm Runtime Estimation
-        Say you have a new algorithm and you want to test its running time. You know the variance of the algorithm's run time is $\sigma^2 = 4 \text{ sec}^2$, but you want to estimate the mean run time $t$ in seconds.
-        You can run the algorithm repeatedly (IID trials). How many trials do you have to run so that your estimated runtime is within ±0.5 seconds of $t$ with 95% certainty?
-        Let $X_i$ be the run time of the $i$-th run (for $1 \leq i \leq n$).
         **Solution:**
         We need to find $n$ such that:
-        $0.95=P(−0.5≤∑ni=1Xin−t≤0.5)0.95 = P\left(-0.5 \leq \frac{\sum_{i=1}^n X_i}{n} - t \leq 0.5\right)$
-        By the central limit theorem, the sample mean follows a normal distribution.
-        We can standardize this to work with the standard normal:
-        $Z=(∑ni=1Xi)−nμσ√nZ = \frac{\left(\sum_{i=1}^n X_i\right) - n\mu}{\sigma \sqrt{n}}$
-        $=(∑ni=1Xi)−nt2√n= \frac{\left(\sum_{i=1}^n X_i\right) - nt}{2 \sqrt{n}}$
-        Rewriting our probability inequality so that the central term is $Z$:
-        $0.95=P(−0.5≤∑ni=1Xin−t≤0.5)0.95 = P\left(-0.5 \leq \frac{\sum_{i=1}^n X_i}{n} - t \leq 0.5\right)$
-        $=P(−0.5√n2≤Z≤0.5√n2)= P\left(\frac{-0.5 \sqrt{n}}{2} \leq Z \leq \frac{0.5 \sqrt{n}}{2}\right)$
-        And now we find the value of $n$ that makes this equation hold:
-        $0.95=Φ(√n4)−Φ(−√n4)0.95 = \Phi\left(\frac{\sqrt{n}}{4}\right) - \Phi\left(-\frac{\sqrt{n}}{4}\right)$
-        $4=Φ(√n4)−(1−Φ(√n4))= \Phi\left(\frac{\sqrt{n}}{4}\right) - \left(1 - \Phi\left(\frac{\sqrt{n}}{4}\right)\right)$
-        $=2Φ(√n4)−1= 2\Phi\left(\frac{\sqrt{n}}{4}\right) - 1$
         Solving for $\Phi\left(\frac{\sqrt{n}}{4}\right)$:
-        $0.975=Φ(√n4)0.975 = \Phi\left(\frac{\sqrt{n}}{4}\right)$
-        $Φ−1(0.975)=√n4\Phi^{-1}(0.975) = \frac{\sqrt{n}}{4}$
-        $1.96=√n41.96 = \frac{\sqrt{n}}{4}$
-        $n=61.4n = 61.4$
-        Therefore, we need to run the algorithm 62 times to estimate the mean runtime within ±0.5 seconds with 95% confidence.
         """
     )
     return
@@ -929,7 +934,6 @@ def _(mo):
         mo.vstack([distribution_type, sample_size, sim_count_slider]),
         run_explorer_button
     ], justify='space-around')
     return (
         controls,
         distribution_type,

 #     "scipy==1.15.2",
 #     "numpy==2.2.4",
 #     "plotly==5.18.0",
+#     "wigglystuff==0.1.13",
 # ]
 # ///
 import marimo
+__generated_with = "0.12.6"
 app = marimo.App(width="medium", app_title="Central Limit Theorem")
         _This notebook is a computational companion to ["Probability for Computer Scientists"](https://chrispiech.github.io/probabilityForComputerScientists/en/part4/clt/), by Stanford professor Chris Piech._
+        The central limit theorem is honestly mind-blowing — it's like magic that no matter what distribution you start with, the sampling distribution of means approaches a normal distribution as sample size increases.
+        Mathematically, if we have:
+        $X_1, X_2, \ldots, X_n$ as independent, identically distributed random variables with:
+        - Mean: $\mu$
+        - Variance: $\sigma^2 < \infty$
+        Then as $n \to \infty$:
+        $$\sqrt{n}\left(\frac{1}{n}\sum_{i=1}^{n}X_i - \mu\right) \xrightarrow{d} \mathcal{N}(0, \sigma^2)$$
+        > _Note:_ The above LaTeX derivation is included as a reference. Credit for this formulation goes to the original source linked at the top of the notebook.
         """
     )
     return
         Let $X_1, X_2, \dots, X_n$ be independent and identically distributed random variables. The sum of these random variables approaches a normal distribution as $n \rightarrow \infty$:
+        $\sum_{i=1}^{n}X_i \sim \mathcal{N}(n \cdot \mu, n \cdot \sigma^2)$
         Where $\mu = E[X_i]$ and $\sigma^2 = \text{Var}(X_i)$. Since each $X_i$ is identically distributed, they share the same expectation and variance.
         Let $X_1, X_2, \dots, X_n$ be independent and identically distributed random variables. The average of these random variables approaches a normal distribution as $n \rightarrow \infty$:
+        $\frac{1}{n}\sum_{i=1}^{n}X_i \sim \mathcal{N}(\mu, \frac{\sigma^2}{n})$
         Where $\mu = E[X_i]$ and $\sigma^2 = \text{Var}(X_i)$.
         r"""
         ### Example 1: Dice Game
+        > _Note:_ The following application demonstrates the practical use of the Central Limit Theorem. The mathematical derivation is based on concepts from ["Probability for Computer Scientists"](https://chrispiech.github.io/probabilityForComputerScientists/en/part2/clt/) by Chris Piech.
+        Let's solve a fun probability problem: You roll a 6-sided die 10 times and let $X$ represent the total value of all 10 dice: $X = X_1 + X_2 + \dots + X_{10}$. You win if $X \leq 25$ or $X \geq 45$. What's your probability of winning?
+        For a single die roll $X_i$, we know:
         - $E[X_i] = 3.5$
         - $\text{Var}(X_i) = \frac{35}{12}$
+        **Solution Approach:**
+        This is where the Central Limit Theorem shines! Since we're summing 10 independent, identically distributed random variables, we can approximate this sum with a normal distribution $Y$:
+        $Y \sim \mathcal{N}(10 \cdot E[X_i], 10 \cdot \text{Var}(X_i)) = \mathcal{N}(35, 29.2)$
+        Now calculating our winning probability:
+        $P(X \leq 25 \text{ or } X \geq 45) = P(X \leq 25) + P(X \geq 45)$
+        Since we're approximating a discrete distribution with a continuous one, we apply a continuity correction:
+        $\approx P(Y < 25.5) + P(Y > 44.5) = P(Y < 25.5) + [1 - P(Y < 44.5)]$
+        Converting to standard normal form:
+        $\approx \Phi\left(\frac{25.5 - 35}{\sqrt{29.2}}\right) + \left[1 - \Phi\left(\frac{44.5 - 35}{\sqrt{29.2}}\right)\right]$
+        $\approx \Phi(-1.76) + [1 - \Phi(1.76)]$
+        $\approx 0.039 + (1 - 0.961) \approx 0.078$
+        So your chance of winning is about 7.8% — not great odds, but that's probability for you!
         """
     )
     return
     dice_explanation = mo.md(
         r"""
+        **Understanding the Visualization:**
+        This graph shows our dice game in action. The blue bars represent the exact probability distribution for summing 10 dice, while the red curve shows our normal approximation from the Central Limit Theorem.
+        I've highlighted the winning regions in orange:
         - The left region where $X \leq 25$
         - The right region where $X \geq 45$
+        Together these regions cover about 7.8% of the total probability.
+        What's fascinating here is how closely the normal curve approximates the actual discrete distribution — this is the Central Limit Theorem working its magic, even with just 10 random variables.
         """
     )
         r"""
         ### Example 2: Algorithm Runtime Estimation
+        > _Note:_ The following derivation demonstrates the practical application of the Central Limit Theorem for experimental design. The mathematical approach is based on concepts from ["Probability for Computer Scientists"](https://chrispiech.github.io/probabilityForComputerScientists/en/part2/clt/) by Chris Piech.
+        Here's a practical problem I encounter in performance testing: You've developed a new algorithm and want to measure its average runtime. You know the variance is $\sigma^2 = 4 \text{ sec}^2$, but need to estimate the true mean runtime $t$.
+        The question: How many test runs do you need to be 95% confident your estimated mean is within ±0.5 seconds of the true value?
+        Let $X_i$ represent the runtime of the $i$-th test (for $1 \leq i \leq n$).
         **Solution:**
         We need to find $n$ such that:
+        $0.95 = P\left(-0.5 \leq \frac{\sum_{i=1}^n X_i}{n} - t \leq 0.5\right)$
+        The Central Limit Theorem tells us that as $n$ increases, the sample mean approaches a normal distribution. Let's standardize this to work with the standard normal distribution:
+        $Z = \frac{\left(\sum_{i=1}^n X_i\right) - n\mu}{\sigma \sqrt{n}} = \frac{\left(\sum_{i=1}^n X_i\right) - nt}{2 \sqrt{n}}$
+        Rewriting our probability constraint in terms of $Z$:
+        $0.95 = P\left(-0.5 \leq \frac{\sum_{i=1}^n X_i}{n} - t \leq 0.5\right) = P\left(\frac{-0.5 \sqrt{n}}{2} \leq Z \leq \frac{0.5 \sqrt{n}}{2}\right)$
+        Using the properties of the standard normal CDF:
+        $0.95 = \Phi\left(\frac{\sqrt{n}}{4}\right) - \Phi\left(-\frac{\sqrt{n}}{4}\right) = 2\Phi\left(\frac{\sqrt{n}}{4}\right) - 1$
         Solving for $\Phi\left(\frac{\sqrt{n}}{4}\right)$:
+        $0.975 = \Phi\left(\frac{\sqrt{n}}{4}\right)$
+        Using the inverse CDF:
+        $\Phi^{-1}(0.975) = \frac{\sqrt{n}}{4}$
+        $1.96 = \frac{\sqrt{n}}{4}$
+        $n = 61.4$
+        Rounding up, we need 62 test runs to achieve our desired confidence interval — a practical result we can immediately apply to our testing protocol.
         """
     )
     return
         mo.vstack([distribution_type, sample_size, sim_count_slider]),
         run_explorer_button
     ], justify='space-around')
     return (
         controls,
         distribution_type,

probability/19_maximum_likelihood_estimation.py CHANGED Viewed

@@ -133,6 +133,8 @@ def _(mo):
         r"""
         ## MLE for Bernoulli Distribution
         Let's start with a simple example: estimating the parameter $p$ of a Bernoulli distribution.
         ### The Model

         r"""
         ## MLE for Bernoulli Distribution
+        > _Note:_ The following derivation is included as reference material. The credit for this mathematical formulation belongs to ["Probability for Computer Scientists"](https://chrispiech.github.io/probabilityForComputerScientists/en/part5/mle/) by Chris Piech.
         Let's start with a simple example: estimating the parameter $p$ of a Bernoulli distribution.
         ### The Model