We start with one typed object:
\[z : \mathcal{Z}\]An observation. A data point. One image, one text sequence, one patient record, one row in a table.
The entire enterprise begins here:
\[\boxed{\text{use old } z\text{'s to handle future } z\text{'s}}\]“Handle” may mean predict, compress, classify, cluster, generate, rank, decide, triage. The specific task determines what handling means. But the primitive is always $z$.
To handle observations we need a model. A model takes an observation and produces an output:
\[f_\theta : \mathcal{Z} \to \mathcal{A}\]where $a : \mathcal{A}$ is the output type. For hard prediction $\mathcal{A} = \mathcal{Y}$. For probabilistic prediction $\mathcal{A} = \Delta(\mathcal{Y})$, the space of distributions over $\mathcal{Y}$.
The model is parameterized by:
\[\theta : \Theta\]the learnable settings. $f_\theta$ is one model from a family indexed by $\Theta$.
Now we need to say what makes an output good or bad. That is the loss:
\[L : \mathcal{Z} \times \mathcal{A} \to \mathbb{R}\]So $L(z, f_\theta(z))$ means: how bad was the model’s output on this observation. Nothing more. The loss is the definition of what we care about. It is not derived from anything deeper.
\[\boxed{\text{loss is the game}}\]We have old observations $z_1, \dots, z_n$. Future observations are uncertain. To speak precisely about them we need probability.
But first: what is probability?
There are two honest answers.
The frequentist answer: probability is long-run frequency. If we drew observations repeatedly, the fraction falling in any region would converge to $P$ of that region. Under this view $Z \sim P$ means nature draws future observations according to a fixed world distribution $P$.
The Bayesian answer: probability is degree of belief. It applies to any uncertain quantity — not just repeatable events. Under this view probability expresses what we know, not what would happen under repetition. Different beliefs, different probabilities.
Both interpretations obey identical mathematical rules. The object in both cases is a probability measure — a function assigning non-negative numbers to subsets of $\mathcal{Z}$, summing to one:
\[P(A) \geq 0, \quad P(\mathcal{Z}) = 1\]For discrete $\mathcal{Z}$, $P$ assigns mass to individual points. For continuous $\mathcal{Z} = \mathbb{R}^d$, individual points have zero mass and $P$ is described by a density $p(z)$ when it exists. The Radon-Nikodym theorem says this density exists — and is called the Radon-Nikodym derivative $dP/d\lambda$ — precisely when $P$ is absolutely continuous with respect to Lebesgue measure $\lambda$: no mass concentrated on sets of Lebesgue measure zero.
The Lebesgue integral unifies discrete and continuous into one notation:
\[\mathbb{E}_{Z \sim P}[g(Z)] = \int g(z) \, dP(z)\]This works whether $P$ has a density or not. When a density exists it becomes $\int g(z) p(z) dz$. When $P$ is discrete it becomes $\sum_z g(z) P(z)$. The concept — average of $g$ under $P$ — is the same in both cases.
A random variable $Z \sim P$ is a quantity whose value is uncertain, governed by $P$. Old observations $z_i$ are realized values — fixed draws from the past. Future observations are still random.
\[\boxed{z_i = \text{realized, fixed}} \qquad \boxed{Z = \text{random, } \sim P}\]The distinction between frequentist and Bayesian becomes operationally important when probability is placed on parameters. $P(Z)$ is naturally frequentist — observations are drawn from the world. $P(\theta)$ is necessarily Bayesian — parameters are fixed unknowns, not outcomes of a random process. Assigning probability to $\theta$ is a statement of belief, not frequency.
With probability in hand, the real objective is clear.
Training loss on old data:
\[\hat R_n(\theta) = \frac{1}{n} \sum_i L(z_i, f_\theta(z_i))\]But we do not care about old observations. We care about future ones. The true objective is:
\[R(\theta) = \mathbb{E}_{Z \sim P}[L(Z, f_\theta(Z))] = \int L(z, f_\theta(z)) \, dP(z)\]This is future risk — average badness under the world distribution.
The empirical distribution $\hat P_n$ puts mass $1/n$ on each observed $z_i$:
\[\hat P_n = \frac{1}{n} \sum_i \delta_{z_i}\]Then training loss is exactly expected loss under $\hat P_n$:
\[\hat R_n(\theta) = \mathbb{E}_{Z \sim \hat P_n}[L(Z, f_\theta(Z))]\]So both objectives are the same kind of object — expected loss under a distribution — just different distributions:
\[R(\theta) = \mathbb{E}_P[L] \qquad \hat R_n(\theta) = \mathbb{E}_{\hat P_n}[L]\]Learning means choosing $\theta$ to minimize the computable proxy:
\[\hat\theta = \arg\min_{\theta \in \Theta} \hat R_n(\theta)\] \[\boxed{\text{learning = minimizing expected loss under } \hat P_n \text{ as proxy for } P}\]The whole difficulty of learning lives in one inequality:
\[\hat P_n \neq P\]Finite data differs from the world. Therefore:
\[\hat R_n(\theta) \neq R(\theta)\]And when we minimize $\hat R_n$, we are not minimizing $R$. The gap:
\[R(\hat\theta) - \hat R_n(\hat\theta)\]is the generalization gap.
Why does it exist? Because $\hat P_n$ is a random object — a different dataset gives a different $\hat P_n$, a different $\hat R_n$, possibly a different $\hat\theta$. For fixed $\theta$, the training loss is an average of random quantities:
\[\hat R_n(\theta) = \frac{1}{n} \sum_i L(z_i, f_\theta(z_i))\]Averages fluctuate, but fluctuate less with more data. The typical error shrinks like $1/\sqrt{n}$.
\[\boxed{\text{sampling variation = finite-data wiggle in } \hat P_n}\]If the model family $\Theta$ is very flexible, $\hat\theta$ can exploit this wiggle — making $\hat R_n(\hat\theta)$ very small while $R(\hat\theta)$ stays large. The model learns accidents of $\hat P_n$ rather than the true pattern of $P$.
\[\boxed{\text{overfitting = exploiting sampling variation in } \hat P_n}\]The response to overfitting is regularization: add a complexity penalty to the objective:
\[\hat\theta = \arg\min_{\theta \in \Theta} \left[ \hat R_n(\theta) + \lambda C(\theta) \right]\]where $C(\theta)$ penalizes complexity and $\lambda$ controls the strength. The goal is not lower training loss but lower future risk.
Parameter penalties add $\theta$ directly: $L_2$ with $\theta = \sum_j \theta_j^2$ shrinks weights smoothly, $L_1$ with $\theta = \sum_j |\theta_j|$ pushes weights to zero. Procedural constraints — early stopping, data augmentation — limit complexity through the training process itself.
Regularization is not free. It introduces bias: the model is prevented from fitting even some real patterns. The tradeoff is:
Bias — how far the best model in $\Theta$ is from the true optimum. Restricted families have high bias.
Variance — how much $\hat\theta$ wiggles across datasets. Flexible families have high variance.
\[\mathbb{E}[R(\hat\theta)] - R^* = \text{bias} + \text{variance}\] \[\boxed{\text{flexible} \implies \text{low bias, high variance}} \qquad \boxed{\text{restricted} \implies \text{high bias, low variance}}\]Regularization shifts the tradeoff. Finding the right $\lambda$ is the art.
So far $P$ appears only as what we average under. Now we make it a direct input.
A functional takes a distribution and returns a number:
\[T : \Delta(\mathcal{Z}) \to \mathbb{R}\]Examples:
\[T(P) = \mathbb{E}_P[Z] \quad \text{mean}\] \[T(P) = \mathbb{E}_P[(Z - \mathbb{E}_P[Z])^2] \quad \text{variance}\] \[T(P) = \inf\{z : P(Z \leq z) \geq q\} \quad \text{quantile}\] \[T(P) = -\mathbb{E}_P[\log P(Z)] \quad \text{entropy}\]These are descriptions of $P$ itself, not actions on individual observations. They live one level up from $L$.
The plug-in principle connects functionals to data:
\[T(P) \approx T(\hat P_n)\]Replace the unknown $P$ with the known $\hat P_n$ everywhere. This gives:
\[T(\hat P_n) = \frac{1}{n}\sum_i z_i \quad \text{sample mean}\] \[T(\hat P_n) = \text{order statistic} \quad \text{sample quantile}\] \[T(\hat P_n) = -\frac{1}{n}\sum_i \log \hat P_n(z_i) \quad \text{empirical entropy}\]One principle. One plug-in. All of descriptive and nonparametric statistics falls out.
$T(\hat P_n)$ is itself random — it wiggles across datasets. How much it wiggles is the central question of statistical inference.
\[\boxed{T(P) = \text{truth}} \qquad \boxed{T(\hat P_n) = \text{estimate}} \qquad \boxed{\text{wiggle of } T(\hat P_n) = \text{the statistical problem}}\]So far the model $f_\theta$ produces outputs. Now we flip perspective: the model generates observations.
\[P_\theta(Z)\]is a distribution over $\mathcal{Z}$ indexed by $\theta$. The likelihood of $\theta$ given data is:
\[\mathcal{L}(\theta) = \prod_i P_\theta(z_i)\]This is not a distribution over $\theta$. It is a function of $\theta$ for fixed data — asking: how probable were these observations if $\theta$ were true?
Maximum likelihood estimation maximizes this:
\[\hat\theta_{\text{MLE}} = \arg\max_\theta \prod_i P_\theta(z_i) = \arg\min_\theta \sum_i -\log P_\theta(z_i)\]So MLE is exactly minimizing log loss $L(z, P_\theta) = -\log P_\theta(z)$. It is not a separate idea. It is a specific choice of loss — the one that asks the model to assign high probability to what actually happened.
\[\boxed{\text{MLE} = \arg\min \text{ log loss} = \arg\max \text{ likelihood}}\]Why is $-\log P(z)$ the natural measure of surprise?
We want a function of how unexpected $z$ is. It should give zero surprise to certain events, high surprise to rare ones, and add across independent events. The unique such function is:
\[\text{surprise}(z) = -\log P(z)\]Entropy is the expected surprise under $P$:
\[H(P) = \mathbb{E}_{Z \sim P}[-\log P(Z)]\]It measures average uncertainty. Zero for a point mass. Maximum for the uniform distribution.
When the true distribution is $P$ but we use $Q$ to describe it, the KL divergence measures the extra surprise incurred:
\[D_{\text{KL}}(P \| Q) = \mathbb{E}_{Z \sim P}\left[\log \frac{P(Z)}{Q(Z)}\right] = H(P, Q) - H(P)\]where $H(P,Q) = \mathbb{E}_{Z \sim P}[-\log Q(Z)]$ is the cross-entropy.
$D_{\text{KL}}(P|Q) \geq 0$ always, with equality iff $P = Q$. Minimizing cross-entropy over $Q$ is the same as minimizing KL divergence from $P$ to $Q$.
So minimizing log loss — which is cross-entropy between true $P$ and model $Q_\theta$ — is the same as making $Q_\theta$ as close to $P$ as possible in information-theoretic distance.
\[\boxed{\text{log loss} = \text{cross-entropy} = \text{KL divergence up to a constant}}\]This is why log loss is the right loss for probabilistic prediction. Not arbitrary. Inevitable.
MLE treats $\theta$ as a fixed unknown and finds the value most consistent with the data. It has no opinion about $\theta$ before seeing data.
The Bayesian approach introduces a prior:
\[P(\theta)\]a distribution over $\theta$ expressing belief before seeing data. This is probability as degree of belief — $\theta$ is not random in a frequentist sense, but we are uncertain about it.
Bayes’ theorem updates the prior after seeing data:
\[P(\theta \mid z_1, \dots, z_n) \propto P(z_1, \dots, z_n \mid \theta) \cdot P(\theta)\] \[\boxed{\text{posterior} \propto \text{likelihood} \times \text{prior}}\]The posterior $P(\theta \mid \text{data})$ is the updated belief. If we want a single estimate, we take the posterior mode:
\[\hat\theta_{\text{MAP}} = \arg\max_\theta \left[\log P(\text{data} \mid \theta) + \log P(\theta)\right]\] \[= \arg\min_\theta \left[\sum_i -\log P_\theta(z_i) - \log P(\theta)\right]\]Compare to MLE which minimizes only $\sum_i -\log P_\theta(z_i)$. MAP adds $-\log P(\theta)$ — exactly a regularization term. A Gaussian prior gives $L_2$ regularization. A Laplace prior gives $L_1$.
\[\boxed{\text{MAP} = \text{MLE} + \text{prior} = \text{regularized MLE}}\] \[\boxed{L_2 \text{ regularization} = \text{Gaussian prior on } \theta}\]Regularization, introduced earlier as a computational response to overfitting, is revealed here as a statement of prior belief about $\theta$.
We have observations $z_i \in \mathbb{R}^d$ and targets $y_i \in \mathbb{R}$. The model is linear:
\[f_\theta(z) = \theta^\top z\]with squared loss. Stack into matrix $\mathbf{Z} \in \mathbb{R}^{n \times d}$ and vector $\mathbf{y} \in \mathbb{R}^n$:
\[\hat R_n(\theta) = \|\mathbf{y} - \mathbf{Z}\theta\|^2\]The set of achievable predictions ${\mathbf{Z}\theta : \theta \in \mathbb{R}^d}$ is a subspace of $\mathbb{R}^n$ — the column space of $\mathbf{Z}$. We want the point in that subspace closest to $\mathbf{y}$. That is the orthogonal projection:
\[\hat{\mathbf{y}} = \text{proj}_{\text{col}(\mathbf{Z})} \mathbf{y} = \mathbf{Z}(\mathbf{Z}^\top\mathbf{Z})^{-1}\mathbf{Z}^\top \mathbf{y}\]The residual $\mathbf{y} - \hat{\mathbf{y}}$ is orthogonal to the column space by construction.
\[\boxed{\text{least squares} = \text{projection of } \mathbf{y} \text{ onto the model subspace}}\] \[\boxed{\text{residual} \perp \text{model subspace}}\]This is not a formula. It is a geometric fact. Regression, PCA, and many other methods are instances of projection in different spaces.
We want to know something about $T(P)$ but only observe $T(\hat P_n)$.
A hypothesis is a claim about $P$:
\[H_0 : T(P) = t_0 \qquad H_1 : T(P) \neq t_0\]A test is a decision rule $\phi : \mathcal{Z}^n \to {0,1}$. This is a loss problem with two error types:
\[\text{Type I}: \phi = 1 \text{ when } H_0 \text{ is true} \quad \text{(false alarm)}\] \[\text{Type II}: \phi = 0 \text{ when } H_1 \text{ is true} \quad \text{(missed detection)}\]We control Type I error at level $\alpha$ and maximize power — the probability of correctly rejecting $H_0$.
The test statistic is a functional $T(\hat P_n)$. We reject when it is far from $t_0$. The p-value:
\[p = P_{H_0}(T(\hat P_n) \geq T(\hat P_n)_{\text{observed}})\]is the probability of seeing this or more extreme data if $H_0$ were true. Small $p$ means the data is surprising under $H_0$.
Note: $p$ is not $P(H_0 \text{ is true})$. That requires a prior — the Bayesian framing.
\[\boxed{\text{hypothesis test} = \text{loss problem on } T(\hat P_n) \text{ with controlled Type I error}}\]We want $T(P)$. We have $T(\hat P_n)$. The plug-in gives us an estimate. But the estimate is random — it wiggles. How much?
The standard error is the typical size of $T(\hat P_n) - T(P)$:
\[\text{SE} = \sqrt{\text{Var}_{Z \sim P}(T(\hat P_n))}\]A confidence interval at level $1-\alpha$ is a random interval $[L(\hat P_n), U(\hat P_n)]$ such that:
\[P(T(P) \in [L(\hat P_n), U(\hat P_n)]) \geq 1-\alpha\]The interval is random. The truth $T(P)$ is fixed. Over repeated datasets from $P$, the interval covers the truth at least $1-\alpha$ of the time.
When $\text{Var}(T(\hat P_n))$ is unknown — which it usually is, since it depends on $P$ — the bootstrap estimates it. Draw $B$ datasets by resampling with replacement from $\hat P_n$, compute $T(\hat P_n^{(b)})$ for each, use their spread to approximate the spread of $T(\hat P_n)$ around $T(P)$.
This is the plug-in principle one level up: $\hat P_n$ substitutes for $P$ not just in the estimate but in the estimation of the estimate’s uncertainty.
\[\boxed{T(P) = \text{truth}} \qquad \boxed{T(\hat P_n) = \text{estimate}} \qquad \boxed{\text{bootstrap} = \text{plug-in applied to the wiggle itself}}\]Text produced by GPT-5 and Sonnet 4.6 on April 27–28, 2026. Slight manual polish by Pavel Zhelnov. Original chat context is lost.