Classical ML from Scratch
Two learning paradigms built from NumPy up — tree-based spam classification (decision tree, Random Forest, AdaBoost) and SVD/ALS matrix factorization for movie recommendations. No frameworks; matched scikit-learn on both.

The full spam decision tree — depth 9, 32 word-frequency features, hand-grown via information gain. Every split is a threshold on a real word (“exclamation ≤ 0.5”, “dollar ≤ 2.5”). The tree memorizes training data perfectly by depth 15; the interesting problem is why that’s wrong, and what to do about it.
Part 1 — Spam Classifier
Structure in the Data
Which Words Give It Away
Exclamation mark frequency is the single strongest predictor — 3× more important than dollar signs. Spam writers use exclamation to create urgency; ham almost never does. “Parenthesis” and “semicolon” ranking high is less obvious — it reflects HTML formatting artifacts in the email corpus. “Drug”, “prescription”, and “pain” rank low because 2004 spam was more financially-themed than medically-themed.
Overfitting & Ensembles
The splitting criterion is information gain:
$$IG(S, A) = H(S) - \sum_{v} \frac{|S_v|}{|S|} H(S_v), \quad H(S) = -\sum_k p_k \log_2 p_k$$
Left: training accuracy climbs monotonically while validation peaks at depth 10. A depth-20 tree has memorized noise. Random Forest (dashed) beats the best single tree by staying shallow across 200 bootstrapped trees — bagging reduces variance without touching bias.
Right: AdaBoost drives ensemble error down over 15 rounds. Learner weight α decays because later stumps face harder examples — the residual errors left by all previous rounds — and so each stump is less confident. The ensemble improves anyway by accumulating many weak signals.
Part 2 — Movie Recommender
The Problem is the Sparsity
The rating matrix: 24,983 users × 100 movies, 64% empty. Each row is a user’s taste; each column is a movie’s reception. Sorted by mean rating, a faint diagonal gradient is visible — positive users cluster top-right, negative users cluster bottom-left — but the structure is mostly hidden in the noise of missing entries. The goal is to fill in the blanks from the pattern of what is observed.
Factorization
Find two matrices $U \in \mathbb{R}^{n \times d}$ and $V \in \mathbb{R}^{m \times d}$ that minimize reconstruction error on observed ratings only:
$$\min_{U,,V} \sum_{(i,j),\in,\Omega} \left(R_{ij} - \mathbf{u}_i^\top \mathbf{v}_j\right)^2 + \lambda \left(|U|_F^2 + |V|_F^2\right)$$
Alternating Least Squares (ALS) solves this by fixing $V$ and solving for $U$ in closed form, then swapping. Each subproblem is a ridge regression — $\mathbf{u}i = (V{J_i}^\top V_{J_i} + \lambda I)^{-1} V_{J_i}^\top \mathbf{r}_i$ where $J_i$ is the set of movies user $i$ has rated.
Left: cosine similarity between learned movie vectors after 20 ALS iterations. The block structure reveals groups of movies the model has learned are interchangeable — users who liked one tend to like the others. Right: ALS convergence — train MSE drops sharply in the first 5 iterations and plateaus; validation accuracy (predicting rating sign) reaches 72% and holds.
SVD with $d=10$ singular vectors provides the warm start. ALS then refines on observed entries only — which SVD can’t do since it requires a dense matrix.
Related projects
Deep Learning from Scratch
The full arc: backpropagation in raw NumPy, CNNs with BatchNorm and Dropout trained to 78% on CIFAR-10, multi-head self-attention for text summarization, and a Masked Autoencoder that reconstructs images from 25% of their patches — then transfers those features to downstream tasks.
Entropy Wordle Solver
Information-theoretic greedy solver that picks each guess to maximize expected entropy over the remaining word set — averaging 3.92 guesses across 300+ games.
MCMC Cipher Decoder
Metropolis-Hastings breaking substitution ciphers against bigram frequencies — watch garbled text slowly resolve into English over MCMC iterations.