Math does not come easily for me. Maybe this is why one of my greatest joys in life is finally understanding why and how a mathematical process works, on my own terms.
I’ll illustrate this with the first example in graduate school where I felt enlightened, so to speak: the dynamics of sparse coding. This brief explanation will not be clear if you’re not familiar with linear algebra, but if you do feel compelled to read through, I’m curious how much is understandable from my explanation in words (diagrams would be much clearer but maybe I’ll do that in another post). Here is a tutorial I wrote with visualizations that may be slightly better if you are interested.
Sparse coding is a principle hypothesized to govern certain types of brain activity, where only a small subset of relevant neurons fire. Sparse activity has been found in many parts of the brain, including the visual pathway. The formulation is
Where x is an image flattened into a vector (N pixels), Φ is a matrix (NxM) whose columns (basis vectors) can linearly combine to approximate x, a is a vector of M coefficients that are ideally sparse (mostly zeros), and n is a random Gaussian vector (we can ignore this for our purposes). The idea is to find an a and Φ that well-approximate a given x. Φ is the total images you have at your disposal (“dictionary”), and a specifies which images you pick to add up to the final one.
We define an energy function that we aim to minimize, which puts into math the desiderata for our model:
Minimizing the first term reduces the difference between the original image x and the approximation Φa; minimizing the second term makes a sparse (this form is commonly known as LASSO). To make E small, we must obtain a good dictionary Φ and sparse coefficients a that get us a good approximation of the image x. One simple way to do this is through gradient descent, i.e. changing our Φ and a in a direction in vector space that minimizes E.
If we take the gradient of -E with respect to a selected a_i (scalar), that is the direction we need to step to minimize E. I won’t go through the gradient descent procedure here, but we’ll just look at the gradient itself:
where Φ_i is the ith column of Φ (the ith basis vector). Intuitively, the direction you need to step to get a better a_i amounts to taking the difference between the target image and your approximation (the direction in which you’re “off”, called the residual), and taking its dot product with Φ_i. The dot product (the similarity between Φ_i and the residual) tell us that Φ_i is used either too much or too little in creating the approximation, and a_i will change to adjust for this. Note that if the approximation is good then the residual will be small, and a_i changes only by a little.
The second term is either negative or positive, the opposite sign of a_i, which nudges the activations toward 0 (this is desired because we want a sparse a_i!).
Similarly, we take the gradient of -E with respect to Φ_j :
where Φ_j is the jth column of Φ. It multiplies the residual by a_j (which can be negative or positive), which results in a vector that corrects for the residual (x-Φa) in the appropriate direction based on how it’s being used in the approximation (a_j), and therefore improves the approximation.
In sparse coding, we alternate between taking a step to minimize Φ to learn the basis vectors, and steps to minimize a to infer the sparse coefficients. Although determining the gradient is calculus, which is just rules, you can explain every part of it intuitively, because of the way we formulated the energy function! The solution that emerges is elegant (see tutorial), and provides hypotheses for primary visual cortex neural activity. For those who think it’s not a big deal, and that’s just how gradients work, that’s fair. But it’s hard for me to take even fundamental concepts like this for granted because it takes me a lot of work to get here. It’s rewarding to understand!
(You know what else uses gradients? Training every machine learning system in production, from ChatGPT to the smart charging feature on your phone. But we are at a loss to explain these gradients the way we can simpler models like sparse coding. These machine learning models are amazing feats of engineering, and (some) work well for a specific purpose, but they are clunky and opaque. This is why I don’t do deep learning 😬.)
Sometimes people ask me why these problems are interesting to me. I feel that I can’t reduce the explanation past the primitive: it’s just pure joy! And I am extremely privileged to get to use this sense to guide what research I do. Following the beautiful ideas in pursuit of understanding has made up most of my scientific journey so far. After all, developing my mathematical maturity was my primary motivation for attending graduate school.
But let’s take a step back. We can certainly play in our little theoretic sandbox (I am in fact very happy to do so). However, we cannot decontextualize our work from the rest of the world. For technical fields especially, where people seem believe that they are smarter than everyone else, and that rationality rules (obviously both are wrong), there are so many problems that cannot be solved by math. For example, creating models for data and metrics we haven’t seen before (external validity), and reducing human experience into probabilities. We must understand the limitations of computation in practice, and resist the allure of tech arrogance. More on this in following posts!