Math does not come easily for me. Maybe this is why one of my greatest joys in life is finally understanding why and how a mathematical process works, on my own terms.

I’ll illustrate this with the first example in graduate school where I felt enlightened, so to speak: the dynamics of sparse coding. This brief explanation will not be clear if you’re not familiar with linear algebra, but if you do feel compelled to read through, I’m curious how much is understandable from my explanation in words (diagrams would be much clearer but maybe I’ll do that in another post). Here is a tutorial I wrote with visualizations that may be slightly better if you are interested.

Sparse coding is a principle hypothesized to govern certain types of brain activity, where only a small subset of relevant neurons fire. Sparse activity has been found in many parts of the brain, including the visual pathway. The formulation is

Where *x *is an image flattened into a vector (*N *pixels), *Φ* is a matrix (*NxM*) whose columns (basis vectors) can linearly combine to approximate *x*, *a* is a vector of *M *coefficients that are ideally sparse (mostly zeros), and *n* is a random Gaussian vector (we can ignore this for our purposes). The idea is to find an *a *and *Φ* that well-approximate a given *x*. *Φ* is the total images you have at your disposal (“dictionary”), and *a* specifies which images you pick to add up to the final one.

We define an energy function that we aim to minimize, which puts into math the desiderata for our model:

Minimizing the first term reduces the difference between the original image *x* and the approximation *Φa*; minimizing the second term makes *a* sparse (this form is commonly known as LASSO). To make *E* small, we must obtain a good dictionary *Φ *and sparse coefficients *a* that get us a good approximation of the image *x*. One simple way to do this is through gradient descent, i.e. changing our *Φ* and *a* in a direction in vector space that minimizes *E*.

If we take the gradient of -*E* with respect to a selected *a_i *(scalar), that is the direction we need to step to minimize *E. *I won’t go through the gradient descent procedure here, but we’ll just look at the gradient itself:

where *Φ_i *is the *i*th column of *Φ *(the *i*th basis vector)*. *Intuitively, the direction you need to step to get a better *a_i *amounts to taking the difference between the target image and your approximation (the direction in which you’re “off”, called the *residual*), and taking its dot product with *Φ_i*. The dot product (the similarity between *Φ_i* and the residual) tell us that *Φ_i* is used either too much or too little in creating the approximation, and *a_i* will change to adjust for this. Note that if the approximation is good then the residual will be small, and *a_i* changes only by a little.

The second term is either negative or positive, the opposite sign of *a_i*, which nudges the activations toward 0 (this is desired because we want a sparse *a_i*!).

Similarly, we take the gradient of -*E* with respect to *Φ_j *:

where *Φ_j* is the *j*th column of *Φ.* It multiplies the residual by *a_j* (which can be negative or positive), which results in a vector that corrects for the residual (*x-Φa)* in the appropriate direction based on how it’s being used in the approximation (*a_j*), and therefore improves the approximation.

In sparse coding, we alternate between taking a step to minimize *Φ *to learn the basis vectors, and steps to minimize* a* to infer the sparse coefficients. Although determining the gradient is calculus, which is just rules, you can explain every part of it *intuitively*, because of the way we formulated the energy function! The solution that emerges is elegant (see tutorial), and provides hypotheses for primary visual cortex neural activity. For those who think it’s not a big deal, and that’s just how gradients work, that’s fair. But it’s hard for me to take even fundamental concepts like this for granted because it takes me a lot of work to get here. It’s rewarding to understand!

(You know what else uses gradients? Training every machine learning system in production, from ChatGPT to the smart charging feature on your phone. But we are at a loss to explain these gradients the way we can simpler models like sparse coding. These machine learning models are amazing feats of engineering, and (some) work well for a specific purpose, but they are clunky and opaque. This is why I don’t do deep learning 😬.)

Sometimes people ask me *why* these problems are interesting to me. I feel that I can’t reduce the explanation past the primitive: it’s just pure joy! And I am extremely privileged to get to use this sense to guide what research I do. Following the beautiful ideas in pursuit of understanding has made up most of my scientific journey so far. After all, developing my mathematical maturity was my primary motivation for attending graduate school.

But let’s take a step back. We can certainly play in our little theoretic sandbox (I am in fact very happy to do so). However, we cannot decontextualize our work from the rest of the world. For technical fields especially, where people seem believe that they are smarter than everyone else, and that rationality rules (obviously both are wrong), there are so many problems that cannot be solved by math. For example, creating models for data and metrics we haven’t seen before (external validity), and reducing human experience into probabilities. We must understand the limitations of computation in practice, and resist the allure of tech arrogance. More on this in following posts!