This post assumes you’ve already read the previous post which introduced AI systems and why the subset of them that get news coverage today are pattern recognizers.
In statistics, the word regression
refers to the following idea:
We guess that data has a pattern which we can represent via some simple function, but that there are also aspects of the data our function does not describe. For example, we might assume that body weight is related to height, plus other factors we’re not modeling, and guess that the structure of that relationship is I picked the function family of cubic polynomials here because of the general geometric formula .
The numbers through I assume to be constants (like or ), but because I haven’t picked what those constants are yet we call them parameters
of the function family. Given many measurements of actual heights and weights I can try to regress the data to the function
by picking values for those parameters that minimize the total noise over all the measurements.
Picking those constants is not magic. A simple approach will just try many sets of random numbers and keep track of the best one so far. Often this can be made more informed by analysis of the function family itself, using properties of extrema of functions as described by calculus to more rapidly hone in on likely-good constants.
Picking the function family is magic, in the sense that it has to come from outside the regression itself. Sometimes it’s magic informed by mathematics, science, or other principled approaches. Sometimes it’s magic informed by the desire for something simple and easy to understand. Sometimes it’s magic informed by what we know how to efficiently regress to. There are fancy approaches that try many different function families taken from a set and keeps the best of them for the given data, but even those get the set from a magic decision.
The smaller the noise term gets during regression, the better the fit
of the function to the data. But sometimes we achieve overfit, meaning we found a function that happens to match the data we have but is unlikely to match any additional data we might find later.
The traditional source of overfit is having more parameters than data. Each parameter adds more flexibility or bendiness to the function family: with a large number of parameters I can find a function that wiggles and jumps all over the place to pass through each and every data point, but along the way I lose the ability to notice bigger patterns.
Consider again the height-weight example and suppose I have three people with heights 172.2 cm, 172.5cm, and 172.6cm and weights 78kg, 114kg, and 78kg, respectively. You or I might recognize this as noise: they’re all about the same height. But if we have at least three parameters we can fit these three points exactly, for example as . Such over-fit functions have found a false pattern within the noise and make nonsensical predictions like someone with height 172.0cm weighing negative 66kg.
The traditional protection against overfit is picking a function family that has fewer parameters. Fewer parameters = less flexibility = more likelihood of finding the desired pattern and not just noise. But current data-driven algorithms have other sources of overfit, and thus other approaches to avoiding it.
I introduced regression with the example of predicting weight from height. I could also try to predict weight from age. I could also try to predict weight from the combination of height and age by picking a function family for . This multi-variable model might reveal interesting interactions between height and age in addition to providing a better fit for weight.
What else might be correlated with weight? Perhaps income, or where you live, or how much money you spend on automobile fuel, or the local weather, or if politicians you like have been elected, or how old your therapist is, or who knows what all else. Picking the variables of the the function family I’m regressing to can itself seem like magic.
One approach to bypass this magic is to use every variable you can find. Do I have data on how often you have a song stuck in your head? Great, I’ll include that in the regression as well. Can’t hurt, right?
Actually, it can hurt. In particular, it can create a new kind of overfitting.
We refer to the number of variables in our function family, and equivalently the number of individual numbers in each point of data, as the dimensionality
of the regression. High dimensionality forces large numbers of parameters: we need at least one per dimension even for a simple weighted average and generally many more than that. Even modeling just pair-wise interactions of variables results in parameters, which can make overfitting hard to avoid.
Except…
Except that some high-dimension function families are smooth
or fair.
They don’t overfit very easily. It’s hard to show simple examples of this because the idea doesn’t map well to lower dimensions, but basically we pick function families where more parameters only add a little bit of bendiness.
Artificial Neural Networks were initially inspired by an incomplete understanding of biological neurons modified to be more easily handed by a computer. That simplification said that
This completely ignores known-important factors like the role of neurologically-active hormones. To make it fast to compute, additional differences from neurons were added, notably including a discrete-time model whereby we imagine time proceeds in steps with each neuron either firing or not firing at each time step.
After those simplifications and changes we end up with a two-step process: multiply a vector by a matrix, then clamp the elements of the resulting vector to the nearest integer. Together these two steps form a function that takes a vector in and produces a vector out.
The results don’t come anywhere close to operating like a brain. They mimic neither neurobiology nor high-level patterns of brain activity. For example, brains run for hours on end while artificial neural networks get stuck in dead ends or pointless loops if run for more than a few steps. But we’re glad for these differences: brains get bored and tired and take time and are bad at the repetitive tasks we want computers to do.
However, the function family created by this vaguely-neuron-inspired design does have two useful properties. First, it is relatively easy to perform regression analysis on it because it is mostly just a matrix and matrix algebra is very well understood. Second, regression to this function family empirically seems to avoid overfitting when given very high-dimension vectors, possibly as an outcome of the alternating linear and simple nonlinear steps, each of which would overfit in very different ways and seem to partially cancel each other’s overfitting tendencies out.
Early artificial neural nets were somewhat useful, but not as effective as people had hoped. Trying to more closely imitate biological neurons didn’t seem to help, either. But there were things that moved further from neurons and did help. Hundreds of papers have been published that broadly follow the template we started with modified-neural-net X, modified it more in way Y11 These modifications are generally informed by some understanding or intuition, and if they work they may have additional explanations of why they work added after the fact; most papers include such explanations, though whether they were pre hoc motivation or post hoc explanation is rarely made clear., and now it works better when regressed to dataset Z.
Common modifications include
clamp to nearest integerwith some other nonlinear function.
There does not yet appear to be any obvious upper limit on the number of modifications that will help. For example transformers
, which are what many of the currently-popular AI systems use22 for example, it’s the T in GPT and BERT, have more than a dozen sequential parts, several of which consist of a neural net paired with some additional post-processing of the result vector and others of which are completely unrelated function types.
OK, so what does all that mean?
We keep designing bigger and more complicated function families that are arcane interleavings of mathematically-simple functions picked because they are (a) easy to regress to and (b) unlike each other in ways that impede overfitting. Notably not in those selection criteria are either (c) more like how brains work or (d) related to any actual structure or relationships in the fields they’ll be used in.
If I regress height/weight data to a cubic polynomial I learn, by the size of the unused information that the regression classifies as noise
, if my assumption that was consistent with the data or not. But if it is not consistent with the data I don’t learn much else: I either get both predictability and insight or neither.
On the other hand, if I regress weight vs everything to a high-dimensional function family like a neural net or transformer I’m likely to get a very accurate function that predicts your weight, but it’s not going to tell me much of anything else. I almost always get predictability and rarely anything approaching insight.
But, you might be thinking, surely predictability is insight? Oddly enough, not very often. High-dimensional nonlinear models tend to find high-dimensional nonlinear patterns that are difficult to even articulate, let alone make any sense out of. They tend to fixate on parts of the input we didn’t even notice were there and can be fooled with optical illusions
that we can’t even see33 For example, see this brief by OpenAI.
In general, high-dimensional regression finds patterns with no understanding, no notion of meaning, and few if any outputs that humans can use to find meaning either.
So, we have tools that can find functions that fit very high-dimensional data quite well, and fit better the more data we provide them with. What do we do with it?
Many of the advancements in AI applications come from more effective applications of this data fitting. To illustrate this, let’s consider one example use case: trying to get computers to create text.
Let’s assume the output vector of our program is the probability that each word will appear next. This will have the form of a number for each word representing how often that word appears. That’s going to be many many numbers: for example, on this blog I have used around 400,000 words but only 20,000 distinct words so I’d need a 20,000-entry vector to represent the likelihood of each one appearing next.
The easiest way to make this vector would be to make a constant vector I always use. For example, of those 400,000 words about 18,000 were the word the
so in the the
position of the vector I’d put that ratio, or 4.5%. The word then
appears around 1,000 times, so it’d have 0.0025. The word alluvium
appears only once44 prior to this post that is; in post 323 in case you’re interested so it would have probability 0.0000025. And so on.
If we used that constant vector to generate text we’d get something like
more have simple the but merriness sin than an tiny about teaching
i.e. nonsense that has roughly the right word frequencies.
Let’s try adding a simple input to our function: the previous word. For example, the
is followed by same
400 times, by day
40 times and by mathematical
4 times.
If I use this as the source of the output vector I get text that is easier to read, but still nonsensical:
work on the velocity or old and interdependent new term gain the current
We don’t have to stop with just one word of context; we could also add more words, getting somewhat better results, such as I did in post 289.
We can keep adding more and more words of context, but if we do we quickly run into overfitting problems, a common result when using a simple function with high dimensionality. So instead of a simple count-based function let’s switch to an outlier-resistant function family and regress to it. But how? What’s the high-dimensional function family we want to regress to?
A major insight in using huge inputs without overfitting is the addition of state vectors. We’ll make some kind of big abstract vector of numbers with no assigned meaning. This vector will represent the abstract notion of the context of the next word.
We’ll start that vector with some kind of new conversation
state and change it after each new word using a function (found via regression) that takes a (last state, new word) pair and returns a new state. We’ll also find (via regression) a function that, given a state vector, produces a word likely to follow it.
This begs the question of how we regress to a good
state vector. Partly this is by transitivity: a good state vector is one that, when combined with a word selection function, produces words like those in the training data. But that’s hard to regress to, so we also add some direct measures; for example, a good state vector changes more when it encounters a word that has long-term impact on the conversation, like rainfall,
and less when it encounters a word that has little impact, like the.
That duration of impact can be discovered via another regression, and can be based on the context it helps define as well as the word itself.
These ideas, coupled with many billion of words of example text and millions of dollars of electricity to do the regression, gives us the current family of news-making large language models.