I spent more than a week working on this post. My goal is to expose the inner workings of LLMs while assuming the reader has little to no mathematical or computational background. I don’t think I got the balance perfectly correct, but I do hope this is useful to some.
When people say machine learning,
they mean:
This is learning
only in the loosest sense: there’s no meaning-making, no understanding, no abstraction, no growth, just trying a whole bunch of things and keeping the one that seems best.
Machine leaning rarely works as well as its name might make you think it would. That said, recently a few specific cases have worked well. The purpose of this post is to explain the most visible of those cases: large language models (LLMs).
Before going deeper into LLMs, it is worth understanding that all of the news-making AI breakthroughs of the past several years have depended on the mindbogglingly immense training set that is the Internet. This is why they’re getting good at Internet-common things (like chatting and answering questions and writing articles and making pictures and videos) much more rapidly than they’re getting good at offline things (like doing dishes and opening doors and walking dogs).
When people say neural networks,
they mean:
They are neural
only in the sense that once upon a time someone ignored almost everything we know about neurons and re-posed what was left as a linear operator followed by a detail discard. That vague analogy is unrelated to the effectiveness of neural networks, and trying to make systems that work more like neurons has thus far proven to be counter-productive.
Linear operators are a very simple class of functions that has been studied by mathematicians for centuries. Some of the reasons we like linear operators as a function family for machine learning include:
Linear operators are not a good function family for machine learning for two reasons. First, by being easily analyzed means they can only represent functions that are easily analyzed, and we’re interested in having computers do things that are complicated and hard to analyze. Second, linear operators with many inputs have too much sensitivity to small differences and can fail to generalize from specific inputs to the general case.
To get around the limitations of linearity, neural nets follow each linear operator with a detail-discarding operator. The most common of these rounds any number above some threshold to 1 and any number below it to 0. That’s a nonlinear thing to do, but it’s simple enough that we keep much of the computational efficiencies of linearity. Because the detail discard is nonlinear it allows us to chain multiple operators together and get something more than just a single operator.
Putting this together, common neural networks
The number of linear operators in this sequence is called the depth of the neural network. The deeper the network, the less linearity limits its behavior22 The idea that several linear things punctuated by nonlinearity can approximate nonlinear things may be familiar from connect-the-dots illustrations, piecewise-linear functions, polygonal computer models, and the trapezoid rule, all of which approximate curves by a sequence of line segments. Those examples all segment the domain into pieces and handle each segment linearly; neural nets instead segment the computation process into linear pieces, not the input domain. but also the less linearity makes its training efficient.
LLMs complete text. Their input is a sequence of a few thousand words and their output is one more word to stick on the end of that sequence. If we want more than one word, we add the first output word to the original input and try again, getting a second word; and so on until the LLM outputs a special word that means end of text.
I say word
, but that’s not as nicely defined as we might hope. Are such
and Such
the same word? Is ,
a word? If not, are of,
and of
the same word? These are questions that the human designing the LLM has to answer, resulting in an algorithm for turning text into a token stream, where token is the concrete realization of the idea of a word
. The more effectively this algorithm can provide single-meaning tokens, the more efficient the training will be. This is one reason why text-based AIs like LLMs are more effective than voice-based AIs: we’re much better at making meaningful tokens from text than from sound.
The training data for LLMs is text. Any text, from any source, and as much of it as possible, but processed in a particular way to make it easy to define an objective function.
To consider how this training works, let’s consider a web page that is broken into the following 15 tokens:
Error 403 Unauthorized «newline» You must be logged in to view this page . «end»
This results in 15 training inputs and an objective for each:
| Input | Objective |
|---|---|
| Error | |
| Error | 403 |
| Error 403 | Unauthorized |
| Error 403 Unauthorized | «newline» |
| Error 403 Unauthorized «newline» | You |
| Error 403 Unauthorized «newline» You | must |
| Error 403 Unauthorized «newline» You must | be |
| Error 403 Unauthorized «newline» You must be | logged |
| Error 403 Unauthorized «newline» You must be logged | in |
| Error 403 Unauthorized «newline» You must be logged in | to |
| Error 403 Unauthorized «newline» You must be logged in to | view |
| Error 403 Unauthorized «newline» You must be logged in to view | this |
| Error 403 Unauthorized «newline» You must be logged in to view this | page |
| Error 403 Unauthorized «newline» You must be logged in to view this page | . |
| Error 403 Unauthorized «newline» You must be logged in to view this page . | «end» |
For each of these training inputs, and all of those we can get from every other webpage we find as well, the training program checks what its current function does with the input and then tries to tweak the numbers in its linear operators to get that behavior closer to the objective.
This is the only training the LLM gets. The goal is to get it to replicate the most common kinds of things that appear in its training text. There’s no cognitive model, no understanding, no cleverness: merely an effort to get a function that get as close to replicating the input as the function family will allow.
It is common for LLMs to be trained in two phases. First, we feed it as much text as possible to get it good at replicating text; then we focus the training on just text we like to get it to prefer that kind of text. For example, if we expect the input to be questions and want the output to be answers then, after we finish training it on all the text we can, we train it on just the Q&A forums again, pushing the function closer to those outputs. That’s part of why33 The other part of the reason you get answers is that the tool will pad what you type with some additional text, such as prepending «start of question»
and postpending «end of question» «start of answer»
, however those starts and ends are encoded on the Q&A forums in the training set. when you enter a prompt like How do I bake bread?
into an LLM, it is much more likely to start producing a recipe for bread than it is to start producing other text that might follow that online, such as expanding the question with Is it hard?
or responding with judgment like Why would you even want to do that?
Linear operators, and by extension neural nets, have lists of numbers as both inputs and outputs. LLMs turn tokens into lists of numbers using a one-hot encoding, which works as follows:
For example, if the list started (the, a, an, of, in, …) then the word of
would be one-hot encoded as (0, 0, 0, 1, 0, …).
Turn a list of 100,000 numbers back into a word we pick the largest number in the list44 Or, if we want more variation in output, randomly pick one of the largest numbers in the list. and use the word from that place.
The one-hot encoding has the nice property that it doesn’t treat any word as special: they each get their own place in the list, none any different than any others. But we know that some words are more like each other than other words, and we want to have that similarity in our model.
One way of measuring similarity is how often a pair of words appear in the same context. For example, we can find many phrases on the Internet with the word king
in them that differ from other phrases on the Internet with the word queen
in them only by that one word (for example the sat on the throne
). It is much harder to find such phrases that interchange king
and of
, so by this measure king
and queen
are much more similar to one another than either is to of
.
To encode this notion of similarity, we’ll feed the one-hot encoding through a linear operator with many many fewer outputs than inputs; and similarly feed the output through a linear operator with many many fewer inputs than outputs. We call that first layer an encoder, the last layer a decoder, and the smaller list of numbers output by the encoder an encoding of a word. We train the encoder and decoder based on two objectives, both of which depend only on existing text online and not on any kind of human intervention:
The decoder should undo the encoder. If the encoder turns into then the decoder should turn into some vector the same length as and with its maximum in the same place55 Turning into directly is not the goal because dimension reduction is not invertible. We just want to keep the maximal value in place to enable recovering the original word..
If the same context appears twice with different words in the middle, the encodings of those words should be as similar (in a mathematical distance sense) as possible.
These two goals and a large corpus of text are all that are needed to train an encoder and decoder pair.
A good encoding is interesting in and of itself. It can serve as a spelling corrector66 thier
and their
appear in almost exactly the same contexts and thus will have very similar encodings, but their
is much more common and thus probably what you meant to type., a wording brainstorming tool77 Looking for a word that means something like happy
but also regretful
? The encoding for nostalgic
is likely to be similar to both., and a search generalizer88 Your search not turning up good hits? Try replacing its words with others that have similar encodings., among other uses.
There is some magic in picking the encoding size. If it is too large, words will stay far from one another and the fact that king
and queen
are largely interchangeable will not be discovered. If it is too small, words will be shoved too close together and the differences between how king
and queen
are used will be lost. The encoding size is one of many parameters of the system99 Other parameters include depth, the size of each linear operator, and which limitations on operators to apply to each., meaning something we have to pick that has significant impact on the results but does not have an obvious rule to use to identify the best choice.
LLMs don’t just take one token as input: they take thousands of tokens. There are multiple ways that could be done, but a simple and sufficient one is to concatenate all the encoded tokens into one big list.
For example, if we encode each word into a list of 500 numbers, and we want to accept 10,000 words as input, then we’d have a 5,000,000-number list as the input. The first 500 numbers would be the encoding of the first word, followed by 500 numbers encoding the second, and so on.
But what if a user types fewer words than we allow? We simply pad it out to the desired length. Did you put 10 words into an LLM that supports 10,000? It will put 9,990 special «nothing» tokens before your ten words to make its input list of numbers.
A major advancement of LLMs came when humans added something to the system that neural networks alone could not handle, no matter how well trained. The aspect of language they added is called self-attention. Self-attention is implemented by a transformer which changes the encoding of each word in a sequence of words, moving it closer to similar words in that same sequence.
Self-attention can express many kinds of linguistic patterns, but is most easily illustrated by considering a single word that has multiple meanings. For example, earth
has one meaning that is similar to air
, fire
, and water
; and another that is similar to mercury
, venus
, and mars
. A well-trained encoder will result in a list of numbers for earth
that is similar to all of these words (often by having different numbers in the list shared with each). If a given input text contains both earth
and mars
, a transformer will move the encoding of earth
in that input closer to the encoding of mars
; but if the input contains both earth
and fire
, the transformer will move the encoding of earth
in that input closer to the encoding of fire
.
Transformers help AI systems handle ambiguous texts, resolve context, and otherwise handle the complexities of language that trained neural nets alone don’t seem able to handle.
Transformers use a custom nonlinear combination of linear operations to find which words in a sequence that are similar after encoding and change them to be even more similar. Because transformers utilize encoders to operate, they are partially trained; but their power comes from the human-designed nonlinear structure informed by the problem domain.
LLMs are limited to completing text, but LLMs do not have to be stand-alone tools. Some of the LLM-included toolchains I’ve seen include:
Pairing an LLM with another machine learning system specialized for a different medium, like images or audio. These other systems often have their own specialized domain-specific component, like LLMs transformers, to help overcome the limitations of neural nets by themselves.
Parsing the LLM’s output and searching for any assertions it makes in a standard search engine.
Using the LLM several times, paired with other tools. For example, if I ask ?
it might
what sources should I check to answer the question ?
I know «what that source said», but still have a question: ?
Attaching an LLM trained on help forums for specific tools with that tool, repeatedly prompting it with things like I’m in this state and see this message; what should I do to get things to work?
and then doing whatever the LLM says.
This is not an exhaustive list, and will become less and less exhaustive as this post ages and more such LLM-included tools are designed in the future. Any process where the average wisdom of the Internet masses could be useful could benefit from the integration of an LLM.
When I type a prompt into an LLM,
end of questionand the like are added.
if you saw this sequence of tokens on some webpage, what token would be most likely to come next?
end of answertoken.
The tool that is generating tokens is a function from a general function family consisting of two of the most easily-computed functions we know of: linear operators and rounding to integers.
Large linear operators have billions of numbers inside them. Some of these numbers may be set by human engineers’ beliefs about the problem domain; the rest are chosen by mathematically-informed guess-and-check where the things to check against come from billions of web pages: enough data that even guess-and-check can eventually converge on something useful.
Nothing in the LLM contains anything like understanding. It finds similarities and patterns in uninterpreted token streams and uses those to try to extend incomplete sequences.
Longer prompts have more power; short prompts waste most of your LLM’s thousands of input tokens.
LLMs don’t think, they don’t have intent or understanding; they don’t even know that the tokens they are generating mean anything at all.
LLMs are like search engines that, instead of giving a list of millions of hits, approximately combine them all into one synthesis median reply.
LLM designers try to steer their LLMs toward mimicking the helpful parts of the Internet, but identifying the helpful parts isn’t easy and there’s not enough helpful parts for LLM training to work well if given only those parts, so the parts you wish didn’t exist are part of their training too.
that was rude/wrongwill get the kinds of responses that follow such a message on a discussion forum.
Prompts that could appear in academic webpages will get replies like academics write, while prompts that could appear in flame wars with trolls will get those kinds of replies – not because of intent or snark or anything like that: the LLM is just trying to finish the pattern you began.
The farther you go from what is common and has been done and posted online millions of times before, the less helpful LLMs are.