Large Language Models
© 2025-09-22 Luther Tychonievich
This work is licensed under a Creative Commons Attribution-NonCommercial-ShareAlike 4.0 International License
other posts
How LLMs work, with parallels for other machine learning systems.

I spent more than a week working on this post. My goal is to expose the inner workings of LLMs while assuming the reader has little to no mathematical or computational background. I don’t think I got the balance perfectly correct, but I do hope this is useful to some.

Machine Learning

When people say machine learning, they mean:

This is learning only in the loosest sense: there’s no meaning-making, no understanding, no abstraction, no growth, just trying a whole bunch of things and keeping the one that seems best.

Machine leaning rarely works as well as its name might make you think it would. That said, recently a few specific cases have worked well. The purpose of this post is to explain the most visible of those cases: large language models (LLMs).

Before going deeper into LLMs, it is worth understanding that all of the news-making AI breakthroughs of the past several years have depended on the mindbogglingly immense training set that is the Internet. This is why they’re getting good at Internet-common things (like chatting and answering questions and writing articles and making pictures and videos) much more rapidly than they’re getting good at offline things (like doing dishes and opening doors and walking dogs).

Neural networks

When people say neural networks, they mean:

They are neural only in the sense that once upon a time someone ignored almost everything we know about neurons and re-posed what was left as a linear operator followed by a detail discard. That vague analogy is unrelated to the effectiveness of neural networks, and trying to make systems that work more like neurons has thus far proven to be counter-productive.

Linear operators are a very simple class of functions that has been studied by mathematicians for centuries. Some of the reasons we like linear operators as a function family for machine learning include:

Linear operators are not a good function family for machine learning for two reasons. First, by being easily analyzed means they can only represent functions that are easily analyzed, and we’re interested in having computers do things that are complicated and hard to analyze. Second, linear operators with many inputs have too much sensitivity to small differences and can fail to generalize from specific inputs to the general case.

To get around the limitations of linearity, neural nets follow each linear operator with a detail-discarding operator. The most common of these rounds any number above some threshold to 1 and any number below it to 0. That’s a nonlinear thing to do, but it’s simple enough that we keep much of the computational efficiencies of linearity. Because the detail discard is nonlinear it allows us to chain multiple operators together and get something more than just a single operator.

Putting this together, common neural networks

  1. take an input, which is a list of many numbers;
  2. apply a linear operator to get a different list of many numbers;
  3. threshold it to get a list of many 0s and 1s;
  4. apply a different linear operator to get a third list of many numbers;
  5. threshold it to get a list of many 0s and 1s;
  6. apply a third linear operator to get a fourth list of many numbers;
  7. threshold it to get a list of many 0s and 1s;
  8. … and so on.

The number of linear operators in this sequence is called the depth of the neural network. The deeper the network, the less linearity limits its behavior22 The idea that several linear things punctuated by nonlinearity can approximate nonlinear things may be familiar from connect-the-dots illustrations, piecewise-linear functions, polygonal computer models, and the trapezoid rule, all of which approximate curves by a sequence of line segments. Those examples all segment the domain into pieces and handle each segment linearly; neural nets instead segment the computation process into linear pieces, not the input domain. but also the less linearity makes its training efficient.

Token streams

LLMs complete text. Their input is a sequence of a few thousand words and their output is one more word to stick on the end of that sequence. If we want more than one word, we add the first output word to the original input and try again, getting a second word; and so on until the LLM outputs a special word that means end of text.

I say word, but that’s not as nicely defined as we might hope. Are such and Such the same word? Is , a word? If not, are of, and of the same word? These are questions that the human designing the LLM has to answer, resulting in an algorithm for turning text into a token stream, where token is the concrete realization of the idea of a word. The more effectively this algorithm can provide single-meaning tokens, the more efficient the training will be. This is one reason why text-based AIs like LLMs are more effective than voice-based AIs: we’re much better at making meaningful tokens from text than from sound.

Training and objective

The training data for LLMs is text. Any text, from any source, and as much of it as possible, but processed in a particular way to make it easy to define an objective function.

To consider how this training works, let’s consider a web page that is broken into the following 15 tokens:

Error 403 Unauthorized «newline» You must be logged in to view this page . «end»

This results in 15 training inputs and an objective for each:

Input Objective
Error
Error 403
Error 403 Unauthorized
Error 403 Unauthorized «newline»
Error 403 Unauthorized «newline» You
Error 403 Unauthorized «newline» You must
Error 403 Unauthorized «newline» You must be
Error 403 Unauthorized «newline» You must be logged
Error 403 Unauthorized «newline» You must be logged in
Error 403 Unauthorized «newline» You must be logged in to
Error 403 Unauthorized «newline» You must be logged in to view
Error 403 Unauthorized «newline» You must be logged in to view this
Error 403 Unauthorized «newline» You must be logged in to view this page
Error 403 Unauthorized «newline» You must be logged in to view this page .
Error 403 Unauthorized «newline» You must be logged in to view this page . «end»

For each of these training inputs, and all of those we can get from every other webpage we find as well, the training program checks what its current function does with the input and then tries to tweak the numbers in its linear operators to get that behavior closer to the objective.

This is the only training the LLM gets. The goal is to get it to replicate the most common kinds of things that appear in its training text. There’s no cognitive model, no understanding, no cleverness: merely an effort to get a function that get as close to replicating the input as the function family will allow.

It is common for LLMs to be trained in two phases. First, we feed it as much text as possible to get it good at replicating text; then we focus the training on just text we like to get it to prefer that kind of text. For example, if we expect the input to be questions and want the output to be answers then, after we finish training it on all the text we can, we train it on just the Q&A forums again, pushing the function closer to those outputs. That’s part of why33 The other part of the reason you get answers is that the tool will pad what you type with some additional text, such as prepending «start of question» and postpending «end of question» «start of answer», however those starts and ends are encoded on the Q&A forums in the training set. when you enter a prompt like How do I bake bread? into an LLM, it is much more likely to start producing a recipe for bread than it is to start producing other text that might follow that online, such as expanding the question with Is it hard? or responding with judgment like Why would you even want to do that?

Encoding token streams

One-hot

Linear operators, and by extension neural nets, have lists of numbers as both inputs and outputs. LLMs turn tokens into lists of numbers using a one-hot encoding, which works as follows:

  1. Place all possible tokens that could ever appear in a big list. Let’s assume that’s a list of 100,000 tokens.
  2. Encode a word as a list of 99,999 zeros and exactly 1 one, with the one at the index of the token in the list of possible tokens.

For example, if the list started (the, a, an, of, in, …) then the word of would be one-hot encoded as (0, 0, 0, 1, 0, …).

Turn a list of 100,000 numbers back into a word we pick the largest number in the list44 Or, if we want more variation in output, randomly pick one of the largest numbers in the list. and use the word from that place.

Reduced dimension

The one-hot encoding has the nice property that it doesn’t treat any word as special: they each get their own place in the list, none any different than any others. But we know that some words are more like each other than other words, and we want to have that similarity in our model.

One way of measuring similarity is how often a pair of words appear in the same context. For example, we can find many phrases on the Internet with the word king in them that differ from other phrases on the Internet with the word queen in them only by that one word (for example the xx sat on the throne). It is much harder to find such phrases that interchange king and of, so by this measure king and queen are much more similar to one another than either is to of.

To encode this notion of similarity, we’ll feed the one-hot encoding through a linear operator with many many fewer outputs than inputs; and similarly feed the output through a linear operator with many many fewer inputs than outputs. We call that first layer an encoder, the last layer a decoder, and the smaller list of numbers output by the encoder an encoding of a word. We train the encoder and decoder based on two objectives, both of which depend only on existing text online and not on any kind of human intervention:

These two goals and a large corpus of text are all that are needed to train an encoder and decoder pair.

A good encoding is interesting in and of itself. It can serve as a spelling corrector66 thier and their appear in almost exactly the same contexts and thus will have very similar encodings, but their is much more common and thus probably what you meant to type., a wording brainstorming tool77 Looking for a word that means something like happy but also regretful? The encoding for nostalgic is likely to be similar to both., and a search generalizer88 Your search not turning up good hits? Try replacing its words with others that have similar encodings., among other uses.

There is some magic in picking the encoding size. If it is too large, words will stay far from one another and the fact that king and queen are largely interchangeable will not be discovered. If it is too small, words will be shoved too close together and the differences between how king and queen are used will be lost. The encoding size is one of many parameters of the system99 Other parameters include depth, the size of each linear operator, and which limitations on operators to apply to each., meaning something we have to pick that has significant impact on the results but does not have an obvious rule to use to identify the best choice.

Sequences

LLMs don’t just take one token as input: they take thousands of tokens. There are multiple ways that could be done, but a simple and sufficient one is to concatenate all the encoded tokens into one big list.

For example, if we encode each word into a list of 500 numbers, and we want to accept 10,000 words as input, then we’d have a 5,000,000-number list as the input. The first 500 numbers would be the encoding of the first word, followed by 500 numbers encoding the second, and so on.

But what if a user types fewer words than we allow? We simply pad it out to the desired length. Did you put 10 words into an LLM that supports 10,000? It will put 9,990 special «nothing» tokens before your ten words to make its input list of numbers.

Self-attention

A major advancement of LLMs came when humans added something to the system that neural networks alone could not handle, no matter how well trained. The aspect of language they added is called self-attention. Self-attention is implemented by a transformer which changes the encoding of each word in a sequence of words, moving it closer to similar words in that same sequence.

Self-attention can express many kinds of linguistic patterns, but is most easily illustrated by considering a single word that has multiple meanings. For example, earth has one meaning that is similar to air, fire, and water; and another that is similar to mercury, venus, and mars. A well-trained encoder will result in a list of numbers for earth that is similar to all of these words (often by having different numbers in the list shared with each). If a given input text contains both earth and mars, a transformer will move the encoding of earth in that input closer to the encoding of mars; but if the input contains both earth and fire, the transformer will move the encoding of earth in that input closer to the encoding of fire.

Transformers help AI systems handle ambiguous texts, resolve context, and otherwise handle the complexities of language that trained neural nets alone don’t seem able to handle.

Transformers use a custom nonlinear combination of linear operations to find which words in a sequence that are similar after encoding and change them to be even more similar. Because transformers utilize encoders to operate, they are partially trained; but their power comes from the human-designed nonlinear structure informed by the problem domain.

LLM within a toolchain

LLMs are limited to completing text, but LLMs do not have to be stand-alone tools. Some of the LLM-included toolchains I’ve seen include:

This is not an exhaustive list, and will become less and less exhaustive as this post ages and more such LLM-included tools are designed in the future. Any process where the average wisdom of the Internet masses could be useful could benefit from the integration of an LLM.

Conclusion

Summary of LLM operation

When I type a prompt into an LLM,

  1. What I typed is split into tokens by human-written code.
  2. Extra tokens for end of question and the like are added.
  3. That sequence of tokens is sent into a tool that tries to answer if you saw this sequence of tokens on some webpage, what token would be most likely to come next?
  4. The answer token is added to the sequence, and the question is asked again with that larger sequence.
  5. The answer token is added to the sequence, and the question is asked again with that larger sequence.
  6. The answer token is added to the sequence, and the question is asked again with that larger sequence.
  7. until the answer is the end of answer token.

The tool that is generating tokens is a function from a general function family consisting of two of the most easily-computed functions we know of: linear operators and rounding to integers.

Large linear operators have billions of numbers inside them. Some of these numbers may be set by human engineers’ beliefs about the problem domain; the rest are chosen by mathematically-informed guess-and-check where the things to check against come from billions of web pages: enough data that even guess-and-check can eventually converge on something useful.

Nothing in the LLM contains anything like understanding. It finds similarities and patterns in uninterpreted token streams and uses those to try to extend incomplete sequences.

Consequences for LLM users