A Semi-Technical LLM Primer

Written by Stephan J. Lemmer, Ph.D. | Apr 10, 2025 11:34:05 PM

This combination of capabilities and dramatic failures is surprising if you expect the (perfect but limited) behavior of a computer or the (imperfect, but self-aware) behavior of a human, but they are a predictable consequence of how these models are trained.

Despite the numerous articles written on LLMs, few explain these concepts in a way that can be used for decision making. Many articles fail to provide any technical detail at all, treating them as some great mystery, focusing on specific use-cases or challenges, and often suggesting a way to improve empirical performance (at a small fee to the writer, of course). Others get too in-depth too quickly, assuming a knowledge of Machine Learning that most individuals do not have. Further complicating these explanations are the wide variety of use-cases, architectures, and sometimes conflicting definitions that make up the world of AI[1].

What is a LLM?

So, despite the risks, I wanted to provide an article with a level of technical detail that is both understandable to someone without a deep machine learning[2] background but provides a basis for both current decisions and future understanding. I think the easiest way to understand LLMs is as follows:

LLMs are Excel’s trendline function, blown up to a mind-numbingly large scale.

This is probably surprising, since conversing with an LLM feels like conversing with a human and Excel fails at counting if you don’t highlight enough cells. At the same time, the same premises---and therefore advantages and challenges---are applicable: neural architectures (including LLMs) use a given dataset to build a function---a mathematical formula that maps an input to its most likely output---and use that function to produce an output for new inputs[3].

For example, let’s consider the plot in Figure 1. It has:

• A set of known inputs on the X axis.
• A set of known outputs corresponding to those inputs on the Y axis.
• A model (trendline) that estimates the relationship between the inputs and outputs.

Figure 1: Data and trendline for the function y = 0.05x+4

How does this extend to text generation?

• Our input is a sequence of words[4] that have been converted to sets of numbers.
• Our output is a probability distribution across the model’s vocabulary. (Don’t worry, I’ll explain that below)
• Our model, therefore, predicts how likely every word is to follow the input sequence.

Because the model---like our Excel trendline---is only able to convert numbers to other numbers, we need to find a way for numbers to represent words. We do this by training the model to produce[5] a number for every word that the model knows, which can be thought of as the model saying I think that xx% of the time, the next word will be…

As an intuitive example: for the input the banana is, the output of the model may be something along the lines of:

• 50% yellow
• 28% ripe
• 10% brown
• 8% green
• 3% smelly
• [many more words]
• 0000001% cat

If the model is accurate, the words with the highest probability make sense, even if they don’t mean the same thing: yellow, ripe, and green are all reasonable descriptors for a banana. Because the high-probability words all make sense, we can choose from them randomly based on these probabilities: 50% of the time, the output would be yellow, while 28% of the time the output would be ripe[6]. The chosen word is then added to the input, and the process repeats until the model outputs a special word that means stop.

This process explains why you can get completely different answers when you click regenerate response. Once the system chooses yellow, it is a fact in the model’s world that the banana is yellow. If the system chooses brown (which will happen 10% of the time in our example), the world the model lives in a world where the banana is brown. This can have significant effects on the full text, particularly as it gets longer.

For example, complete the sentences:

• The banana is yellow because…
• The banana is brown because…

The result of a dice roll has significantly changed the content of the sentence. In this case both sentences can be correct, but in many cases that is not true.

Training the Model (or Fitting the Line)

While some speak of training a model as a dark magic, the truth is that the goal of training is fundamentally the same as drawing a trendline on a scatterplot: find the parameters---slope and intercept for a line---that most accurately map the given input points to the desired output. If we accept that words and sequences of words can be represented as numbers, as described in the previous section, there are only two big differences: 1) your LLM has a LOT more parameters (typically in the tens to hundreds of billions for an LLM vs. 2 for a line), and 2) where your trendline fits a single objective, the LLM optimizes multiple objectives in two basic stages.

Stage 1: Pre-Training

The first stage is called pre-training, where the model reads a scrape of the internet and attempts to guess the next word. This serves as a practical way to teach the model both grammar and facts without any human supervision: if the model has seen the sequence the sixteenth president of the United States was, enough times, it will produce the word Abraham reliably. Notably, like the trendline produced by Excel, it doesn’t need an exact match in the training set to make an accurate prediction. Similar to how 1.3 is close to 1, the numbers representing input the sixteenth president of the United States was are close to the numbers representing the input question: who was the sixteenth president of the United States? Answer:

This ability to generalize not only gives us the ability to answer new or rephrased questions, it also gives the LLM the ability to lie convincingly: the statement the sixteenth president of the United States was… and the first president of the United States was… are highly similar (as they probably are in your mind). If the model is imperfect or uncertain (or just too “creative”), it is may also assign some probability to the output George which---if you recall the previous section---means that the model now lives in a world where it can still express itself in the style of a history professor, but George Washington was the sixteenth president.

Stage 2: Fine-Tuning

Although technically just an autocomplete function at this point, it was shown in the original GPT-3 paper that the model can already perform admirably on challenging tasks through this pre-training process. However, such a model is flawed in important ways: from an interface perspective, it is not able to have conversations or follow instructions. From a safety and bias perspective… it is a text completion model trained on the internet[7].

For this reason, there is a fine-tuning stage where we turn our model into a polite, safe, and helpful conversational agent. The details of how this is done vary, but at a high level two fine-tuning techniques are used for LLMs:

• Dataset-based fine-tuning makes the model behave conversationally and follow instructions. Like pre-training, dataset-based fine-tuning is performed by predicting the next word across a large number of text documents. Unlike pre-training, however, these are curated documents, such as chat logs, that more accurately represent desired behavior but are more expensive to obtain and less thematically diverse.
• Reinforcement Learning from Human Feedback (RLHF) is used to align the model, or make it more accurately represent human values, desires, and expectations. In RLHF, a human reads outputs from a large language model and judges whether or not the output is acceptable (however we define acceptable) or whether one output is better than another. This feedback is used to train a different (smaller) to judge what outputs are high-and-low quality, and the smaller model is used to train the LLM. In the trendline metaphor, a human is asking the computer to ignore certain areas on the Y axis when fitting the trendline.

Conclusion

Large Language Models are incredibly powerful and useful tools, yet this power and the power of anthropomorphism has led to a lot of speculation about their capabilities and their future. Some of this is science fiction, some of it is justified, but all of it will benefit from a basic intuition of how LLMs are created.

Although the LLMs-as-a-trendline metaphor is imperfect, it provides a solid intuition for not just LLMs, but neural models in general: they all seek parameters that can map a given set of inputs to outputs, and hope that it generalizes accurately to new inputs. In future articles I will extend upon this metaphor, discussing how it relates to commonly referenced problems such as hallucinations and jailbreaking, as well as what it means for how to appropriately utilize LLMs and other AI techniques.

[1] This also introduces significant challenges in the creation of policy that are unfortunately overlooked in public discourse. But that’s a different post.

[2] You’ll see me use the term machine learning throughout this article, since I consider generative AI to be a subtopic of machine learning, and most of the ideas I’m introducing are more broadly applicable than just LLMs or generative AI.

[3] Neural networks are often referred to as universal function approximators, meaning that while your Excel trendline can give you a y = mx+b estimate for your data, a sufficiently large neural network can give you an estimate for any possible function. By the way, I consider GenAI tools to be a neural networks, too.

[4] Words isn’t technically correct, but the distinctions between words, tokens, and embeddings is unimportant at this level of detail.

[5] Placing on the y-axis of our scatterplot, if we’re speaking in terms of the trendline.

[6] The temperature parameter changes this distribution in a way that the probability of less likely outputs increases, but which outputs are more likely than others remains the same. It makes the model “more creative” by allowing it to output words it believes are less likely to be correct.

[7] You are likely thinking about things like social media and conspiracy blogs, but bias and safety issues also emerge from high-quality data---such as reputable news websites---or, some early findings suggest, bias mitigation techniques themselves.

AEM's AI team stands out for our expertise in realizing the benefits of human-in the-loop approaches in deep learned systems, and we offer capabilities across a range of traditional ML areas. Contact us at ai@aemcorp.com to explore challenges your team is facing.

View full post