How LLMs Work
Large Language Models are transformer-based neural networks trained to predict the next token in a sequence. Understanding the core mechanics helps you prompt more effectively, interpret outputs more accurately, and know when to trust -- or question -- a model's response.
Training Pipeline
Tokens, Not Words
LLMs process text as tokens -- roughly 0.75 words each. A token is a byte-pair encoding unit. Understanding tokens explains why models sometimes split words oddly, why context windows are measured in tokens, and why code often costs more tokens than prose.
Attention Mechanism
The transformer's self-attention mechanism lets every token attend to every other token in the context window. This is why LLMs can reason over long documents -- but also why inference cost scales quadratically with context length.
Temperature and Sampling
After computing a probability distribution over possible next tokens, the model samples from it. Temperature controls the sharpness of the distribution: 0 = always the most likely token (deterministic), 1 = sample proportionally to probabilities.
Knowledge Cutoffs
LLMs know nothing about events after their training cutoff. They also have no real-time access unless given tools. Always verify time-sensitive facts with retrieval or web search rather than trusting training-data recall.
LLMs generate plausible-sounding text by predicting likely next tokens -- they are not retrieval systems and have no concept of "truth." This means they can confidently generate false facts, fake citations, or incorrect code. Always verify outputs for high-stakes decisions.