How LLMs work

Large language models (LLMs) are the engines behind today’s chatbots — ChatGPT, Claude, Gemini and the rest. Here is what they actually are, why they sound so sure of themselves, and where they can go wrong.

Most explanations of how these tools work are either too simplistic or written for engineers. This page tries to explain just enough to use them well (no maths, no jargon), then stops. If you understand this much, you will make better decisions than most people using these tools, including many who build with them.

What large language models actually are

A large language model (an "LLM", the engine inside ChatGPT, Claude, Gemini and the rest) is a program that does one thing: it predicts the next word. You give it some text, and it guesses what is most likely to come next, then the next, then the next.

Strictly speaking, it predicts tokens, not words. A token is usually a word or part of a word. That sounds like a technicality, but it is one reason a model can stumble on spelling, exact word counts, citations, section numbers and number formats: it does not read text quite the way you do.

That prediction engine is the core trick. But the chatbot you actually use is a product built around the model, and the product adds layers the raw model does not have: instructions, safety rules, and tools such as web search, file upload, memory and connectors. Keep that distinction in mind throughout this page. The model predicts text; the product around it adds the rest.

The model learned to predict by reading an enormous amount of text (much of the public internet, a great many books, and more), adjusting billions of internal settings until it became very good at the prediction game. Nobody programmed in the rules of grammar or the facts of the world. It absorbed patterns from the text until it could continue almost any passage plausibly.

Why they sound so confident, and are sometimes wrong

Unless a search, database, file or connector tool is switched on, the model is not looking anything up. It is not consulting a register of facts. It is producing the most plausible continuation of your text, from the patterns it absorbed in training. Most of the time the most plausible-sounding answer is also correct, which is why these tools are useful. But when it is not, the model says it anyway, with exactly the same confidence. That failure has a name: a "hallucination".

Modern chatbots are also trained to sound helpful, clear and direct. That makes them easier to use, but it can make uncertainty disappear from the style of an answer. A wrong answer can arrive with the same polished confidence as a right one.

When a model invents a court case that sounds real, a citation in the right format that does not exist, or a statute section that says something it does not, it is not lying. It has no independent way to check what is true; the confident tone is a style of output, not proof that the answer is right. Verifying the output via human review is therefore not optional, especially for anything that matters.

When a tool is switched on, the product may fetch outside material and feed it to the model. That helps, but the model can still misunderstand, omit or misstate what it retrieved, so the same checking applies.

What they are genuinely good at, and what they are not

They are strongest when you give them the material and ask them to work on it: summarise a document, compare two versions, redraft a clumsy paragraph, explain a concept in plainer words, turn notes into prose, suggest a structure. They are weaker when you ask for facts you did not give them: a case on a point of law, the details of a person, the contents of a document they cannot see. That is exactly where the confident-but-wrong failure shows up.

The safest pattern is to bring the material with you. Give the model the statute, judgment, contract, email chain or policy you want it to work on, then ask it to summarise, compare, redraft or explain that. That is very different from asking a generic chatbot to recall a case or a legal rule from its training.

Many professional legal tools work this way: they search a verified database first, pull back the relevant passages, and feed those to the model. That is safer than relying on the model's memory alone, but it is not magic. You still need to check that the retrieved source is the right one, that it is current, and that the model has described it accurately. Even when you supply the material yourself, check the result against the source when the answer matters.

The foreign bias (jurisdictional drift)

There is a particular trap for New Zealand users: these models have read far more United States and United Kingdom material than New Zealand material. Ask a generic legal question and they may drift towards the legal system they have seen most often.

That can show up in small phrases, such as "attorney-client privilege" instead of legal professional privilege, or in larger mistakes, such as importing a United States rule into a New Zealand problem. A model can sound local while reasoning from foreign assumptions.

For New Zealand work, name the jurisdiction clearly and give the tool the local material to read: the Act, the regulation, the case, the guidance or the contract clause. Do not rely on a generic chatbot to know which legal system applies.

What they can and cannot see

The model itself does not remember you just because you once chatted with it. But the product around it may add memory, chat-history search, project context or saved instructions. Check the product's settings before assuming a new chat is private, blank, or disconnected from your earlier work.

Within a single conversation the model can only "see" a limited amount of text at once: its "context window", which is finite and measured in tokens. When a conversation or document is too large to fit, the product may drop older text, summarise it, or keep only the pieces it judges relevant. That helps, but it is lossy: the model can still forget, miss or distort earlier material, which is why a long chat can start to contradict itself.

The thing most people get wrong is to assume the model is aware of them, their matter, or the world as it changes. Unless a tool fetches it, the model sees the text in front of it, in this conversation, and nothing else.

How they learn (and why their knowledge has a cut-off)

A deployed model version has a training cut-off: its internal settings reflect training data up to some point, which is why a model can be brilliant on settled matters and blank on last week's news. Providers can release newer versions, route you to a different model, or add live tools. But during an ordinary chat the model is not quietly learning new facts about the world into its settings.

Some tools now add a live-search step that fetches current information into the context window. That can make an answer more current, but it does not make it automatically correct: the same need to check the source still applies.

Where this comes from

This page is a plain-language summary. The technical detail sits in the providers' own documentation: for example Google's introduction to large language models, OpenAI on why models hallucinate, and Anthropic on the context window. The risk to lawyers is not hypothetical: there are numerous cases now where courts have sanctioned counsel who filed AI-fabricated case citations.