Tokens, explained: the hidden meter behind every AI bill

Everyone in AI is talking about tokens right now: running out of them, optimizing them, staying under the limit. Almost nobody stops to explain what a token actually is. That is backwards, because you cannot optimize a number you do not understand. So here is the plain version, with the real figures, and then the few changes that actually move your bill.

Think of a token as a click on a meter. Every word you send moves the meter. Every word the model writes back moves it further. Cost, speed, and how fast you hit your usage limits all trace back to that one running number.

What a token actually is

A token is a chunk of text, usually a little smaller than a word. Common short words ("the", "is") are one token each. Longer or unusual words get split into pieces, so "unforgettable" might be three. A rule of thumb OpenAI has published for years: about 100 tokens to every 75 words of English.

The model never sees your letters. Your text is chopped into tokens, each token becomes a number, the model does its math on the numbers, and the numbers it produces get turned back into the words you read. You are billed for both directions.

Why the same sentence costs different amounts on different models

That chopping step, called tokenization, is slightly different on every model. The same paragraph can be a different number of tokens on Claude versus GPT versus Gemini, and even across versions of the same family. Anthropic notes, for example, that its newest tokenizer counts the same text as roughly 30% more tokens than its previous generation.

The practical lesson: a token count is model-specific. Do not assume a price you measured on one model carries over to another.

Input versus output: the answer is the expensive half

This is the split that matters most, and the one most people have backwards.

Input tokens are everything you send: your message, the whole conversation so far, and any files or instructions riding along.
Output tokens are what the model writes back.

Generating text costs more compute than reading it, so output is priced several times higher than input across the major APIs, typically three to five times. Claude Opus 4.8, as of mid-2026, runs about $5 per million input tokens and $25 per million output tokens (Anthropic pricing), a clean 5x. OpenAI's and Google's published rates (OpenAI, Google) follow the same shape.

So if you care about cost, the length of the model's answer matters more than the length of your question. Asking for "three sentences" instead of an essay is one of the cheapest savings available, and almost nobody uses it.

The trap nobody sees: your whole conversation is re-sent every turn

Chat models have no memory between turns. To stay coherent, the interface quietly re-sends the entire conversation so far with every new message. You are not paying only for your latest question; you are paying to re-read the whole thread again.

That is why a long chat gets more expensive the longer it runs, even when your messages stay the same size:

Message 1: just your prompt, maybe 50 tokens.
Message 5: the full history plus your new line, a few hundred.
Message 20: a long history dragged along every single turn, often thousands.

The bill climbs because the history behind each message keeps growing. It also explains three things you have probably felt: answers that get vaguer deep into a long chat (the context is crowded), hitting a usage limit sooner than expected, and a fresh chat suddenly giving a sharper answer (zero history, full room to work).

What counts as a token besides your words

Plenty of things you do not think of as "text" still become tokens, and some are far heavier than they look.

PDFs. A document is processed as both its text and an image of each page. A dense page can run several hundred to over a thousand tokens, so a 30-page report can be tens of thousands of tokens before you have asked a single question.
Images and screenshots. Pictures are converted to tokens by area. Anthropic estimates image tokens at roughly (width x height) / 750 (vision docs). In practice a single screenshot can cost more than a page of writing, and a handful can quietly eat a big chunk of your daily budget.
Hidden reasoning. Many current models "think" before they answer, generating chain-of-thought tokens you never see. You pay for them anyway, so a short visible reply can carry a large invisible cost.
The system prompt and tools. Background instructions that set the model's behavior, plus the definitions of any tools it can call, are tokens on every request even though you never type them.

A big context window is not a free one

A model with a 1-million-token context window does not make those tokens free; it just lets you load more of them. Every token you stuff into that window is re-priced on every turn. A bigger window is a bigger gas tank, not cheaper gas. Filling it with an entire codebase or a stack of PDFs "just in case" is one of the fastest ways to a surprising bill.

The single biggest lever: prompt caching

If you send the same large block of context repeatedly, a long system prompt, a reference document, a fixed set of instructions, prompt caching stores it and reuses it instead of reprocessing it every time.

The economics are dramatic. On Anthropic's API, a cache read costs about a tenth of the normal input price, against a small one-time write premium of roughly 25% (prompt caching docs). It pays for itself after about two reads. OpenAI and Google offer comparable discounts on cached input. If you are building anything that sends the same context with each call, this one setting can cut the bill by most of itself.

One catch worth knowing: caching is a prefix match. The stable content has to come first and stay byte-for-byte identical. Slip a timestamp or a changing ID into the front of your prompt and the cache silently never hits.

Do not guess with the wrong ruler

If you want to estimate cost before you spend it, count tokens with the right tool. OpenAI's popular tiktoken library is for OpenAI models; it undercounts Claude by 15 to 20%, and more on code or non-English text. Each provider exposes a token-counting tool that returns the real number for its own models (Anthropic, OpenAI). Use the one that matches the model you are actually calling.

What to actually do

Sorted simplest to most advanced. Using even two or three of these cuts consumption sharply.

Start a fresh chat when you switch topics. Old history is dead weight you re-pay on every turn. This alone can cut a busy day's usage by half.
Ask for shorter answers. "Answer in three sentences." Output is the expensive side, so capping it saves more than anything you do to your own input.
Paste the section, not the whole document. A three-page excerpt costs a fraction of a 30-page PDF. Summarize the rest yourself in a line or two if the model needs the context.
Skip image uploads you do not need. Describe what is in the picture when a description will do; save real uploads for when the model genuinely has to see it.
Turn on prompt caching for any repeated context. It is the highest-leverage setting most teams never enable.
Right-size the model. Route quick formatting, simple summaries, and one-line answers to a cheaper, faster model; reserve the flagship for hard reasoning and long generation. A taxi for short trips, the town car only when it earns it.
Ask for structured output. JSON or a table is almost always tighter than free-form prose, and easier to use downstream.
Cut the throat-clearing. "I was wondering if you could possibly help me with..." costs tokens and adds nothing. Say what you need.
Compact long sessions. Tools like Claude Code's /compact summarize a long history and free up room instead of forcing a fresh start every time you hit the wall.

The one idea to keep

Tokens are the invisible currency behind every AI interaction. Your cost, your speed, your message limits, and the reason a model seems to get worse the longer you talk to it, all trace back to them. You do not need to count tokens by hand. You just need the instinct that every word in and every word out is being metered, and that the meter runs faster the longer the conversation gets. Once you think in tokens, the limits and the costs stop feeling random, and you know exactly which lever to pull.