LLM Token Cost Optimization: A Practical Engineering Guide
TL;DR — Cutting LLM bills is not about shorter prompts. It is the joint management of input tokens, output tokens, reasoning tokens, cache hits, batch jobs, and model routing. This vendor-neutral, engineering-grade guide — cross-verified by three AI model families — shows you the levers that actually move cost.
Cutting an LLM bill is an engineering problem, not a prompt-trimming exercise.
Most teams discover their token costs the same way: a billing alert. The reflex is to "make the prompt shorter," which barely helps and sometimes makes outputs worse. Real LLM token cost optimization is the joint management of input tokens, output tokens, reasoning/thinking tokens, tool calls, cache hits, batch processing, and model routing — and each lever moves cost in a different way. This guide walks through the levers that matter, with concrete, vendor-neutral examples.
To keep the recommendations honest, the technical claims here were cross-verified by three independent AI model families — Anthropic's Claude (Opus 4.8), an OpenAI-family model (GPT-5.5), and a Google-family model (Gemini 3.5 Flash). Where the three agreed, it is stated plainly; where vendor behavior differs, that is called out.
The cost model you are actually paying for
A single request is rarely "one price." You are billed across several distinct buckets:
| Token bucket | What it is | Relative cost |
|---|---|---|
| Uncached input | Prompt prefill processed fresh | Full price |
| Cache write | Prompt prefix stored for reuse | ~1.25× input (5-min) / ~2× (1-hour) |
| Cache read | Prefix served from cache | ~0.1× input |
| Output | Tokens the model generates | Highest per-token rate |
| Reasoning / thinking | Hidden deliberation tokens | Billed at output rates |
The two non-obvious entries are cache reads (a 90% discount on repeated prefixes) and reasoning tokens (billed even though you never see them). Optimizing cost means moving as much traffic as possible into the cheap buckets and keeping the expensive ones small.
A correction worth internalizing up front: prompt caching reuses the input prefill, not the answer. It does not memoize responses. If you ask the same question twice, the model still generates a fresh (billed) output both times — only the shared prompt prefix is discounted.
Prompt caching: reuse the prefix, not the answer
Prompt caching lets the provider reuse the computation of a repeated long prefix — your system prompt, tool definitions, few-shot examples, or shared document context. Every major vendor offers it: Anthropic via explicit cache_control breakpoints, OpenAI/Azure via automatic exact-prefix matching reported as cached_tokens, and Gemini via implicit and explicit caching.
The mechanism is a prefix match: the cache key is the exact bytes of the rendered prompt up to a breakpoint. Any change anywhere in that prefix invalidates everything after it. This single fact drives every caching best practice.
Cache-friendly ordering
Put stable content first, volatile content last:
- Stable (cache this): instructions, tool schemas, few-shot examples, shared reference documents.
- Volatile (after the breakpoint): timestamps, per-user data, request IDs, the actual question.
A classic anti-pattern is injecting Current date: 2026-06-17 or a request UUID at the top of the system prompt. Because it sits at the front of the prefix, it changes every request and silently busts the entire cache. Move it to the end.
# WRONG — date at the front invalidates the whole cache
SYSTEM: "Today is 2026-06-17 14:05:33. You are a support agent. <10KB of rules>"
# RIGHT — stable rules first, volatile data last
SYSTEM: "You are a support agent. <10KB of rules>" <-- cache breakpoint here
USER: "[ctx: 2026-06-17] How do I reset my password?"
Verify it is working. Do not trust your intuition — read the usage fields the response returns:
# Anthropic-style: confirm reads are non-zero across repeated calls
curl -s https://api.anthropic.com/v1/messages ... | \
jq '{written: .usage.cache_creation_input_tokens, read: .usage.cache_read_input_tokens}'
If cache_read_input_tokens stays at zero across identical-prefix requests, a silent invalidator is at work — a non-deterministic JSON dump (sort_keys off), a varying tool set, or a clock value in the prefix.
Watch the economics: a cache write costs more than a plain read (1.25×–2×). With a 5-minute TTL you break even after about two requests; with a 1-hour TTL you need roughly three. If a prefix is hit only once, caching it is a net loss.
Context editing and compaction for long sessions
Long agent sessions accumulate transcript that you re-send (and re-pay for) every turn. Two techniques control this:
- Context editing prunes stale tool results and old reasoning blocks.
- Compaction summarizes earlier turns into a compact block when you approach the context limit.
Both shrink the billed prefix, but they are lossy. The rule that keeps them safe: preserve decisions, IDs, tool-result summaries, open blockers, and the next goal. Compact the narration; never compact the state. A session that forgets a database ID it created three turns ago will redo the work — burning more tokens than compaction saved.
Route reasoning effort — and turn it off for trivial tasks
Modern reasoning models deliberate before answering, and those hidden tokens are billed. The key accuracy point: the deprecated fixed budget_tokens knob has been replaced. Current Claude models use adaptive thinking plus effort levels (low, medium, high, xhigh, max) — you express intent, not a token count. Other vendors expose comparable effort/reasoning controls.
Match effort to the task:
| Task | Effort | Why |
|---|---|---|
| Classification, tagging, short summaries | low (or off) |
No multi-step reasoning needed |
| Standard extraction, Q&A | medium |
Balanced |
| Complex code, math, multi-step planning | high / xhigh |
Correctness dominates cost |
Leaving
higheffort enabled on a sentiment-classification endpoint is one of the most common silent cost leaks. The model "thinks" for hundreds of billed tokens to answer "positive." All three verifying models flagged this as the highest-ROI, lowest-effort fix.
And note the misconception in reverse: shortening the output does not cut reasoning cost. Reasoning tokens are a separate bucket controlled by the effort/thinking setting, not by max_tokens. You must dial both independently.
Small-model routing
The cheapest token is the one a smaller model handles. Build a router:
- Cheap/fast model for easy, high-volume, repeat requests (classification, routing, simple rewrites).
- Escalate to the flagship model only for hard, uncertain, or high-stakes requests (complex code, ambiguous reasoning, anything customer-facing where errors are costly).
A practical pattern is a confidence gate: run the cheap model first, and escalate only when it signals low confidence or the request matches a "hard" heuristic. One caution from the caching section applies here — switching models mid-conversation invalidates the cache (caches are model-scoped), so route at the start of a task, not mid-stream, or spin the cheap model up as a separate subagent.
Batch API for non-real-time work
If a workload does not need an instant response — nightly enrichment, bulk classification, embeddings backfills, eval runs — use the Batch API. Across vendors it is typically ~50% cheaper than synchronous calls, with results returned within a window (often under an hour, up to 24 hours). Prompt caching, tools, and vision all still apply inside batches.
The misuse to avoid: do not put interactive UX behind a batch. A user staring at a spinner for an hour is not an optimization. Batch is for throughput, not latency.
Constrain output and measure real cost
Output tokens are the most expensive bucket, so cap them deliberately:
- Set a sensible
max_tokens. - Use a JSON schema / structured output so the model emits only the fields you need.
- Ask for "N bullets only" or "one sentence" when that is genuinely enough.
Finally, measure with a real token-counting API — not chars / 4. Two accuracy points the three models converged on:
- Local heuristics are wrong. Images, file attachments, tool schemas, and structured-output scaffolding all consume tokens that a character count never sees.
- Tokenizers are model-specific.
tiktokenis OpenAI's tokenizer and undercounts Claude tokens by roughly 15–20% on prose, and far more on code. For Claude, use the dedicatedcount_tokensendpoint:
# Accurate, model-specific count — never tiktoken for Claude
resp = client.messages.count_tokens(
model="claude-opus-4-8",
messages=[{"role": "user", "content": open("prompt.txt").read()}],
)
print(resp.input_tokens)
Common misconceptions, corrected
- "Caching reuses the answer." No — it reuses the input prefix. Output is always regenerated and billed.
- "Put the date/request-ID/username at the top." That breaks the cache for everything after it. Stable first, volatile last.
- "Shorter output also cuts reasoning cost." Reasoning is a separate bucket; control it with effort levels, not
max_tokens. - "
chars / 4is close enough." It ignores images, tool schemas, and tokenizer differences. Count with the real API. - "Use Batch for everything." Batch is for non-real-time throughput. Never put live UX behind it.
Wrap-up
LLM token cost optimization is a portfolio, not a single trick. Order your prompts so a long stable prefix stays cacheable; compact long sessions without dropping state; route reasoning effort down for easy tasks and small models for easy requests; push non-interactive work to the Batch API; cap and structure your output; and measure cost with a real token-counting API instead of a character estimate. Layered together, these routinely cut bills by half or more without degrading quality — which is the real test of an optimization.
This guide is part of a three-part series on building cost-effective, reliable AI systems. For the layers above token economics, read the companion guides on AI agent skills and agent harness engineering. Pick one lever from this guide — most teams start with effort routing — instrument it, and measure the delta before moving to the next.