If your application sends the same system prompt, tool definitions, or document context with every API request, you're paying for the same tokens over and over again. Prompt caching solves this by storing the processed representation of repeated text so the model doesn't have to recompute it. The result: input costs drop by 90% or more on cached tokens.
The best part is that on most providers, caching happens automatically. You don't need to change your code — you just need to understand how it works and structure your prompts to take advantage of it.
How Prompt Caching Works
When you send a prompt to an AI model, the model processes every token from scratch — reading your system instructions, understanding the context, and building internal representations. This is expensive computation.
Prompt caching stores the result of that initial processing. When you send the same prefix again (the beginning of your prompt), the provider recognizes it and reuses the cached computation. The model only needs to process the new, unique part of your request.
Think of it like compiling code. The first time, the compiler processes everything from scratch. After that, it only recompiles what changed. Prompt caching does the same thing for your API calls.
There are two costs associated with caching:
- Cache write — The first time a prefix is cached, you pay a small premium (varies by provider). This happens once.
- Cache read — Every subsequent call that hits the cache pays a dramatically reduced rate. This is where the 90% savings come from.
Caching Pricing Across Providers
Here's what each major provider charges for cached tokens (per million tokens):
| Provider | Normal Input | Cache Write | Cache Read | Savings |
|---|---|---|---|---|
| Anthropic (Opus 4.7) | $5.00 | $6.25 | $0.50 | 90% |
| Anthropic (Sonnet 4.6) | $3.00 | $3.75 | $0.30 | 90% |
| OpenAI (GPT-5.5) | $5.00 | $5.00 | $0.50 | 90% |
| OpenAI (GPT-5.4) | $2.50 | $2.50 | $0.25 | 90% |
| DeepSeek (V4 Flash) | $0.14 | $0.14 | $0.0028 | 98% |
| DeepSeek (V4 Pro*) | $0.435 | $0.435 | $0.0036 | 99% |
* DeepSeek V4 Pro: 75% off until May 31, 2026. DeepSeek cache read pricing is 1/10 of the launch price as of April 26, 2026.
Anthropic charges a 25% premium on cache writes — you pay $6.25 instead of $5.00 the first time. But subsequent reads are 90% cheaper.
OpenAI charges the same price for cache writes as normal input — no premium, no penalty.
DeepSeek offers the most aggressive discount: cache reads are 98-99% cheaper than normal input. At $0.0028/M, cached tokens are essentially free.
How Much Can You Save? A Real Example
Let's say you're building an AI coding assistant with a 4,000-token system prompt that defines the assistant's behavior, coding standards, and tool instructions. This prompt is sent with every request. You handle 50,000 requests per day, each with 1,000 tokens of unique user input.
Without caching
Every request processes the full 5,000 tokens (4K system + 1K user input).
| Model | Daily Input Tokens | Daily Cost | Monthly Cost |
|---|---|---|---|
| GPT-5.5 | 250M | $1,250 | $37,500 |
| Claude Sonnet 4.6 | 250M | $750 | $22,500 |
| DeepSeek V4 Flash | 250M | $35 | $1,050 |
With caching (system prompt cached)
The 4K system prompt is cached after the first request. Subsequent requests pay cache read pricing for those tokens.
| Model | Cache Write | Cache Read | Unique Input | Daily Cost | Monthly Cost |
|---|---|---|---|---|---|
| GPT-5.5 | $0.02 | $99.98 | $125.00 | $225 | $6,750 |
| Claude Sonnet 4.6 | $0.015 | $59.99 | $75.00 | $135 | $4,050 |
| DeepSeek V4 Flash | $0.003 | $0.56 | $7.00 | $7.56 | $227 |
Savings with caching:
- GPT-5.5: 82% reduction ($37,500 → $6,750/month)
- Claude Sonnet 4.6: 82% reduction ($22,500 → $4,050/month)
- DeepSeek V4 Flash: 78% reduction ($1,050 → $227/month)
A 4,000-token system prompt sent 50,000 times per day costs $1,250/day on GPT-5.5 without caching. With caching, it costs $225. That's $30,000 saved every month.
What Gets Cached?
Caching works on the prefix of your prompt — the beginning portion that stays the same across requests. Here's what typically qualifies:
System prompts
The instructions that define your assistant's behavior. These are usually identical across all requests and are the single biggest opportunity for caching.
Tool definitions
If you use function calling, the tool schemas are sent with every request. These can be thousands of tokens and rarely change — perfect for caching.
Document context
If you're building a RAG system where users ask questions about the same document, the document content stays the same across requests. Cache it.
Conversation history (partially)
In multi-turn conversations, earlier messages stay the same as new messages are added. The prefix of the conversation history can be cached.
Most providers require a minimum prompt length for caching to activate:
Anthropic: 1,024 tokens (Claude), 2,048 tokens (Opus)
OpenAI: 1,024 tokens
DeepSeek: Automatic, no published minimum
How to Structure Prompts for Maximum Caching
The golden rule: put stable content at the beginning of your prompt, and variable content at the end.
Good structure (cacheable)
System prompt → Tool definitions → Document context → User message
The first three components are identical across requests. Only the user message changes. The provider caches everything up to the point where the content diverges.
Bad structure (not cacheable)
User message → System prompt → Tool definitions → Document context
If the first token of your prompt changes every time (because it starts with the user's unique message), nothing gets cached. The provider can only cache a common prefix.
If you're using the OpenAI or Anthropic SDK, caching works automatically — you don't need to enable it. Just make sure your system prompt comes before any variable content in the messages array.
Provider-Specific Details
Anthropic (Claude)
Anthropic's caching is automatic for prompts over 1,024 tokens. The cache lives for 5 minutes by default — if you send the same prefix again within 5 minutes, it's a cache hit. For Opus models, the minimum is 2,048 tokens.
Anthropic is the only provider that charges a premium for cache writes (25% above normal input price). This means caching is only worth it if you send the same prefix more than once. For a prompt sent twice, you break even. For anything sent three or more times, you save money.
OpenAI (GPT)
OpenAI's caching is also automatic for prompts over 1,024 tokens. Cache TTL is typically 5-10 minutes. There's no premium for cache writes — you pay the same as normal input on the first request, and 90% less on subsequent requests.
OpenAI also offers a Batch API with 50% off all tokens for non-real-time workloads (24-hour turnaround). If your use case allows async processing, combine batching with caching for maximum savings.
DeepSeek
DeepSeek offers the most aggressive cache pricing in the industry. Cache reads on V4 Flash cost $0.0028/M — that's 98% cheaper than the normal $0.14/M input price. For high-volume agent applications, this makes cached tokens essentially free.
DeepSeek's caching applies automatically. There's no minimum length requirement published, and the cache duration is generous for repeated prefixes.
When Caching Doesn't Help
Caching isn't always the answer. Here's when it provides little or no benefit:
- Unique prompts every time. If every request has completely different content, there's no common prefix to cache.
- Very short prompts. If your total input is under 1,024 tokens, most providers won't cache it.
- Infrequent requests. If you send fewer than 2-3 requests per cache TTL window (5-10 minutes), you won't get cache hits.
- Constantly changing system prompts. If you A/B test different system prompts frequently, each variant needs its own cache.
Beyond Provider Caching: Application-Level Strategies
Provider-level caching handles repeated prefixes, but you can go further:
Response caching
If two users ask the same question, the response will be identical. Cache the response at the application level (in Redis, Memcached, or even a database) and serve it without making an API call.
Semantic caching
Use embeddings to find semantically similar queries. If someone asks "How do I sort a list in Python?" and another asks "How to sort array in Python?", the response is the same. Semantic caching matches these even when the exact wording differs.
Prompt compression
Reduce the size of your system prompt. Every token you remove from the system prompt saves money on every single request. Use concise instructions, remove redundant examples, and compress verbose tool descriptions.
Calculate how much you'd save with caching
Open Cost CalculatorThe Bottom Line
Prompt caching is the single most impactful cost optimization available to AI API users. It requires zero code changes on most providers, and the savings compound with every request. If your application sends the same system prompt more than three times, you're leaving money on the table.
Start by measuring your current costs. Then structure your prompts with stable content first, variable content last. The savings will follow.
Pricing sourced from official provider documentation (May 2026). Cache TTL, minimum lengths, and pricing may change. Always verify current rates on the provider's pricing page. DeepSeek V4 Pro pricing reflects the 75% promotional discount valid until May 31, 2026.