If your application sends the same system prompt, tool definitions, or document context with every API request, you're paying for the same tokens over and over again. Prompt caching solves this by storing the processed representation of repeated text so the model doesn't have to recompute it. The result: input costs drop by 90% or more on cached tokens.

The best part is that on most providers, caching happens automatically. You don't need to change your code — you just need to understand how it works and structure your prompts to take advantage of it.

How Prompt Caching Works

When you send a prompt to an AI model, the model processes every token from scratch — reading your system instructions, understanding the context, and building internal representations. This is expensive computation.

Prompt caching stores the result of that initial processing. When you send the same prefix again (the beginning of your prompt), the provider recognizes it and reuses the cached computation. The model only needs to process the new, unique part of your request.

Think of it like compiling code. The first time, the compiler processes everything from scratch. After that, it only recompiles what changed. Prompt caching does the same thing for your API calls.

There are two costs associated with caching:

  • Cache write — The first time a prefix is cached, you pay a small premium (varies by provider). This happens once.
  • Cache read — Every subsequent call that hits the cache pays a dramatically reduced rate. This is where the 90% savings come from.

Caching Pricing Across Providers

Here's what each major provider charges for cached tokens (per million tokens):

Provider Normal Input Cache Write Cache Read Savings
Anthropic (Opus 4.7) $5.00 $6.25 $0.50 90%
Anthropic (Sonnet 4.6) $3.00 $3.75 $0.30 90%
OpenAI (GPT-5.5) $5.00 $5.00 $0.50 90%
OpenAI (GPT-5.4) $2.50 $2.50 $0.25 90%
DeepSeek (V4 Flash) $0.14 $0.14 $0.0028 98%
DeepSeek (V4 Pro*) $0.435 $0.435 $0.0036 99%

* DeepSeek V4 Pro: 75% off until May 31, 2026. DeepSeek cache read pricing is 1/10 of the launch price as of April 26, 2026.

Key Differences

Anthropic charges a 25% premium on cache writes — you pay $6.25 instead of $5.00 the first time. But subsequent reads are 90% cheaper.

OpenAI charges the same price for cache writes as normal input — no premium, no penalty.

DeepSeek offers the most aggressive discount: cache reads are 98-99% cheaper than normal input. At $0.0028/M, cached tokens are essentially free.

How Much Can You Save? A Real Example

Let's say you're building an AI coding assistant with a 4,000-token system prompt that defines the assistant's behavior, coding standards, and tool instructions. This prompt is sent with every request. You handle 50,000 requests per day, each with 1,000 tokens of unique user input.

Without caching

Every request processes the full 5,000 tokens (4K system + 1K user input).

Model Daily Input Tokens Daily Cost Monthly Cost
GPT-5.5 250M $1,250 $37,500
Claude Sonnet 4.6 250M $750 $22,500
DeepSeek V4 Flash 250M $35 $1,050

With caching (system prompt cached)

The 4K system prompt is cached after the first request. Subsequent requests pay cache read pricing for those tokens.

Model Cache Write Cache Read Unique Input Daily Cost Monthly Cost
GPT-5.5 $0.02 $99.98 $125.00 $225 $6,750
Claude Sonnet 4.6 $0.015 $59.99 $75.00 $135 $4,050
DeepSeek V4 Flash $0.003 $0.56 $7.00 $7.56 $227

Savings with caching:

  • GPT-5.5: 82% reduction ($37,500 → $6,750/month)
  • Claude Sonnet 4.6: 82% reduction ($22,500 → $4,050/month)
  • DeepSeek V4 Flash: 78% reduction ($1,050 → $227/month)

A 4,000-token system prompt sent 50,000 times per day costs $1,250/day on GPT-5.5 without caching. With caching, it costs $225. That's $30,000 saved every month.

What Gets Cached?

Caching works on the prefix of your prompt — the beginning portion that stays the same across requests. Here's what typically qualifies:

System prompts

The instructions that define your assistant's behavior. These are usually identical across all requests and are the single biggest opportunity for caching.

Tool definitions

If you use function calling, the tool schemas are sent with every request. These can be thousands of tokens and rarely change — perfect for caching.

Document context

If you're building a RAG system where users ask questions about the same document, the document content stays the same across requests. Cache it.

Conversation history (partially)

In multi-turn conversations, earlier messages stay the same as new messages are added. The prefix of the conversation history can be cached.

Minimum Length Requirements

Most providers require a minimum prompt length for caching to activate:

Anthropic: 1,024 tokens (Claude), 2,048 tokens (Opus)

OpenAI: 1,024 tokens

DeepSeek: Automatic, no published minimum

How to Structure Prompts for Maximum Caching

The golden rule: put stable content at the beginning of your prompt, and variable content at the end.

Good structure (cacheable)

System prompt → Tool definitions → Document context → User message

The first three components are identical across requests. Only the user message changes. The provider caches everything up to the point where the content diverges.

Bad structure (not cacheable)

User message → System prompt → Tool definitions → Document context

If the first token of your prompt changes every time (because it starts with the user's unique message), nothing gets cached. The provider can only cache a common prefix.

Practical Tip

If you're using the OpenAI or Anthropic SDK, caching works automatically — you don't need to enable it. Just make sure your system prompt comes before any variable content in the messages array.

Provider-Specific Details

Anthropic (Claude)

Anthropic's caching is automatic for prompts over 1,024 tokens. The cache lives for 5 minutes by default — if you send the same prefix again within 5 minutes, it's a cache hit. For Opus models, the minimum is 2,048 tokens.

Anthropic is the only provider that charges a premium for cache writes (25% above normal input price). This means caching is only worth it if you send the same prefix more than once. For a prompt sent twice, you break even. For anything sent three or more times, you save money.

OpenAI (GPT)

OpenAI's caching is also automatic for prompts over 1,024 tokens. Cache TTL is typically 5-10 minutes. There's no premium for cache writes — you pay the same as normal input on the first request, and 90% less on subsequent requests.

OpenAI also offers a Batch API with 50% off all tokens for non-real-time workloads (24-hour turnaround). If your use case allows async processing, combine batching with caching for maximum savings.

DeepSeek

DeepSeek offers the most aggressive cache pricing in the industry. Cache reads on V4 Flash cost $0.0028/M — that's 98% cheaper than the normal $0.14/M input price. For high-volume agent applications, this makes cached tokens essentially free.

DeepSeek's caching applies automatically. There's no minimum length requirement published, and the cache duration is generous for repeated prefixes.

When Caching Doesn't Help

Caching isn't always the answer. Here's when it provides little or no benefit:

  • Unique prompts every time. If every request has completely different content, there's no common prefix to cache.
  • Very short prompts. If your total input is under 1,024 tokens, most providers won't cache it.
  • Infrequent requests. If you send fewer than 2-3 requests per cache TTL window (5-10 minutes), you won't get cache hits.
  • Constantly changing system prompts. If you A/B test different system prompts frequently, each variant needs its own cache.

Beyond Provider Caching: Application-Level Strategies

Provider-level caching handles repeated prefixes, but you can go further:

Response caching

If two users ask the same question, the response will be identical. Cache the response at the application level (in Redis, Memcached, or even a database) and serve it without making an API call.

Semantic caching

Use embeddings to find semantically similar queries. If someone asks "How do I sort a list in Python?" and another asks "How to sort array in Python?", the response is the same. Semantic caching matches these even when the exact wording differs.

Prompt compression

Reduce the size of your system prompt. Every token you remove from the system prompt saves money on every single request. Use concise instructions, remove redundant examples, and compress verbose tool descriptions.

Calculate how much you'd save with caching

Open Cost Calculator

The Bottom Line

Prompt caching is the single most impactful cost optimization available to AI API users. It requires zero code changes on most providers, and the savings compound with every request. If your application sends the same system prompt more than three times, you're leaving money on the table.

Start by measuring your current costs. Then structure your prompts with stable content first, variable content last. The savings will follow.

Pricing sourced from official provider documentation (May 2026). Cache TTL, minimum lengths, and pricing may change. Always verify current rates on the provider's pricing page. DeepSeek V4 Pro pricing reflects the 75% promotional discount valid until May 31, 2026.