Claude Prompt Caching: Cut Your API Costs by 90%
If you are calling the Claude API repeatedly with a large system prompt, a big document, or a long codebase context — and you are not using prompt caching — you are paying full price every time for content that has not changed. Prompt caching stores a prefix of your prompt server-side and charges 90% less to read it back on every subsequent request.
For applications that repeatedly process the same context, this is the single highest-impact API optimisation available.
How It Works
Every Claude API request has two cost components: input tokens and output tokens. Input tokens include your system prompt, any documents you include, and the conversation history. If you include the same 50,000-token codebase context on every request, you pay for all 50,000 tokens every time.
Prompt caching changes this. You mark a specific point in your prompt as a cache breakpoint. Anthropic stores everything up to that point server-side. On the next request that hits the same prefix, you pay the cache read rate instead of the full input rate.
sequenceDiagram
participant App as Your App
participant API as Claude API
Note over App,API: First request — cache write
App->>API: System prompt (50k tokens) + question [cache_control: ephemeral]
API->>API: Store prefix in cache
API-->>App: Response. Charged: 1.25x input rate for cached tokens
Note over App,API: Second request — cache hit
App->>API: Same prefix + new question
API->>API: Found in cache
API-->>App: Response. Charged: 0.1x input rate for cached tokens
Cache pricing:
| Write | Read | |
|---|---|---|
| Ephemeral (5-min TTL) | 1.25x input rate | 0.1x input rate |
| Persistent (1-hour TTL) | 2x input rate | 0.1x input rate |
The write cost is slightly higher than a normal request, but every subsequent read within the TTL costs 90% less. On any prefix you use more than twice within the TTL window, caching pays for itself.
Basic Implementation
Add a cache_control field to any content block you want to cache. The cache breakpoint sits at the end of that block.
Python
import anthropic
client = anthropic.Anthropic()
# Load your large context once
with open("codebase-summary.txt") as f:
codebase_context = f.read()
def ask_about_codebase(question: str) -> str:
response = client.messages.create(
model="claude-sonnet-4-6",
max_tokens=1024,
system=[
{
"type": "text",
"text": "You are an expert software engineer helping with this codebase.",
},
{
"type": "text",
"text": codebase_context,
"cache_control": {"type": "ephemeral"} # cache everything up to here
}
],
messages=[{"role": "user", "content": question}]
)
return response.content[0].text
# First call: cache write (1.25x rate for the large context)
answer1 = ask_about_codebase("Why is the auth service slow?")
# Second call: cache hit (0.1x rate for the large context)
answer2 = ask_about_codebase("How does session management work?")
# The cache hit saves ~89% on the large context tokens
TypeScript
import Anthropic from "@anthropic-ai/sdk";
import { readFileSync } from "fs";
const client = new Anthropic();
const codebaseContext = readFileSync("codebase-summary.txt", "utf-8");
async function askAboutCodebase(question: string): Promise<string> {
const response = await client.messages.create({
model: "claude-sonnet-4-6",
max_tokens: 1024,
system: [
{
type: "text",
text: "You are an expert software engineer helping with this codebase.",
},
{
type: "text",
text: codebaseContext,
cache_control: { type: "ephemeral" },
},
],
messages: [{ role: "user", content: question }],
});
return (response.content[0] as { text: string }).text;
}
Cache TTL: Ephemeral vs. Persistent
Ephemeral (5-minute TTL) is the default. It is the right choice when:
- You are running an interactive session where questions come rapidly
- You are processing a document and asking many questions in quick succession
- Cost per write is a concern (1.25x vs 2x)
Persistent (1-hour TTL) costs more to write but keeps the cache alive longer. It is the right choice when:
- Requests come less frequently (every few minutes rather than every few seconds)
- You have a shared context used by multiple users (e.g., a documentation assistant)
- The write cost is small compared to the number of reads you expect
# Ephemeral (5-min TTL) — default
"cache_control": {"type": "ephemeral"}
# Persistent (1-hour TTL) — higher write cost, longer window
"cache_control": {"type": "persistent"}
Multi-Turn Conversations with Caching
For multi-turn conversations, you want to cache both the stable system context and the growing conversation history. The key is re-sending the full conversation with cache_control on the most recent complete exchange.
import anthropic
client = anthropic.Anthropic()
system_prompt = "You are a DevOps expert. Help the user debug their infrastructure."
large_runbook = open("runbook.md").read()
conversation_history = []
def chat(user_message: str) -> str:
conversation_history.append({"role": "user", "content": user_message})
# Mark the last user message as a cache point so history
# is cached up to this point for the next turn
messages_with_cache = []
for i, msg in enumerate(conversation_history):
if i == len(conversation_history) - 1 and msg["role"] == "user":
messages_with_cache.append({
"role": msg["role"],
"content": [
{
"type": "text",
"text": msg["content"],
"cache_control": {"type": "ephemeral"}
}
]
})
else:
messages_with_cache.append(msg)
response = client.messages.create(
model="claude-sonnet-4-6",
max_tokens=2048,
system=[
{"type": "text", "text": system_prompt},
{
"type": "text",
"text": large_runbook,
"cache_control": {"type": "ephemeral"} # always cached
}
],
messages=messages_with_cache
)
assistant_message = response.content[0].text
conversation_history.append({"role": "assistant", "content": assistant_message})
return assistant_message
Automatic Caching
As of February 2026, Anthropic offers automatic caching — add one cache_control field to your system prompt and the infrastructure automatically advances the cache breakpoint as your context grows. You do not need to manually track which turns to cache.
response = client.messages.create(
model="claude-sonnet-4-6",
max_tokens=1024,
system=[
{
"type": "text",
"text": large_context,
"cache_control": {"type": "ephemeral"} # that's it — auto-managed from here
}
],
messages=conversation_history
)
With automatic caching, the server detects the longest common prefix between your current request and recent cached requests, and serves from cache accordingly. This is the recommended approach for new integrations.
Checking Cache Performance
The API response includes cache usage metrics so you can verify your caching is working:
response = client.messages.create(...)
usage = response.usage
print(f"Input tokens: {usage.input_tokens}")
print(f"Cache write tokens: {usage.cache_creation_input_tokens}")
print(f"Cache read tokens: {usage.cache_read_input_tokens}")
# Cache hit rate
total = usage.input_tokens + usage.cache_creation_input_tokens + usage.cache_read_input_tokens
if total > 0:
hit_rate = usage.cache_read_input_tokens / total
print(f"Cache hit rate: {hit_rate:.1%}")
A well-tuned implementation targeting a large stable prefix should show 80–95% of tokens as cache reads after the first request.
DevOps Use Cases
Caching a Codebase for a Review Session
import anthropic
import subprocess
client = anthropic.Anthropic()
# Get all TypeScript files
ts_files = subprocess.check_output(
["find", "src", "-name", "*.ts", "-not", "-path", "*/node_modules/*"],
text=True
).strip().split("\n")
# Build context from all files
codebase = []
for path in ts_files:
try:
content = open(path).read()
codebase.append(f"=== {path} ===\n{content}")
except Exception:
pass
full_context = "\n\n".join(codebase)
def review_pr_changes(diff: str) -> str:
"""Review a git diff against cached codebase context."""
response = client.messages.create(
model="claude-sonnet-4-6",
max_tokens=2048,
system=[
{
"type": "text",
"text": "You are a senior engineer reviewing code changes. "
"You have full context of the codebase below.",
},
{
"type": "text",
"text": full_context,
"cache_control": {"type": "ephemeral"}
}
],
messages=[{
"role": "user",
"content": f"Review this diff for bugs, security issues, and improvements:\n\n{diff}"
}]
)
return response.content[0].text
Caching Documentation for a Support Bot
docs_content = ""
for doc_file in Path("docs").rglob("*.md"):
docs_content += f"\n\n## {doc_file}\n\n{doc_file.read_text()}"
def answer_support_question(question: str) -> str:
response = client.messages.create(
model="claude-haiku-4-5-20251001", # Haiku for fast, cheap support responses
max_tokens=512,
system=[
{"type": "text", "text": "Answer questions based on our documentation."},
{
"type": "text",
"text": docs_content,
"cache_control": {"type": "persistent"} # 1-hour TTL — questions come slowly
}
],
messages=[{"role": "user", "content": question}]
)
return response.content[0].text
Combining Caching with the Batch API
For maximum cost reduction on non-urgent work, combine prompt caching with the Batch API:
- Batch API: 50% discount on all requests
- Prompt caching: 90% discount on cached input tokens
- Combined: up to 95% reduction versus standard per-request pricing
import anthropic
client = anthropic.Anthropic()
large_context = open("large-context.txt").read()
# Create a batch of 100 requests — all reusing the same cached context
batch = client.messages.batches.create(
requests=[
{
"custom_id": f"task-{i}",
"params": {
"model": "claude-sonnet-4-6",
"max_tokens": 1024,
"system": [
{"type": "text", "text": "You are a code reviewer."},
{
"type": "text",
"text": large_context,
"cache_control": {"type": "ephemeral"}
}
],
"messages": [
{"role": "user", "content": f"Review change #{i} for security issues."}
]
}
}
for i in range(100)
]
)
print(f"Batch submitted: {batch.id}")
print("Will complete within 24 hours.")
What Can and Cannot Be Cached
Can be cached:
- System prompt content
- User message content
- Tool definitions
- Documents and file content included in messages
Cannot be cached:
- Output tokens — you always pay full output rate
- The last message in a conversation (it needs to be complete to cache)
- Images (currently; text-based content only)
Minimum cacheable size: The minimum size for a cache breakpoint is approximately 1,024 tokens. Caching very short prefixes is not supported — the content needs to be substantial enough that the savings justify the infrastructure overhead.
Summary: When to Use Each Approach
| Scenario | Approach |
|---|---|
| Interactive session with a large codebase | Ephemeral caching, automatic mode |
| Documentation assistant with shared users | Persistent caching (1-hour TTL) |
| Batch code review across 100+ PRs | Batch API + ephemeral caching |
| Support bot answering questions slowly | Persistent caching + Haiku model |
| Real-time chat with conversation history | Ephemeral caching on last user turn |
| One-off request, no repetition | Skip caching — no benefit |
The minimum effective use of caching is a single large document that you query more than once in a 5-minute window. The maximum benefit is a large stable context used across hundreds of requests per hour — there, the savings are dramatic.
Prompt caching is available on all current Claude models: Opus 4.7, Sonnet 4.6, and Haiku 4.5.
