Claude Prompt Caching: Cut Your API Costs by 90%

May 03, 2026 Abhay 8 min read

Claude Prompt Caching: Cut Your API Costs by 90%

If you are calling the Claude API repeatedly with a large system prompt, a big document, or a long codebase context — and you are not using prompt caching — you are paying full price every time for content that has not changed. Prompt caching stores a prefix of your prompt server-side and charges 90% less to read it back on every subsequent request.

For applications that repeatedly process the same context, this is the single highest-impact API optimisation available.

How It Works

Every Claude API request has two cost components: input tokens and output tokens. Input tokens include your system prompt, any documents you include, and the conversation history. If you include the same 50,000-token codebase context on every request, you pay for all 50,000 tokens every time.

Prompt caching changes this. You mark a specific point in your prompt as a cache breakpoint. Anthropic stores everything up to that point server-side. On the next request that hits the same prefix, you pay the cache read rate instead of the full input rate.

sequenceDiagram
    participant App as Your App
    participant API as Claude API

    Note over App,API: First request — cache write
    App->>API: System prompt (50k tokens) + question [cache_control: ephemeral]
    API->>API: Store prefix in cache
    API-->>App: Response. Charged: 1.25x input rate for cached tokens

    Note over App,API: Second request — cache hit
    App->>API: Same prefix + new question
    API->>API: Found in cache
    API-->>App: Response. Charged: 0.1x input rate for cached tokens

Cache pricing:

	Write	Read
Ephemeral (5-min TTL)	1.25x input rate	0.1x input rate
Persistent (1-hour TTL)	2x input rate	0.1x input rate

The write cost is slightly higher than a normal request, but every subsequent read within the TTL costs 90% less. On any prefix you use more than twice within the TTL window, caching pays for itself.

Basic Implementation

Add a cache_control field to any content block you want to cache. The cache breakpoint sits at the end of that block.

Python

import anthropic

client = anthropic.Anthropic()

# Load your large context once
with open("codebase-summary.txt") as f:
    codebase_context = f.read()

def ask_about_codebase(question: str) -> str:
    response = client.messages.create(
        model="claude-sonnet-4-6",
        max_tokens=1024,
        system=[
            {
                "type": "text",
                "text": "You are an expert software engineer helping with this codebase.",
            },
            {
                "type": "text",
                "text": codebase_context,
                "cache_control": {"type": "ephemeral"}  # cache everything up to here
            }
        ],
        messages=[{"role": "user", "content": question}]
    )
    return response.content[0].text

# First call: cache write (1.25x rate for the large context)
answer1 = ask_about_codebase("Why is the auth service slow?")

# Second call: cache hit (0.1x rate for the large context)
answer2 = ask_about_codebase("How does session management work?")

# The cache hit saves ~89% on the large context tokens

TypeScript

import Anthropic from "@anthropic-ai/sdk";
import { readFileSync } from "fs";

const client = new Anthropic();
const codebaseContext = readFileSync("codebase-summary.txt", "utf-8");

async function askAboutCodebase(question: string): Promise<string> {
  const response = await client.messages.create({
    model: "claude-sonnet-4-6",
    max_tokens: 1024,
    system: [
      {
        type: "text",
        text: "You are an expert software engineer helping with this codebase.",
      },
      {
        type: "text",
        text: codebaseContext,
        cache_control: { type: "ephemeral" },
      },
    ],
    messages: [{ role: "user", content: question }],
  });

  return (response.content[0] as { text: string }).text;
}

Cache TTL: Ephemeral vs. Persistent

Ephemeral (5-minute TTL) is the default. It is the right choice when:

You are running an interactive session where questions come rapidly
You are processing a document and asking many questions in quick succession
Cost per write is a concern (1.25x vs 2x)

Persistent (1-hour TTL) costs more to write but keeps the cache alive longer. It is the right choice when:

Requests come less frequently (every few minutes rather than every few seconds)
You have a shared context used by multiple users (e.g., a documentation assistant)
The write cost is small compared to the number of reads you expect

# Ephemeral (5-min TTL) — default
"cache_control": {"type": "ephemeral"}

# Persistent (1-hour TTL) — higher write cost, longer window
"cache_control": {"type": "persistent"}

Multi-Turn Conversations with Caching

For multi-turn conversations, you want to cache both the stable system context and the growing conversation history. The key is re-sending the full conversation with cache_control on the most recent complete exchange.

import anthropic

client = anthropic.Anthropic()

system_prompt = "You are a DevOps expert. Help the user debug their infrastructure."
large_runbook = open("runbook.md").read()

conversation_history = []

def chat(user_message: str) -> str:
    conversation_history.append({"role": "user", "content": user_message})

    # Mark the last user message as a cache point so history
    # is cached up to this point for the next turn
    messages_with_cache = []
    for i, msg in enumerate(conversation_history):
        if i == len(conversation_history) - 1 and msg["role"] == "user":
            messages_with_cache.append({
                "role": msg["role"],
                "content": [
                    {
                        "type": "text",
                        "text": msg["content"],
                        "cache_control": {"type": "ephemeral"}
                    }
                ]
            })
        else:
            messages_with_cache.append(msg)

    response = client.messages.create(
        model="claude-sonnet-4-6",
        max_tokens=2048,
        system=[
            {"type": "text", "text": system_prompt},
            {
                "type": "text",
                "text": large_runbook,
                "cache_control": {"type": "ephemeral"}  # always cached
            }
        ],
        messages=messages_with_cache
    )

    assistant_message = response.content[0].text
    conversation_history.append({"role": "assistant", "content": assistant_message})
    return assistant_message

Automatic Caching

As of February 2026, Anthropic offers automatic caching — add one cache_control field to your system prompt and the infrastructure automatically advances the cache breakpoint as your context grows. You do not need to manually track which turns to cache.

response = client.messages.create(
    model="claude-sonnet-4-6",
    max_tokens=1024,
    system=[
        {
            "type": "text",
            "text": large_context,
            "cache_control": {"type": "ephemeral"}  # that's it — auto-managed from here
        }
    ],
    messages=conversation_history
)

With automatic caching, the server detects the longest common prefix between your current request and recent cached requests, and serves from cache accordingly. This is the recommended approach for new integrations.

Checking Cache Performance

The API response includes cache usage metrics so you can verify your caching is working:

response = client.messages.create(...)

usage = response.usage
print(f"Input tokens:          {usage.input_tokens}")
print(f"Cache write tokens:    {usage.cache_creation_input_tokens}")
print(f"Cache read tokens:     {usage.cache_read_input_tokens}")

# Cache hit rate
total = usage.input_tokens + usage.cache_creation_input_tokens + usage.cache_read_input_tokens
if total > 0:
    hit_rate = usage.cache_read_input_tokens / total
    print(f"Cache hit rate:        {hit_rate:.1%}")

A well-tuned implementation targeting a large stable prefix should show 80–95% of tokens as cache reads after the first request.

DevOps Use Cases

Caching a Codebase for a Review Session

import anthropic
import subprocess

client = anthropic.Anthropic()

# Get all TypeScript files
ts_files = subprocess.check_output(
    ["find", "src", "-name", "*.ts", "-not", "-path", "*/node_modules/*"],
    text=True
).strip().split("\n")

# Build context from all files
codebase = []
for path in ts_files:
    try:
        content = open(path).read()
        codebase.append(f"=== {path} ===\n{content}")
    except Exception:
        pass

full_context = "\n\n".join(codebase)

def review_pr_changes(diff: str) -> str:
    """Review a git diff against cached codebase context."""
    response = client.messages.create(
        model="claude-sonnet-4-6",
        max_tokens=2048,
        system=[
            {
                "type": "text",
                "text": "You are a senior engineer reviewing code changes. "
                        "You have full context of the codebase below.",
            },
            {
                "type": "text",
                "text": full_context,
                "cache_control": {"type": "ephemeral"}
            }
        ],
        messages=[{
            "role": "user",
            "content": f"Review this diff for bugs, security issues, and improvements:\n\n{diff}"
        }]
    )
    return response.content[0].text

Caching Documentation for a Support Bot

docs_content = ""
for doc_file in Path("docs").rglob("*.md"):
    docs_content += f"\n\n## {doc_file}\n\n{doc_file.read_text()}"

def answer_support_question(question: str) -> str:
    response = client.messages.create(
        model="claude-haiku-4-5-20251001",  # Haiku for fast, cheap support responses
        max_tokens=512,
        system=[
            {"type": "text", "text": "Answer questions based on our documentation."},
            {
                "type": "text",
                "text": docs_content,
                "cache_control": {"type": "persistent"}  # 1-hour TTL — questions come slowly
            }
        ],
        messages=[{"role": "user", "content": question}]
    )
    return response.content[0].text

Combining Caching with the Batch API

For maximum cost reduction on non-urgent work, combine prompt caching with the Batch API:

Batch API: 50% discount on all requests
Prompt caching: 90% discount on cached input tokens
Combined: up to 95% reduction versus standard per-request pricing

import anthropic

client = anthropic.Anthropic()
large_context = open("large-context.txt").read()

# Create a batch of 100 requests — all reusing the same cached context
batch = client.messages.batches.create(
    requests=[
        {
            "custom_id": f"task-{i}",
            "params": {
                "model": "claude-sonnet-4-6",
                "max_tokens": 1024,
                "system": [
                    {"type": "text", "text": "You are a code reviewer."},
                    {
                        "type": "text",
                        "text": large_context,
                        "cache_control": {"type": "ephemeral"}
                    }
                ],
                "messages": [
                    {"role": "user", "content": f"Review change #{i} for security issues."}
                ]
            }
        }
        for i in range(100)
    ]
)

print(f"Batch submitted: {batch.id}")
print("Will complete within 24 hours.")

What Can and Cannot Be Cached

Can be cached:

System prompt content
User message content
Tool definitions
Documents and file content included in messages

Cannot be cached:

Output tokens — you always pay full output rate
The last message in a conversation (it needs to be complete to cache)
Images (currently; text-based content only)

Minimum cacheable size: The minimum size for a cache breakpoint is approximately 1,024 tokens. Caching very short prefixes is not supported — the content needs to be substantial enough that the savings justify the infrastructure overhead.

Summary: When to Use Each Approach

Scenario	Approach
Interactive session with a large codebase	Ephemeral caching, automatic mode
Documentation assistant with shared users	Persistent caching (1-hour TTL)
Batch code review across 100+ PRs	Batch API + ephemeral caching
Support bot answering questions slowly	Persistent caching + Haiku model
Real-time chat with conversation history	Ephemeral caching on last user turn
One-off request, no repetition	Skip caching — no benefit

The minimum effective use of caching is a single large document that you query more than once in a 5-minute window. The maximum benefit is a large stable context used across hundreds of requests per hour — there, the savings are dramatic.

Prompt caching is available on all current Claude models: Opus 4.7, Sonnet 4.6, and Haiku 4.5.

Abhay Pratap Singh

DevOps Engineer passionate about automation, cloud infrastructure, and self-hosted tools. I write about Kubernetes, Terraform, DNS, and everything in between.

GitHub LinkedIn RSS