Stop Burning Tokens: A Practical Guide to Claude Code Cost Optimization

April 26, 2026 Abhay 7 min read

Stop Burning Tokens: A Practical Guide to Claude Code Cost Optimization

Token usage with Claude Code follows a frustrating pattern: costs are not spread evenly — they cluster around a handful of bad habits. Most developers using Claude Code daily are burning 40–60% more tokens than they need to, simply because of how they phrase prompts, what they put in CLAUDE.md, and which model they reach for by default.

This guide covers five concrete changes that make an immediate difference.

Why Tokens Are Worth Caring About

Every message you send in a Claude Code session includes:

The full conversation history up to that point
Every file Claude has read during the session
The content of your CLAUDE.md files
Tool outputs from previous commands
MCP tool definitions for configured servers

That means a long session is not just expensive on the last message — it gets more expensive on every single message, because history accumulates. A 3-hour debugging session where Claude reads ten files can cost 10x what the same work would cost if you cleared context between tasks.

flowchart LR
    subgraph SessionCost[What Goes Into Every Message]
        H[Conversation history]
        F[Files Claude read]
        C[CLAUDE.md content]
        T[Tool outputs]
        M[MCP definitions]
    end
    SessionCost -->|multiplied by| MSG[Every new message]
    MSG --> COST[Token cost compounds over time]

Habit 1: Trim Your CLAUDE.md

CLAUDE.md loads into every context window, every session, on every message. If your file is 400 lines, you are paying for 400 lines of context on every single message — even when 300 of those lines are things Claude already knows.

Common bloat to cut:

Framework documentation (“React is a library for building UIs”) — Claude knows this
Obvious best practices (“write readable code”, “use meaningful variable names”)
Package version lists Claude can read from package.json
Architectural diagrams that are already in your README

What to keep:

Commands Claude cannot guess: npm run deploy:staging, ./scripts/seed-db.sh
Team conventions that differ from defaults: “we use 4-space indentation, not 2”
Project-specific gotchas: “the auth service requires NODE_ENV=production to initialise”
Model routing rules (see Habit 3 below)

Target: under 150 lines. Studies of Claude’s instruction-following show adherence drops noticeably beyond that length — so a shorter, sharper file is both cheaper and more effective.

Habit 2: Ask for Specific Lines, Not Whole Files

When Claude reads a file, the entire file enters the context window. If you ask “what does this function in auth.ts do?” and auth.ts is 800 lines, all 800 lines are loaded — even if the function is at line 45.

Instead, be specific:

Check lines 40 to 65 in src/auth/token-service.ts — why does the refresh logic fail when the user is logged in on two devices?

This is especially important for:

Configuration files (often large, usually only partially relevant)
Migration files (you usually care about one migration)
Log files (you need the last 50 lines, not 10,000)

# Instead of: cat app.log | claude -p "debug this"
tail -100 app.log | claude -p "what errors appear here and why?"

Habit 3: Route Tasks to the Right Model

The default behaviour in Claude Code is to use a capable model for everything. That is correct for complex work but wildly over-specified for simple tasks.

flowchart TD
    Task[New task] --> Q1{Requires deep reasoning or architecture?}
    Q1 -->|Yes| Opus[Claude Opus\nArchitecture decisions\nComplex refactors\nDeep debugging]
    Q1 -->|No| Q2{Standard implementation or debugging?}
    Q2 -->|Yes| Sonnet[Claude Sonnet\nCode implementation\nBug fixes\nCode review\nTest writing]
    Q2 -->|No| Haiku[Claude Haiku\nRenamings\nFormatting\nSimple lookups\nSummaries]

Add this routing table to your CLAUDE.md so Claude applies it automatically:

## Model Routing

Use the cheapest model that can handle the task:

- **Haiku:** renaming, formatting, simple lookups, one-line changes, summarising output
- **Sonnet:** standard implementation, debugging, code review, writing tests
- **Opus:** architecture decisions, complex multi-file refactors, security analysis

You can also switch mid-session:

/model claude-haiku-4-5    # switch to Haiku for a simple task
/model claude-sonnet-4-6   # back to Sonnet for implementation

Habit 4: Front-Load Your Prompts

The most expensive pattern in any Claude Code session is back-and-forth clarification. Each exchange reloads the full context. A prompt that generates three rounds of clarification costs roughly 4x as much as a prompt that generates one complete answer.

Expensive pattern:

You: "Refactor the payment flow"
Claude: "Which part? The card form or the API layer?"
You: "The API layer"
Claude: "What framework? What's the error?"
You: "Express, and it times out on large orders"

That is three context loads before any work begins.

Cheaper pattern:

Before writing any code, confirm your understanding:
- I want to refactor the payment API layer in src/api/payments/
- It uses Express and times out on orders with more than 50 line items
- I need the fix to not change the public API contract
- Tell me if anything is ambiguous before you start

One message establishes full context. Claude confirms, you approve, work begins. One context load.

sequenceDiagram
    participant Dev as Developer
    participant Claude as Claude Code

    Note over Dev,Claude: Expensive pattern (4x context loads)
    Dev->>Claude: Refactor the payment flow
    Claude-->>Dev: Which part?
    Dev->>Claude: The API layer
    Claude-->>Dev: What framework?
    Dev->>Claude: Express, it times out on large orders
    Claude-->>Dev: OK, making changes...

    Note over Dev,Claude: Cheap pattern (2x context loads)
    Dev->>Claude: Refactor payment API in src/api/payments/ - Express - times out on 50+ line items - keep public API unchanged - confirm understanding first
    Claude-->>Dev: Confirmed. Making changes...

Habit 5: Scope Your Subagent Briefs

When Claude spawns subagents to investigate your codebase, vague instructions cause expensive exploration. A subagent told to “look into the auth system” may read dozens of files searching for context it does not need.

Vague (expensive):

Use a subagent to investigate why login is slow

Scoped (cheap):

Use a subagent to investigate login latency.
Read only: src/auth/login-handler.ts, src/middleware/session.ts, and src/db/user-queries.ts
Report: which function takes the longest and why
Do not read any other files

Tight scoping reduces subagent context to just the relevant files, cutting costs dramatically on large codebases.

Habit 6: Disable Unused MCP Servers

Every MCP server you have enabled loads its full tool definition into context on every message — whether you use that server or not. Five connected MCP servers can add thousands of tokens per message as a constant overhead.

# Check what MCP servers are loaded and their context cost
/mcp

# Disable servers you are not using in this session
/mcp disable google-drive
/mcp disable slack

# Re-enable when needed
/mcp enable slack

The practical rule: if you have not used an MCP server in the last two hours, disable it. The savings are proportional to how many servers you have connected and how verbose their tool schemas are.

Bonus: The Advisor Tool Pattern

For long agentic runs where you need high-quality decision-making but most of the work is routine execution, the Advisor Tool is a powerful cost lever. A fast, cheap executor model (Haiku) does the file reading, editing, and command running. When it hits a decision point requiring judgment, it calls an intelligent advisor model (Opus 4.7) for a short plan.

The advisor generates a plan of roughly 400–700 tokens. The executor then carries out the full implementation at its much lower rate. You get near-Opus quality for the hard decisions at a fraction of the cost of running Opus throughout.

This is particularly effective for large refactors and long dependency audits — tasks with many mechanical steps punctuated by a few genuinely hard decisions.

import anthropic

client = anthropic.Anthropic()

response = client.messages.create(
    model="claude-haiku-4-5-20251001",   # cheap executor
    max_tokens=4096,
    extra_headers={"anthropic-beta": "advisor-tool-2026-03-01"},
    tools=[
        {
            "type": "advisor",
            "advisor_model": "claude-opus-4-7",  # smart advisor
        }
    ],
    messages=[
        {"role": "user", "content": "Refactor our test suite to use async patterns throughout"}
    ]
)

Tracking Your Usage

Make it a habit to check costs before ending a session:

/cost      # estimated session cost so far
/usage     # detailed breakdown by message
/context   # see what is consuming context space

The /context command is particularly useful — it shows you exactly what is taking up space, so you can target the right optimisation.

Quick Reference

Habit	Impact	Effort
Trim CLAUDE.md to under 150 lines	High	One-time
Use line ranges instead of full files	High	Low per-prompt
Route tasks to Haiku/Sonnet/Opus	Medium	One-time setup
Front-load context in prompts	High	Medium habit change
Scope subagent file access	Medium	Low per-task
Run `/clear` between unrelated tasks	High	Low habit
Disable unused MCP servers	Medium	Low habit

The biggest gains come from CLAUDE.md trimming and front-loading. Do those two first and check your usage dashboard in a week — you will see the difference immediately.

Abhay Pratap Singh

DevOps Engineer passionate about automation, cloud infrastructure, and self-hosted tools. I write about Kubernetes, Terraform, DNS, and everything in between.

GitHub LinkedIn RSS