Claude Extended and Adaptive Thinking: Making Claude Reason Before It Answers

May 03, 2026 Abhay 9 min read

Claude Extended and Adaptive Thinking: Making Claude Reason Before It Answers

By default, Claude generates its response token by token without any deliberate planning step. For most tasks — answering a question, writing a function, explaining a concept — this is fine. The response comes quickly and it is good.

For some tasks, it is not enough. Complex multi-step reasoning problems, ambiguous architecture decisions, intricate security analyses — these benefit from Claude thinking through the problem before committing to an answer. That is what extended thinking and adaptive thinking provide.

Extended Thinking vs. Adaptive Thinking

These are two distinct mechanisms, and which one you use depends on the model:

Extended Thinking (manual control, Sonnet 4.6 and Haiku 4.5):

You explicitly enable thinking and set a token budget
Claude uses up to that many tokens thinking through the problem
You see the thinking blocks in the response (or choose to hide them)
Available as thinking: {type: "enabled", budget_tokens: N}

Adaptive Thinking (model-controlled, Sonnet 4.6 and Opus 4.7):

The model decides when and how much to think
You control the effort level (low, medium, high, max)
Available as thinking: {type: "adaptive"}, effort: "high"

Important: Opus 4.7 only supports adaptive thinking. If you pass type: "enabled" with a budget on Opus 4.7, the API returns an error. This is a breaking change from Opus 4.6.

How Thinking Works Under the Hood

sequenceDiagram
    participant You as Developer
    participant Claude as Claude API

    Note over You,Claude: Standard response (no thinking)
    You->>Claude: "What's the best database for this use case?"
    Claude-->>You: [answer] — instant, no deliberation

    Note over You,Claude: With extended/adaptive thinking
    You->>Claude: "What's the best database for this use case?" + thinking enabled
    Claude->>Claude: [thinking block — internal reasoning]
    Note right of Claude: Considers options, trade-offs,\nweighs constraints...
    Claude-->>You: [thinking block] + [answer] — deliberated response

When thinking is enabled, Claude generates a thinking block before its response. This is real deliberation — Claude working through the problem, considering alternatives, and catching errors in its own reasoning. The final answer is grounded in that reasoning process.

You are billed for thinking tokens. On models that return summarised thinking (Claude 4 and newer), you pay for the full internal token count but receive a summary — Anthropic returns a condensed version to prevent misuse of the raw reasoning trace.

Extended Thinking — Python

import anthropic

client = anthropic.Anthropic()

response = client.messages.create(
    model="claude-sonnet-4-6",
    max_tokens=16000,
    thinking={
        "type": "enabled",
        "budget_tokens": 10000  # up to 10k tokens for thinking
    },
    messages=[
        {
            "role": "user",
            "content": """
            We are designing a data pipeline that needs to:
            - Ingest 50,000 events per second from IoT sensors
            - Store raw events for 90 days
            - Support real-time aggregation queries (last 5 minutes, 1 hour, 24 hours)
            - Support historical queries going back 90 days
            - Budget: ~$500/month infrastructure

            Recommend an architecture with specific technologies and explain your reasoning.
            """
        }
    ]
)

# Response contains thinking blocks and text blocks
for block in response.content:
    if block.type == "thinking":
        print("=== Claude's reasoning ===")
        print(block.thinking)
        print()
    elif block.type == "text":
        print("=== Final answer ===")
        print(block.text)

Controlling the Budget

The budget_tokens parameter sets an upper limit — Claude will use as many tokens as the problem requires, up to the budget. Setting it too low forces Claude to cut reasoning short on hard problems.

Task difficulty	Suggested budget
Simple reasoning, few steps	2,000–5,000
Standard complex problem	8,000–12,000
Deep architectural analysis	15,000–20,000
Maximum depth	Up to 100,000

# For a simple trade-off question
thinking={"type": "enabled", "budget_tokens": 3000}

# For a multi-step security audit
thinking={"type": "enabled", "budget_tokens": 15000}

Hiding Thinking from the Response

If you want the quality benefit without showing users the raw reasoning:

response = client.messages.create(
    model="claude-sonnet-4-6",
    max_tokens=8000,
    thinking={
        "type": "enabled",
        "budget_tokens": 8000,
        "display": "omitted"  # thinking still happens, just not returned
    },
    messages=[...]
)

# Only text blocks in response.content — thinking is omitted but still happened

This is useful in production where you want improved answer quality without streaming or storing thinking content.

Adaptive Thinking — Python

import anthropic

client = anthropic.Anthropic()

response = client.messages.create(
    model="claude-opus-4-7",       # or claude-sonnet-4-6
    max_tokens=16000,
    thinking={"type": "adaptive"}, # model decides when to think
    effort="high",                 # low | medium | high (default) | max
    messages=[
        {
            "role": "user",
            "content": """
            Review this Terraform module for security issues.
            Focus on IAM policies, S3 bucket configurations, and network ACLs.
            Identify specific CVE patterns if applicable.
            """
        }
    ]
)

for block in response.content:
    if block.type == "thinking":
        print(f"[Reasoning — {len(block.thinking)} chars]")
    elif block.type == "text":
        print(block.text)

Effort Levels

Effort	Behaviour	Cost
`low`	Minimal reasoning — fast, cheap. Thinks only when unavoidable.	Lowest
`medium`	Moderate reasoning. Thinks on moderately complex problems.	Moderate
`high`	Default. Reasons on any problem that benefits from it.	Higher
`max`	Maximum reasoning. Applies deep thinking to almost every response.	Highest

Start with high (the default). Drop to medium or low for tasks where speed matters more than depth. Use max only when you genuinely need the absolute best answer on a hard problem — the cost scales accordingly.

Adaptive Thinking — TypeScript

import Anthropic from "@anthropic-ai/sdk";

const client = new Anthropic();

async function reviewArchitecture(description: string): Promise<string> {
  const response = await client.messages.create({
    model: "claude-opus-4-7",
    max_tokens: 16000,
    thinking: { type: "adaptive" },
    effort: "high",
    messages: [
      {
        role: "user",
        content: `Review this architecture for scalability and reliability issues:\n\n${description}`,
      },
    ],
  });

  // Collect only text blocks for the final response
  return response.content
    .filter((block) => block.type === "text")
    .map((block) => (block as { text: string }).text)
    .join("\n");
}

Multi-Turn Conversations with Thinking

When building multi-turn conversations, you must pass thinking blocks back in subsequent turns to preserve the model’s reasoning chain. Dropping them causes errors.

import anthropic

client = anthropic.Anthropic()

conversation = []

def chat_with_thinking(user_message: str) -> str:
    conversation.append({"role": "user", "content": user_message})

    response = client.messages.create(
        model="claude-sonnet-4-6",
        max_tokens=8000,
        thinking={"type": "enabled", "budget_tokens": 5000},
        messages=conversation
    )

    # Add the full response (including thinking blocks) back to history
    # This is required — you cannot drop thinking blocks from history
    conversation.append({
        "role": "assistant",
        "content": response.content  # includes both thinking and text blocks
    })

    # Return only the text to the user
    text_blocks = [b.text for b in response.content if b.type == "text"]
    return "\n".join(text_blocks)

# Multi-turn example
answer1 = chat_with_thinking("What database should I use for a high-write IoT workload?")
print(answer1)

answer2 = chat_with_thinking("What if we also need ACID transactions on some of those writes?")
print(answer2)  # Claude has full context including previous reasoning

When to Enable Thinking

Thinking adds latency and cost. It is not always the right choice.

Enable thinking for:

Architecture decisions with multiple valid approaches and real trade-offs
Security analysis requiring understanding of multi-step attack chains
Complex debugging where the cause is not obvious from the symptoms
Mathematical or algorithmic problems with non-trivial solutions
Any task where you have tried without thinking and found the output lacking

Skip thinking for:

Simple factual questions (“what does this flag do?”)
Formatting, renaming, or structural refactoring tasks
Summarisation of content that is already clear
Tasks where speed is more important than depth
High-volume production endpoints where cost is a constraint

A useful heuristic: if a senior engineer would spend more than 5 minutes thinking before answering, enable thinking. If they would answer immediately, skip it.

Interleaved Thinking in Agentic Workflows

Newer Claude 4 models support interleaved thinking — Claude reasons not just before the final response, but between tool calls during a multi-step agent run. This is particularly valuable for agentic coding tasks where the model needs to plan, check the result, re-plan, and continue.

import anthropic

client = anthropic.Anthropic()

tools = [
    {
        "name": "run_bash",
        "description": "Execute a bash command and return output",
        "input_schema": {
            "type": "object",
            "properties": {
                "command": {"type": "string", "description": "The bash command to run"}
            },
            "required": ["command"]
        }
    },
    {
        "name": "read_file",
        "description": "Read a file from the filesystem",
        "input_schema": {
            "type": "object",
            "properties": {
                "path": {"type": "string"}
            },
            "required": ["path"]
        }
    }
]

response = client.messages.create(
    model="claude-opus-4-7",  # interleaved thinking automatic on Opus 4.7
    max_tokens=32000,
    thinking={"type": "adaptive"},
    effort="high",
    tools=tools,
    messages=[
        {
            "role": "user",
            "content": "Debug why our test suite is taking 12 minutes instead of the expected 3. Find the bottleneck."
        }
    ]
)

# Claude will think, call tools, think again based on results,
# call more tools, and reason through the full investigation
for block in response.content:
    if block.type == "thinking":
        print(f"[Claude reasoning: {len(block.thinking)} chars]")
    elif block.type == "tool_use":
        print(f"[Tool call: {block.name}({block.input})]")
    elif block.type == "text":
        print(block.text)

With interleaved thinking, Claude does not just think once at the start — it reasons between every step of the tool-calling loop. This is what makes it effective at open-ended investigation tasks.

Migrating from Opus 4.6 to Opus 4.7

If you are using extended thinking with Opus 4.6, this migration is required when moving to Opus 4.7:

# Opus 4.6 — extended thinking (manual budget)
response = client.messages.create(
    model="claude-opus-4-6",
    max_tokens=16000,
    thinking={
        "type": "enabled",
        "budget_tokens": 10000
    },
    messages=[...]
)

# Opus 4.7 — adaptive thinking (model-controlled)
response = client.messages.create(
    model="claude-opus-4-7",
    max_tokens=16000,
    thinking={"type": "adaptive"},  # NOT "enabled"
    effort="high",                  # replaces budget_tokens
    messages=[...]
)

The effort parameter is the practical replacement for budget_tokens. If you were using a large budget (15,000+) for complex tasks, use effort: "max". For standard budgets (5,000–10,000), use effort: "high" (the default).

Cost Estimation

Thinking tokens are billed as input tokens. Estimate costs before enabling at scale:

import anthropic

client = anthropic.Anthropic()

# Check what thinking actually costs on your specific task
response = client.messages.create(
    model="claude-sonnet-4-6",
    max_tokens=8000,
    thinking={"type": "enabled", "budget_tokens": 8000},
    messages=[{"role": "user", "content": your_task}]
)

print(f"Input tokens (non-thinking): {response.usage.input_tokens}")
print(f"Thinking tokens:             {response.usage.thinking_tokens if hasattr(response.usage, 'thinking_tokens') else 'n/a'}")
print(f"Output tokens:               {response.usage.output_tokens}")

As a rough guide, a 5,000-token thinking budget on a complex task with Sonnet 4.6 costs about $0.015 in thinking tokens alone (5,000 × $3/MTok). At that rate, it is cost-effective for decisions that matter but negligible for tasks you run once. For high-volume production workloads, skip thinking unless the task genuinely requires it.

Quick Reference

Scenario	Model	Thinking config
Architecture decision	Opus 4.7 or Sonnet 4.6	Adaptive, effort: high
Security audit	Opus 4.7	Adaptive, effort: max
Complex debugging	Sonnet 4.6	Extended, budget: 8,000
Simple code review	Sonnet 4.6	None
Feature implementation	Sonnet 4.6	None (or low)
Agentic multi-step task	Opus 4.7	Adaptive, effort: high
High-volume production	Haiku 4.5	Extended, budget: 2,000

The goal is not to enable thinking everywhere — it is to apply it where it makes a meaningful difference. On tasks that are hard, ambiguous, or high-stakes, it substantially improves output quality. On tasks that are straightforward, it is wasted cost and latency.

Abhay Pratap Singh

DevOps Engineer passionate about automation, cloud infrastructure, and self-hosted tools. I write about Kubernetes, Terraform, DNS, and everything in between.

GitHub LinkedIn RSS