Lessons from Building Claude Code: How Anthropic Designs Agent Tools

The hardest part of building an agent harness isn't picking the right model — it's constructing the action space. Anthropic's Claude Code team just shared their internal playbook for designing agent tools, and the core insight is deceptively simple: you have to learn to see like an agent. That means understanding the model's actual capabilities, watching how it uses tools in practice, and iterating based on what you observe — not what you assume. Three specific lessons from Claude Code's evolution reveal patterns every developer building agentic systems should internalize.

What Happened

The Anthropic team published a detailed breakdown of how they design tools for Claude Code, their AI coding assistant that acts through tool calling. The post covers three case studies from Claude Code's development: building the AskUserQuestion elicitation tool, evolving from TodoWrite to the Task system, and replacing RAG with self-directed search via Grep.

The framing principle is an analogy: if you were given a hard math problem, the best tool depends on your skill level. Paper for manual work, a calculator for assisted computation, a computer for full programmatic solutions. Agent tool design follows the same logic — you shape tools to match the model's actual abilities, not some theoretical ideal.

Claude acts through the Claude API's tool calling primitives, which include bash, skills, and code execution. The question isn't "what tools exist?" but "what tools does this specific model use well?" The answer, Anthropic found, requires close observation of model behavior across iterations.

Why It Matters

This is one of the first detailed accounts of agent tool design principles from a team building a production-scale coding agent. Most agent frameworks focus on orchestration — chains, graphs, routing. Anthropic is arguing that the tool interface itself is the critical design surface.

The implications cut across the entire agentic AI ecosystem. If you're building with the Claude API or any LLM-based agent, tool design choices compound. A poorly designed tool doesn't just reduce performance on one task — it degrades the model's ability to plan and execute across entire workflows.

The "see like an agent" framework also challenges the common approach of giving agents maximally flexible tools (like a single bash shell) and hoping the model figures it out. Anthropic's experience shows that structured, purpose-built tools with clear semantics outperform generic ones — but only when they match the model's current capability level. What works for one model generation may actively constrain the next.

Technical Deep-Dive

AskUserQuestion: Three Iterations to Get Elicitation Right

The team's first attempt embedded questions into the ExitPlanTool as an array parameter alongside the plan output. This confused the model — it couldn't cleanly separate "here's my plan" from "here's what I need to know." User answers could contradict the plan, creating an awkward re-planning loop.

Attempt two used modified markdown output format — bullet-point questions with bracketed alternatives that could be parsed into UI. Claude could approximate this format, but not reliably. It would append extra sentences, drop options, or drift into different formatting.

The winning approach: a dedicated AskUserQuestion tool that Claude calls at any point, showing a modal that blocks the agent loop until the user responds. Structured output via tool parameters proved far more reliable than freeform text parsing. The tool naturally composed with other features — the Agent SDK could invoke it, and skills could reference it.

Key takeaway: even the best-designed tool fails if the model doesn't understand how to call it. Tool design is UX design for language models.

TodoWrite to Task Tools: Adapting to Model Growth

Early Claude Code needed a todo list to stay on track. The TodoWrite tool let Claude write and check off items, with system reminders injected every five turns to keep focus.

As models improved, this scaffolding became a constraint. Reminders made Claude treat the todo list as immutable instead of adaptive. And when Opus 4.5 got dramatically better at using subagents, the single-threaded todo list couldn't support coordination across parallel workers.

The replacement Task Tool reframes the concept entirely. Tasks support dependencies, cross-agent updates, and dynamic modification. The shift from "keeping the model on track" to "helping agents communicate" reflects a broader principle: as model capabilities increase, tools that once helped can become bottlenecks.

This is why Anthropic recommends supporting a small set of models with similar capability profiles — tool design assumptions break when capability levels diverge too much.

From RAG to Self-Directed Search

Claude Code originally used a RAG vector database to provide codebase context. It worked but required indexing, was fragile across environments, and — critically — the model received context passively rather than seeking it actively.

Replacing RAG with a Grep tool let Claude search the codebase directly. The model builds its own context by deciding what to look for, evaluating results, and iterating. This pattern scales better because the model's search strategy improves as its reasoning improves.

The broader lesson: as models get smarter, tools that hand-feed context become less valuable than tools that let the model find context itself. The same principle applies beyond code search — document retrieval, web browsing, and database queries all benefit from giving the agent search primitives rather than pre-computed results.

What You Should Do

Observe your model's actual tool usage before optimizing. Log tool calls, study failure modes, and identify where the model struggles with your current interface.
Prefer structured tool parameters over freeform output parsing. If you need the model to ask questions, return options, or produce structured data — make it a tool, not a format instruction.
Revisit tool design with each model upgrade. Tools built for earlier capabilities may now be constraints. Watch for signs: the model working around your tool, ignoring it, or producing worse output when forced to use it.
Give search tools, not pre-built context. Let the model build its own understanding through iterative search rather than dumping documents into the context window.
Keep your supported model set small. Tool design assumptions are capability-dependent — supporting too many models with different strengths forces lowest-common-denominator tool design.

Related: Today's newsletter covers more AI developer tool updates. See also: AI Agent for foundational concepts.

Found this useful? Subscribe to AI News for daily AI briefings.