Notes on Building Agentic Tools Using Local LLMs

Over the Christmas break, I decided to explore code execution for AI agents, inspired by Anthropic’s blog on the topic. The idea is appealing: reduce the amount of unnecessary context that gets fed into an agent’s working memory.

As MCP usage scales, there are two common patterns that can increase agent cost and latency: Tool definitions overload the context window; Intermediate tool results consume additional tokens.

If you’ve spent any time with Claude Code or similar tools, you’ll know the problem. You really want to avoid the orchestrator seeing unnecessary logs, digging through dense files with low information density, or accumulating cruft that poisons the context window.

Anthropic’s approach uses an orchestrator that composes agent tasks without ever seeing the results of those tasks directly. It only sees structured outputs if the tool deems it necessary. I wanted to understand how this actually works, particularly for small local models. Can you get a 7B model to operate coherently well beyond its context limits, and can the orchestrator agent be useful without needing to see all the tokens?

The short answer is yes, sort of. But the interesting part is what that requires: tool design that makes composition obvious through what I’m calling tool ergonomics (python types) alone.

Starting Simple: Bash Scripts and Local Models

I started basic, using a local Llama 7B model. My first attempt was a simple feedback loop:

Writer (draft) → Editor (feedback) → Writer (revision) → ...

The editor provides critique on structure, pacing, and prose quality. The writer interprets the feedback and revises. In theory, this preserves the writer’s voice across iterations.

Neither this nor the direct rewrite variant worked particularly well. The feedback loop used about 3x the tokens per round, while direct rewrite was around 1.5x. But it wasn’t much better than just loading one model and hitting the 8k context limit. I wanted to stay around 2k tokens per model run to keep things focused (they seem happier there?).

The Iteration Loop (and Why It Failed)

Next, I built an iteration loop with a planner, writer, and critic. The problem became obvious quickly. The orchestrator was still seeing all the context:

Iteration 1: prompt → draft (2000 tokens)
Iteration 2: prompt + draft + critique → revised (4500 tokens)
Iteration 3: prompt + draft + critique + revised + critique2 → ... (growing)

Four or five iterations and it fell apart. No meaningful reduction in context bloat.

Trying smolagents (and Moving On)

I gave Hugging Face’s smolagents a go. The hope was that the LLM generates code, data flows through variables, and you get less context bloat.

The issue: smolagents uses ReAct (step-by-step reasoning), so the orchestrator maintains a memory of previous actions and observations at each step. My impression was that this meant tool outputs were still accumulating in context, making it worse than my bash scripts due to orchestrator overhead.

(Disclaimer: I didn’t rigorously measure this. I moved on fairly quickly because I wanted to build something from scratch that I understood fully. smolagents may well have optimisations or configuration options I missed. Take this with a grain of salt.)

Building My Own Orchestrator

This is where things got interesting. I (Claude) built an orchestrator that looks at a manifest of tools and writes code to invoke them. The key difference: the orchestrator generates the code once, then it runs in Python without the orchestrator seeing intermediate results.

def solve_task(user_prompt):
    draft = writer(prompt=user_prompt)
    for i in range(4):
        critique = critic(content=draft, requirements=user_prompt)
        if "DONE" in critique:
            break
        draft = writer(prompt=user_prompt, feedback=critique)
    return draft

result = solve_task(user_prompt)

The orchestrator only ever sees the first call and the final return. Everything in between is blind to it. This is what makes extended generation possible: the orchestrator’s context stays constant regardless of how many iterations run.

This also opens up interesting possibilities around data privacy. Imagine processing bank statements where you don’t want the full statement in context. A tool returns just the insight: “the largest customer is X”. That gets passed to the next tool, which finds that customer in the accounting system, but it doesn’t return their details, just the necessary step for analysis (their payment terms), and those aren’t shown to the orchestrator, they go to the underwriting tool, and so on. The orchestrator never sees the raw data, just structured results flowing between steps.

This isolation pattern has side effects: It’s the exact mechanism that was used to run Claude Code as a hacker. As Anthropic reported, attackers “broke down their attacks into small, seemingly innocent tasks that Claude would execute without being provided the full context of their malicious purpose.” The orchestrator’s blindness enables privacy and extended context, but it also means the model can’t reason about the broader implications of what it’s being asked to do. Something to keep in mind when designing these systems.

The Complexity Valley

But then I hit what I started calling the “complexity valley”. As I added more tools (writer, critic, planner, outliner, evaluator…), the generated code became a mess of manual state tracking:

def solve(prompt, emphasis=""):
    chapters = planner(prompt)
    combined = ""
    outline = ""

    for i, ch in enumerate(chapters):
        prev = combined[-1000:] if combined else ""
        remaining = chapters[i+1:]
        remaining_str = "\n".join([f"- {c['title']}: {c['summary']}" for c in remaining])
        # ... and on and on

Small models couldn’t generate this reliably. They’d forget to update the outline, mishandle the slicing, or botch the string formatting.

The Template Trap

My first fix was to make the code generation prompt extremely explicit:

PROMPT: Explore - write {explore_chapters + 1} chapters organically.
Pattern:
  content = write(prompt, emphasis)
  contents = [content]
  for _ in range({explore_chapters}):
      next_prompt = explore(content, prompt)
      content = write(next_prompt, emphasis)
      contents.append(content)
  return combine(contents)"""

This worked, but is it cheating? If we have to show the exact code pattern, the model isn’t composing tools; it’s copying templates. Maybe?

The Discovery: Types Guide Composition

After multiple iterations, it started working best when tools have type-aligned signatures. The composition becomes obvious without explicit templates.

write(prompt: str) -> str           # string in, string out
plan(prompt: str) -> list[str]      # string in, list out
explore(content: str, original_prompt: str) -> str
combine(contents: list[str]) -> str # list in, string out
evaluate(content: str, original_prompt: str) -> str

When plan() returns list[str], the model knows it needs to iterate. When combine() takes list[str], the model knows to collect results. The generated code became correct without explicit patterns:

def solve(prompt, emphasis=""):
    chapters = plan(prompt)
    contents = [write(ch) for ch in chapters]
    return combine(contents)

A 1.5B parameter model could figure this out based on the types.

Naming mattered too. I changed explore(content, prompt) to explore(content: str, original_prompt: str). The name “original_prompt” signals “grounding context”. Models stopped inventing new prompts and started passing the original variable.

Early versions used rich dataclasses, but small models struggled with attribute access. The fix was just using strings. This allowed a 7B model to use it correctly 90-95% of the time, and a 1.5B model almost as well.

When It Goes Wrong

The most common issue: the model would overwrite content rather than appending.

# What the model should generate:
contents = []
for ch in chapters:
    content = write(ch)
    contents.append(content)

# What it actually generated:
for ch in chapters:
    contents = write(ch)  # Overwrites each time!

This happened less once the types were clear (list[str] signals “collect these”), but it never went away completely. Small models make small mistakes, and without a verification loop, garbage propagates.

[PLACEHOLDER: Include before/after example of generated text quality, the challenge is these are long passages]

Results

By the end, an 8B writing model paired with a 7B instructor model could work pretty seamlessly. Sometimes down to 1.5B for simple tasks.

I’d say it went from a 2/10 initially to maybe a 3-5/10 by the end. It would write a chapter, review it, write the next with tolerable handover, and string together coherent sequences. Not great literature, but the mechanics worked: small models running well beyond their context limits because the orchestrator carried zero burden.

I experimented with two modes: “explore” (write freeform, let each chapter lead to the next) and “plan” (outline upfront, fill in the gaps). Explore worked better. The modal split feels like a hack, but maybe that’s fine. Different mental models want different tool behaviour.

What I Learned

The thing that surprised me most was how much time went into tool design. Figuring out the most ergonomic way for tools to be usable by an agent, making them implicit and obvious and composed and intuitive. That was the work.

The best tools I created:

  • Use strings, not custom objects
  • Signal iteration with types (list[str] means “iterate over this”)
  • Keep content out of orchestrator context
  • Name parameters semantically (original_prompt not prompt)
  • Validate ruthlessly, because small models make small mistakes

The insight: tool design for LLMs is about ergonomics. If the types make composition self-evident, if the tool is easy to use right and hard to use wrong, small models can orchestrate complex workflows.

Next Steps

Better continuity management. I’m figuring out how to manage continuity more explicitly, which led to tools that work differently in explore vs plan mode. This feels like a hack, but maybe matching tool behaviour to user mental model is actually correct.

Better verification. Right now, if the generated code is wrong, the output is garbage. I need lightweight checks before execution.

Testing the Bitter Lesson. As models improve, does the scaffolding help or hurt? Worth checking whether this structure earns its keep on larger models or just adds overhead.


One caveat to close on: this whole approach might age badly. As Peak noted, agent harnesses can limit performance as models advance. The structure improves performance, but this structure can limit performance as compute grows.

For small local models, making the orchestrator blind to intermediate results while letting types guide composition seems to work.


Appendix

The Final Architecture

┌─────────────────────────────────────────────────────────────┐
│  manifest.json (Public API)                                 │
│  - 5 tools with type signatures                             │
│  - Descriptions guide model selection                       │
└─────────────────────────────────────────────────────────────┘
                         │
                         ▼
┌─────────────────────────────────────────────────────────────┐
│  orchestrator.py (Code Generation)                          │
│  - Reads manifest                                           │
│  - Generates Python solve() function                        │
│  - Validates: forbidden patterns, param names, mock exec    │
└─────────────────────────────────────────────────────────────┘
                         │
                         ▼
┌─────────────────────────────────────────────────────────────┐
│  tools.py (Implementation)                                  │
│  - High-level: write, plan, explore, combine, evaluate      │
│  - Low-level: writer, critic, planner (internal only)       │
│  - TOOL_NAMESPACE exports only public tools                 │
└─────────────────────────────────────────────────────────────┘
                         │
                         ▼
┌─────────────────────────────────────────────────────────────┐
│  Generated Code Execution                                   │
│  - Content flows between tools                              │
│  - Orchestrator never sees intermediate results             │
│  - Constant context regardless of iterations                │
└─────────────────────────────────────────────────────────────┘

The manifest:

{
  "tools": [
    {
      "name": "write",
      "signature": "write(prompt: str, emphasis: str = '') -> str",
      "description": "Write content with automatic revision."
    },
    {
      "name": "plan",
      "signature": "plan(prompt: str) -> list[str]",
      "description": "Break a prompt into chapter summaries."
    },
    {
      "name": "explore",
      "signature": "explore(content: str, original_prompt: str) -> str",
      "description": "Generate next chapter prompt from current content."
    },
    {
      "name": "combine",
      "signature": "combine(contents: list[str]) -> str",
      "description": "Join chapters into final output."
    },
    {
      "name": "evaluate",
      "signature": "evaluate(content: str, original_prompt: str) -> str",
      "description": "Evaluate content against requirements."
    }
  ]
}