Appendix

Chapter 11: Appendix — Architecture of Agentic Coding Systems

Written: 2026-04-08 Last updated: 2026-04-08

Summary

Chapter 10 analyzed the seven-dimension gap between Agentic Coding and Agentic Robotics. This appendix dissects the internal architecture of one side of that comparison — Agentic Coding systems — in detail. By analyzing the Claude Code source code leaked in late 2025 and OpenAI Codex's publicly documented architecture, we identify why these structures succeed and chart how the same structural principles can be transplanted into Agentic Robotics.

11.1 Introduction: Why Examine the Internals of Coding Agents

This book's central claim is that the agentic loop is universal (see Chapter 1). Observe, plan, execute, verify, reflect, remember, retry. Claude Code and OpenAI Codex have implemented this loop at production level for software engineering. Distinguishing whether their success stems purely from model capability or from the harness engineering that wraps the model is essential.

In December 2025, Anthropic accidentally included a 59.8 MB source map file in version 2.1.88 of the @anthropic-ai/claude-code npm package, exposing approximately 512,000 lines of TypeScript code ^[1]. By 4:23 AM ET, the discovery was broadcast on X, and within hours the codebase was mirrored across GitHub and analyzed by thousands of developers. While unintentional, this leak provided the research community with the most detailed reference implementation of an agentic system to date. Analysis revealed that 60% of system experience derives from the model's raw capability, while the remaining 40% comes from meticulously engineered harness ^[2].

This 40% of harness engineering is precisely the blueprint transplantable to Agentic Robotics.

11.2 Claude Code's Architecture

11.2.1 Three-Layer Memory System

Claude Code's most sophisticated design is its memory architecture. Three layers handle different time horizons ^[2]:

Layer 1 — Persistent Memory (CLAUDE.md). The CLAUDE.md file at the project root is automatically read at every session start. It holds the project's "long-term memory": coding conventions, architecture decisions, frequently occurring mistake patterns. This file is re-injected every turn even when context compression occurs, ensuring critical information survives long sessions. The agent itself writes to this file, permanently storing learned patterns.

Layer 2 — Session Context. The full context window of the current conversation (200K+ tokens). File contents being worked on, execution results, and error messages accumulate in real time. A memory.md pointer-index file navigates a larger network of structured memory files.

Layer 3 — Tool-Based Retrieval. The Grep, Glob, and Read tools search the entire codebase on demand, pulling information not loaded into the context window when needed.

The core principle of this three-layer structure is appropriate recall — not remembering everything, but surfacing the right memories at the right time ^[2].

11.2.2 Tool Orchestration

The leaked source code reveals Claude Code's core tool set: Read (file reading), Edit (exact string replacement), Write (file creation), Bash (shell command execution), Grep (ripgrep-based content search), Glob (file pattern matching), and Agent (subagent spawning) ^[4].

The critical detail is the routing logic for tool selection. The system prompt explicitly instructs: "Do NOT use Bash to run grep; use the built-in Grep tool instead." Each tool is optimized for a specific task, and the harness corrects the model's tendency to solve everything with the general-purpose tool (Bash). This is not merely an efficiency concern but a safety issue — since Bash can execute arbitrary commands, routing to dedicated tools reduces the probability of dangerous command execution.

11.2.3 Subagents and Parallel Execution

Claude Code's Agent tool spawns specialized subagents ^[16]. Each subagent runs in its own context window with a custom system prompt, specific tool access, and independent permissions. Git worktrees provide filesystem isolation, enabling parallel edits without collision.

A concrete example: sixteen agents across 2,000 sessions built a 100,000-line Rust-based C compiler ^[12]. Claude Code Review (launched March 2026) dispatches parallel agents to review pull requests, flagging problems in 84% of changes exceeding 1,000 lines.

This structure is the orchestrator + specialist team pattern. The main agent decomposes tasks, and each subagent handles a specialized domain.

11.2.4 The Feedback Loop: From Error to Fix

Claude Code's core execution pattern follows this sequence:

Modify code (Edit/Write)
Run tests (Bash: npm test)
Receive error output (structured text: filename, line number, error type)
Analyze error (LLM interprets stack trace)
Search related code (Grep/Read)
Apply fix (Edit)
Re-test (Bash)

This loop repeats until tests pass ^[6]. When CI failures occur on a PR, cloud-running agents automatically detect failures, apply fixes, and push ^[13].

The loop's power derives from the structuredness of errors. "File X, Line Y, TypeError: cannot read property Z of undefined" — this single line provides the exact location, exact cause, and exact fix direction. This is the essence of the Error Feedback dimension (5/5) analyzed in Chapter 10.

11.2.5 Permission Model and Safety Guardrails

Claude Code uses a granular permission system with deny-first evaluation ^[2]. Permissions are configurable per tool, per pattern, and per directory. Dangerous operations (file deletion, git push --force, etc.) require user approval by default.

Built-in classifiers automatically distinguish safe from risky actions, defaulting to cautious behavior ^[6]. This is structurally identical to AutoRT's Robot Constitution (see Chapter 8) — predefined rules constraining the agent's action space.

11.3 OpenAI Codex's Architecture

11.3.1 Container-Based Sandbox

Codex takes a fundamentally different approach. Each task runs in an isolated container in the cloud, with internet access disabled during execution ^[7]. The agent can only use code explicitly provided via GitHub repositories and pre-installed dependencies configured through a setup script.

This design philosophy is safety through isolation. Where Claude Code runs in the user's local environment and ensures safety through a permission model, Codex runs in an entirely isolated environment, eliminating risk at the source. It uses Landlock and seccomp, and is the only major agent with sandboxing enabled by default ^[10].

11.3.2 AGENTS.md — Codex's Counterpart to CLAUDE.md

Codex uses AGENTS.md files for project-specific guidance ^[7]. These describe how to navigate the codebase, which commands to run for testing, and how to adhere to project practices. Functionally identical to CLAUDE.md but differently named. That both systems independently converged on the same concept — persistent project memory — is noteworthy.

11.3.3 From codex-1 to GPT-5.3-Codex

Codex's initial model, codex-1, was an o3 variant optimized for software engineering ^[7]. It was trained via reinforcement learning on real coding tasks and iteratively runs tests until achieving passing results. In February 2026, it was upgraded to GPT-5.3-Codex, followed shortly by GPT-5.3-Codex-Spark, a lower-latency variant for real-time interactive coding ^[9].

11.3.4 Unified Server Architecture

By February 2026, Codex unified its CLI, VS Code extension, web app, macOS desktop app, and JetBrains/Xcode integrations under a single "App Server" architecture ^[9]. Long-running sessions and approval requests remain consistent across client interfaces, solving the session continuity problem across multiple interfaces.

11.4 Common Success Patterns

Synthesizing the patterns that recur across Claude Code, Codex, and the broader 2026 agentic coding ecosystem ^[6]:

Pattern 1: Three-Layer Memory — Persistent / Session / Retrieval

Layer	Claude Code	Codex	Function
Persistent (project)	CLAUDE.md	AGENTS.md	Project knowledge, conventions
Session (conversation)	Context window	Container state	Current task context
Retrieval (on-demand)	Grep/Read/Glob	File system access	Codebase exploration

That both systems independently converged on the same three-layer structure suggests this is an essential architecture for agentic systems.

Pattern 2: Structured Feedback Loop

Error, then structured text (stack trace), then LLM analysis, then targeted fix, then re-execution. The loop's efficiency is directly proportional to the structuredness and precision of feedback. Compiler errors point to exact files and lines, enabling LLMs to immediately identify fix targets.

Pattern 3: Tool Orchestration

Routing to specialized tools (Read, Edit, Grep) instead of general-purpose ones (Bash/terminal). Specialized tools are (1) safer, (2) produce structured output, and (3) provide consistent user experience. The model selects tools; the harness validates tool use.

Pattern 4: Orchestrator + Specialist Team

Instead of a single general-purpose agent handling everything, an orchestrator decomposes tasks and delegates to specialized subagents ^[14]. Multi-agent system inquiries surged 1,445% between Q1 2024 and Q2 2025. The "puppeteer" orchestrator coordinating specialist agents is becoming the standard pattern.

Pattern 5: Test-Time Computation

Codex-1 iteratively runs until tests pass. Claude Code attempts automatic fixes on CI failures. Repeated attempts at inference time dramatically improve quality over single-pass generation. This is precisely the principle CaP-X (see Chapter 3) aims to apply to robots — improvement through agentic scaffolding and trial-and-error.

Pattern 6: Human-in-the-Loop

Both systems provide a spectrum between full autonomy and full manual control. Claude Code's permission model and Codex's approval requests allow users to tune autonomy levels, defaulting to conservative. Trust is built gradually, with autonomy expanding incrementally.

11.5 Transplanting to Agentic Robotics

Mapping how these six patterns correspond in the physical world:

Three-Layer Memory to Robot Memory Systems

Agentic Coding	Agentic Robotics	Corresponding System
CLAUDE.md (persistent)	Environment map + object property DB	KARMA's LTM (see Ch. 7)
Context window (session)	Current task scene graph	KARMA's STM (see Ch. 7)
Grep/Read (retrieval)	Spatial-semantic search	Embodied-RAG (see Ch. 7)

KARMA's 62.7x efficiency improvement validates this three-layer structure's power in the physical world. However, the fundamental difference is that CLAUDE.md is a text file with millisecond read/write, while real-time 3D scene graph updates must contend with sensor noise, occlusion, and dynamic change.

Structured Feedback to VLM-Based Failure Diagnosis

The robotics counterpart to coding's stack traces is REFLECT's VLM-based failure summarization (see Chapter 8). However, at 69-79% accuracy, it falls far short of "File X, Line Y, TypeError" at effectively 100% accuracy. VeriGraph (see Chapter 7) attempts to structure feedback through scene graphs, but fine-grained manipulation failures remain elusive.

Key research direction: Converting physical errors to "code-error-level" structuredness. Semantic translation of tactile sensor data, natural language descriptions of force-torque profiles, and multimodal failure RAG are promising approaches.

Tool Orchestration to Skill Library + VLM Routing

Claude Code's tool routing is structurally identical to BUMBLE's skill library + VLM routing (see Chapter 8). The VLM observes the situation and selects the appropriate skill (navigation, grasping, drawer opening, etc.). SayCan's affordance function (see Chapter 2) follows the same principle.

The difference: coding tools are deterministic with predictable outcomes; robot skills are stochastic and can fail. Robot tool orchestration therefore requires additional failure detection and fallback selection mechanisms.

Orchestrator + Specialists to Fleet Orchestration

Claude Code's subagent pattern maps to AutoRT's fleet orchestration (see Chapter 8). A central orchestrator coordinates 20+ robots simultaneously, with each robot performing specialized tasks. The Robot Constitution provides safety guardrails.

The difference: coding subagents are isolated via git worktrees with no collision, while physical robots share the same space, requiring additional management of collision avoidance, resource contention, and physical interference.

The Core Lesson of Transplantation

The commonality is structure; the difference is medium. Three-layer memory, feedback loops, tool orchestration, specialist teams — these structural principles hold regardless of medium. But the three properties of the digital medium (determinism, immediacy, reversibility) collapse in the physical medium, fundamentally changing the implementation difficulty of each structural principle (see Chapter 10).

11.6 Future Vision: Claude Code for the Physical World

What would a system look like with Agentic Coding's success structure fully transplanted to robots?

Persistent Environment Memory: Robots maintain an "instruction manual" for their environment, analogous to CLAUDE.md. "This drawer opens by pulling left," "the living room light switch is to the right of the door" — environment knowledge learned from experience is permanently stored and loaded at every task start.

Structured Physical Feedback: Fusion of tactile, force-torque, and visual data generates "stack-trace-level" failure reports. "Grasp failure at step 3: insufficient grip force (measured 2.1N, required 4.5N) due to wet surface — retry with increased force and slower approach."

Specialist Robot Teams: Navigation-specialist robots, precision-manipulation-specialist robots, and inspection-specialist drones collaborate under orchestrator coordination. When a specialist fails, the orchestrator selects alternatives.

Test-Time Trial and Error: Multiple strategies are tried in parallel in simulation, and the highest-probability strategy is executed in the physical world. On failure, the system returns to simulation for replanning.

Each element of this vision has already been demonstrated in individual research efforts. Integration remains the open challenge.

11.7 Conclusion

The success of Agentic Coding systems cannot be explained by model capability alone. The 40% of harness engineering — three-layer memory, structured feedback loops, specialized tool orchestration, subagent teams, and permission models — is an essential condition for success.

These structural principles are medium-independent. KARMA's memory, REFLECT's feedback, BUMBLE's skill routing, and AutoRT's fleet orchestration are all physical-world implementations of the same principles. The gap originates not in structure but in the properties of the medium — nondeterminism, latency, and irreversibility.

Building "Claude Code for the physical world" is not about inventing new structures. It is about adapting proven structures within the constraints of the physical world. Charting this adaptation path is this book's core contribution, and this appendix provides the starting point — a detailed blueprint of the original structure.

References

The Register, "Claude Code's innards revealed as source code leaked online," theregister.com, April 2026. scholar
MindStudio, "Claude Code Source Leak: The Three-Layer Memory Architecture and What It Means for Builders," mindstudio.ai/blog, 2026. scholar
Rajiv Pant, "How Claude's Memory Actually Works (And Why CLAUDE.md Matters)," rajiv.com/blog, December 2025. scholar
Penligent, "Inside Claude Code: The Architecture Behind Tools, Memory, Hooks, and MCP," penligent.ai, 2025. scholar
VentureBeat, "Claude Code's source code appears to have leaked: here's what we know," venturebeat.com, 2026. scholar
Anthropic, "Claude Code Best Practices," anthropic.com/engineering, 2025. scholar
OpenAI, "Introducing Codex," openai.com/index/introducing-codex, May 2025. scholar
OpenAI, "Introducing the Codex App," openai.com/index/introducing-the-codex-app, February 2026. scholar
OpenAI, "Introducing upgrades to Codex," openai.com/index/introducing-upgrades-to-codex, 2026. scholar
Wikipedia, "OpenAI Codex (AI agent)," en.wikipedia.org, 2026. scholar
Morphllm, "Claude Code as Orchestrator: Inter-Agent Communication Protocols," morphllm.com, 2026. scholar
Morphllm, "Claude Code Subagents: How They Work, What They See & When to Use Them," morphllm.com, 2026. scholar
Paddo.dev, "Claude Code Auto-Fix: The PR That Fixes Itself," paddo.dev/blog, 2026. scholar
Springer, "Agentic AI: A Comprehensive Survey of Architectures, Applications, and Future Directions," Artificial Intelligence Review, 2025. scholar
Anthropic, "2026 Agentic Coding Trends Report," resources.anthropic.com, 2026. scholar
Claude Code Docs, "Create custom subagents," code.claude.com/docs/en/sub-agents, 2026. scholar
Claude Code Docs, "How Claude remembers your project," code.claude.com/docs/en/memory, 2026. scholar
Dbreunig, "How Claude Code Builds a System Prompt," dbreunig.com, April 2026. scholar
Liu, Z. et al., "REFLECT: Summarizing Robot Experiences for Failure Explanation and Correction," arXiv:2306.15724, 2023. scholar
Wang, Z. et al., "KARMA: Augmenting Embodied AI Agents with Long-and-short Term Memory Systems," arXiv:2409.14908, 2024. scholar
Xie, Q. et al., "Embodied-RAG: General Non-parametric Embodied Memory for Retrieval and Generation," arXiv:2409.18313, 2024. scholar
Brohan, A. et al., "AutoRT: Embodied Foundation Models for Large Scale Orchestration of Robotic Agents," arXiv:2401.12963, 2024. scholar
Shah, M. et al., "BUMBLE: Unifying Reasoning and Acting with VLMs for Building-wide Mobile Manipulation," arXiv:2410.06237, 2024. scholar
Fu, M. et al., "CaP-X: A Framework for Benchmarking and Improving Coding Agents for Robot Manipulation," arXiv:2603.22435, 2026. scholar