Chapter 10: Agentic Coding vs Agentic Robotics — The Gap and the Future
Summary
This book has traced the research trajectory from LLM Planners (2022) through Agentic Robotics (2026), maintaining Agentic Coding as a consistent comparison axis. This final chapter synthesizes the seven-dimension gap analysis, distinguishes what can be closed from what is irreducible in principle, and charts the path for physical agents to approach the maturity of digital agents.
10.1 Introduction: Same Loop, Fundamentally Different Worlds
The core observation running through this entire book is simple: the agentic loop's structure is identical, but the physical world fundamentally changes the difficulty of each step.
Observe, plan, execute, verify, reflect, remember, retry. Claude Code iterates this loop in seconds and has reached production readiness. PragmaBot runs the same loop structure in minutes, achieving 84% success (see Chapter 8). BUMBLE plateaus at 47.1% at building scale. Drawing the precise map of this gap is this chapter's goal.
10.2 Seven-Dimension Synthesis
Dimension 1: Error Feedback — Quality and Speed of Feedback [5/5]
Code errors are structured text: "File X, Line Y, TypeError" — exact file, exact line, exact error type. The information channel bandwidth is effectively infinite.
Physical errors are noisy multisensory data. When a gripper drops a cup, whether the cause is insufficient force, slippage, or position error is not immediately determinable. REFLECT [Liu et al., 2023] infers failure causes via VLM at 69-79% accuracy depending on task type (see Chapter 8). VeriGraph [Ekpo et al., 2024] structures feedback through scene graphs but still fails to capture fine-grained manipulation failures (see Chapter 7).
This gap is the most fundamental bottleneck. Feedback quality determines the efficiency of the entire agentic loop. Converting physical-world errors to "stack-trace-level" clarity is the central challenge of Agentic Robotics.
Dimension 2: Execution Determinism — Same Command, Different Results [4/5]
Same code and input guarantees same output. Docker isolates even the environment.
Same pick-and-place command yields different results each time. Diffusion Policy [Chi et al., 2023] explicitly models stochastic policies, acknowledging this intrinsic uncertainty (see Chapter 6). DROID [Khazatsky et al., 2024] attempts to internalize environmental diversity through large-scale data from 13 institutions and 564 scenes, but deterministic reproducibility is unachievable in principle.
Direction: Design systems that are "stochastic but robust" rather than deterministic — systems that recover from failure rather than prevent it.
Dimension 3: State Representation — How the World Is Seen [4/5]
Codebases are fully observable. ASTs precisely represent semantic structure; the file system exposes all state.
Physical environments are partially observable. Objects outside camera view, occluded objects, and internal states (contents of drawers) are unknown. SayPlan's 3D scene graphs [Rana et al., 2023], RoboEXP's action-conditioned graphs [Jiang et al., 2024], and KARMA's memory-extended graphs [Wang et al., 2024] have advanced structural representation (see Chapter 7), but "reading the world like reading code" remains distant.
Dimension 4: Memory Architecture — What to Remember and How to Retrieve [3/5]
Code agents can reference entire codebases in 200K+ token context windows; files persist indefinitely. No time constraints.
Robots must query memory within real-time control loops (10-100Hz). KARMA's LTM/STM separation is the most effective response, achieving 62.7x efficiency improvement on Complex Tasks (see Chapter 7). Embodied-RAG [Xie et al., 2024] structures retrieval through spatial-semantic hierarchies.
This dimension is advancing fastest. Memory architectures have visible practical paths, with high likelihood of gap closure.
Dimension 5: Action Space — What Can Be Done [4/5]
Code agent actions are discrete, compositional, and extensible. "Open file" has a clear start and end.
Robot actions are continuous, unfold over time, and are dominated by contact dynamics. RT-H's language motion [Belkhale et al., 2024], HAMSTER's 2D paths [Li et al., 2025], and Hi Robot's atomic commands [Shi et al., 2025] hierarchically bridge the discrete-continuous gap (see Chapter 5). Code-as-Symbolic-Planner [Chen et al., 2025] attempts to transplant code's discrete advantages into robotics (see Chapter 3).
Direction: Hierarchical abstraction is key. GR00T N1's [NVIDIA, 2025] Dual-System offers an architecture-level answer.
Dimension 6: Verification & Testing — How to Verify [5/5]
Code runs thousands of unit tests in milliseconds. The robot equivalent of a "unit test" does not exist.
SIMPLER [Li et al., 2024] established simulation-based evaluation standards, but the sim2real gap persists (see Chapter 9). CaP-X [Fu et al., 2026] first applied agentic coding metrics to robots but remains early-stage. AutoRT [Brohan et al., 2024] increased throughput through fleet-scale testing but requires Google-scale infrastructure.
This gap is irreducible in principle. Contact dynamics cannot be perfectly simulated due to physics approximation limits. The direction is hierarchical verification: high-level via scene graphs, mid-level via code, low-level via simulation, final confirmation via physical experiment.
Dimension 7: Recoverability — How to Recover from Failure [5/5]
git revert undoes any change. A broken cup cannot be undone.
AutoRT's Robot Constitution preemptively blocks dangerous actions, PragmaBot [2025] reduces failure probability through conservative strategies, and BUMBLE [Shah et al., 2024] replans upon failure detection (see Chapter 8). But physical actions already taken cannot be reversed.
This gap is irreducible in principle. The second law of thermodynamics imposes this constraint. The direction is "prevention and graceful degradation" rather than recovery.
10.3 Three Tiers of Gap
Synthesizing the seven dimensions yields three tiers:
Irreducible gaps (adaptation required): Error Feedback, Verification, Recoverability. Arising from essential properties of the physical world. "Adaptation" strategies, not "elimination," are needed.
Structural gaps (solvable with new approaches): Execution Determinism, State Representation, Action Space. Limited by current methodology. Stochastic policies, scene graphs, and hierarchical abstraction chart the solution paths.
Practical gaps (solvable with engineering effort): Memory Architecture. A matter of time and resources. KARMA/Embodied-RAG have already demonstrated practical paths.
This classification carries important implications for research resource allocation. Investing in "complete elimination" of irreducible gaps is inefficient. Focus should instead be on adaptation strategies: safety-first design, graceful degradation, and hierarchical verification.
10.4 Transplanting Agentic Coding's Success Factors to Robotics
| Success Factor | In Coding | Robotics Transplant Attempt | Maturity |
|---|---|---|---|
| Fast, precise feedback | Stack traces, tests | VLM failure diagnosis (REFLECT) | Early (69-79% accuracy) |
| Low-cost experimentation | Virtually free | Simulation (SIMPLER) | Mid (sim2real gap) |
| Easy recovery | git revert | Safety-first (AutoRT Constitution) | Early (prevention only) |
| Structured state | File system, AST | 3D scene graphs (SayPlan, KARMA) | Mid (dynamic env challenge) |
The fourth factor (structured state) is maturing fastest; the first (feedback quality) is the largest bottleneck. Simultaneous progress on all four determines Agentic Robotics' maturity.
10.5 Eight Open Problems
[Fundamental] Semantic Translation of Physical Feedback
Converting sensor data to structured feedback LLMs can understand. The most urgent and most difficult challenge. Counterfactual reasoning, failure RAG, and tactile feedback integration are promising directions.
[Fundamental] Balancing Safety and Autonomy
The dilemma of permitting autonomous action while guaranteeing safety. AutoRT's Robot Constitution is the beginning, but vulnerable to long-tail risks. A hierarchical safety architecture — hardware-level reflexive safety + software-level reasoning-based safety — is needed.
[Structural] Real-Time World Models
Internal models that predict action outcomes before execution. VLAs are currently reactive with no future prediction. GR00T N1's dual-system is a starting point, but both prediction accuracy and speed are insufficient.
[Structural] Cross-Embodiment Generalization
A single model working across diverse robot hardware. Open X-Embodiment, Octo, and OpenVLA built the foundation, but true cross-embodiment generalization remains unachieved.
[Structural] Long-Horizon Cumulative Error
95% per-step success over 20 steps yields only 36% overall. BUMBLE's 47.1% is reality. Mid-task verification checkpoints and adaptive replanning frequency are the solution direction.
[Practical] Data Efficiency
Web data spans trillions of tokens; robot data reaches 1 million episodes. DROID, simulation data, and human video utilization are progressing but the gap persists.
[Practical] Real-Time Inference
VLA models (billions of parameters) vs robot control rates (100Hz+). TinyVLA, FAST, and hierarchical separation (HAMSTER, GR00T N1) are the solution path. Likely the fastest to be resolved, alongside hardware advances.
[Practical] Evaluation Standards
The absence of an Agentic Robotics equivalent to SWE-bench. CaP-X is the beginning, but standardized protocols are needed. Establishing this standard will accelerate the entire field.
10.6 Timeline Outlook: What Comes Next
Following the four paradigm shifts traced in this book, a fifth is foreshadowed:
| Shift | Period | Key Achievement | Paradigm Lag |
|---|---|---|---|
| 1. LLM External Planner | 2022 | LLM robot planning | Concurrent |
| 2. Multimodal VLA | 2023 | End-to-end VLA | Concurrent |
| 3. Open VLA Ecosystem | 2024 | VLA democratization | ~1 year |
| 4. Agentic Closed-Loop | 2025 | Closed-loop agentic systems | ~2 years |
| 5. Embodied World Models | 2026-27 | Robots that predict the future | ? |
Signs of the fifth shift — Embodied World Models — are already visible. GR00T N1's dual-system, Code-as-Symbolic-Planner's symbolic reasoning, and video prediction model advances are converging. The ability to "simulate outcomes before acting" already exists in Agentic Coding — type checking, static analysis, and tests are "pre-execution outcome prediction." Robots need the same capability.
Three conditions must be met: (1) real-time physics prediction models (currently seconds, needs milliseconds), (2) quantifying prediction uncertainty and incorporating it into decision-making, (3) graceful fallback to conservative behavior when prediction fails.
The Rise of Hybrid Architectures
The tension running through this entire book — Large Model (VLA: fast but fails) vs System-Level Orchestration (slow but robust) — resolves not in choosing one, but in hybridization. Two promising directions emerge.
First, combining classical TAMP with VLM reasoning. VLA executes basic motions rapidly, while frame-level progress monitoring triggers VLM intervention for re-planning when deviations occur. This combines AutoTAMP's STL-based verification (→ Chapter 5) with pi0's real-time control (→ Chapter 4).
Second, Orchestration→Distillation. Experiences accumulated through system-level orchestration (success/failure trajectories, re-planning patterns) are distilled into VLA, enabling VLA to gradually internalize the orchestrator's judgment. This parallels how in Agentic Coding, LLMs learn linter and type-checker patterns to progressively generate cleaner code.
Low-Level Temporal Constraints: The Unexplored Frontier
Most current Agentic Robotics research focuses on quasi-static manipulation — slowly and stably grasping and moving objects. But real-world industrial and daily settings require dynamic manipulation (throwing, catching, rapid assembly). In this domain, the agentic loop's "observe→plan→execute" cycle hits physical time constraints — catching a thrown ball requires sub-200ms reaction, shorter than current VLM inference times. Solving these low-level temporal constraints is the next frontier of Agentic Robotics.
10.7 Comparison with Agentic Coding: Final Synthesis
Agentic Coding is a world where the agentic loop already works. This success rests on three properties of the digital world: deterministic execution, instant feedback, and full reversibility. Agentic Robotics attempts to operate the same loop in a world where all three properties collapse.
Yet what the papers in this book demonstrate is that the agentic loop does work in the physical world. PragmaBot improves from 35% to 84%; CaP-X compensates for the absence of human-crafted abstraction through agentic scaffolding; KARMA achieves 62.7x efficiency improvement through memory alone. Each loop component contributes meaningfully.
Agentic Robotics reaching Agentic Coding's maturity is not a matter of time but of adaptation. The physical world's fundamental constraints will not dissolve, but building sufficiently effective systems within those constraints is achievable. Safety-first design, hierarchical verification, graceful degradation, and simulation-augmented experience — on these four pillars the future of Agentic Robotics stands.
10.8 Conclusion: A New Engineering for the Physical World
This book's map delivers three core messages.
First, the agentic loop is universal. Whether digital or physical, the observe-plan-execute-verify-reflect-remember loop is the core structure of autonomous systems. This universality is what enables transplanting Agentic Coding's lessons to robotics.
Second, the physical world imposes unique constraints. Three of seven dimensions (Error Feedback, Verification, Recoverability) are irreducible in principle. Ignoring these constraints and directly applying digital-world strategies leads to failure.
Third, adaptation is possible. From open-loop LLM Planners in 2022 to fully closed-loop PragmaBot in 2025, all components of the physical agentic loop were implemented in just four years. Maturity remains at the research prototype level, but the direction is clear.
From LLM Planners to VLAs, from VLAs to Agentic Robotics — this journey is one of humanity's most fundamental engineering challenges: extending digital intelligence into the physical world. We hope this book serves as a useful compass for researchers and engineers navigating that challenge.
References
- Chi, C. et al., "Diffusion Policy: Visuomotor Policy Learning via Action Diffusion," arXiv:2303.04137, 2023. scholar
- Liu, Z. et al., "REFLECT: Summarizing Robot Experiences for Failure Explanation and Correction," arXiv:2306.15724, 2023. scholar
- Ekpo, D. et al., "VeriGraph: Scene Graphs for Execution Verifiable Robot Planning," arXiv:2411.10446, 2024. scholar
- Khazatsky, A. et al., "DROID: A Large-Scale In-The-Wild Robot Manipulation Dataset," arXiv:2403.12945, 2024. scholar
- Rana, K. et al., "SayPlan: Grounding Large Language Models using 3D Scene Graphs," arXiv:2307.06135, 2023. scholar
- Jiang, H. et al., "RoboEXP: Action-Conditioned Scene Graph via Interactive Exploration," arXiv:2402.15487, 2024. scholar
- Wang, Z. et al., "KARMA: Augmenting Embodied AI Agents with Long-and-short Term Memory Systems," arXiv:2409.14908, 2024. scholar
- Xie, Q. et al., "Embodied-RAG: General Non-parametric Embodied Memory for Retrieval and Generation," arXiv:2409.18313, 2024. scholar
- Belkhale, S. et al., "RT-H: Action Hierarchies Using Language," arXiv:2403.01823, 2024. scholar
- Li, J. et al., "HAMSTER: Hierarchical Action Models for Open-World Robot Manipulation," arXiv:2502.05485, 2025. scholar
- Shi, L. X. et al., "Hi Robot: Open-Ended Instruction Following with Hierarchical VLA," arXiv:2502.19417, 2025. scholar
- Chen, Y. et al., "Foundation Model-Based Robot Planning via Symbolic Code Generation for TAMP," arXiv:2503.01700, 2025. scholar
- NVIDIA, "GR00T N1: An Open Foundation Model for Generalist Humanoid Robots," arXiv:2503.14734, 2025. scholar
- Li, X. et al., "Evaluating Real-World Robot Manipulation Policies in Simulation (SIMPLER)," arXiv:2405.05941, 2024. scholar
- Fu, M. et al., "CaP-X: A Framework for Benchmarking and Improving Coding Agents for Robot Manipulation," arXiv:2603.22435, 2026. scholar
- Brohan, A. et al., "AutoRT: Embodied Foundation Models for Large Scale Orchestration of Robotic Agents," arXiv:2401.12963, 2024. scholar
- Shah, M. et al., "BUMBLE: Unifying Reasoning and Acting with VLMs for Building-wide Mobile Manipulation," arXiv:2410.06237, 2024. scholar
- PragmaBot, "A Pragmatist Robot: Learning to Plan Tasks by Experiencing the Real World," arXiv:2507.16713, 2025. scholar
- Kim, M. J. et al., "OpenVLA: An Open-Source Vision-Language-Action Model," arXiv:2406.09246, 2024. scholar
- Black, K. et al., "π0: A Vision-Language-Action Flow Model for General Robot Control," arXiv:2410.24164, 2024. scholar