Part IV: Fundamental Differences

Chapter 10: Agentic Coding vs Agentic Robotics — The Gap and the Future

Written: 2026-04-08 Last updated: 2026-04-08

Summary

This book has traced the research trajectory from LLM Planners (2022) through Agentic Robotics (2026), maintaining Agentic Coding as a consistent comparison axis. This final chapter synthesizes the seven-dimension gap analysis, distinguishes what can be closed from what is irreducible in principle, and charts the path for physical agents to approach the maturity of digital agents.

10.1 Introduction: Same Loop, Fundamentally Different Worlds

The core observation running through this entire book is simple: the agentic loop's structure is identical, but the physical world fundamentally changes the difficulty of each step.

Observe, plan, execute, verify, reflect, remember, retry. Claude Code iterates this loop in seconds and has reached production readiness. PragmaBot runs the same loop structure in minutes, achieving 84% success (see Chapter 8). BUMBLE plateaus at 47.1% at building scale. Drawing the precise map of this gap is this chapter's goal.

10.2 Seven-Dimension Synthesis

Dimension 1: Error Feedback — Quality and Speed of Feedback [5/5]

Code errors are structured text: "File X, Line Y, TypeError" — exact file, exact line, exact error type. The information channel bandwidth is effectively infinite.

Physical errors are noisy multisensory data. When a gripper drops a cup, whether the cause is insufficient force, slippage, or position error is not immediately determinable. REFLECT [Liu et al., 2023] infers failure causes via VLM at 69-79% accuracy depending on task type (see Chapter 8). VeriGraph [Ekpo et al., 2024] structures feedback through scene graphs but still fails to capture fine-grained manipulation failures (see Chapter 7).

This gap is the most fundamental bottleneck. Feedback quality determines the efficiency of the entire agentic loop. Converting physical-world errors to "stack-trace-level" clarity is the central challenge of Agentic Robotics.

Dimension 2: Execution Determinism — Same Command, Different Results [4/5]

Same code and input guarantees same output. Docker isolates even the environment.

Same pick-and-place command yields different results each time. Diffusion Policy [Chi et al., 2023] explicitly models stochastic policies, acknowledging this intrinsic uncertainty (see Chapter 6). DROID [Khazatsky et al., 2024] attempts to internalize environmental diversity through large-scale data from 13 institutions and 564 scenes, but deterministic reproducibility is unachievable in principle.

Direction: Design systems that are "stochastic but robust" rather than deterministic — systems that recover from failure rather than prevent it.

Dimension 3: State Representation — How the World Is Seen [4/5]

Codebases are fully observable. ASTs precisely represent semantic structure; the file system exposes all state.

Physical environments are partially observable. Objects outside camera view, occluded objects, and internal states (contents of drawers) are unknown. SayPlan's 3D scene graphs [Rana et al., 2023], RoboEXP's action-conditioned graphs [Jiang et al., 2024], and KARMA's memory-extended graphs [Wang et al., 2024] have advanced structural representation (see Chapter 7), but "reading the world like reading code" remains distant.

Dimension 4: Memory Architecture — What to Remember and How to Retrieve [3/5]

Code agents can reference entire codebases in 200K+ token context windows; files persist indefinitely. No time constraints.

Robots must query memory within real-time control loops (10-100Hz). KARMA's LTM/STM separation is the most effective response, achieving 62.7x efficiency improvement on Complex Tasks (see Chapter 7). Embodied-RAG [Xie et al., 2024] structures retrieval through spatial-semantic hierarchies.

This dimension is advancing fastest. Memory architectures have visible practical paths, with high likelihood of gap closure.

Dimension 5: Action Space — What Can Be Done [4/5]

Code agent actions are discrete, compositional, and extensible. "Open file" has a clear start and end.

Robot actions are continuous, unfold over time, and are dominated by contact dynamics. RT-H's language motion [Belkhale et al., 2024], HAMSTER's 2D paths [Li et al., 2025], and Hi Robot's atomic commands [Shi et al., 2025] hierarchically bridge the discrete-continuous gap (see Chapter 5). Code-as-Symbolic-Planner [Chen et al., 2025] attempts to transplant code's discrete advantages into robotics (see Chapter 3).

Direction: Hierarchical abstraction is key. GR00T N1's [NVIDIA, 2025] Dual-System offers an architecture-level answer.

Dimension 6: Verification & Testing — How to Verify [5/5]

Code runs thousands of unit tests in milliseconds. The robot equivalent of a "unit test" does not exist.

SIMPLER [Li et al., 2024] established simulation-based evaluation standards, but the sim2real gap persists (see Chapter 9). CaP-X [Fu et al., 2026] first applied agentic coding metrics to robots but remains early-stage. AutoRT [Brohan et al., 2024] increased throughput through fleet-scale testing but requires Google-scale infrastructure.

This gap is irreducible in principle. Contact dynamics cannot be perfectly simulated due to physics approximation limits. The direction is hierarchical verification: high-level via scene graphs, mid-level via code, low-level via simulation, final confirmation via physical experiment.

Dimension 7: Recoverability — How to Recover from Failure [5/5]

git revert undoes any change. A broken cup cannot be undone.

AutoRT's Robot Constitution preemptively blocks dangerous actions, PragmaBot [2025] reduces failure probability through conservative strategies, and BUMBLE [Shah et al., 2024] replans upon failure detection (see Chapter 8). But physical actions already taken cannot be reversed.

This gap is irreducible in principle. The second law of thermodynamics imposes this constraint. The direction is "prevention and graceful degradation" rather than recovery.

10.3 Three Tiers of Gap

Synthesizing the seven dimensions yields three tiers:

Irreducible gaps (adaptation required): Error Feedback, Verification, Recoverability. Arising from essential properties of the physical world. "Adaptation" strategies, not "elimination," are needed.

Structural gaps (solvable with new approaches): Execution Determinism, State Representation, Action Space. Limited by current methodology. Stochastic policies, scene graphs, and hierarchical abstraction chart the solution paths.

Practical gaps (solvable with engineering effort): Memory Architecture. A matter of time and resources. KARMA/Embodied-RAG have already demonstrated practical paths.

This classification carries important implications for research resource allocation. Investing in "complete elimination" of irreducible gaps is inefficient. Focus should instead be on adaptation strategies: safety-first design, graceful degradation, and hierarchical verification.

10.4 Transplanting Agentic Coding's Success Factors to Robotics

Success Factor In Coding Robotics Transplant Attempt Maturity
Fast, precise feedback Stack traces, tests VLM failure diagnosis (REFLECT) Early (69-79% accuracy)
Low-cost experimentation Virtually free Simulation (SIMPLER) Mid (sim2real gap)
Easy recovery git revert Safety-first (AutoRT Constitution) Early (prevention only)
Structured state File system, AST 3D scene graphs (SayPlan, KARMA) Mid (dynamic env challenge)

The fourth factor (structured state) is maturing fastest; the first (feedback quality) is the largest bottleneck. Simultaneous progress on all four determines Agentic Robotics' maturity.

10.5 Eight Open Problems

[Fundamental] Semantic Translation of Physical Feedback

Converting sensor data to structured feedback LLMs can understand. The most urgent and most difficult challenge. Counterfactual reasoning, failure RAG, and tactile feedback integration are promising directions.

[Fundamental] Balancing Safety and Autonomy

The dilemma of permitting autonomous action while guaranteeing safety. AutoRT's Robot Constitution is the beginning, but vulnerable to long-tail risks. A hierarchical safety architecture — hardware-level reflexive safety + software-level reasoning-based safety — is needed.

[Structural] Real-Time World Models

Internal models that predict action outcomes before execution. VLAs are currently reactive with no future prediction. GR00T N1's dual-system is a starting point, but both prediction accuracy and speed are insufficient.

[Structural] Cross-Embodiment Generalization

A single model working across diverse robot hardware. Open X-Embodiment, Octo, and OpenVLA built the foundation, but true cross-embodiment generalization remains unachieved.

[Structural] Long-Horizon Cumulative Error

95% per-step success over 20 steps yields only 36% overall. BUMBLE's 47.1% is reality. Mid-task verification checkpoints and adaptive replanning frequency are the solution direction.

[Practical] Data Efficiency

Web data spans trillions of tokens; robot data reaches 1 million episodes. DROID, simulation data, and human video utilization are progressing but the gap persists.

[Practical] Real-Time Inference

VLA models (billions of parameters) vs robot control rates (100Hz+). TinyVLA, FAST, and hierarchical separation (HAMSTER, GR00T N1) are the solution path. Likely the fastest to be resolved, alongside hardware advances.

[Practical] Evaluation Standards

The absence of an Agentic Robotics equivalent to SWE-bench. CaP-X is the beginning, but standardized protocols are needed. Establishing this standard will accelerate the entire field.

10.6 Timeline Outlook: What Comes Next

Following the four paradigm shifts traced in this book, a fifth is foreshadowed:

Shift Period Key Achievement Paradigm Lag
1. LLM External Planner 2022 LLM robot planning Concurrent
2. Multimodal VLA 2023 End-to-end VLA Concurrent
3. Open VLA Ecosystem 2024 VLA democratization ~1 year
4. Agentic Closed-Loop 2025 Closed-loop agentic systems ~2 years
5. Embodied World Models 2026-27 Robots that predict the future ?

Signs of the fifth shift — Embodied World Models — are already visible. GR00T N1's dual-system, Code-as-Symbolic-Planner's symbolic reasoning, and video prediction model advances are converging. The ability to "simulate outcomes before acting" already exists in Agentic Coding — type checking, static analysis, and tests are "pre-execution outcome prediction." Robots need the same capability.

Three conditions must be met: (1) real-time physics prediction models (currently seconds, needs milliseconds), (2) quantifying prediction uncertainty and incorporating it into decision-making, (3) graceful fallback to conservative behavior when prediction fails.

The Rise of Hybrid Architectures

The tension running through this entire book — Large Model (VLA: fast but fails) vs System-Level Orchestration (slow but robust) — resolves not in choosing one, but in hybridization. Two promising directions emerge.

First, combining classical TAMP with VLM reasoning. VLA executes basic motions rapidly, while frame-level progress monitoring triggers VLM intervention for re-planning when deviations occur. This combines AutoTAMP's STL-based verification (→ Chapter 5) with pi0's real-time control (→ Chapter 4).

Second, Orchestration→Distillation. Experiences accumulated through system-level orchestration (success/failure trajectories, re-planning patterns) are distilled into VLA, enabling VLA to gradually internalize the orchestrator's judgment. This parallels how in Agentic Coding, LLMs learn linter and type-checker patterns to progressively generate cleaner code.

Low-Level Temporal Constraints: The Unexplored Frontier

Most current Agentic Robotics research focuses on quasi-static manipulation — slowly and stably grasping and moving objects. But real-world industrial and daily settings require dynamic manipulation (throwing, catching, rapid assembly). In this domain, the agentic loop's "observe→plan→execute" cycle hits physical time constraints — catching a thrown ball requires sub-200ms reaction, shorter than current VLM inference times. Solving these low-level temporal constraints is the next frontier of Agentic Robotics.

10.7 Comparison with Agentic Coding: Final Synthesis

Agentic Coding is a world where the agentic loop already works. This success rests on three properties of the digital world: deterministic execution, instant feedback, and full reversibility. Agentic Robotics attempts to operate the same loop in a world where all three properties collapse.

Yet what the papers in this book demonstrate is that the agentic loop does work in the physical world. PragmaBot improves from 35% to 84%; CaP-X compensates for the absence of human-crafted abstraction through agentic scaffolding; KARMA achieves 62.7x efficiency improvement through memory alone. Each loop component contributes meaningfully.

Agentic Robotics reaching Agentic Coding's maturity is not a matter of time but of adaptation. The physical world's fundamental constraints will not dissolve, but building sufficiently effective systems within those constraints is achievable. Safety-first design, hierarchical verification, graceful degradation, and simulation-augmented experience — on these four pillars the future of Agentic Robotics stands.

10.8 Conclusion: A New Engineering for the Physical World

This book's map delivers three core messages.

First, the agentic loop is universal. Whether digital or physical, the observe-plan-execute-verify-reflect-remember loop is the core structure of autonomous systems. This universality is what enables transplanting Agentic Coding's lessons to robotics.

Second, the physical world imposes unique constraints. Three of seven dimensions (Error Feedback, Verification, Recoverability) are irreducible in principle. Ignoring these constraints and directly applying digital-world strategies leads to failure.

Third, adaptation is possible. From open-loop LLM Planners in 2022 to fully closed-loop PragmaBot in 2025, all components of the physical agentic loop were implemented in just four years. Maturity remains at the research prototype level, but the direction is clear.

From LLM Planners to VLAs, from VLAs to Agentic Robotics — this journey is one of humanity's most fundamental engineering challenges: extending digital intelligence into the physical world. We hope this book serves as a useful compass for researchers and engineers navigating that challenge.

References

  1. Chi, C. et al., "Diffusion Policy: Visuomotor Policy Learning via Action Diffusion," arXiv:2303.04137, 2023. scholar
  2. Liu, Z. et al., "REFLECT: Summarizing Robot Experiences for Failure Explanation and Correction," arXiv:2306.15724, 2023. scholar
  3. Ekpo, D. et al., "VeriGraph: Scene Graphs for Execution Verifiable Robot Planning," arXiv:2411.10446, 2024. scholar
  4. Khazatsky, A. et al., "DROID: A Large-Scale In-The-Wild Robot Manipulation Dataset," arXiv:2403.12945, 2024. scholar
  5. Rana, K. et al., "SayPlan: Grounding Large Language Models using 3D Scene Graphs," arXiv:2307.06135, 2023. scholar
  6. Jiang, H. et al., "RoboEXP: Action-Conditioned Scene Graph via Interactive Exploration," arXiv:2402.15487, 2024. scholar
  7. Wang, Z. et al., "KARMA: Augmenting Embodied AI Agents with Long-and-short Term Memory Systems," arXiv:2409.14908, 2024. scholar
  8. Xie, Q. et al., "Embodied-RAG: General Non-parametric Embodied Memory for Retrieval and Generation," arXiv:2409.18313, 2024. scholar
  9. Belkhale, S. et al., "RT-H: Action Hierarchies Using Language," arXiv:2403.01823, 2024. scholar
  10. Li, J. et al., "HAMSTER: Hierarchical Action Models for Open-World Robot Manipulation," arXiv:2502.05485, 2025. scholar
  11. Shi, L. X. et al., "Hi Robot: Open-Ended Instruction Following with Hierarchical VLA," arXiv:2502.19417, 2025. scholar
  12. Chen, Y. et al., "Foundation Model-Based Robot Planning via Symbolic Code Generation for TAMP," arXiv:2503.01700, 2025. scholar
  13. NVIDIA, "GR00T N1: An Open Foundation Model for Generalist Humanoid Robots," arXiv:2503.14734, 2025. scholar
  14. Li, X. et al., "Evaluating Real-World Robot Manipulation Policies in Simulation (SIMPLER)," arXiv:2405.05941, 2024. scholar
  15. Fu, M. et al., "CaP-X: A Framework for Benchmarking and Improving Coding Agents for Robot Manipulation," arXiv:2603.22435, 2026. scholar
  16. Brohan, A. et al., "AutoRT: Embodied Foundation Models for Large Scale Orchestration of Robotic Agents," arXiv:2401.12963, 2024. scholar
  17. Shah, M. et al., "BUMBLE: Unifying Reasoning and Acting with VLMs for Building-wide Mobile Manipulation," arXiv:2410.06237, 2024. scholar
  18. PragmaBot, "A Pragmatist Robot: Learning to Plan Tasks by Experiencing the Real World," arXiv:2507.16713, 2025. scholar
  19. Kim, M. J. et al., "OpenVLA: An Open-Source Vision-Language-Action Model," arXiv:2406.09246, 2024. scholar
  20. Black, K. et al., "π0: A Vision-Language-Action Flow Model for General Robot Control," arXiv:2410.24164, 2024. scholar