Part IV: Fundamental Differences

Chapter 10: Agentic Coding vs Agentic Robotics — The Gap and the Future

Written: 2026-04-08 Last updated: 2026-04-08

Summary

This book has traced the research trajectory from LLM Planners (2022) through Agentic Robotics (2026), maintaining Agentic Coding as a consistent comparison axis. This final chapter synthesizes the seven-dimension gap analysis, distinguishes what can be closed from what is irreducible in principle, and charts the path for physical agents to approach the maturity of digital agents.

10.1 Introduction: Same Loop, Fundamentally Different Worlds

The core observation running through this entire book is simple: the agentic loop's structure is identical, but the physical world fundamentally changes the difficulty of each step.

Observe, plan, execute, verify, reflect, remember, retry. Claude Code iterates this loop in seconds and has reached production readiness. PragmaBot runs the same loop structure in minutes, achieving 84% success (see Chapter 8). BUMBLE plateaus at 47.1% at building scale. Drawing the precise map of this gap is this chapter's goal.

10.2 Seven-Dimension Synthesis

Dimension 1: Error Feedback — Quality and Speed of Feedback [5/5]

Code errors are structured text: "File X, Line Y, TypeError" — exact file, exact line, exact error type. The information channel bandwidth is effectively infinite.

Physical errors are noisy multisensory data. When a gripper drops a cup, whether the cause is insufficient force, slippage, or position error is not immediately determinable. REFLECT ^[2] infers failure causes via VLM at 69-79% accuracy depending on task type (see Chapter 8). VeriGraph ^[3] structures feedback through scene graphs but still fails to capture fine-grained manipulation failures (see Chapter 7).

This gap is the most fundamental bottleneck. Feedback quality determines the efficiency of the entire agentic loop. Converting physical-world errors to "stack-trace-level" clarity is the central challenge of Agentic Robotics.

Dimension 2: Execution Determinism — Same Command, Different Results [4/5]

Same code and input guarantees same output. Docker isolates even the environment.

Same pick-and-place command yields different results each time. Diffusion Policy ^[1] explicitly models stochastic policies, acknowledging this intrinsic uncertainty (see Chapter 6). DROID ^[4] attempts to internalize environmental diversity through large-scale data from 13 institutions and 564 scenes, but deterministic reproducibility is unachievable in principle.

Direction: Design systems that are "stochastic but robust" rather than deterministic — systems that recover from failure rather than prevent it.

Dimension 3: State Representation — How the World Is Seen [4/5]

Codebases are fully observable. ASTs precisely represent semantic structure; the file system exposes all state.

Physical environments are partially observable. Objects outside camera view, occluded objects, and internal states (contents of drawers) are unknown. SayPlan's 3D scene graphs ^[5], RoboEXP's action-conditioned graphs ^[6], and KARMA's memory-extended graphs ^[7] have advanced structural representation (see Chapter 7), but "reading the world like reading code" remains distant.

Dimension 4: Memory Architecture — What to Remember and How to Retrieve [3/5]

Code agents can reference entire codebases in 200K+ token context windows; files persist indefinitely. No time constraints.

Robots must query memory within real-time control loops (10-100Hz). KARMA's LTM/STM separation is the most effective response, achieving 62.7x efficiency improvement on Complex Tasks (see Chapter 7). Embodied-RAG ^[8] structures retrieval through spatial-semantic hierarchies.

This dimension is advancing fastest. Memory architectures have visible practical paths, with high likelihood of gap closure.

Dimension 5: Action Space — What Can Be Done [4/5]

Code agent actions are discrete, compositional, and extensible. "Open file" has a clear start and end.

Robot actions are continuous, unfold over time, and are dominated by contact dynamics. RT-H's language motion ^[9], HAMSTER's 2D paths ^[10], and Hi Robot's atomic commands ^[11] hierarchically bridge the discrete-continuous gap (see Chapter 5). Code-as-Symbolic-Planner ^[12] attempts to transplant code's discrete advantages into robotics (see Chapter 3).

Direction: Hierarchical abstraction is key. GR00T N1's ^[13] Dual-System offers an architecture-level answer.

Dimension 6: Verification & Testing — How to Verify [5/5]

Code runs thousands of unit tests in milliseconds. The robot equivalent of a "unit test" does not exist.

SIMPLER ^[14] established simulation-based evaluation standards, but the sim2real gap persists (see Chapter 9). CaP-X ^[15] first applied agentic coding metrics to robots but remains early-stage. AutoRT ^[16] increased throughput through fleet-scale testing but requires Google-scale infrastructure.

This gap is irreducible in principle. Contact dynamics cannot be perfectly simulated due to physics approximation limits. The direction is hierarchical verification: high-level via scene graphs, mid-level via code, low-level via simulation, final confirmation via physical experiment.

Dimension 7: Recoverability — How to Recover from Failure [5/5]

git revert undoes any change. A broken cup cannot be undone.

AutoRT's Robot Constitution preemptively blocks dangerous actions, PragmaBot^[18] reduces failure probability through conservative strategies, and BUMBLE ^[17] replans upon failure detection (see Chapter 8). But physical actions already taken cannot be reversed.

This gap is irreducible in principle. The second law of thermodynamics imposes this constraint. The direction is "prevention and graceful degradation" rather than recovery.

10.3 Three Tiers of Gap

Synthesizing the seven dimensions yields three tiers:

Irreducible gaps (adaptation required): Error Feedback, Verification, Recoverability. Arising from essential properties of the physical world. "Adaptation" strategies, not "elimination," are needed.

Structural gaps (solvable with new approaches): Execution Determinism, State Representation, Action Space. Limited by current methodology. Stochastic policies, scene graphs, and hierarchical abstraction chart the solution paths.

Practical gaps (solvable with engineering effort): Memory Architecture. A matter of time and resources. KARMA/Embodied-RAG have already demonstrated practical paths.

This classification carries important implications for research resource allocation. Investing in "complete elimination" of irreducible gaps is inefficient. Focus should instead be on adaptation strategies: safety-first design, graceful degradation, and hierarchical verification.

10.4 Transplanting Agentic Coding's Success Factors to Robotics

Success Factor	In Coding	Robotics Transplant Attempt	Maturity
Fast, precise feedback	Stack traces, tests	VLM failure diagnosis (REFLECT)	Early (69-79% accuracy)
Low-cost experimentation	Virtually free	Simulation (SIMPLER)	Mid (sim2real gap)
Easy recovery	git revert	Safety-first (AutoRT Constitution)	Early (prevention only)
Structured state	File system, AST	3D scene graphs (SayPlan, KARMA)	Mid (dynamic env challenge)

The fourth factor (structured state) is maturing fastest; the first (feedback quality) is the largest bottleneck. Simultaneous progress on all four determines Agentic Robotics' maturity.

10.5 Eight Open Problems

[Fundamental] Semantic Translation of Physical Feedback

Converting sensor data to structured feedback LLMs can understand. The most urgent and most difficult challenge. Counterfactual reasoning, failure RAG, and tactile feedback integration are promising directions.

[Fundamental] Balancing Safety and Autonomy

The dilemma of permitting autonomous action while guaranteeing safety. AutoRT's Robot Constitution is the beginning, but vulnerable to long-tail risks. A hierarchical safety architecture — hardware-level reflexive safety + software-level reasoning-based safety — is needed.

[Structural] Real-Time World Models

Internal models that predict action outcomes before execution. VLAs are currently reactive with no future prediction. GR00T N1's dual-system is a starting point, but both prediction accuracy and speed are insufficient.

[Structural] Cross-Embodiment Generalization

A single model working across diverse robot hardware. Open X-Embodiment, Octo, and OpenVLA built the foundation, but true cross-embodiment generalization remains unachieved.

[Structural] Long-Horizon Cumulative Error

95% per-step success over 20 steps yields only 36% overall. BUMBLE's 47.1% is reality. Mid-task verification checkpoints and adaptive replanning frequency are the solution direction.

[Practical] Data Efficiency

Web data spans trillions of tokens; robot data reaches 1 million episodes. DROID, simulation data, and human video utilization are progressing but the gap persists.

[Practical] Real-Time Inference

VLA models (billions of parameters) vs robot control rates (100Hz+). TinyVLA, FAST, and hierarchical separation (HAMSTER, GR00T N1) are the solution path. Likely the fastest to be resolved, alongside hardware advances.

[Practical] Evaluation Standards

The absence of an Agentic Robotics equivalent to SWE-bench. CaP-X is the beginning, but standardized protocols are needed. Establishing this standard will accelerate the entire field.

10.6 Timeline Outlook: What Comes Next

Following the four paradigm shifts traced in this book, a fifth is foreshadowed:

Shift	Period	Key Achievement	Paradigm Lag
1. LLM External Planner	2022	LLM robot planning	Concurrent
2. Multimodal VLA	2023	End-to-end VLA	Concurrent
3. Open VLA Ecosystem	2024	VLA democratization	~1 year
4. Agentic Closed-Loop	2025	Closed-loop agentic systems	~2 years
5. Embodied World Models	2026-27	Robots that predict the future	?

Signs of the fifth shift — Embodied World Models — are already visible. GR00T N1's dual-system, Code-as-Symbolic-Planner's symbolic reasoning, and video prediction model advances are converging. The ability to "simulate outcomes before acting" already exists in Agentic Coding — type checking, static analysis, and tests are "pre-execution outcome prediction." Robots need the same capability.

Three conditions must be met: (1) real-time physics prediction models (currently seconds, needs milliseconds), (2) quantifying prediction uncertainty and incorporating it into decision-making, (3) graceful fallback to conservative behavior when prediction fails.

The Rise of Hybrid Architectures

The tension running through this entire book — Large Model (VLA: fast but fails) vs System-Level Orchestration (slow but robust) — resolves not in choosing one, but in hybridization. Two promising directions emerge.

First, combining classical TAMP with VLM reasoning. VLA executes basic motions rapidly, while frame-level progress monitoring triggers VLM intervention for re-planning when deviations occur. This combines AutoTAMP's STL-based verification (→ Chapter 5) with pi0's real-time control (→ Chapter 4).

Second, Orchestration→Distillation. Experiences accumulated through system-level orchestration (success/failure trajectories, re-planning patterns) are distilled into VLA, enabling VLA to gradually internalize the orchestrator's judgment. This parallels how in Agentic Coding, LLMs learn linter and type-checker patterns to progressively generate cleaner code.

Low-Level Temporal Constraints: The Unexplored Frontier

Most current Agentic Robotics research focuses on quasi-static manipulation — slowly and stably grasping and moving objects. But real-world industrial and daily settings require dynamic manipulation (throwing, catching, rapid assembly). In this domain, the agentic loop's "observe→plan→execute" cycle hits physical time constraints — catching a thrown ball requires sub-200ms reaction, shorter than current VLM inference times. Solving these low-level temporal constraints is the next frontier of Agentic Robotics.

10.7 Comparison with Agentic Coding: Final Synthesis

Agentic Coding is a world where the agentic loop already works. This success rests on three properties of the digital world: deterministic execution, instant feedback, and full reversibility. Agentic Robotics attempts to operate the same loop in a world where all three properties collapse.

Yet what the papers in this book demonstrate is that the agentic loop does work in the physical world. PragmaBot improves from 35% to 84%; CaP-X compensates for the absence of human-crafted abstraction through agentic scaffolding; KARMA achieves 62.7x efficiency improvement through memory alone. Each loop component contributes meaningfully.

Agentic Robotics reaching Agentic Coding's maturity is not a matter of time but of adaptation. The physical world's fundamental constraints will not dissolve, but building sufficiently effective systems within those constraints is achievable. Safety-first design, hierarchical verification, graceful degradation, and simulation-augmented experience — on these four pillars the future of Agentic Robotics stands.

10.8 Conclusion: A New Engineering for the Physical World

This book's map delivers three core messages.

First, the agentic loop is universal. Whether digital or physical, the observe-plan-execute-verify-reflect-remember loop is the core structure of autonomous systems. This universality is what enables transplanting Agentic Coding's lessons to robotics.

Second, the physical world imposes unique constraints. Three of seven dimensions (Error Feedback, Verification, Recoverability) are irreducible in principle. Ignoring these constraints and directly applying digital-world strategies leads to failure.

Third, adaptation is possible. From open-loop LLM Planners in 2022 to fully closed-loop PragmaBot in 2025, all components of the physical agentic loop were implemented in just four years. Maturity remains at the research prototype level, but the direction is clear.

From LLM Planners to VLAs, from VLAs to Agentic Robotics — this journey is one of humanity's most fundamental engineering challenges: extending digital intelligence into the physical world. We hope this book serves as a useful compass for researchers and engineers navigating that challenge.

References

Chi, C. et al., "Diffusion Policy: Visuomotor Policy Learning via Action Diffusion," arXiv:2303.04137, 2023. scholar
Liu, Z. et al., "REFLECT: Summarizing Robot Experiences for Failure Explanation and Correction," arXiv:2306.15724, 2023. scholar
Ekpo, D. et al., "VeriGraph: Scene Graphs for Execution Verifiable Robot Planning," arXiv:2411.10446, 2024. scholar
Khazatsky, A. et al., "DROID: A Large-Scale In-The-Wild Robot Manipulation Dataset," arXiv:2403.12945, 2024. scholar
Rana, K. et al., "SayPlan: Grounding Large Language Models using 3D Scene Graphs," arXiv:2307.06135, 2023. scholar
Jiang, H. et al., "RoboEXP: Action-Conditioned Scene Graph via Interactive Exploration," arXiv:2402.15487, 2024. scholar
Wang, Z. et al., "KARMA: Augmenting Embodied AI Agents with Long-and-short Term Memory Systems," arXiv:2409.14908, 2024. scholar
Xie, Q. et al., "Embodied-RAG: General Non-parametric Embodied Memory for Retrieval and Generation," arXiv:2409.18313, 2024. scholar
Belkhale, S. et al., "RT-H: Action Hierarchies Using Language," arXiv:2403.01823, 2024. scholar
Li, J. et al., "HAMSTER: Hierarchical Action Models for Open-World Robot Manipulation," arXiv:2502.05485, 2025. scholar
Shi, L. X. et al., "Hi Robot: Open-Ended Instruction Following with Hierarchical VLA," arXiv:2502.19417, 2025. scholar
Chen, Y. et al., "Foundation Model-Based Robot Planning via Symbolic Code Generation for TAMP," arXiv:2503.01700, 2025. scholar
NVIDIA, "GR00T N1: An Open Foundation Model for Generalist Humanoid Robots," arXiv:2503.14734, 2025. scholar
Li, X. et al., "Evaluating Real-World Robot Manipulation Policies in Simulation (SIMPLER)," arXiv:2405.05941, 2024. scholar
Fu, M. et al., "CaP-X: A Framework for Benchmarking and Improving Coding Agents for Robot Manipulation," arXiv:2603.22435, 2026. scholar
Brohan, A. et al., "AutoRT: Embodied Foundation Models for Large Scale Orchestration of Robotic Agents," arXiv:2401.12963, 2024. scholar
Shah, M. et al., "BUMBLE: Unifying Reasoning and Acting with VLMs for Building-wide Mobile Manipulation," arXiv:2410.06237, 2024. scholar
PragmaBot, "A Pragmatist Robot: Learning to Plan Tasks by Experiencing the Real World," arXiv:2507.16713, 2025. scholar
Kim, M. J. et al., "OpenVLA: An Open-Source Vision-Language-Action Model," arXiv:2406.09246, 2024. scholar
Black, K. et al., "π0: A Vision-Language-Action Flow Model for General Robot Control," arXiv:2410.24164, 2024. scholar