Part I: Foundations — LLM Meets Robotics

Chapter 1: Introduction — From Agentic Coding to Agentic Robotics

Written: 2026-04-08 Last updated: 2026-04-08

Summary

Since 2022, LLM-based agents have fundamentally transformed software development. The autonomous loop of code generation, execution, error analysis, and correction has reached practical maturity. This book traces the attempts to extend that same agentic loop into the physical world — from LLM Planners through VLAs to Agentic Robotics — mapping the fundamental gaps that emerge and the paths toward overcoming them across seven dimensions.

1.1 Introduction: The Agentic Loop as a Shared Structure

The operating principle of LLM-based agents in software development follows a remarkably simple loop: observe (read code), plan (decide on modifications), execute (write and run code), verify (tests and linters), reflect (analyze errors), and retry. Tools like Claude Code, Cursor, and Devin iterate this loop in seconds, achieving autonomy comparable to human developers.

This loop works because of three properties of the digital world. First, deterministic execution — the same code and input guarantee the same output. Second, immediate and precise feedback — stack traces and test outputs pinpoint the exact cause of failure. Third, full reversibility — a single git revert undoes any change.

Now imagine transplanting this loop into the physical world. When a robot receives the instruction "clean the kitchen," it must run an analogous loop: observe (cameras, sensors), plan (action sequences), execute (motor control), verify (observe outcomes), reflect (analyze failures), and retry. The structure is identical. But in the physical world, all three properties collapse.

1.2 The Fundamental Constraints of the Physical World

Stochastic Execution and Irreversibility

Executing the same "pick up the cup" command twice yields different outcomes. Subtle variations in object position, surface friction, and lighting conditions ensure this. Diffusion Policy [Chi et al., 2023] explicitly models stochastic policies precisely because it acknowledges this intrinsic uncertainty. More critically, physical actions are irreversible. A broken cup cannot be git revert-ed. The second law of thermodynamics imposes a fundamental constraint on physical agents.

Degraded Feedback Quality

A code error reads "File X, Line Y, TypeError: cannot concatenate int and str." A robot failure does not. When a gripper drops a cup — was the force insufficient? Was the object slippery? Was there a position error? REFLECT [Liu et al., 2023] attempts to explain failure causes in natural language using VLMs, with accuracy ranging from 69-79% depending on task type.

The Cost and Gap of Verification

Code allows thousands of unit tests to run in milliseconds. The robot equivalent of a "unit test" is a physical trial, requiring minutes per attempt and human supervision. Simulation-based alternatives like SIMPLER [Li et al., 2024] exist, but the sim-to-real gap persists.

1.3 Four Paradigm Shifts

The research trajectory from 2022 to 2026 that this book traces can be structured as four paradigm shifts.

Shift 1: LLM as External Planner (2022). It was demonstrated that LLMs can decompose natural language instructions into action plans. LLM as Zero-Shot Planners [Huang et al., 2022] was the starting point, SayCan [Ahn et al., 2022] grounded plans in the physical world, and Code as Policies [Liang et al., 2022] established code as a control interface (see Chapters 2, 3).

Shift 2: Multimodal VLA (2023). PaLM-E [Driess et al., 2023] and RT-2 [Brohan et al., 2023] gave birth to the Vision-Language-Action model. A single model that sees, understands, and acts — the end-to-end paradigm (see Chapter 4).

Shift 3: Open VLA Ecosystem (2024). OpenVLA [Kim et al., 2024], Octo [Ghosh et al., 2024], and pi0 [Black et al., 2024] democratized VLAs. The transition from RT-2 (55B, closed-source) to OpenVLA (7B, open-source) directly parallels the GPT-4-to-Llama transition in the LLM world (see Chapters 4, 5, 6).

Shift 4: Agentic Closed-Loop (2025-2026). BUMBLE [Shah et al., 2024], PragmaBot [2025], and AutoRT [Brohan et al., 2024] began constructing closed-loop agentic systems — prototypes of a complete plan-execute-reflect-remember loop operating in the physical world (see Chapters 7, 8, 9).

1.4 The Seven-Dimension Comparison Framework

The analytical backbone of this book is a seven-dimension framework that separates Agentic Coding from Agentic Robotics.

Dimension Agentic Coding Agentic Robotics Gap Severity
Error Feedback Stack traces, test output Sensor noise, partial observability 5/5
Execution Determinism Deterministic, reproducible Stochastic, non-reproducible 4/5
State Representation Code, file system, AST Scene graphs, point clouds 4/5
Memory Architecture Long context, persistent files Real-time constraints, spatial memory 3/5
Action Space API calls, code edits (discrete) Continuous motor commands 4/5
Verification Unit tests, CI/CD Physical trials, sim2real gap 5/5
Recoverability git revert, undo Irreversible (physical consequences) 5/5

Three dimensions — Error Feedback, Verification, and Recoverability — arise from fundamental properties of the physical world and cannot be fully resolved in principle. These dimensions require strategies of adaptation rather than elimination, which is why Agentic Robotics does not simply follow the development trajectory of Agentic Coding but must forge its own path.

1.5 Structure of This Book

Part I (Chapters 1-3) covers the 2022 research where LLMs were first applied to robot planning. We examine the potential and limits of LLM Planners, and why code serves as a superior control interface over natural language.

Part II (Chapters 4-6) traces the VLA revolution. How end-to-end models emerged and were democratized, how high-level planning connects to low-level control, and how diffusion policies and 3D representations form the frontier of low-level control.

Part III (Chapters 7-9) addresses the core components of Agentic Robotics: memory and world representation, closed-loop systems, and strategies for bridging the simulation-to-reality gap.

Part IV (Chapter 10) synthesizes all seven dimensions for a final analysis of the fundamental differences between digital and physical agents, and looks to the future.

1.6 Agentic Coding as the Comparative Lens: Why This Framework

This book uses Agentic Coding as a consistent comparison axis for a simple reason: it is a world where the agentic loop already works. Claude Code generating code, executing it, analyzing errors, and fixing them is already production-ready. Understanding precisely why this succeeds clarifies what additional elements the physical world demands.

Four key conditions enable Agentic Coding, and the corresponding robotics efforts to transplant them are:

  1. Fast, precise feedback — Strengthening VLM-based failure diagnosis (REFLECT)
  2. Low-cost experimentation — Advancing simulation environments (SIMPLER)
  3. Easy recovery — Safety-first design (AutoRT's Robot Constitution)
  4. Structured state — Adopting scene graphs (SayPlan, KARMA)

The success of these four transplants will determine the maturity of Agentic Robotics. Each chapter traces how the relevant papers attempt them, how far they succeed, and what remains.

1.7 Open Problems and Outlook

Three problems must be solved for Agentic Robotics to move beyond lab demos.

First, semantic translation of physical feedback. Converting sensor data into structured feedback that LLMs can understand is the most urgent challenge. Counterfactual reasoning ("what if the grip were stronger?") and failure RAG (retrieving similar past failures) are promising directions.

Second, balancing safety and autonomy. AutoRT's Robot Constitution [Brohan et al., 2024] is a start, but offers no guarantees against long-tail risks. A hierarchical safety architecture — combining hardware-level reflexive safety with software-level reasoning-based safety — is needed.

Third, establishing evaluation standards. Just as SWE-bench accelerated Agentic Coding, standardized benchmarks like CaP-X [Fu et al., 2026] are needed for Agentic Robotics. The current situation where each paper evaluates on its own environment and metrics prevents fair comparison and reproducibility.

Agentic Robotics shows a roughly 1-2 year paradigm lag behind Agentic Coding. But this lag is not simply a matter of technological maturation — it reflects the fundamental constraints imposed by the physical world. This book draws the precise map of those constraints and explores paths for overcoming them.

References

  1. Huang, W. et al., "Language Models as Zero-Shot Planners: Extracting Actionable Knowledge for Embodied Agents," arXiv:2201.07207, 2022. scholar
  2. Ahn, M. et al., "Do As I Can, Not As I Say: Grounding Language in Robotic Affordances," arXiv:2204.01691, 2022. scholar
  3. Liang, J. et al., "Code as Policies: Language Model Programs for Embodied Control," arXiv:2209.07753, 2022. scholar
  4. Driess, D. et al., "PaLM-E: An Embodied Multimodal Language Model," arXiv:2303.03378, 2023. scholar
  5. Brohan, A. et al., "RT-2: Vision-Language-Action Models Transfer Web Knowledge to Robotic Control," arXiv:2307.15818, 2023. scholar
  6. Chi, C. et al., "Diffusion Policy: Visuomotor Policy Learning via Action Diffusion," arXiv:2303.04137, 2023. scholar
  7. Liu, Z. et al., "REFLECT: Summarizing Robot Experiences for Failure Explanation and Correction," arXiv:2306.15724, 2023. scholar
  8. Kim, M. J. et al., "OpenVLA: An Open-Source Vision-Language-Action Model," arXiv:2406.09246, 2024. scholar
  9. Ghosh, D. et al., "Octo: An Open-Source Generalist Robot Policy," arXiv:2405.12213, 2024. scholar
  10. Black, K. et al., "pi0: A Vision-Language-Action Flow Model for General Robot Control," arXiv:2410.24164, 2024. scholar
  11. Li, X. et al., "Evaluating Real-World Robot Manipulation Policies in Simulation (SIMPLER)," arXiv:2405.05941, 2024. scholar
  12. Brohan, A. et al., "AutoRT: Embodied Foundation Models for Large Scale Orchestration of Robotic Agents," arXiv:2401.12963, 2024. scholar
  13. Shah, M. et al., "BUMBLE: Unifying Reasoning and Acting with Vision-Language Models for Building-wide Mobile Manipulation," arXiv:2410.06237, 2024. scholar
  14. Wang, Z. et al., "KARMA: Augmenting Embodied AI Agents with Long-and-short Term Memory Systems," arXiv:2409.14908, 2024. scholar
  15. Rana, K. et al., "SayPlan: Grounding Large Language Models using 3D Scene Graphs for Scalable Robot Task Planning," arXiv:2307.06135, 2023. scholar
  16. Fu, M. et al., "CaP-X: A Framework for Benchmarking and Improving Coding Agents for Robot Manipulation," arXiv:2603.22435, 2026. scholar