Part I: Foundations — LLM Meets Robotics

Chapter 2: LLM as Planner — Zero-Shot Planning and Grounding

Written: 2026-04-08 Last updated: 2026-04-08

Summary

In 2022, pretrained LLMs were shown to serve as high-level planners for robots. This transition progressed through three stages — generation, grounding, and structuring — each precisely targeting the limitations of its predecessor. However, the fundamental inadequacy of text-only control for the physical world became clear, triggering two successor paradigms: Code as Policy and VLA.

2.1 Introduction: The Discovery of World Knowledge in LLMs

In early 2022, researchers made a striking discovery. LLMs trained on internet text could decompose instructions like "make breakfast" into action sequences such as "open fridge, take out milk, close fridge, ..." This capability was not explicitly trained but emerged from "world knowledge" implicitly encoded in internet text.

This discovery transformed robot planning. Previously, experts had to manually design formal planning systems like PDDL or behavior trees. Each new task required new planning logic, creating a fundamental bottleneck for general-purpose robotics. LLMs showed the potential to bypass this bottleneck entirely.

Yet the question "can LLMs plan?" quickly evolved into the harder question: "can those plans actually be executed in the physical world?" The trajectory of that evolution is the central thread of this chapter.

2.2 Language Models as Zero-Shot Planners: The Starting Point

^[1] is the origin point for LLM-based robot planning.

Figure 2.1: Three approaches to LLM-based planning — Zero-Shot Planning via Causal LLM, Translation to Admissible Actions, and Step-by-Step Autoregressive Generation. Source: Huang et al. (2022), Fig. 2

Three key techniques were introduced. Admissible Action Parsing semantically maps the LLM's free-form output to the environment's set of permissible actions. Autoregressive Trajectory Correction uses the outcome of previous actions as feedback to autoregressively correct subsequent actions. Dynamic Example Selection dynamically selects the most similar examples for few-shot prompting.

The results were both encouraging and revealing. Executability improved dramatically from 18% to 79%, but semantic correctness remained at just 32.87% (LCS metric). Executable but incorrect plans — this previewed the fundamental dilemma of LLM planners.

The most critical limitation was the absence of grounding. The LLM could not observe the current state of the physical environment. It might plan "take milk from the fridge" when no milk exists. Unlike code agents that can instantly read the file system, this observation-action gap was the first clear evidence of a fundamental challenge in physical planning.

2.3 SayCan: Grounding in the Physical World

Figure 2.2: A SayCan robot manipulating objects in a real kitchen. By combining 551 pretrained skills with affordance scoring, it executes physically grounded plans. Source: Ahn et al. (2022)

SayCan ^[2] directly attacked the grounding problem by combining the LLM's semantic knowledge ("Say") with a robot affordance function ("Can").

The core idea is a product of probabilities:

P(\text{action} \mid \text{instruction}) = P_{\text{LLM}}(\text{useful} \mid \text{instruction, action}) \times P_{\text{affordance}}(\text{feasible} \mid \text{state, action})

The LLM scores each skill's usefulness for the task, while the affordance function scores each skill's feasibility in the current environment. The action with the highest combined score is selected from a library of 551 pretrained robot skills.

Evaluated on 101 tasks in a real kitchen environment, SayCan with PaLM 540B achieved 84% planning success and 74% execution success, reducing the error rate by 50% compared to FLAN. It successfully executed tasks requiring up to 8 steps, including contextual requests like "I spilled coke on the table, how can you help me clean it up?"

The "Say + Can" paradigm became the standard framework for subsequent research. It is the direct ancestor of Google's robot foundation model lineage — RT-1, RT-2, PaLM-E — and was scaled to fleet management with AutoRT (2024) (see Chapter 8).

However, SayCan's limitations were also clear: a closed skill set of 551 skills requiring additional training to extend, static affordances that could not reflect real-time environmental changes, and limited replanning upon failure. The contrast with Agentic Coding is stark — adding a new tool requires only a tool description, not retraining.

2.4 SayPlan: Structuring the Environment Representation

SayPlan ^[3] directly targeted SayCan's scale limitations. Its key insight: the environment representation determines the scale of LLM planning.

SayPlan exploits the hierarchical structure of 3D Scene Graphs (3DSG) — building, floor, room, object. Rather than feeding the entire graph to the LLM, it performs Hierarchical Semantic Search, selectively expanding only the task-relevant subgraph. Integration with a classical path planner reduces the LLM's planning horizon, and an Iterative Replanning Pipeline corrects infeasible actions.

The results were impressive: successful long-horizon task planning in a 3-floor building with 36 rooms and 140+ objects. Iterative replanning corrected most hallucination errors, though 6.67% of tasks retained uncorrectable hallucinated nodes.

SayPlan's most important legacy is establishing the scene graph + LLM planning combination. This combination feeds directly into KARMA's long-term memory system ^[5], Embodied-RAG's spatial-semantic retrieval ^[6], and VeriGraph's execution verification ^[7] (see Chapter 7).

2.5 The Evolutionary Arc: From Generation to Structure

The evolutionary trajectory across these three papers is clear:

Stage	Key Achievement	Remaining Limitation
Generation (LLM as Planners)	LLMs can generate plans	Not grounded in physical world
Grounding (SayCan)	Affordances provide grounding	Limited space, static environment
Structuring (SayPlan)	Scene graphs enable scaling	No dynamic environment tracking, residual hallucination

Each stage precisely resolved its predecessor's core limitation while exposing new ones. The challenges that all three stages left unresolved triggered subsequent research directions:

"What if plans were expressed as code?" — Code as Policy (see Chapter 3)
"What if the LLM directly output actions?" — VLA (see Chapter 4)
"What if high-level planning and low-level control were separated?" — Hierarchical Planning (see Chapter 5)
"What if we could remember dynamic environments?" — Memory & Scene Graph (see Chapter 7)
"What if we could detect and recover from failures?" — Agentic Systems (see Chapter 8)

2.6 Comparison with Agentic Coding: The Determinism of Tool Calls

The core problem these three papers wrestle with is grounding — connecting LLM plans to the physical world. Examining why this problem does not arise in Agentic Coding reveals the fundamental difference.

When an LLM in Agentic Coding plans "read the file," the command cat file.txt executes instantly, deterministically, and completely. Tool availability is verifiable through API documentation or type signatures, and results are immediately returned as text. Tool calls are deterministic — that is the key.

In SayCan, whether a robot skill can be executed depends on the physical environment state and requires a separately trained affordance model. The same "pick up cup" skill has varying success probabilities depending on the cup's position, orientation, and surface properties. Tool calls are stochastic. SayPlan's introduction of 3D scene graphs is structurally similar to drilling down from tree to a specific file to a specific function in a codebase, but constructing and maintaining a physical scene graph is expensive and requires constant updates.

This difference is not merely technical. The determinism of digital tool calls enables infinite "try, fail, retry" iterations. The stochasticity of physical execution means each attempt carries cost and risk, requiring "estimate success probability before trying." SayCan's affordance scoring was the first attempt at precisely this estimation.

2.7 Open Problems and Outlook

The LLM Planner paradigm leaves three fundamental open problems.

First, the absence of closed-loop replanning. None of the three papers offer robust recovery from execution failures. SayPlan's iterative replanning is the most advanced attempt, but it does not address physical execution failures. This challenge is taken up by REFLECT ^[4] and BUMBLE ^[9] (see Chapter 8).

Second, dynamic environment tracking. All three papers assume static or quasi-static environments. There is no mechanism to detect and incorporate changes made by other people or robots. KARMA's short-term memory system ^[5] represents the first step in this direction (see Chapter 7).

Third, complete hallucination elimination. SayPlan's residual 6.67% hallucination rate reveals an inherent limitation of LLM-based planning. AutoTAMP's formal verification ^[8] and VeriGraph's scene-graph-based verification ^[7] address this but do not achieve complete solutions (see Chapters 5 and 7).

The conclusion of a recent survey ^[11] is clear: LLMs are inadequate as standalone planners but powerful when combined with traditional planning methods. The hybrid approach — combining LLM flexibility with the rigor of classical planning — is the most promising direction, explored further in the hierarchical planning of Chapter 5.

References

Huang, W. et al., "Language Models as Zero-Shot Planners: Extracting Actionable Knowledge for Embodied Agents," arXiv:2201.07207, 2022. scholar
Ahn, M. et al., "Do As I Can, Not As I Say: Grounding Language in Robotic Affordances," arXiv:2204.01691, 2022. scholar
Rana, K. et al., "SayPlan: Grounding Large Language Models using 3D Scene Graphs for Scalable Robot Task Planning," arXiv:2307.06135, 2023. scholar
Liu, Z. et al., "REFLECT: Summarizing Robot Experiences for Failure Explanation and Correction," arXiv:2306.15724, 2023. scholar
Wang, Z. et al., "KARMA: Augmenting Embodied AI Agents with Long-and-short Term Memory Systems," arXiv:2409.14908, 2024. scholar
Xie, Q. et al., "Embodied-RAG: General Non-parametric Embodied Memory for Retrieval and Generation," arXiv:2409.18313, 2024. scholar
Ekpo, D. et al., "VeriGraph: Scene Graphs for Execution Verifiable Robot Planning," arXiv:2411.10446, 2024. scholar
Chen, Y. et al., "AutoTAMP: Autoregressive Task and Motion Planning with LLMs as Translators and Checkers," arXiv:2306.06531, 2023. scholar
Shah, M. et al., "BUMBLE: Unifying Reasoning and Acting with Vision-Language Models for Building-wide Mobile Manipulation," arXiv:2410.06237, 2024. scholar
Brohan, A. et al., "AutoRT: Embodied Foundation Models for Large Scale Orchestration of Robotic Agents," arXiv:2401.12963, 2024. scholar
Survey, "A Survey on Large Language Models for Automated Planning," arXiv:2502.12435, 2025. scholar