Part II: The VLA Revolution

Chapter 5: Hierarchical Planning — From High-Level to Low-Level

Written: 2026-04-08 Last updated: 2026-04-08

Summary

Between the instruction "make a sandwich" and joint torque commands lies a vast abstraction gap. LLM Planners excel at high-level planning but cannot perform precise motor control; VLAs directly output actions but struggle with complex reasoning. This chapter examines four approaches to hierarchically bridging these two worlds: formal specification (AutoTAMP), language motion (RT-H), real-time feedback integration (Hi Robot), and off-domain data utilization (HAMSTER).

5.1 Introduction: Why Hierarchy Is Necessary

In Agentic Coding, hierarchical abstraction comes naturally. Given "fix this bug," an agent first explores relevant files (high-level), identifies the specific function (mid-level), and edits code (low-level). Each level's actions are discrete and interpretable, with IDE autocomplete handling low-level details.

In robotics, this hierarchy is not self-evident. The distance from the high-level instruction "pick up the cup" to continuous torque commands across 7 degrees of freedom is far greater than in coding. And what to use as the intermediate representation — formal logic? language motion? 2D paths? atomic commands? — is a critical design decision.

Five reasons consistently cited across all hierarchical approaches: (1) abstraction level mismatch, (2) data efficiency through separate training, (3) off-domain data utilization, (4) ease of human intervention, and (5) debuggability and interpretability.

5.2 AutoTAMP: The Formal Verification Path

Figure 5.1: AutoTAMP's "LLM-As-Translator & Checker" pipeline. Natural language instructions are translated into formal specifications (STL/PDDL), executed by classical TAMP algorithms, with automatic syntactic and semantic error verification and correction. Source: Chen et al. (2023)

AutoTAMP ^[1] uses the LLM not as a direct planner but as a translator from natural language to formal specifications (Signal Temporal Logic, etc.) and a checker. The key insight: "it is better to use the LLM as a preprocessor and verifier for planners than as the planner itself."

The LLM translates natural language task descriptions into formal representations like STL, which traditional TAMP algorithms consume to produce task and motion plans. Autoregressive Re-prompting automatically detects and corrects both syntactic and semantic errors.

Results were strong: GPT-4 + AutoTAMP achieved 82.5-87.7% on single-agent tasks and 100% on multi-agent tasks with full AutoTAMP, significantly outperforming direct LLM planning on geometrically and temporally constrained tasks.

AutoTAMP's core idea — "LLM as translator, classical methods as executor" — was directly inherited by Code-as-Symbolic-Planner ^[6] (see Chapter 3).

5.3 RT-H: Language as Intermediate Representation

RT-H ^[2] proposes a completely different intermediate representation: language motion. Fine-grained language descriptions like "move arm forward" and "close gripper" bridge high-level tasks and low-level actions.

The key insight is that semantically different tasks may share similar low-level motions. "Pick up the cup" and "pour from the bottle" are entirely different tasks, but "extend arm forward" is a shared low-level motion. Language motion makes this shared structure explicit, improving data efficiency.

RT-H additionally enables in-execution human correction — users can modify robot behavior in real-time at the language motion level, naturally integrating corrections like "not that one, move left."

5.4 Hi Robot: Alignment with Human Intent

Hi Robot ^[3] adds real-time human feedback integration to hierarchical VLAs. A high-level VLM interprets current observations and user utterances to generate atomic commands ("grasp the cup"), which a low-level policy executes.

Hi Robot's core contribution is real-time response to user corrections like "not that one." Evaluated on single-arm, bimanual, and mobile platforms across scenario-based tasks (table clearing, sandwich making, grocery shopping), it outperformed both API-based VLMs and flat VLAs on human intent alignment and task success.

5.5 HAMSTER: Leveraging Off-Domain Data

HAMSTER ^[4] delivers a key message: hierarchical separation enables not just abstraction but off-domain data utilization.

A high-level VLM predicts 2D end-effector paths from RGB images and task descriptions, while a low-level 3D-aware policy performs precise manipulation along these paths. Critically, the high-level VLM can be fine-tuned with action-free videos, hand-drawn sketches, and simulation data — cheap alternatives to expensive robot data.

On real robots, HAMSTER achieved +20% average success rate over OpenVLA (7 generalization axes), a 50% relative improvement, demonstrating that off-domain data can overcome gaps in embodiment, dynamics, visual appearance, and task semantics.

5.6 Four Forms of Hierarchical Separation

Comparing the four approaches:

Form	High-Level	Intermediate	Low-Level	Core Strength
Formal spec (AutoTAMP)	LLM → STL/PDDL	Formal logic	TAMP solver	Verifiability
Language motion (RT-H)	Task instruction	Language motion	Action output	Data efficiency, human correction
Atomic command (Hi Robot)	VLM → atomic cmd	Atomic commands	Low-level policy	Real-time human feedback
2D path (HAMSTER)	VLM → 2D path	2D path	3D-aware policy	Off-domain data utilization

GR00T N1's Dual-System Architecture (see Chapter 4) implements this hierarchical separation at the architecture level: System 2 (VLM) corresponds to the high-level of Hi Robot/HAMSTER, System 1 (Diffusion Transformer) to the low-level policy.

5.7 Comparison with Agentic Coding: The Naturalness of Abstraction

Understanding why hierarchical separation is so difficult in robotics but natural in Agentic Coding reveals a fundamental difference.

In Agentic Coding, abstraction layers are intrinsic to programming languages. Functions, classes, modules, packages — each level of abstraction is built into language design, and IDE autocomplete smoothes transitions between layers. The path from high-level intent ("build an HTTP server") to low-level implementation (socket.bind()) is well-paved by frameworks and libraries.

In robotics, these abstractions must be manually designed. RT-H's language motion, HAMSTER's 2D paths, Hi Robot's atomic commands — each is a researcher-designed intermediate representation. CaP-X's discovery of "dependence on human-crafted abstraction" (see Chapter 3) points precisely to this problem.

AutoTAMP's formal verification approach is particularly interesting. The pattern "LLM generates code, linter/compiler/tests verify" from Agentic Coding is structurally identical to "LLM generates formal spec, TAMP solver verifies" from AutoTAMP. The difference: code verification has a mature toolchain (type checkers, linters, test frameworks), while robot TAMP verification is still nascent.

5.8 Open Problems and Outlook

The most fundamental open problem is determining the optimal number of layers and intermediate representation. Currently 2-3 layers predominate, but complex long-horizon tasks may require more. The optimal choice of intermediate representation (language? code? 2D path? formal logic?) may be task- and environment-dependent.

The second problem is inter-layer information loss. Information inevitably degrades as it passes through abstraction layers. VLM inference latency becoming a system bottleneck in Hi Robot illustrates how interface design between layers determines overall system performance.

A promising convergence direction is code-based hierarchical planning. Combining Code-as-Symbolic-Planner's "code as solver/planner/checker" with HAMSTER/Hi Robot's "VLM high-level, policy low-level" structure could let code serve as a verifiable intermediate representation that simultaneously handles information transfer and verification between layers.

References

Chen, Y. et al., "AutoTAMP: Autoregressive Task and Motion Planning with LLMs as Translators and Checkers," arXiv:2306.06531, 2023. scholar
Belkhale, S. et al., "RT-H: Action Hierarchies Using Language," arXiv:2403.01823, 2024. scholar
Shi, L. X. et al., "Hi Robot: Open-Ended Instruction Following with Hierarchical Vision-Language-Action Models," arXiv:2502.19417, 2025. scholar
Li, J. et al., "HAMSTER: Hierarchical Action Models for Open-World Robot Manipulation," arXiv:2502.05485, 2025. scholar
NVIDIA, "GR00T N1: An Open Foundation Model for Generalist Humanoid Robots," arXiv:2503.14734, 2025. scholar
Chen, Y. et al., "Code-as-Symbolic-Planner: Foundation Model-Based Robot Planning via Symbolic Code Generation," arXiv:2503.01700, 2025. scholar
Kim, M. J. et al., "OpenVLA: An Open-Source Vision-Language-Action Model," arXiv:2406.09246, 2024. scholar
Fu, M. et al., "CaP-X: A Framework for Benchmarking and Improving Coding Agents for Robot Manipulation," arXiv:2603.22435, 2026. scholar