Chapter 7: Memory and World Representation
Summary
The difference between a one-shot manipulation ("pick up the cup") and a long-horizon task ("clean the kitchen") is memory. Tracking environmental changes across dozens of steps, remembering past actions, and maintaining information beyond the field of view are all essential. This chapter examines how robots should represent and remember the physical world — from 3D scene graphs through LTM/STM separation, spatial-semantic retrieval, action-conditioned graphs, and execution verification. This trajectory corresponds to Agentic Coding's file system and context window while carrying fundamentally different demands.
7.1 Introduction: The Memory Demands of the Physical World
In Agentic Coding, "memory" is straightforward. Long-term memory is permanently preserved in the file system; short-term memory lives in the context window. Everything is instantly searchable via grep or find, and git log tracks every change. The codebase is fully observable, and when working alone, the environment does not change on its own.
The physical world imposes four unique memory demands. Partial observability: robots see only what is in camera view; they must remember objects that have left the field of view. Environmental dynamism: other people or robots move things; changes must be detected and memory updated. Task longevity: in a 20-step task, completed and remaining steps must be tracked. Irreversibility: pre-execution verification is critical, making accurate environmental memory a precondition for safe action.
7.2 KARMA: Separating Long-Term and Short-Term Memory
KARMA [Wang et al., 2024] transplants the human cognitive distinction between long-term and short-term memory into robots.
Long-term Memory is implemented as a comprehensive 3D scene graph of the environment, capturing spatial structure and object relations. Short-term Memory records real-time changes in object positions and states. Memory-augmented Prompting enriches LLM prompts with information from both memory systems.
Results in AI2-THOR were dramatic: 1.3x success rate and 3.4x efficiency improvement on Composite Tasks; 2.3x success rate and 62.7x efficiency improvement on Complex Tasks. The value of memory for long-horizon tasks was overwhelmingly demonstrated.
7.3 Embodied-RAG: Retrieval-Augmented Generation for the Physical World
Embodied-RAG [Xie et al., 2024] extends text document RAG to the physical world. The key is a spatial-semantic hierarchy.
A Topological Map represents the environment's topological structure (room-corridor-building), while a Semantic Forest provides hierarchical semantic representations (overall atmosphere, regions, individual objects). Information is retrieved at the spatial-semantic level appropriate to each query — "where is the red cup?" at the object level, "which room has a comfortable atmosphere?" at the region level.
Structurally similar to Agentic Coding's RAG — indexing codebases with vector embeddings and retrieving relevant code — but with a fundamental difference: code RAG is text-embedding-based with exact file-path locations, while Embodied RAG combines 3D spatial coordinates with semantic hierarchies and may require the robot to physically explore to obtain information.
7.4 RoboEXP: Action-Conditioned Scene Graphs
RoboEXP [Jiang et al., 2024] goes beyond static scene graphs. It builds an Action-Conditioned Scene Graph (ACSG) through autonomous exploration — not just "what is visible" but "how it can be manipulated," including affordance information like "this drawer can be opened" and "this object can be grasped."
This corresponds to dynamic analysis in Agentic Coding — not just reading code (static analysis) but executing it to collect runtime information, just as RoboEXP's "you have to touch it to know" is the physical world's dynamic analysis.
7.5 VeriGraph: Pre-Execution Verification
VeriGraph [Ekpo et al., 2024] uses scene graphs as a plan verification tool. It forms an iterative verification loop: if a VLM-generated action sequence violates scene graph constraints, it is regenerated; otherwise, it is executed. Results showed +58% on language-based tasks and +30% on image-based tasks.
This maps precisely to Agentic Coding's CI/CD pipeline: generate code, run tests, fix on failure, retry. The difference: code tests take milliseconds; robot action verification requires simulation or execution, taking far longer.
7.6 The Evolutionary Arc of Scene Graphs
| Stage | Representative | Key Addition |
|---|---|---|
| Static 3D scene graph | SayPlan (2023) | Hierarchical search for scaling |
| Action-conditioned | RoboEXP (2024) | Affordance information |
| Memory system | KARMA (2024) | LTM/STM separation |
| Verification tool | VeriGraph (2024) | Pre-execution plan verification |
| Combined with retrieval | Embodied-RAG (2024) | Spatial-semantic hierarchy search |
Scene graphs evolved from simple environmental representations to a core data structure simultaneously serving as planning interface, long-term memory, verification tool, and search index.
7.7 Comparison with Agentic Coding: The Cost of Observation
The most fundamental difference in memory and world representation is observation cost.
| Dimension | Agentic Coding | Embodied Robotics |
|---|---|---|
| Long-term memory | CLAUDE.md, project docs (free access) | 3D scene graph (high construction cost) |
| Short-term memory | Context window (auto-maintained) | Real-time state tracking required |
| Search | Text embeddings, grep (instant) | Spatial-semantic hierarchy (exploration needed) |
| Exploration | File reads (instant, free) | Physical exploration (time, energy) |
| Updates | Automatic on file modification | Re-exploration/re-observation needed |
Reading all files in a codebase takes milliseconds. Building a 3D scene graph of an entire building requires the robot to physically visit every room. This cost differential determines the fundamental constraints of memory architecture design.
7.8 Open Problems and Outlook
First, real-time scene graph updates in dynamic environments where other agents continuously modify the space. Second, the memory capacity vs speed tradeoff — rich 3D scene graphs are information-dense but expensive to store and query, especially within real-time control loops (10-100Hz). Third, integrating multiple memory systems — episodic ("last time the red cup was at the sink"), semantic ("cups usually go on shelves"), and procedural ("to grab a cup, grasp the handle") — remains an open problem that RoboMemory [2025] has begun to explore.
The most promising direction is standardization and modularization of scene graphs — if KARMA's memory, VeriGraph's verification, and Embodied-RAG's retrieval could operate as modules on a common scene graph standard, it could become the "operating system" of Agentic Robotics.
References
- Wang, Z. et al., "KARMA: Augmenting Embodied AI Agents with Long-and-short Term Memory Systems," arXiv:2409.14908, 2024. scholar
- Xie, Q. et al., "Embodied-RAG: General Non-parametric Embodied Memory for Retrieval and Generation," arXiv:2409.18313, 2024. scholar
- Jiang, H. et al., "RoboEXP: Action-Conditioned Scene Graph via Interactive Exploration for Robotic Manipulation," arXiv:2402.15487, 2024. scholar
- Ekpo, D. et al., "VeriGraph: Scene Graphs for Execution Verifiable Robot Planning," arXiv:2411.10446, 2024. scholar
- Rana, K. et al., "SayPlan: Grounding Large Language Models using 3D Scene Graphs for Scalable Robot Task Planning," arXiv:2307.06135, 2023. scholar
- MoMa-LLM, "Language-Grounded Dynamic Scene Graphs for Interactive Object Search with Mobile Manipulation," arXiv:2403.08605, 2024. scholar
- 3D-Mem, "3D Scene Memory for Embodied Exploration and Reasoning," arXiv:2411.17735, 2024. scholar
- RoboMemory, "RoboMemory: A Brain-inspired Multi-memory Agentic Framework for Interactive Environmental Learning in Physical Embodied Systems," arXiv:2508.01415, 2025. scholar
- Chen, Y. et al., "AutoTAMP: Autoregressive Task and Motion Planning with LLMs as Translators and Checkers," arXiv:2306.06531, 2023. scholar