Part III: Toward Agentic Robotics

Chapter 7: Memory and World Representation

Written: 2026-04-08 Last updated: 2026-04-08

Summary

The difference between a one-shot manipulation ("pick up the cup") and a long-horizon task ("clean the kitchen") is memory. Tracking environmental changes across dozens of steps, remembering past actions, and maintaining information beyond the field of view are all essential. This chapter examines how robots should represent and remember the physical world — from 3D scene graphs through LTM/STM separation, spatial-semantic retrieval, action-conditioned graphs, and execution verification. This trajectory corresponds to Agentic Coding's file system and context window while carrying fundamentally different demands.

7.1 Introduction: The Memory Demands of the Physical World

In Agentic Coding, "memory" is straightforward. Long-term memory is permanently preserved in the file system; short-term memory lives in the context window. Everything is instantly searchable via grep or find, and git log tracks every change. The codebase is fully observable, and when working alone, the environment does not change on its own.

The physical world imposes four unique memory demands. Partial observability: robots see only what is in camera view; they must remember objects that have left the field of view. Environmental dynamism: other people or robots move things; changes must be detected and memory updated. Task longevity: in a 20-step task, completed and remaining steps must be tracked. Irreversibility: pre-execution verification is critical, making accurate environmental memory a precondition for safe action.

7.2 KARMA: Separating Long-Term and Short-Term Memory

KARMA ^[1] transplants the human cognitive distinction between long-term and short-term memory into robots.

Long-term Memory is implemented as a comprehensive 3D scene graph of the environment, capturing spatial structure and object relations. Short-term Memory records real-time changes in object positions and states. Memory-augmented Prompting enriches LLM prompts with information from both memory systems.

Results in AI2-THOR were dramatic: 1.3x success rate and 3.4x efficiency improvement on Composite Tasks; 2.3x success rate and 62.7x efficiency improvement on Complex Tasks. The value of memory for long-horizon tasks was overwhelmingly demonstrated.

7.3 Embodied-RAG: Retrieval-Augmented Generation for the Physical World

Embodied-RAG ^[2] extends text document RAG to the physical world. The key is a spatial-semantic hierarchy.

A Topological Map represents the environment's topological structure (room-corridor-building), while a Semantic Forest provides hierarchical semantic representations (overall atmosphere, regions, individual objects). Information is retrieved at the spatial-semantic level appropriate to each query — "where is the red cup?" at the object level, "which room has a comfortable atmosphere?" at the region level.

Structurally similar to Agentic Coding's RAG — indexing codebases with vector embeddings and retrieving relevant code — but with a fundamental difference: code RAG is text-embedding-based with exact file-path locations, while Embodied RAG combines 3D spatial coordinates with semantic hierarchies and may require the robot to physically explore to obtain information.

7.4 RoboEXP: Action-Conditioned Scene Graphs

RoboEXP ^[3] goes beyond static scene graphs. It builds an Action-Conditioned Scene Graph (ACSG) through autonomous exploration — not just "what is visible" but "how it can be manipulated," including affordance information like "this drawer can be opened" and "this object can be grasped."

This corresponds to dynamic analysis in Agentic Coding — not just reading code (static analysis) but executing it to collect runtime information, just as RoboEXP's "you have to touch it to know" is the physical world's dynamic analysis.

7.5 VeriGraph: Pre-Execution Verification

VeriGraph ^[4] uses scene graphs as a plan verification tool. It forms an iterative verification loop: if a VLM-generated action sequence violates scene graph constraints, it is regenerated; otherwise, it is executed. Results showed +58% on language-based tasks and +30% on image-based tasks.

This maps precisely to Agentic Coding's CI/CD pipeline: generate code, run tests, fix on failure, retry. The difference: code tests take milliseconds; robot action verification requires simulation or execution, taking far longer.

7.6 The Evolutionary Arc of Scene Graphs

Stage	Representative	Key Addition
Static 3D scene graph	SayPlan (2023)	Hierarchical search for scaling
Action-conditioned	RoboEXP (2024)	Affordance information
Memory system	KARMA (2024)	LTM/STM separation
Verification tool	VeriGraph (2024)	Pre-execution plan verification
Combined with retrieval	Embodied-RAG (2024)	Spatial-semantic hierarchy search

Scene graphs evolved from simple environmental representations to a core data structure simultaneously serving as planning interface, long-term memory, verification tool, and search index.

7.7 Comparison with Agentic Coding: The Cost of Observation

The most fundamental difference in memory and world representation is observation cost.

Dimension	Agentic Coding	Embodied Robotics
Long-term memory	CLAUDE.md, project docs (free access)	3D scene graph (high construction cost)
Short-term memory	Context window (auto-maintained)	Real-time state tracking required
Search	Text embeddings, grep (instant)	Spatial-semantic hierarchy (exploration needed)
Exploration	File reads (instant, free)	Physical exploration (time, energy)
Updates	Automatic on file modification	Re-exploration/re-observation needed

Reading all files in a codebase takes milliseconds. Building a 3D scene graph of an entire building requires the robot to physically visit every room. This cost differential determines the fundamental constraints of memory architecture design.

7.8 Open Problems and Outlook

First, real-time scene graph updates in dynamic environments where other agents continuously modify the space. Second, the memory capacity vs speed tradeoff — rich 3D scene graphs are information-dense but expensive to store and query, especially within real-time control loops (10-100Hz). Third, integrating multiple memory systems — episodic ("last time the red cup was at the sink"), semantic ("cups usually go on shelves"), and procedural ("to grab a cup, grasp the handle") — remains an open problem that RoboMemory ^[8] has begun to explore.

The most promising direction is standardization and modularization of scene graphs — if KARMA's memory, VeriGraph's verification, and Embodied-RAG's retrieval could operate as modules on a common scene graph standard, it could become the "operating system" of Agentic Robotics.

References

Wang, Z. et al., "KARMA: Augmenting Embodied AI Agents with Long-and-short Term Memory Systems," arXiv:2409.14908, 2024. scholar
Xie, Q. et al., "Embodied-RAG: General Non-parametric Embodied Memory for Retrieval and Generation," arXiv:2409.18313, 2024. scholar
Jiang, H. et al., "RoboEXP: Action-Conditioned Scene Graph via Interactive Exploration for Robotic Manipulation," arXiv:2402.15487, 2024. scholar
Ekpo, D. et al., "VeriGraph: Scene Graphs for Execution Verifiable Robot Planning," arXiv:2411.10446, 2024. scholar
Rana, K. et al., "SayPlan: Grounding Large Language Models using 3D Scene Graphs for Scalable Robot Task Planning," arXiv:2307.06135, 2023. scholar
MoMa-LLM, "Language-Grounded Dynamic Scene Graphs for Interactive Object Search with Mobile Manipulation," arXiv:2403.08605, 2024. scholar
3D-Mem, "3D Scene Memory for Embodied Exploration and Reasoning," arXiv:2411.17735, 2024. scholar
RoboMemory, "RoboMemory: A Brain-inspired Multi-memory Agentic Framework for Interactive Environmental Learning in Physical Embodied Systems," arXiv:2508.01415, 2025. scholar
Chen, Y. et al., "AutoTAMP: Autoregressive Task and Motion Planning with LLMs as Translators and Checkers," arXiv:2306.06531, 2023. scholar