MemER: Scaling Up Memory for Robot Control via Experience Retrieval

Abstract

Humans rely on memory to perform tasks; our goal is to endow robot policies with the same ability. Naively conditioning on long observation histories is computationally expensive and brittle under covariate shift, while indiscriminate subsampling of history leads to irrelevant or redundant information. We propose a hierarchical policy framework, where the high-level policy is trained to select and track previous task-relevant keyframes from its experience. The high-level policy uses selected keyframes and the most recent frames when generating text instructions for a low-level policy to execute. This design is compatible with existing vision-language-action (VLA) models and enables the system to efficiently reason over long-horizon dependencies. In our experiments, we fine-tune Qwen2.5-VL-3B-Instruct and \(\pi_{0.5}\) as the high-level and low-level policies respectively, using demonstrations supplemented with minimal language annotations. Our approach, MemER, outperforms prior methods on three real-world long-horizon robotic manipulation tasks that require minutes of memory.

Architecture

The high-level policy processes task instructions, selected keyframes (if any), and recent images from base and wrist-mounted cameras to generate low-level language subtasks and candidate keyframes (if any). The low-level policy uses the subtask, current image, and robot joint states to produce actions. The candidate keyframe(s) are processed by the keyframe filter to obtain the selected keyframes for input during the next step of inference.

At each timestep, the high-level policy nominates candidate keyframe(s), as highlighted in orange. All candidate keyframes are aggregated across time with 1D single-linkage using a merge distance of \(d=5\) frames, yielding disjoint clusters. For each cluster, the colored bars indicate nominations for the observation at that timestamp, with bar height proportional to the number of nominations received. We select one representative frame per cluster by taking the median keyframe of all the candidates, and add that frame to memory.

Results

MemER successfully completes a range of long-horizon tasks that require memory, using a single policy

Robust to Retries

Dropped baguette

Multiple failed grasps on tomato

Multiple empty scoops

Multiple missed grasps on duster, smiley face dropped twice

Shoe stuck on gripper

MemER successfully completes the tasks despite errors and retries incurred during execution that cause historical context to look out of distribution

Method Comparisons

No History (N=1 frame)

Short History (N=8 frames)

Long History (N=32 frames)

MemER (Ours)

Object Search
Task: "search for milk carton" → "search for grapes" → "search for blue block"

Object Search