Humans rely on memory to perform tasks; our goal is to endow robot policies with the same ability. Naively conditioning on long observation histories is computationally expensive and brittle under covariate shift, while indiscriminate subsampling of history leads to irrelevant or redundant information. We propose a hierarchical policy framework, where the high-level policy is trained to select and track previous task-relevant keyframes from its experience. The high-level policy uses selected keyframes and the most recent frames when generating text instructions for a low-level policy to execute. This design is compatible with existing vision-language-action (VLA) models and enables the system to efficiently reason over long-horizon dependencies. In our experiments, we fine-tune Qwen2.5-VL-3B-Instruct and \(\pi_{0.5}\) as the high-level and low-level policies respectively, using demonstrations supplemented with minimal language annotations. Our approach, MemER, outperforms prior methods on three real-world long-horizon robotic manipulation tasks that require minutes of memory.
The high-level policy processes task instructions, selected keyframes (if any), and recent images from base and wrist-mounted cameras to generate low-level language subtasks and candidate keyframes (if any). The low-level policy uses the subtask, current image, and robot joint states to produce actions. The candidate keyframe(s) are processed by the keyframe filter to obtain the selected keyframes for input during the next step of inference.
At each timestep, the high-level policy nominates candidate keyframe(s), as highlighted in orange. All candidate keyframes are aggregated across time with 1D single-linkage using a merge distance of \(d=5\) frames, yielding disjoint clusters. For each cluster, the colored bars indicate nominations for the observation at that timestamp, with bar height proportional to the number of nominations received. We select one representative frame per cluster by taking the median keyframe of all the candidates, and add that frame to memory.
MemER successfully completes a range of long-horizon tasks that require memory, using a single policy
MemER successfully completes the tasks despite errors and retries incurred during execution that cause historical context to look out of distribution
0/3 objects retrieved
0/3 optimal path taken
1/3 objects retrieved
1/3 optimal path taken
2/3 objects retrieved
2/3 optimal path taken
3/3 objects retrieved
3/3 optimal path taken
2 scoops of peanuts
0 scoops of jelly beans
2 scoops of peanuts
1 scoop of jelly beans
6 scoops of peanuts
2 scoops of jelly beans
3 scoops of peanuts
3 scoops of jelly beans
0/2 objects replaced
0/2 shelves dusted
0/2 objects replaced
0/2 shelves dusted
0/2 objects replaced
2/2 shelves dusted
2/2 objects replaced
2/2 shelves dusted
1/3 objects retrieved
1/3 optimal path taken
3/3 objects retrieved
3/3 optimal path taken
3 scoops of peanuts
0 scoops of jelly beans
3 scoops of peanuts
3 scoops of jelly beans
0/2 objects replaced
2/2 shelves dusted
2/2 objects replaced
2/2 shelves dusted