Xinze Li · Ziyue Zhu · Siyuan Liu · Yubo Ma · Yuhang Zang · Yixin Cao · Aixin Sun
EMemBench is a programmatic benchmark framework for evaluating episodic (experience-grounded) memory in interactive agents.
Instead of using a fixed, static QA set, EMemBench generates questions from each agent’s own interaction trajectory and computes verifiable ground-truth answers from underlying game signals.
This repo provides an end-to-end pipeline for:
- Jericho (text-only interactive fiction)
- Crafter (visual, partially observed survival & crafting)
EMemBench is not a single fixed dataset. It is a benchmark generator + evaluation harness: run an agent → log → generate QA with programmatic GT → answer & score.
Figure 1: EMemBench overview. An agent interacts with game environment to produce an episode trajectory. We log agent-observable signals and all underlying game signals. A carefully designed algorithm converts each episode into a QA set with calculated ground truths, and the same agent then answers these questions using only agent-observable context plus its own memory.
- Trajectory-conditioned QA: questions are derived from the agent’s own interaction trace.
- Programmatic, verifiable ground truth: answers are computed from game signals / structured logs.
- Query Horizon Control (QHC): templates can optionally restrict evidence selection and answer computation to a prefix window (e.g., steps 1..50) to reduce confounds from variable episode lengths.
- Legacy naming note: current code passes QHC values via flags named
--difficulties/--difficulty, and writes to folders likeDIF_-1,DIF_50. These values correspond to QHC settings.
- Legacy naming note: current code passes QHC values via flags named
text_game/
game_envs/ # Jericho ROMs (.z3/.z5/...)
run_jericho_openai.py # play + log
generate_jericho_qa.py # QA generation (+ indices/maps)
answer_jericho_qa.py # answer + eval
run_text_game_pipeline.py # E2E entry (play -> gen -> answer)
game_envs/
advent.z5
...
zork3.z5
logs/
<game>/..._logs.jsonl
generated_qa/
<game>/<run_name>/
DIF_-1/ # legacy folder name = QHC=-1
DIF_50/ # legacy folder name = QHC=50
...
eval/
<game>/<run_name>/...
visual_game/
instructions/
run_crafter_openai.py # play + log + frames + map file
generate_crafter_qa.py # QA generation
answer_crafter_qa.py # answer + eval
run_visual_game_pipeline.py # E2E entry (play -> gen -> answer)
log/
seed{SEED}/{RUN_NAME}/
logs.jsonl
map_seed{SEED}.txt
frames/*.png
generated_qa/
seed{SEED}/{RUN_NAME}/
qa_context.json
DIF_-1/qa.jsonl # legacy folder name = QHC=-1
DIF_50/qa.jsonl # legacy folder name = QHC=50
...
eval/
seed{SEED}/{RUN_NAME}/...
conda create -n emembench python=3.10
conda activate emembench
pip install -r requirements.txtJericho typically requires Linux + basic build tools. Install and download the spaCy model:
pip install jericho
python -m spacy download en_core_web_smYou must place Jericho ROM files under text_game/game_envs/ (they are not included in this repo).
pip install crafterThe provided runners assume an OpenAI-compatible chat API.
export OPENAI_API_KEY="YOUR_KEY"
# Optional (if your code supports OpenAI-compatible endpoints):
export OPENAI_BASE_URL="https://YOUR_ENDPOINT"
# Optional:
export OPENAI_MODEL="gpt-5.1"From the text_game/ directory (or repo root, depending on your working directory):
python run_text_game_pipeline.py \
--model gpt-5.1 \
--max-steps 200 \
--history-turns 30 \
--difficulties -1 50 \
--max-per-type 2 \
--logs-root logs \
--qa-root generated_qaWhat it does (per game):
- Play & log →
logs/<game>/*_logs.jsonl - Generate QA (QHC values) →
generated_qa/<game>/<run_name>/DIF_* - Answer & evaluate →
eval/<game>/<run_name>/...
Notes
--history-turnscontrols how many recent turns are included in the policy prompt during play.- The list of games is defined in
run_text_game_pipeline.py(editJERICHO_GAMESto run more/fewer titles).
From the visual_game/ directory (or repo root):
python run_visual_game_pipeline.py \
--seeds 1 42 43 100 123 \
--steps 500 \
--history-turns 10 \
--difficulties -1 50 \
--qa-source paraphrase \
--qa-temperature 0.0 \
--qa-max-tokens 4096 \
--batch-size 8 \
--frames-mode mosaicOverride the answering model (optional):
python run_visual_game_pipeline.py \
--seeds 42 \
--qa-model gpt-5.1Notes
--frames-modecontrols how frames are packaged into evaluation prompts (mosaicis typically the most economical).- Outputs are grouped by seed:
log/seed{SEED}/{RUN_NAME}/...
- Jericho:
logs/<game>/*_logs.jsonl - Crafter:
log/seed{SEED}/{RUN_NAME}/logs.jsonl+frames/+map_seed{SEED}.txt
qa_context.json: agent-observable context used to build evaluation promptsqa.jsonl: one QA per line (question, metadata, GT answer, evidence pointers, etc.)
- per-question predictions:
answers.jsonl(or equivalent) - aggregated metrics:
index.json(or equivalent)
- Jericho: https://github.com/microsoft/jericho
- Crafter: https://github.com/danijar/crafter
@misc{li2026emembenchinteractivebenchmarkingepisodic,
title={EMemBench: Interactive Benchmarking of Episodic Memory for VLM Agents},
author={Xinze Li and Ziyue Zhu and Siyuan Liu and Yubo Ma and Yuhang Zang and Yixin Cao and Aixin Sun},
year={2026},
eprint={2601.16690},
archivePrefix={arXiv},
primaryClass={cs.CL},
url={https://arxiv.org/abs/2601.16690},
}
Usage and License Notices: The data and code are intended and licensed for research use only.
License: Attribution-NonCommercial 4.0 International It should abide by the policy of OpenAI: https://openai.com/policies/terms-of-use
