zexuehe/memoryarena

★ 15PythonAudience · researcherComplexity · 4/5Setup · hard

Mindmap

mindmap
  root((repo))
    Agents
      Task input
      Action output
      Multi-provider LLMs
    Memory Systems
      Long context window
      BM25 keyword search
      Text embeddings
      GraphRAG
      Third-party services
    Environments
      Web shopping
      Travel planning
      Web search
      Formal reasoning
    Benchmark Design
      Cross-session memory
      Interdependent tasks
      Comparative eval
    Setup
      API keys needed
      Per-env instructions
      Preview release

mindmap root((repo)) Agents Task input Action output Multi-provider LLMs Memory Systems Long context window BM25 keyword search Text embeddings GraphRAG Third-party services Environments Web shopping Travel planning Web search Formal reasoning Benchmark Design Cross-session memory Interdependent tasks Comparative eval Setup API keys needed Per-env instructions Preview release

Click or tap to explore — scroll the page freely

Things people build with this

USE CASE 1

Evaluate how different memory systems help AI agents carry knowledge from one session to the next

USE CASE 2

Compare retrieval approaches like keyword search, embeddings, and graph-based memory side by side

USE CASE 3

Test AI agent performance on realistic tasks like shopping and travel planning that require memory

USE CASE 4

Use as a starting point for building or improving memory systems for your own AI agent

Tech stack

PythonOpenAI APIAnthropic APIGoogle AI APIOpenRouterBM25GraphRAGMem0

Getting it running

Difficulty · hard Time to first run · 1day+

Requires API keys for OpenAI, Anthropic, Google, and OpenRouter, plus separate keys for any third-party memory services. Each benchmark environment has its own setup instructions in separate markdown files.

No license is stated. All rights are reserved by default, you can read the code but cannot legally reuse or modify it without permission from the authors.

In plain English

MemoryArena is the code release for an academic research paper that benchmarks how well AI agents remember information across multiple separate task sessions. The core research question is: if an AI agent completes a task in one session, and a later task depends on what it learned or did earlier, how reliably does the agent carry that memory forward? The paper introduces a suite of tasks designed so that sessions are interdependent, making memory a critical factor in performance. The codebase is a Python framework with three main parts: agents (which take in a task and produce actions), environments (which execute those actions and return observations), and memory systems (which store and retrieve information between steps or sessions). The flow for each task step is: the memory system wraps the incoming task prompt with relevant stored context, the agent generates an action, the environment executes it, and the result is stored back into memory for future steps. Several memory approaches are included so they can be compared against each other. These range from simply giving the agent a long context window, to retrieval systems based on keyword search (BM25) or text embeddings, to graph-based retrieval (GraphRAG), to third-party memory services (Letta, Mirix, Mem0). The benchmark environments cover web shopping, travel planning, web search, and formal reasoning tasks. Running the code requires API keys for multiple AI providers (OpenAI, Anthropic, Google, OpenRouter) as well as separate keys for any third-party memory services used. Setup instructions for each environment are in separate markdown files in the repository. The README describes this as a preview version that is still being actively maintained. No license is stated in the README.

Copy-paste prompts

Prompt 1

I'm looking at the MemoryArena benchmark. Can you explain how the memory system wraps a task prompt with stored context before sending it to the agent?

Prompt 2

In MemoryArena, what is the difference between the BM25, embedding-based, and GraphRAG memory retrieval approaches, and when would each perform best?

Prompt 3

How do I add a new benchmark environment to MemoryArena? Walk me through the steps based on how existing environments are structured.

Prompt 4

I want to run MemoryArena with Mem0 as the memory backend. What API keys do I need and how do I configure them?

Prompt 5

Explain the evaluation methodology in MemoryArena, how does it measure whether an agent successfully used memory from a previous session?

Open on GitHub → Explain another repo

← zexuehe on gitmyhub — every repo by this author, as a profile.

Verify against the repo before relying on details.