Analysis updated 2026-05-18
Run AutoMem's pre-evolved scaffolds on Crafter or MiniHack to reproduce the paper's results using a locally served Qwen2.5-32B model.
Apply Loop 1 scaffold optimization to automatically improve an LLM agent's memory management code on a custom task environment.
Use Loop 2 to fine-tune a memory specialist model on your agent's own traces, then evaluate the two-model configuration against the baseline.
| autolearnmem/automem | cortex-ai-network/crypto-arbitrage-bot-automated-trading | madguyevans-creator/resale-agent-skill-hub | |
|---|---|---|---|
| Stars | 32 | 32 | 32 |
| Language | Python | Python | Python |
| Setup difficulty | hard | moderate | moderate |
| Complexity | 5/5 | 2/5 | 3/5 |
| Audience | researcher | general | vibe coder |
Figures from each repo's GitHub metadata at analysis time.
Requires three separate component installs (BALROG, LLaMA-Factory, Claude Code CLI) plus a GPU server running vLLM for the base model.
AutoMem is an AI research project that asks a specific question: can a language model agent learn to manage its own memory as a skill? Instead of storing information in a fixed, pre-designed memory system, the agent maintains a directory of text files and decides for itself what to record, when to look something up, and how to organize what it knows. These file operations (logging what just happened, consulting past notes before acting) are part of the agent's action space alongside the actual task actions. Two outer improvement loops run over time. The first loop (scaffold optimization) has a powerful meta-LLM read the agent's complete game traces, diagnose where memory use went wrong, and rewrite the agent's code, prompts, and memory schema to fix the problems. A revision is only kept if it improves average task performance on a fixed test set. The second loop (memory-proficiency training) uses the meta-LLM to select good examples of memory operations from the base model's own traces, then fine-tunes a separate smaller model (a memory specialist) on those examples using LoRA. At inference time, the memory specialist handles logging and consulting notes, while the original unmodified model handles the actual task actions. The system was evaluated on three challenging long-horizon games: Crafter (a 2D crafting game), MiniHack (procedurally generated dungeons), and NetHack (a complex roguelike). Using Qwen2.5-32B-Instruct as the base model, AutoMem achieved performance competitive with frontier systems by improving memory alone, without changing how the model handles gameplay decisions. Setting up AutoMem requires three components: the BALROG benchmark harness for running the game environments, LLaMA-Factory for LoRA fine-tuning in Loop 2, and the Claude Code CLI for the meta-LLM that drives both optimization loops. The base model is served via vLLM. This is academic research code. It is not a plug-in for general LLM applications, it is a research framework for studying how agents can learn to use memory more effectively in long-horizon sequential tasks.
AutoMem is an AI research framework that teaches LLM agents to manage memory as a trainable skill using two optimization loops: one that rewrites the agent scaffold and one that fine-tunes a dedicated memory specialist with LoRA.
Mainly Python. The stack also includes Python, PyTorch, LoRA.
No license is stated in this repository.
Setup difficulty is rated hard, with roughly 1day+ to a first successful run.
Mainly researcher.
This repo across BitVibe Labs
Verify against the repo before relying on details.