Reproducing results across AI agent memory systems is hard, different LLMs, embeddings, token budgets, and scoring methods make comparisons almost meaningless.
We built MemEval, an open-source benchmark that evaluates memory systems under standardized conditions and tracks token efficiency. While benchmarking, we discovered recurring failure modes, which led to PropMem, a factual memory system designed to address them efficiently.
Both projects are Open Source: ready for evaluation, extension, or collaboration.
Try it out:
-
Blog: https://medium.com/prosus-ai-tech-blog/memeval-benchmarking-memory-for-ai-agents-932d3fd9f3b4
-
Code: GitHub - ProsusAI/MemEval: Benchmark suite for evaluating agent and LLM memory systems · GitHub
We would love to hear how the community benchmarks or improves agent memory systems!