I built an open-weights memory system that reaches 80.1% on the LoCoMo benchmark

2 points by ViktorKuz 3 hours ago

I’ve been experimenting with long-term memory architectures for agent systems and wanted to share some technical results that might be useful to others working on retrieval pipelines.

Benchmark: LoCoMo (10 runs × 10 conversation sets) Average accuracy: 80.1% Setup: full isolation across all 10 conv groups (no cross-contamination, no shared memory between runs)

Architecture (all open weights except answer generation)

1. Dense retrieval

BGE-large-en-v1.5 (1024d)

FAISS IndexFlatIP

Standard BGE instruction prompt: “Represent this sentence for searching relevant passages.”

2. Sparse retrieval

BM25 via classic inverted index

Helps with low-embedding-recall queries and keyword-heavy prompts

3. MCA (Multi-Component Aggregation) ranking A simple gravitational-style score combining:

keyword coverage

token importance

local frequency signal MCA acts as a first-pass filter to catch exact-match questions. Threshold: coverage ≥ 0.1 → keep top-30

4. Union strategy Instead of aggressively reducing the union, the system feeds 112–135 documents directly to a re-ranker. In practice this improved stability and prevented loss of rare but crucial documents.

5. Cross-Encoder reranking

bge-reranker-v2-m3

Processes the full union (rare for RAG pipelines, but worked best here)

Produces a final top-k used for answer generation

6. Answer generation

GPT-4o-mini, used only for the final synthesis step

No agent chain, no tool calls, no memory-dependent LLM logic

Performance

<3 seconds per query on a single RTX 4090

Deterministic output between runs

Reproducible test harness (10×10 protocol)

Why this worked

Three things seemed to matter most:

MCA-first filter to stabilize early recall

Not discarding the union before re-ranking

Proper dense embedding instruction, which massively affects BGE performance

Notes

LoCoMo remains one of the hardest public memory benchmarks: 5,880 multi-hop, temporal, negation-rich QA pairs derived from human–agent conversations. Would be interested to compare with others working on long-term retrieval, especially multi-stage ranking or cross-encoder heavy pipelines.

Github: https://github.com/vac-architector/VAC-Memory-System