I built an open-weights memory system that reaches 80.1% on the LoCoMo benchmark
I’ve been experimenting with long-term memory architectures for agent systems and wanted to share some technical results that might be useful to others working on retrieval pipelines.
Benchmark: LoCoMo (10 runs × 10 conversation sets) Average accuracy: 80.1% Setup: full isolation across all 10 conv groups (no cross-contamination, no shared memory between runs)
Architecture (all open weights except answer generation)
1. Dense retrieval
BGE-large-en-v1.5 (1024d)
FAISS IndexFlatIP
Standard BGE instruction prompt: “Represent this sentence for searching relevant passages.”
2. Sparse retrieval
BM25 via classic inverted index
Helps with low-embedding-recall queries and keyword-heavy prompts
3. MCA (Multi-Component Aggregation) ranking A simple gravitational-style score combining:
keyword coverage
token importance
local frequency signal MCA acts as a first-pass filter to catch exact-match questions. Threshold: coverage ≥ 0.1 → keep top-30
4. Union strategy Instead of aggressively reducing the union, the system feeds 112–135 documents directly to a re-ranker. In practice this improved stability and prevented loss of rare but crucial documents.
5. Cross-Encoder reranking
bge-reranker-v2-m3
Processes the full union (rare for RAG pipelines, but worked best here)
Produces a final top-k used for answer generation
6. Answer generation
GPT-4o-mini, used only for the final synthesis step
No agent chain, no tool calls, no memory-dependent LLM logic
Performance
<3 seconds per query on a single RTX 4090
Deterministic output between runs
Reproducible test harness (10×10 protocol)
Why this worked
Three things seemed to matter most:
MCA-first filter to stabilize early recall
Not discarding the union before re-ranking
Proper dense embedding instruction, which massively affects BGE performance
Notes
LoCoMo remains one of the hardest public memory benchmarks: 5,880 multi-hop, temporal, negation-rich QA pairs derived from human–agent conversations. Would be interested to compare with others working on long-term retrieval, especially multi-stage ranking or cross-encoder heavy pipelines.
Github: https://github.com/vac-architector/VAC-Memory-System
Great project!
We also use a slightly different strategy to build a lossless memory system without any Vector DBs: https://github.com/VectifyAI/ChatIndex
Hope to have more discussions around this topic!
Thanks for the kind words about VAC Memory System!