MultiHop-RAG benchmarks explained: what the dataset reveals about iterative retrieval
MultiHop-RAG shows that existing RAG methods struggle when evidence is spread across 2 to 4 documents — the benchmark’s 2,556-query setup exposes the weakness of single-pass retrieval and motivates iterative retrieval — but the paper demonstrates this on a news-article knowledge base, so the result is strong evidence for multi-hop failure modes rather than a universal fix.