TREC 2024 RAG - Corpus Discussion
Based on some preliminary experiments and discussions, we are considering using a deduped version of the MS MARCO V2 passage corpus for the TREC 2024 RAG track. Given how widely used the original corpus is in the community (TREC DL 21-23), along with it being open and of approachable scale (~120M), we believe it a good fit. Of course, there have been some issues with the corpus. One is the presence of duplicates, which Ian and the folk at NIST have worked hard to address and provide near-dupe equivalence classes. Additionally, LLMS may have memorized the corpus to a certain degree, given it is not “brand new information” and could be a concern for topic creation and evaluation. However, we still believe the pros outweigh the cons. Of course, I think we can speak for the cool folk at Anserini, Pyserini, and RankLLM to get the ball rolling on indexing the corpus and providing effective and easy-to-run baselines for the community to use :)
We now pass this on to the community for feedback. We are particularly interested in hearing about any concerns or issues that you may have with using this corpus. Please feel free to reach out to us via email/Twitter/Discord!
Live Long and Prosper,
TREC RAG Organizers