TREC 2025 RAG Corpus
For TREC 2025, we continue to utilize the MS MARCO V2.1 document corpus and its segmented counterpart, as introduced in TREC 2024. This version addresses deduplication concerns and provides enhanced segmentation for improved retrieval and generation tasks.
📚 MS MARCO V2.1 Document Corpus
- Total Documents: 10,960,555
- Format: 70 gzipped JSONL files packaged in a TAR archive
- Download: msmarco_v2.1_doc.tar (28.1 GB)
- MD5 Checksum:
a5950665d6448d3dbaf7135645f1e074
Document Structure
Each document contains:
docid
: Unique identifierurl
: Source URLtitle
: Document titleheadings
: Subheadings within the documentbody
: Main content of the document
Note: The docid
(e.g., msmarco_v2.1_doc_29_677149
) encodes the filename (29
) and byte (677149
) offset, facilitating efficient data retrieval.
📚 MS MARCO V2.1 Segmented Corpus
- Total Segments: 113,520,750
- Format: 70 gzipped JSONL files packaged in a TAR archive
- Download: msmarco_v2.1_doc_segmented.tar (25.1 GB)
- MD5 Checksum:
3799e7611efffd8daeb257e9ccca4d60
Segmentation Details
- Method: Sliding window of 10 sentences with a stride of 5 sentences
- Segment Size: Approximately 500–1000 characters
Segment Structure
Each segment includes:
docid
: Segment identifierurl
: Source URLtitle
: Document titleheadings
: Subheadings within the documentsegment
: Segment textstart_char
: Start character offset in the original documentend_char
: End character offset in the original document
Relevance Judgments (Qrels)
Qrels mapped to MS MARCO V2.1.
Set | Filename | Records | MD5 Checksum |
---|---|---|---|
TREC RAG 2024 (UMBRELA) | qrels.rag24.test-umbrela-all.txt | 108,479 | 7a43d4a23cf37f12dab0043a4b9f4d02 |
TREC DL 2021 | qrels.dl21-doc-msmarco-v2.1.txt | 10,973 | 6845b6c128aec71027e72078a960600e |
TREC DL 2022 | qrels.dl22-doc-msmarco-v2.1.txt | 349,541 | ac9f5c6fcb6972d8bf13b07ab150680a |
TREC DL 2023 | qrels.dl23-doc-msmarco-v2.1.txt | 15,995 | 4b30e8850b6fba56289b6c177afe959b |
MS MARCO Dev | qrels.msmarco-v2.1-doc.dev.txt | 4,702 | 089b19dce0a6f74d3f01fd988d439f2f |
MS MARCO Dev2 | qrels.msmarco-v2.1-doc.dev2.txt | 5,177 | 8ff337f2110ea44ba8c5533a7eff6222 |
Topics Files
Corresponding topics files are provided for each qrels set.
Set | Filename | Records | MD5 Checksum |
---|---|---|---|
TREC RAG 2024 | topics.rag24.test.txt | 301 | 5bd6c8fa0e1300233fe139bae8288d09 |
TREC DL 2021 | topics.dl21.txt | 477 | 46d863434dda18300f5af33ee29c4b28 |
TREC DL 2022 | topics.dl22.txt | 500 | f1bfd53d80e81e58207ce557fd2211a0 |
TREC DL 2023 | topics.dl23.txt | 700 | 7df9e17b47cc9aa5d1c9fd5b313e273c |
MS MARCO Dev | topics.msmarco-v2-doc.dev.txt | 4,552 | b05dc19f1d2b8ad729f189328a685aa1 |
MS MARCO Dev2 | topics.msmarco-v2-doc.dev2.txt | 5,000 | f000319f1893a7acdd60fdcae0703b95 |
More information on the topics for the TREC 2025 RAG Track will be released soon. Stay tuned!
Regards, The TREC 2025 RAG Organizers