1 minute read

For TREC 2025, we continue to utilize the MS MARCO V2.1 document corpus and its segmented counterpart, as introduced in TREC 2024. This version addresses deduplication concerns and provides enhanced segmentation for improved retrieval and generation tasks.


📚 MS MARCO V2.1 Document Corpus

  • Total Documents: 10,960,555
  • Format: 70 gzipped JSONL files packaged in a TAR archive
  • Download: msmarco_v2.1_doc.tar (28.1 GB)
  • MD5 Checksum: a5950665d6448d3dbaf7135645f1e074

Document Structure

Each document contains:

  • docid: Unique identifier
  • url: Source URL
  • title: Document title
  • headings: Subheadings within the document
  • body: Main content of the document

Note: The docid (e.g., msmarco_v2.1_doc_29_677149) encodes the filename (29) and byte (677149) offset, facilitating efficient data retrieval.


📚 MS MARCO V2.1 Segmented Corpus

  • Total Segments: 113,520,750
  • Format: 70 gzipped JSONL files packaged in a TAR archive
  • Download: msmarco_v2.1_doc_segmented.tar (25.1 GB)
  • MD5 Checksum: 3799e7611efffd8daeb257e9ccca4d60

Segmentation Details

  • Method: Sliding window of 10 sentences with a stride of 5 sentences
  • Segment Size: Approximately 500–1000 characters

Segment Structure

Each segment includes:

  • docid: Segment identifier
  • url: Source URL
  • title: Document title
  • headings: Subheadings within the document
  • segment: Segment text
  • start_char: Start character offset in the original document
  • end_char: End character offset in the original document

Relevance Judgments (Qrels)

Qrels mapped to MS MARCO V2.1.

Set Filename Records MD5 Checksum
TREC RAG 2024 (UMBRELA) qrels.rag24.test-umbrela-all.txt 108,479 7a43d4a23cf37f12dab0043a4b9f4d02
TREC DL 2021 qrels.dl21-doc-msmarco-v2.1.txt 10,973 6845b6c128aec71027e72078a960600e
TREC DL 2022 qrels.dl22-doc-msmarco-v2.1.txt 349,541 ac9f5c6fcb6972d8bf13b07ab150680a
TREC DL 2023 qrels.dl23-doc-msmarco-v2.1.txt 15,995 4b30e8850b6fba56289b6c177afe959b
MS MARCO Dev qrels.msmarco-v2.1-doc.dev.txt 4,702 089b19dce0a6f74d3f01fd988d439f2f
MS MARCO Dev2 qrels.msmarco-v2.1-doc.dev2.txt 5,177 8ff337f2110ea44ba8c5533a7eff6222

Topics Files

Corresponding topics files are provided for each qrels set.

Set Filename Records MD5 Checksum
TREC RAG 2024 topics.rag24.test.txt 301 5bd6c8fa0e1300233fe139bae8288d09
TREC DL 2021 topics.dl21.txt 477 46d863434dda18300f5af33ee29c4b28
TREC DL 2022 topics.dl22.txt 500 f1bfd53d80e81e58207ce557fd2211a0
TREC DL 2023 topics.dl23.txt 700 7df9e17b47cc9aa5d1c9fd5b313e273c
MS MARCO Dev topics.msmarco-v2-doc.dev.txt 4,552 b05dc19f1d2b8ad729f189328a685aa1
MS MARCO Dev2 topics.msmarco-v2-doc.dev2.txt 5,000 f000319f1893a7acdd60fdcae0703b95

More information on the topics for the TREC 2025 RAG Track will be released soon. Stay tuned!

Regards, The TREC 2025 RAG Organizers