MSMARCO

A Family of datasets built using technology from Bing.

MS MARCO: A Human Generated MAchine Reading COmprehension Dataset Paper URL : https://arxiv.org/abs/1611.09268

MS MARCO(Microsoft Machine Reading Comprehension) is a large scale dataset focused on machine reading comprehension, question answering, and passage ranking, Keyphrase Extraction, and Conversational Search Studies, or what the community thinks would be useful.

First released at NIPS 2016, the current dataset has 1,010,916 unique real queries that were generated by sampling and anonymizing Bing usage logs. The dataset started off focusing on QnA but has since evolved to focus on any problem related to search. For task specifics please explore some of the tasks that have been built out of the dataset. If you think there is a relevant task we have missed please open an issue explaining your ideas?

For more information about Q&A

For more information about ReRanking

For more information about Keyphrase Extraction

For more information about End2End Ranking

For more information about Conversational Search

Dataset Generation, Data Format, And Statistics

What is the difference between MSMARCO and other MRC datasets? We believe the advantages that are special to MSMARCO are:

Real questions: All questions have been sample from real anonymized bing queries.
Real Documents: Most Url's that we have source the passages from contain the full web documents. These can be used as extra contextual information to improve systems or be used to compete in our expert task.
Human Generated Answers: All questions have an answer written by a human. If there was no answer in the passages the judge read they have written 'No Answer Present.'
Human Generated Well-Formed: Some questions contain extra human evaluation to create well formed answers that could be used by intelligent agents like Cortana, Siri, Google Assistant, and Alexa.
Dataset Size: At over 1 million queries the dataset is large enough to train the most complex systems and also sample the data for specific applications.

Download the Dataset

To Download the MSMARCO Dataset please navigate to msmarco.org and agree to our Terms and Conditions. If there is some data you think we are missing and would be useful please open an issue.

Tasks

In an effort to produce a dataset that can continue to be challanging and rewarding we have broken down the MSMARCO dataset into tasks of varying difficulty.

Question Answering

Given a query q and a set of passages P = p1, p2, p3,... p10 a successful Machine Reading Comprehension system is expected to read and understand both the questions and passages. Then, the system must accuratley decide if the passages provide addequate information to answer the query since not all queries have an answer. If there is not enough information, the system should response 'No Answer Present.'. If there is enough information the system should create a quality answer. The target of the answer a should be as close as possible to the human generated refrence answers RA= ra1,ra2,...,ram. Evaluation will be done using ROUGE-L, BLEU-1, and a F1 score computed by measuring how well a system can know to answer questions or not. Questions that have the do not have an answer will not be used to calculate ROUGE-L or BLEU-1.

Natural Language Generation

Given a query q and a set of passages P = p1, p2, p3,... p10 a successful Machine Reading Comprehension system is expected to read and understand both the questions and passages. For this task all queries have an answer so systems do not need to understand No Answer Queries. Using the relevant passages a successful system should produce a candidate answer that should be as close as possible to the human generated well formed refrence answers RA= wfra1,wfra2,...,wfram. Evaluation will be done using ROUGE-L and BLEU-1.

Passage RerankingTask

Given a query q and a the 1000 most relevant passages P = p1, p2, p3,... p1000, as retrieved by BM25 a succeful system is expected to rerank the most relevant passage as high as possible. For this task not all 1000 relevant items have a human labeled relevant passage. Evaluation will be done using MRR

End2End Passage Ranking Task

Given a query q and a corpus of passages P = p1, p2, p3,... pn, a succeful system is expected to provide 10 passages with the goal of providing the most relevant results. Evaluation will be done using MRR

KeyPhrase Extraction

Given a document D represented as text and cleanbody text a successful system is expected to provide 5 potential document keyphrases ranked by importance.

Conversational Search

Given a sequence of queries Q = q1,...,qn-1 predict query qn

V1 MSMARCO Dataset

The first iteration of the MSMARCO dataset was 100,000 queries and ran from Dec 2016-March 2018. The full data can be found below Train, Dev, Eval, Evaluation Scripts

Feedback

MS MARCO has been designed not as a dataset to be beat but an effort to establish a large community of researchers working on Machine Comprehension. If you have any thoughts on things we can do better, ideas for how to use datasets or general question please dont hesitate to [reach out and ask](mailto:[email protected]?subject=MS MARCO Feedback).

Author

Daniel Campos, Microsoft ([email protected])

License

This project is licensed under the MIT License - see the LICENSE.md file for details

Name		Name	Last commit message	Last commit date
Latest commit History 285 Commits
.github/ISSUE_TEMPLATE		.github/ISSUE_TEMPLATE
ConversationalSearch		ConversationalSearch
End2EndRanking		End2EndRanking
JudgingSetup		JudgingSetup
KeyPhraseExtraction		KeyPhraseExtraction
Q+A		Q+A
Ranking		Ranking
.gitignore		.gitignore
CODE_OF_CONDUCT.md		CODE_OF_CONDUCT.md
CONTRIBUTING.md		CONTRIBUTING.md
DATASHEET.md		DATASHEET.md
History.MD		History.MD
LICENSE		LICENSE
README.md		README.md

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

MSMARCO

Dataset Generation, Data Format, And Statistics

Download the Dataset

Tasks

Question Answering

Natural Language Generation

Passage RerankingTask

End2End Passage Ranking Task

KeyPhrase Extraction

Conversational Search

V1 MSMARCO Dataset

Feedback

Author

License

About

Releases

Packages

Languages

License

KihaRaito/MSMARCO

Folders and files

Latest commit

History

Repository files navigation

MSMARCO

Dataset Generation, Data Format, And Statistics

Download the Dataset

Tasks

Question Answering

Natural Language Generation

Passage RerankingTask

End2End Passage Ranking Task

KeyPhrase Extraction

Conversational Search

V1 MSMARCO Dataset

Feedback

Author

License

About

Resources

License

Code of conduct

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages