A Family of datasets built using technology from Bing.
MS MARCO: A Human Generated MAchine Reading COmprehension Dataset Paper URL : https://arxiv.org/abs/1611.09268
MS MARCO(Microsoft Machine Reading Comprehension) is a large scale dataset focused on machine reading comprehension, question answering, and passage ranking, Keyphrase Extraction, and Conversational Search Studies, or what the community thinks would be useful.
First released at NIPS 2016, the current dataset has 1,010,916 unique real queries that were generated by sampling and anonymizing Bing usage logs. The dataset started off focusing on QnA but has since evolved to focus on any problem related to search. For task specifics please explore some of the tasks that have been built out of the dataset. If you think there is a relevant task we have missed please open an issue explaining your ideas?
For more information about Q&A
For more information about ReRanking
For more information about Keyphrase Extraction
For more information about End2End Ranking
For more information about Conversational Search
What is the difference between MSMARCO and other MRC datasets? We believe the advantages that are special to MSMARCO are:
- Real questions: All questions have been sample from real anonymized bing queries.
- Real Documents: Most Url's that we have source the passages from contain the full web documents. These can be used as extra contextual information to improve systems or be used to compete in our expert task.
- Human Generated Answers: All questions have an answer written by a human. If there was no answer in the passages the judge read they have written 'No Answer Present.'
- Human Generated Well-Formed: Some questions contain extra human evaluation to create well formed answers that could be used by intelligent agents like Cortana, Siri, Google Assistant, and Alexa.
- Dataset Size: At over 1 million queries the dataset is large enough to train the most complex systems and also sample the data for specific applications.
To Download the MSMARCO Dataset please navigate to msmarco.org and agree to our Terms and Conditions. If there is some data you think we are missing and would be useful please open an issue.
In an effort to produce a dataset that can continue to be challanging and rewarding we have broken down the MSMARCO dataset into tasks of varying difficulty.
Given a query q and a set of passages P = p1, p2, p3,... p10 a successful Machine Reading Comprehension system is expected to read and understand both the questions and passages. Then, the system must accuratley decide if the passages provide addequate information to answer the query since not all queries have an answer. If there is not enough information, the system should response 'No Answer Present.'. If there is enough information the system should create a quality answer. The target of the answer a should be as close as possible to the human generated refrence answers RA= ra1,ra2,...,ram. Evaluation will be done using ROUGE-L, BLEU-1, and a F1 score computed by measuring how well a system can know to answer questions or not. Questions that have the do not have an answer will not be used to calculate ROUGE-L or BLEU-1.
Given a query q and a set of passages P = p1, p2, p3,... p10 a successful Machine Reading Comprehension system is expected to read and understand both the questions and passages. For this task all queries have an answer so systems do not need to understand No Answer Queries. Using the relevant passages a successful system should produce a candidate answer that should be as close as possible to the human generated well formed refrence answers RA= wfra1,wfra2,...,wfram. Evaluation will be done using ROUGE-L and BLEU-1.
Given a query q and a the 1000 most relevant passages P = p1, p2, p3,... p1000, as retrieved by BM25 a succeful system is expected to rerank the most relevant passage as high as possible. For this task not all 1000 relevant items have a human labeled relevant passage. Evaluation will be done using MRR
Given a query q and a corpus of passages P = p1, p2, p3,... pn, a succeful system is expected to provide 10 passages with the goal of providing the most relevant results. Evaluation will be done using MRR
Given a document D represented as text and cleanbody text a successful system is expected to provide 5 potential document keyphrases ranked by importance.
Given a sequence of queries Q = q1,...,qn-1 predict query qn
The first iteration of the MSMARCO dataset was 100,000 queries and ran from Dec 2016-March 2018. The full data can be found below Train, Dev, Eval, Evaluation Scripts
MS MARCO has been designed not as a dataset to be beat but an effort to establish a large community of researchers working on Machine Comprehension. If you have any thoughts on things we can do better, ideas for how to use datasets or general question please dont hesitate to [reach out and ask](mailto:[email protected]?subject=MS MARCO Feedback).
Daniel Campos, Microsoft ([email protected])
This project is licensed under the MIT License - see the LICENSE.md file for details