index.html

<!DOCTYPE html>
<!-- modified from url=(0044)http://idl.cs.washington.edu/papers/tfgraph/ -->
<html>

<head>
    <meta http-equiv="Content-Type" content="text/html; charset=UTF-8">
    <title>A Survey of Available Corpora for Building Data-Driven Dialogue Systems</title>
    <meta property="og:image" content="/static/images/logo/idl-logo.png">
    <link rel="stylesheet" type="text/css" href="./tfthing_files/fonts.css">
    <link rel="stylesheet" href="https://maxcdn.bootstrapcdn.com/font-awesome/4.7.0/css/font-awesome.min.css">

    <link rel="stylesheet" type="text/css" href="./tfthing_files/main.css">
    <style type="text/css">
        :root #content > #center > .dose > .dosesingle,
        :root #content > #right > .dose > .dosesingle {
            display: none !important;
        }
    </style>
</head>

<body>
    <div class="gutter">
        Associated Links
        <div class="nav">
          <br>
            <a href="http://rl.cs.mcgill.ca/index.html" class="at">RLLAB</a>
            <br>
            <a href="http://diro.umontreal.ca/accueil/" class="at">IRO</a>
            <br>
            <a href="http://www.iulianserban.com/">Iulian Vlad Serban</a>
            <br>
            <a href="http://cs.mcgill.ca/~rlowe1/">Ryan Lowe</a>
            <br>
            <a href="http://peterhenderson.co">Peter Henderson</a>
            <br>
            <a href="http://www.cs.toronto.edu/~lcharlin/">Laurent Charlin</a>
            <br>
            <a href="http://www.cs.mcgill.ca/~jpineau/">Joelle Pineau</a>
        </div>
        <div class="orgs">
          <a href="https://umontreal.ca"><img width="150px" src="tfthing_files/udem.png"> </a>
          <a href="https://mcgill.ca"><img width="150px" src="tfthing_files/mcgill_logo_big.png"> </a>
          <a href="http://www.sait.samsung.com/"><img width="150px" src="tfthing_files/sait.png"> </a>
          <a href="http://www.nserc-crsng.gc.ca/index_eng.asp"><img width="150px" src="tfthing_files/nserc.png"></a>
          <a href="http://www.cifar.ca/"><img width="150px" src="tfthing_files/cifar.jpg"></a>
        </div>
    </div>
    <div id="main">
        <div class="header">
            <div class="mark">
                <span class="mcgill">McGill</span>
                <span class="mark-i">&</span>
                <span class="mark-d" style="color: blue">UdeM</span>
            </div>
            <div id="title">Papers</div>
        </div>
        <div id="content">
            <div class="article">
                <div class="pub">
                    <div class="title">
                      A Survey of Available Corpora for Building Data-Driven Dialogue Systems
                    </div>
                    <div class="authors">

                    <span class="author"><a href="http://www.iulianserban.com/">Iulian Vlad Serban,</a></span>
                    <span class="author"><a href="http://cs.mcgill.ca/~rlowe1/">Ryan Lowe,</a></span>
                    <span class="author"><a href="http://peterhenderson.co">Peter Henderson,</a></span>
                    <span class="author"><a href="http://www.cs.toronto.edu/~lcharlin/">Laurent Charlin,</a></span>
                    <span class="author"><a href="http://www.cs.mcgill.ca/~jpineau/">Joelle Pineau</a></span>

                    </div>
                    <div class="alert alert-success fade in">
                      <strong>Call for contributions!</strong> We're always looking for more datasets. Feel free to <a href="https://github.com/Breakend/DialogDatasets">send us a pull request!</a>
                    </div>
                    <div class="figure">
                      <img width="50%" src="./tfthing_files/splash_image.png">
                        <br>
                        <div class="caption">A basic outline of a dialog system.</div>
                    </div>
                    <div class="abstract">
                        <div class="heading">Abstract</div>
                        <div class="text">
                          During the past decade, several areas of speech and language understanding have witnessed substantial breakthroughs from the use of data-driven models. In the area of dialogue systems, the trend is less obvious, and most practical systems are still built through significant engineering and expert knowledge. Nevertheless, several recent results suggest that data-driven approaches are feasible and quite promising. To facilitate research in this area, we have carried out a wide survey of publicly available datasets suitable for data-driven learning of dialogue systems. We discuss important characteristics of these datasets, how they can be used to learn diverse dialogue strategies, and their other potential uses. We also examine methods for transfer learning between datasets and the use of external knowledge. Finally, we discuss appropriate choice of evaluation metrics for the learning objective.
                          </div>
                    </div>
                    <div class="materials">
                        <div class="heading">Materials</div>
                        <div class="links">
                          <a href="https://arxiv.org/abs/1512.05742" class="link at">[ PDF ]</a>

                          <a class="link at" href="https://raw.githubusercontent.com/Breakend/about-me/master/_bibliography/singles/survey.bib">[ BibTeX ]</a>

                          <a class="link at" href="https://github.com/Breakend/DialogDatasets">[ Website Code ]</a>

                          </div>
                    </div>
                    <div class="citation">
                      <div class="heading">Acknowledgements</div>
                      The authors gratefully acknowledge financial support by the <b>Samsung Advanced Institute of Technology (SAIT)</b>, the <b>Natural Sciences and Engineering Research Council of Canada (NSERC)</b>, the <b>Canada Research Chairs</b>, the <b>Canadian Institute for Advanced Research (CIFAR)</b> and <b>Compute Canada</b>. Early versions of the manuscript benefited greatly from the proofreading of Melanie Lyman-Abramovitch, and later versions were extensively revised by Genevieve Fried and Nicolas Angelard-Gontier. The authors also thank Nissan Pow, Michael Noseworthy, Chia-Wei Liu, Gabriel Forgues, Alessandro Sordoni, Yoshua Bengio and Aaron Courville for helpful discussions.
                    </div>
                    <div class="citation">
                        <div class="heading">Citation</div>
                        <div class="paper">
                            <div class="thumbnail">
                                <img width="150px" style="padding-right: 5px;" src="./tfthing_files/citation_thumb.png">
                            </div>
                            <div class="entry" style="padding-top: 10px;">
                                <div class="title">
                                  <a href="https://arxiv.org/abs/1512.05742" class="at">A Survey of Available Corpora for Building Data-Driven Dialogue Systems</a>
                                </div>
                                <div class="authors">
                                  <span class="author"><a href="http://www.iulianserban.com/">Iulian Vlad Serban,</a></span>
                                  <span class="author"><a href="http://cs.mcgill.ca/~rlowe1/">Ryan Lowe,</a></span>
                                  <span class="author"><a href="http://peterhenderson.co">Peter Henderson,</a></span>
                                  <span class="author"><a href="http://www.cs.toronto.edu/~lcharlin/">Laurent Charlin,</a></span>
                                  <span class="author"><a href="http://www.cs.mcgill.ca/~jpineau/">Joelle Pineau</a></span>
                                </div>
                                <div class="venue"><span class="journal">ArXiv</span>, <span class="year">2017</span></div>
                            </div>
                        </div>
                    </div>
                    <div>
                      <div class="heading">Dataset Statistics</div>
                      <div class="paper tables">

          <br> <div class="header"><span>Human-Machine Dialogue Datasets</span></div></span>

          <div style="text-align:left; font-size: 14px;">

            <span></span>
            <div class="footnotesize">
              <span></span>
              <table border="1">
                <tr>
                  <td align="left">Name</td>
                  <td align="left">Type</td>
                  <td align="left">Topics</td>
                  <td align="center">Avg. # of turns</td>
                  <td align="center">Total # of dialogues</td>
                  <td align="center" style="width: 5%;">Total # of words</td>
                  <td align="left">Description</td>
                  <td align="left">Links</td>
                </tr>
                <tr>
                  <td align="left">
                    Let's Go!<br/>[<a class="tth_citeref" href="./references.html#raux2005let">Raux et&nbsp;al., 2005</a>]
                  </td>
                  <td align="left">Spoken</td>
                  <td align="left">Bus schedules</td>
                  <td align="center">--</td>
                  <td align="center">171,128</td>
                  <td align="center">--</td>
                  <td align="left">Bus ride information system</td>
                  <td align="left"><a class="dllink" href="https://github.com/DialRC/LetsGoDataset">Info and download</a></td>
                </tr>
                <tr>
                <tr>
                  <td align="left">
                    DSTC1<br/>[<a class="tth_citeref" href="./references.html#williams2013dialog">Williams et&nbsp;al., 2013</a>]
                  </td>
                  <td align="left">Spoken</td>
                  <td align="left">Bus schedules</td>
                  <td align="center">13.56</td>
                  <td align="center">15,000</td>
                  <td align="center">3.7M</td>
                  <td align="left">Bus ride information system</td>
                  <td align="left"><a class="dllink" href="https://www.microsoft.com/en-us/research/event/dialog-state-tracking-challenge/">Info and download</a></td>
                </tr>
                <tr>
                  <td align="left">
                    DSTC2<br/>[<a class="tth_citeref" href="./references.html#henderson2014second">Henderson et&nbsp;al., 2014b</a>]
                  </td>
                  <td align="left">Spoken</td>
                  <td align="left">Restaurants</td>
                  <td align="center">7.88</td>
                  <td align="center">3,000</td>
                  <td align="center">432K</td>
                  <td align="left">Restaurant booking system</td>
                  <td align="left"><a class="dllink" href="http://camdial.org/~mh521/dstc/">Info and Download</a></td>
                </tr>
                <tr>
                  <td align="left">
                    DSTC3<br/>[<a class="tth_citeref" href="./references.html#henderson2014dialog">Henderson et&nbsp;al., 2014a</a>]
                  </td>
                  <td align="left">Spoken</td>
                  <td align="left">Tourist information</td>
                  <td align="center">8.27</td>
                  <td align="center">2,265</td>
                  <td align="center">403K</td>
                  <td align="left">Information for tourists</td>
                  <td align="left"><a class="dllink" href="http://camdial.org/~mh521/dstc/">Info and Download</a></td>
                </tr>
                <tr>
                  <td align="left">CMU Communicator Corpus<br/>[<a class="tth_citeref" href="./references.html#bennett2002carnegie">Bennett and Rudnicky, 2002</a>]</td>
                  <td align="left">Spoken</td>
                  <td align="left">Travel</td>
                  <td align="center">11.67</td>
                  <td align="center">15,481</td>
                  <td align="center">2M*</td>
                  <td align="left">Travel planning and booking system</td>
                  <td align="left"><a class="dllink" href="http://www.speech.cs.cmu.edu/Communicator/manuals.html">Info and Download</a></td>
                </tr>
                <tr>
                  <td align="left">ATIS Pilot Corpus<br/>[<a class="tth_citeref" href="./references.html#hemphill1990atis">Hemphill et&nbsp;al., 1990</a>]</td>
                  <td align="left">Spoken</td>
                  <td align="left">Travel</td>
                  <td align="center">25.4</td>
                  <td align="center">41</td>
                  <td align="center">11.4K*</td>
                  <td align="left">Travel planning and booking system</td>
                  <td align="left"><a class="dllink" href="https://catalog.ldc.upenn.edu/docs/LDC93S4B/corpus.html">Info</a><br/><a class="dllink" href="https://github.com/denizyuret/nlpcourse/tree/master/download">Download</a></td>

                </tr>

                <tr>
                  <td align="left">Ritel Corpus<br/>[<a class="tth_citeref" href="./references.html#rosset2006ritel">Rosset and Petel, 2006</a>]</td>
                  <td align="left">Spoken</td>
                  <td align="left">Unrestricted/ Diverse Topics</td>
                  <td align="center">9.3*</td>
                  <td align="center">582</td>
                  <td align="center">60k</td>
                  <td align="left">An annotated open-domain question answering spoken dialogue system</td>
                  <td align="left"><a class="dllink" href="https://publi.limsi.fr/RS2005/chm/lir/lir12/">Info</a><br/> Contact corpus authors for download</td>
                </tr>
                <tr>
                  <td align="left">DIALOG Mathematical Proofs [<a class="tth_citeref" href="./references.html#wolska2004annotated">Wolska et&nbsp;al., 2004</a>]</td>
                  <td align="left">Spoken</td>
                  <td align="left">Mathematics</td>
                  <td align="center">12</td>
                  <td align="center">66</td>
                  <td align="center">8.7K*</td>
                  <td align="left">Humans interact with computer system to do mathematical theorem proving</td>
                  <td align="left"><a class="dllink" href="http://citeseerx.ist.psu.edu/viewdoc/download?doi=10.1.1.114.5792&rep=rep1&type=pdf">Info</a><br/> Contact corpus authors for download</td>
                </tr>

                <tr>
                  <td align="left">MATCH Corpus <br/> [<a class="tth_citeref" href="./references.html#georgila2010match">Georgila et&nbsp;al., 2010</a>]</td>
                  <td align="left">Spoken</td>
                  <td align="left">Appointment Scheduling</td>
                  <td align="center">14.0</td>
                  <td align="center">447</td>
                  <td align="center">69K*</td>
                  <td align="left">A system for scheduling appointments.</td>
                  <td align="left"><a class="dllink" href="http://groups.inf.ed.ac.uk/match/">Info and download</a></td>
                </tr>

                <tr>
                  <td align="left">Maluuba Frames<br/>[<a class="tth_citeref" href="./references.html#elframes">El&nbsp;Asri et&nbsp;al., 2017</a>]</td>
                  <td align="left">Chat, QA &amp; Recommendation</td>
                  <td align="left">Travel &amp; Vacation Booking</td>
                  <td align="center">15</td>
                  <td align="center">1369</td>
                  <td align="center">-</td>
                  <td align="left">For goal-driven dialogue systems. Semantic frames labeled and actions taken on a knowledge-base annotated.</td>
                  <td align="left"><a class="dllink" href="https://datasets.maluuba.com/Frames">Info and Download</a></td>
                </tr>
                <tr>
                  <td align="left">Key-Value Retrieval dataset<br/>[<a class="tth_citeref" href="./references.html#kvret">Eric and Manning, 2017</a>]</td>
                  <td align="left">Chat, QA</td>
                  <td align="left">Calendar, Weather, POI navigation</td>
                  <td align="center">	5.25</td>
                  <td align="center">3031</td>
                  <td align="center">-</td>
                  <td align="left">For Task-oriented dialogue systems. Intent, slots and KB annotated for each session.</td>
                  <td align="left"><a class="dllink" href="https://nlp.stanford.edu/blog/a-new-multi-turn-multi-domain-task-oriented-dialogue-dataset/">Info and Download</a></td>
                </tr>
                <tr>
                  <td align="left">MultiWOZ<br/>[<a class="tth_citeref" href="./references.html#multiwoz">Budzianowski et al. 2018</a>]</td>
                  <td align="left">Chat, QA, Recommendations</td>
                  <td align="left">Travel</td>
                  <td align="center">14</td>
                  <td align="center">10438</td>
                  <td align="center">-</td>
                  <td align="left">For goal-driven dialogue systems. Fully labelled on both user and system sides.</td>
                  <td align="left"><a class="dllink" href="http://dialogue.mi.eng.cam.ac.uk/index.php/corpus/">Info and Download</a></td>
                </tr>
              </table>
              <div style="text-align:left; font-size: 14px;">
                <span>Table 1: Human-machine dialogue datasets. Starred (*) numbers are approximated based on the average number of words per utterance.</span>
              </div><span><a id="tab:hc"></a></span>
            </div>
          </div>

          <span><br>
          <br>
                <br> <div class="header"><span>Human-Human Constrained Dialogue Datasets</span></div></span>
          <div style="text-align:left; font-size: 14px;">
    <div class="footnotesize">
      <table border="1">
        <tr>
          <td align="left">Name</td>
          <td align="left">Topics</td>
          <td align="center">Total # of dialogues</td>
          <td align="center" style="width: 5%;">Total # of words</td>
          <td align="center">Total length</td>
          <td align="left">Description</td>
          <td align="left">Links</td>
        </tr>
        <tr>
          <td align="left">HCRC Map Task Corpus [<a class="tth_citation" href="./references.html#anderson1991hcrc" id="CITEanderson1991hcrc">Anderson et&nbsp;al., 1991</a>]</td>
          <td align="left">Map-Reproducing Task</td>
          <td align="center">128</td>
          <td align="center">147k</td>
          <td align="center">18hrs</td>
          <td align="left">Dialogues from HLAP Task in which speakers must collaborate verbally to reproduce on one participant’s map a route printed on the other’s.</td>
          <td align="left"><a class="dllink" href="http://groups.inf.ed.ac.uk/maptask/">Info and Download</a></td>
        </tr>
        <tr>
          <td align="left">The Walking Around Corpus [<a class="tth_citation" href="./references.html#brennan2013entrainment" id="CITEbrennan2013entrainment">Brennan et&nbsp;al., 2013</a>]</td>
          <td align="left">Location Finding Task</td>
          <td align="center">36</td>
          <td align="center">300k*</td>
          <td align="center">33hrs</td>
          <td align="left">People collaborating over telephone to find certain locations.</td>
          <td align="left"><a class="dllink" href="https://catalog.ldc.upenn.edu/ldc2015s08">Info and Download</a></td>
        </tr>
        <tr>
          <td align="left">Green Persuasive Database [<a class="tth_citation" href="./references.html#douglas2007humaine" id="CITEdouglas2007humaine">Douglas-Cowie et&nbsp;al., 2007</a>]</td>
          <td align="left">Lifestyle</td>
          <td align="center">8</td>
          <td align="center">35k*</td>
          <td align="center">4hrs</td>
          <td align="left">A persuader with (genuinely) strong pro-green feelings tries to convince persuadees to consider adopting more ‘green’ lifestyles.</td>
          <td align="left"><a class="dllink" href="http://sspnet.eu/2009/12/green-persuasive-database/">Info</a><br/><a class="dllink" href="https://green-persuasive-db.sspnet.eu/">Download</a></td>
        </tr>
        <tr>
          <td align="left">Intelligence Squared Debates [<a class="tth_citation" href="./references.html#zhang2016conversational" id="CITEzhang2016conversational">Zhang et&nbsp;al., 2016</a>]</td>
          <td align="left">Debates</td>
          <td align="center">108</td>
          <td align="center">1.8M</td>
          <td align="center">200hrs*</td>
          <td align="left">Various topics in Oxford-style debates, each constrained to one subject. Audience opinions provided pre- and post-debates.</td>
          <td align="left"><a class="dllink" href="http://tisjune.github.io/research/iq2">Info and Download</a></td>
        </tr>
        <tr>
          <td align="left">The Corpus of Professional Spoken American English [<a class="tth_citation" href="./references.html#barlow2000corpus" id="CITEbarlow2000corpus">Barlow, 2000</a>]</td>
          <td align="left">Politics, Education</td>
          <td align="center">200</td>
          <td align="center">2M</td>
          <td align="center">220hrs*</td>
          <td align="left">Interactions from faculty meetings and White House press conferences.</td>
          <td align="left"><a class="dllink" href="http://www.athel.com/cpsa.html">Info and Download</a><br/>(Download may require purchase.)</td>
        </tr>
        <tr>
          <td align="left">MAHNOB Mimicry Database [<a class="tth_citation" href="./references.html#sun2011multimodal" id="CITEsun2011multimodal">Sun et&nbsp;al., 2011</a>]</td>
          <td align="left">Politics, Games</td>
          <td align="center">54</td>
          <td align="center">100k*</td>
          <td align="center">11hrs</td>
          <td align="left">Two experiments: a discussion on a political topic, and a role-playing game.</td>
          <td align="left"><a class="dllink" href="https://mahnob-db.eu/mimicry/">Info and Download</a></td>
        </tr>
        <tr>
          <td align="left">The IDIAP Wolf Corpus [<a class="tth_citation" href="./references.html#hung2010idiap" id="CITEhung2010idiap">Hung and Chittaranjan, 2010</a>]</td>
          <td align="left">Role-Playing Game</td>
          <td align="center">15</td>
          <td align="center">60k*</td>
          <td align="center">7hrs</td>
          <td align="left">A recording of Werewolf role-playing game with annotations related to game progress.</td>
          <td align="left"><a class="dllink" href="http://www.idiap.ch/dataset/wolf-database">Info and Download</a></td>
        </tr>
        <tr>
          <td align="left">SEMAINE corpus [<a class="tth_citation" href="./references.html#mckeown2010semaine" id="CITEmckeown2010semaine">McKeown et&nbsp;al., 2010</a>]</td>
          <td align="left">Emotional Conversations</td>
          <td align="center">100</td>
          <td align="center">450k*</td>
          <td align="center">50hrs</td>
          <td align="left">Users were recorded while holding conversations with an operator who adopts roles designed to evoke emotional reactions.</td>
          <td align="left"><a class="dllink" href="https://semaine-db.eu/">Info and Download</a></td>
        </tr>
        <tr>
          <td align="left">DSTC4/DSTC5 Corpora [<a class="tth_citeref" href="./references.html#kim2015dialog">Kim et&nbsp;al., 2015</a>,<a class="tth_citation" href="./references.html#kim2016fifth" id="CITEkim2016fifth">Kim et&nbsp;al., 2016</a>]</td>
          <td align="left">Tourist</td>
          <td align="center">35</td>
          <td align="center">273k</td>
          <td align="center">21hrs</td>
          <td align="left">Tourist information exchange over Skype.</td>
          <td align="left"><a class="dllink" href="http://www.colips.org/workshop/dstc4/">DSTC4</a><br/><br/><a class="dllink" href="http://workshop.colips.org/dstc5/">DSTC5</a><br/><br/>(DSTC4 Training Set with Chinese lang. Test Set)</td>
        </tr>
        <tr>
          <td align="left">Loqui Dialogue Corpus [<a class="tth_citation" href="./references.html#passonneau2014loqui" id="CITEpassonneau2014loqui">Passonneau and Sachar, 2014</a>]</td>
          <td align="left">Library Inquiries</td>
          <td align="center">82</td>
          <td align="center">21K</td>
          <td align="center">140*</td>
          <td align="left">Telephone interactions between librarians and patrons. Annotated dialogue acts, discussion topics, frames (discourse units), question-answer pairs.</td>
          <td align="left"><a class="dllink" href="https://academiccommons.columbia.edu/catalog/ac:176612">Info and Download</a></td>
        </tr>
        <tr>
          <td align="left">MRDA Corpus [<a class="tth_citation" href="./references.html#shriberg2004icsi" id="CITEshriberg2004icsi">Shriberg et&nbsp;al., 2004</a>]</td>
          <td align="left">ICSI Meetings</td>
          <td align="center">75</td>
          <td align="center">11K*</td>
          <td align="center">72hrs</td>
          <td align="left">Recordings of ICSI meetings. Topics include: ICSI meeting recorder project itself, automatic speech recognition, natural language processing and neural theories of language. Dialogue acts, question-answer pairs, and hot spots.</td>
          <td align="left"><a class="dllink" href="http://www1.icsi.berkeley.edu/~ees/dadb/">Info and Download</a></td>
        </tr>
        <tr>
          <td align="left">TRAINS 93 Dialogues Corpus [<a class="tth_citation" href="./references.html#heeman1995trains" id="CITEheeman1995trains">Heeman and Allen, 1995</a>]</td>
          <td align="left">Railroad Freight Route Planning</td>
          <td align="center">98</td>
          <td align="center">55K</td>
          <td align="center">6.5hrs</td>
          <td align="left">Collaborative planning of railroad freight routes.</td>
          <td align="left"><a class="dllink" href="https://www.cs.rochester.edu/research/speech/trains.html">Info and Download</a></td>
        </tr>
        <tr>
          <td align="left">Verbmobil Corpus [<a class="tth_citation" href="./references.html#weilhammer2002multi" id="CITEweilhammer2002multi">Burger et&nbsp;al., 2000</a>]</td>
          <td align="left">Appointment Scheduling</td>
          <td align="center">726</td>
          <td align="center">270K</td>
          <td align="center">38Hrs</td>
          <td align="left">Spontaneous speech data collected for the Verbmobil project. Full corpus is in English, German, and Japanese. We only show English statistics.</td>
          <td align="left"><a class="dllink" href="http://verbmobil.dfki.de/facts.html">Info</a><br/><a class="dllink" href="https://www.phonetik.uni-muenchen.de/Bas/BasVM1eng.html">Download I</a><br/><a class="dllink" href="https://www.phonetik.uni-muenchen.de/Bas/BasVM2eng.html">Download II</a></td>
        </tr>
        <tr>
          <td align="left">ICT Rapport Datasets [<a class="tth_citation" href="./references.html#gratch2007rapport" id="CITEgratch2008rapport">Gratch et&nbsp;al., 2007</a>]</td>
          <td align="left">Sexual Harassment Awareness</td>
          <td align="center">165</td>
          <td align="center">N/A</td>
          <td align="center">N/A</td>
          <td align="left">A speaker tells a story to a listener. The listener is asked to not speak during the story telling. Contains audio-visual data, transcriptions, and annotations.</td>
          <td align="left"><a class="dllink" href="http://rapport.ict.usc.edu/">Info and Download</a></td>
        </tr>
      </table>
      <div style="text-align:left; font-size: 14px;">
        Table 2: Human-human constrained spoken dialogue datasets. Starred (*) numbers are estimates based on the <a style="text-decoration: underline;" href="www.ncvs.org/ncvs/tutorials/voiceprod/tutorial/quality.html">average rate of English speech from the National Center for Voice and Speech</a>.</span>
      </div><a id="tab:hh_constrained_spoken"></a>
    </div>
  </div>

  <span><br>
  <br>
        <br> <div class="header"><span>Human-Human Spontaneous Dialogue Datasets</span></div></span>
  <div style="text-align:left; font-size: 14px;">
      <div class="footnotesize">
        <table border="1">
          <tr>
            <td align="left">Name</td>
            <td align="left">Topics</td>
            <td align="center">Total # of dialogues</td>
            <td align="center" style="width: 5%;">Total # of words</td>
            <td align="center">Total length</td>
            <td align="left">Description</td>
            <td align="left">Links</td>
          </tr>
          <tr>
            <td align="left">
              Switchboard [<a class="tth_citation" href="./references.html#godfrey1992switchboard" id="CITEgodfrey1992switchboard">Godfrey et&nbsp;al., 1992</a>]
            </td>
            <td align="left">Casual Topics</td>
            <td align="center">2,400</td>
            <td align="center">3M</td>
            <td align="center">300hrs*</td>
            <td align="left">Telephone conversations on pre-specified topics</td>
            <td align="left"><a class="dllink" href="http://groups.inf.ed.ac.uk/switchboard/">Info and Download</a></td>

          </tr>
          <tr>
            <td align="left">British National Corpus (BNC) [<a class="tth_citation" href="./references.html#leech1992100" id="CITEleech1992100">Leech, 1992</a>]</td>
            <td align="left">Casual Topics</td>
            <td align="center">854</td>
            <td align="center">10M</td>
            <td align="center">1,000hrs*</td>
            <td align="left">British dialogues many contexts, from formal business or government meetings to radio shows and phone-ins.</td>
            <td align="left"><a class="dllink" href="http://www.natcorp.ox.ac.uk/">Info and Download</a></td>
          </tr>
          <tr>
            <td align="left">CALLHOME American English Speech [<a class="tth_citation" href="./references.html#canavan1997callhome" id="CITEcanavan1997callhome">Canavan et&nbsp;al., 1997</a>]</td>
            <td align="left">Casual Topics</td>
            <td align="center">120</td>
            <td align="center">540k*</td>
            <td align="center">60hrs</td>
            <td align="left">Telephone conversations between family members or close friends.</td>
            <td align="left"><a class="dllink" href="https://ca.talkbank.org/access/CallHome/eng.html">Info and Download</a></td>
          </tr>

          <tr>
            <td align="left">CALLFRIEND American English Non-Southern Dialect [<a class="tth_citation" href="./references.html#canavan1996callfriend" id="CITEcanavan1996callfriend">Canavan and Zipperlen, 1996</a>]</td>
            <td align="left">Casual Topics</td>
            <td align="center">60</td>
            <td align="center">180k*</td>
            <td align="center">20hrs</td>
            <td align="left">Telephone conversations between Americans with a Non-Southern accent.</td>
            <td align="left"><a class="dllink" href="https://catalog.ldc.upenn.edu/LDC96S46">Info and Download</a></td>
          </tr>
          <tr>
            <td align="left">The Bergen Corpus of London Teenage Language [<a class="tth_citation" href="./references.html#haslerud1995bergen" id="CITEhaslerud1995bergen">Haslerud and Stenstr&#246;m, 1995</a>]</td>
            <td align="left">Unrestricted</td>
            <td align="center">100</td>
            <td align="center">500k</td>
            <td align="center">55hrs</td>
            <td align="left">Spontaneous teenage talk recorded in 1993. Conversations were recorded secretly.</td>
            <td align="left"><a class="dllink" href="http://clu.uni.no/icame/colt/">Info and Download</a></td>
          </tr>
          <tr>
            <td align="left">The Cambridge and Nottingham Corpus of Discourse in English [<a class="tth_citation" href="./references.html#mccarthy1998spoken" id="CITEmccarthy1998spoken">McCarthy, 1998</a>]</td>
            <td align="left">Casual Topics</td>
            <td align="center">-</td>
            <td align="center">5M</td>
            <td align="center">550hrs*</td>
            <td align="left">British dialogues from wide variety of informal contexts, such as hair salons, restaurants, etc.</td>
            <td align="left"><a class="dllink" href="http://www.cambridge.org/us/cambridgeenglish/about-cambridge-english/cambridge-english-corpus">Info and Download</a><br/>Note: CANCODE is a subset of the Cambridge English Corpus.</td>
          </tr>
          <tr>
            <td align="left">D64 Multimodal Conversation Corpus [<a class="tth_citation" href="./references.html#oertel2013d64" id="CITEoertel2013d64">Oertel et&nbsp;al., 2013</a>]</td>
            <td align="left">Unrestricted</td>
            <td align="center">2</td>
            <td align="center">70k*</td>
            <td align="center">8hrs</td>
            <td align="left">Several hours of natural interaction between a group of people</td>
            <td align="left">Contact corpus authors for data.</td>
          </tr>
          <tr>
            <td align="left">AMI Meeting Corpus [<a class="tth_citation" href="./references.html#renals2007recognition" id="CITErenals2007recognition">Renals et&nbsp;al., 2007</a>]</td>
            <td align="left">Meetings</td>
            <td align="center">175</td>
            <td align="center">900k*</td>
            <td align="center">100hrs</td>
            <td align="left">Face-to-face meeting recordings.</td>
            <td align="left"><a class="dllink" href="http://groups.inf.ed.ac.uk/ami/download/">Info and Download</a></td>
          </tr>
          <tr>
            <td align="left">Cardiff Conversation Database (CCDb) [<a class="tth_citation" href="./references.html#aubrey2013cardiff" id="CITEaubrey2013cardiff">Aubrey et&nbsp;al., 2013</a>]</td>
            <td align="left">Unrestricted</td>
            <td align="center">30</td>
            <td align="center">20k*</td>
            <td align="center">150min</td>
            <td align="left">Audio-visual database with unscripted natural conversations, including visual annotations.</td>
            <td align="left"><a class="dllink" href="http://www.cs.cf.ac.uk/ccdb/">Info and Download</a></td>
          </tr>
          <tr>
            <td align="left">4D Cardiff Conversation Database (4D CCDb) [<a class="tth_citation" href="./references.html#vandeventer20154d" id="CITEvandeventer20154d">Vandeventer et&nbsp;al., 2015</a>]</td>
            <td align="left">Unrestricted</td>
            <td align="center">17</td>
            <td align="center">2.5k*</td>
            <td align="center">17min</td>
            <td align="left">A version of the CCDb with 3D video</td>
            <td align="left"><a class="dllink" href="http://www.cs.cf.ac.uk/ccdb/">Info and Download</a></td>
          </tr>
          <tr>
            <td align="left">The Diachronic Corpus of Present-Day Spoken English [<a class="tth_citation" href="./references.html#aarts2006diachronic" id="CITEaarts2006diachronic">Aarts and Wallis, 2006</a>]</td>
            <td align="left">Casual Topics</td>
            <td align="center">280</td>
            <td align="center">800k</td>
            <td align="center">80hrs*</td>
            <td align="left">Selection of face-to-face, telephone, and public discussion dialogue from Britain.</td>
            <td align="left"><a class="dllink" href="http://www.ucl.ac.uk/english-usage/projects/dcpse/">Info and Download</a></td>
          </tr>
          <tr>
            <td align="left">The Spoken Corpus of the Survey of English Dialects [<a class="tth_citation" href="./references.html#beare1999spoken" id="CITEbeare1999spoken">Beare and Scott, 1999</a>]</td>
            <td align="left">Casual Topics</td>
            <td align="center">314</td>
            <td align="center">800k</td>
            <td align="center">60hrs</td>
            <td align="left">Dialogue of people aged 60 or above talking about their memories, families, work and the folklore of the countryside from a century ago.</td>
            <td align="left"><a class="dllink" href="http://www2.iath.virginia.edu/ach-allc.99/proceedings/scott.html">Info</a><br/>Contact corpus authors for download.</td>
          </tr>
          <tr>
            <td align="left">The Child Language Data Exchange System [<a class="tth_citation" href="./references.html#macwhinney1985child" id="CITEmacwhinney1985child">MacWhinney and Snow, 1985</a>]</td>
            <td align="left">Unrestricted</td>
            <td align="center">11K</td>
            <td align="center">10M</td>
            <td align="center">1,000hrs*</td>
            <td align="left">International database organized for the study of first and second language acquisition.</td>
            <td align="left"><a class="dllink" href="http://childes.psy.cmu.edu/">Info and Download</a></td>
          </tr>
          <tr>
            <td align="left">The Charlotte Narrative and Conversation Collection (CNCC) [<a class="tth_citation" href="./references.html#reppen2004american" id="CITEreppen2004american">Reppen and Ide, 2004</a>]</td>
            <td align="left">Casual Topics</td>
            <td align="center">95</td>
            <td align="center">20K</td>
            <td align="center">2hrs*</td>
            <td align="left">Narratives, conversations and interviews representative of the residents of Mecklenburg County, North Carolina.</td>
            <td align="left"><a class="dllink" href="http://nsv.uncc.edu/nsv/narratives/">Info and Download</a></td>
          </tr>
          <tr>
            <td align="left">The Group Affect and Performance (GAP) Corpus [<a class="tth_citation" href="./references.html#braley18gap" id="CITEbraley18gap">Braley and Murray, 2018</a>]</td>
            <td align="left">Survival</td>
            <td align="center">28</td>
            <td align="center">70K</td>
            <td align="center">4hrs+</td>
            <td align="left">A winter survival task</td>
            <td align="left"><a class="dllink" href="https://sites.google.com/view/gap-corpus/home">Info and Download</a></td>
          </tr>
          <tr>
            <td align="left">The MULTISIMO Corpus [<a class="tth_citation" href="./references.html#koutsomm2018multi" id="CITEkoutsomm2018multi">Koutsombogera and Vogel, 2018</a>]</td>
            <td align="left">Game</td>
            <td align="center">18</td>
            <td align="center">26K</td>
            <td align="center">3hrs+</td>
            <td align="left">Family Feud-like game</td>
            <td align="left"><a class="dllink" href="http://multisimo.eu/datasets.html">Info and Download</a></td>
          </tr>
        </table>
        <div style="text-align:left; font-size: 14px;">
          Table 3: Human-human spontaneous spoken dialogue datasets. Starred (*) numbers are estimates based on the <a style="text-decoration: underline;" href="www.ncvs.org/ncvs/tutorials/voiceprod/tutorial/quality.html">average rate of English speech from the National Center for Voice and Speech</a>.</span>
        </div><a id="tab:hh_spontaneous_spoken"></a>
      </div>
    </div>

    <span><br>
    <br>
          <br> <div class="header"><span>Human-Human Scripted Dialogue Datasets</span></div></span>
    <div style="text-align:left; font-size: 14px;">
  <div class="footnotesize">
    <table border="1">
      <tr>
        <td align="left">Name</td>
        <td align="left">Topics</td>
        <td align="center">Total # of utterances</td>
        <td align="center">Total # of dialogues</td>
        <td align="center">Total # of works</td>
        <td align="center" style="width: 5%;">Total # of words</td>
        <td align="left">Description</td>
        <td align="left">Links</td>
      </tr>
      <tr>
        <td align="left">Movie-DiC [<a class="tth_citation" href="./references.html#banchs2012movie" id="CITEbanchs2012movie">Banchs, 2012</a>]</td>
        <td align="left">Movie dialogues</td>
        <td align="center">764k</td>
        <td align="center">132K</td>
        <td align="center">753</td>
        <td align="center">6M</td>
        <td align="left">Movie scripts of American films.</td>
        <td align="left">Contact corpus authors for data.</td>
      </tr>
      <tr>
        <td align="left">Movie-Triples [<a class="tth_citeref" href="./references.html#2015arXiv150704808S">Serban et&nbsp;al., 2016</a>]</td>
        <td align="left">Movie dialogues</td>
        <td align="center">736k</td>
        <td align="center">245K</td>
        <td align="center">614</td>
        <td align="center">13M</td>
        <td align="left">Triples of utterances which are filtered to come from X-Y-X triples.</td>
        <td align="left">Contact corpus authors for data.</td>
      </tr>
      <tr>
        <td align="left">Film Scripts Online Series</td>
        <td align="left">Movie scripts</td>
        <td align="center">1M*</td>
        <td align="center">263K</td>
        <td align="center">1,500</td>
        <td align="center">16M*</td>
        <td align="left">Two subsets of scripts (1000 American films and 500 mixed British/American films).</td>
        <td align="left"><a class="dllink" href="https://alexanderstreet.com/products/film-scripts-online-series">Info and Download</a></td>
      </tr>
      <tr>
        <td align="left">Cornell Movie-Dialogue Corpus [<a class="tth_citation" href="./references.html#Danescu-Niculescu-Mizil+Lee:11a" id="CITEDanescu-Niculescu-Mizil+Lee:11a">Danescu-Niculescu-Mizil and Lee, 2011</a>]</td>
        <td align="left">Movie dialogues</td>
        <td align="center">305K</td>
        <td align="center">220K</td>
        <td align="center">617</td>
        <td align="center">9M*</td>
        <td align="left">Short conversations from film scripts, annotated with character metadata.</td>
        <td align="left"><a class="dllink" href="https://www.cs.cornell.edu/~cristian/Cornell_Movie-Dialogs_Corpus.html">Info and Download</a></td>
      </tr>
      <tr>
        <td align="left">Filtered Movie Script Corpus [<a class="tth_citation" href="./references.html#nio2014conversation" id="CITEnio2014conversation">Nio et&nbsp;al., 2014</a>]</td>
        <td align="left">Movie dialogues</td>
        <td align="center">173k</td>
        <td align="center">86K</td>
        <td align="center">1,786</td>
        <td align="center">2M*</td>
        <td align="left">Triples of utterances which are filtered to come from X-Y-X triples.</td>
        <td align="left"><a class="dllink" href="http://ahclab.naist.jp/resource/dialog_corpora/resources.html">Info and Download</a></td>
      </tr>
      <tr>
        <td align="left">American Soap Opera Corpus [<a class="tth_citation" href="./references.html#davies2012corpus" id="CITEdavies2012corpus">Davies, 2012b</a>]</td>
        <td align="left">TV show scripts</td>
        <td align="center">10M*</td>
        <td align="center">1.2M</td>
        <td align="center">22,000</td>
        <td align="center">100M</td>
        <td align="left">Transcripts of American soap operas.</td>
        <td align="left"><a class="dllink" href="http://corpus.byu.edu/soap/">Info and Download</a></td>
      </tr>
      <tr>
        <td align="left">TVD Corpus [<a class="tth_citation" href="./references.html#roy2014tvd" id="CITEroy2014tvd">Roy et&nbsp;al., 2014</a>]</td>
        <td align="left">TV show scripts</td>
        <td align="center">60k*</td>
        <td align="center">10K</td>
        <td align="center">191</td>
        <td align="center">600k*</td>
        <td align="left">TV scripts from a comedy (Big Bang Theory) and drama (Game of Thrones) show.</td>
        <td align="left"><a class="dllink" href="http://tvd.niderb.fr/corpus/">Info and Download</a></td>
      </tr>
      <tr>
        <td align="left">Character Style from Film Corpus [<a class="tth_citation" href="./references.html#walker2012annotated" id="CITEwalker2012annotated">Walker et&nbsp;al., 2012a</a>]</td>
        <td align="left">Movie scripts</td>
        <td align="center">664k</td>
        <td align="center">151K</td>
        <td align="center">862</td>
        <td align="center">9.6M</td>
        <td align="left">Scripts from IMSDb, annotated for linguistic structures and character archetypes.</td>
        <td align="left">Contact corpus authors for data.</td>
      </tr>
      <tr>
        <td align="left">SubTle Corpus [<a class="tth_citation" href="./references.html#ameixa2013subtitles" id="CITEameixa2013subtitles">Ameixa and Coheur, 2013</a>]</td>
        <td align="left">Movie subtitles</td>
        <td align="center">6.7M</td>
        <td align="center">3.35M</td>
        <td align="center">6,184</td>
        <td align="center">20M</td>
        <td align="left">Aligned interaction-response pairs from movie subtitles.</td>
        <td align="left">Contact corpus authors for data.</td>
      </tr>
      <tr>
        <td align="left">OpenSubtitles [<a class="tth_citation" href="./references.html#tiedemann2012parallel" id="CITEtiedemann2012parallel">Tiedemann, 2012</a>]</td>
        <td align="left">Movie subtitles</td>
        <td align="center">140M*</td>
        <td align="center">36M</td>
        <td align="center">207,907</td>
        <td align="center">1B</td>
        <td align="left">Movie subtitles which are not speaker-aligned.</td>
        <td align="left"><a class="dllink" href="http://opus.lingfil.uu.se/OpenSubtitles2016.php">Info and Download</a></td>
      </tr>
      <tr>
        <td align="left">CED (1560-1760) Corpus [<a class="tth_citation" href="./references.html#kyto2006guide" id="CITEkyto2006guide">Kyt&#246; and Walker, 2006</a>]</td>
        <td align="left">Written Works &amp; Trial Proceedings</td>
        <td align="center">-</td>
        <td align="center">-</td>
        <td align="center">177</td>
        <td align="center">1.2M</td>
        <td align="left">Various scripted fictional works from (1560-1760) as well as court trial proceedings.</td>
        <td align="left"><a class="dllink" href="http://www.engelska.uu.se/forskning/engelska-spraket/elektroniska-resurser/a-corpus">Info and Download</a></td>
      </tr>
    </table>
    <div style="text-align:left; font-size: 14px;">
      Table 4: Human-human scripted dialogue datasets. Quantities denoted with () indicate estimates based on average dialogues per movie seen in [<a class="tth_citeref" href="./references.html#banchs2012movie">Banchs, 2012</a>] and the number of scripts or works. Dialogues may not be explicitly separated in these datasets. TV show datasets were adjusted based on the ratio of average film runtime (112 minutes) to average TV show runtime (36 minutes). This data was scraped from the IMBD database (<span class="small"><a href="http://www.imdb.com/interfaces"><tt>http://www.imdb.com/interfaces</tt></a>). ( Starred (*) quantities are estimated based on the average number of words and utterances per film, and the average lengths of films and TV shows. Estimates derived from the Tameri Guide for Writers (<span class="small"><a href="http://www.tameri.com/format/wordcounts.html"><tt>http://www.tameri.com/format/wordcounts.html</tt></a>).</span></span>
    </div><a id="tab:hh_scripted"></a>
  </div>
</div>
<span><br>
<br>
      <br> <div class="header"><span>Human-Human Written Dialogue Datasets</span></div></span>
<div style="text-align:left; font-size: 14px;">
  <div class="footnotesize">
    <table border="1">
      <tr>
        <td align="left">Name</td>
        <td align="left">Type</td>
        <td align="left">Topics</td>
        <td align="center">Avg. # of turns</td>
        <td align="center">Total # of dialogues</td>
        <td align="center" style="width: 5%;">Total # of words</td>
        <td align="left">Description</td>
        <td align="left">Links</td>
      </tr>
      <tr>
        <td align="left">NPS Chat Corpus [<a class="tth_citation" href="./references.html#forsyth2007lexical" id="CITEforsyth2007lexical">Forsyth and Martell, 2007</a>]</td>
        <td align="left">Chat</td>
        <td align="left">Unrestricted</td>
        <td align="center">&nbsp;704</td>
        <td align="center">15</td>
        <td align="center">100M</td>
        <td align="left">Posts from age-specific online chat rooms.</td>
        <td align="left"><a class="dllink" href="http://faculty.nps.edu/cmartell/NPSChat.htm">Info and Download</a></td>
      </tr>
      <tr>
        <td align="left">Twitter Corpus [<a class="tth_citation" href="./references.html#Ritter:2010:UMT:1857999.1858019" id="CITERitter:2010:UMT:1857999.1858019">Ritter et&nbsp;al., 2010</a>]</td>
        <td align="left">Microblog</td>
        <td align="left">Unrestricted</td>
        <td align="center">2</td>
        <td align="center">1.3M</td>
        <td align="center">&nbsp;125M</td>
        <td align="left">Tweets and replies extracted from Twitter</td>
        <td align="left">Contact corpus authors for data.</td>
      </tr>
      <tr>
        <td align="left">Twitter Triple Corpus [<a class="tth_citeref" href="./references.html#sordoni2015aneural">Sordoni et&nbsp;al., 2015</a>]</td>
        <td align="left">Microblog</td>
        <td align="left">Unrestricted</td>
        <td align="center">3</td>
        <td align="center">4,232</td>
        <td align="center">&nbsp;65K</td>
        <td align="left">A-B-A triples extracted from Twitter</td>
        <td align="left"><a class="dllink" href="https://www.microsoft.com/en-us/download/details.aspx?id=52375">Info and Download</a></td>
      </tr>

      <tr>
        <td align="left">UseNet Corpus [<a class="tth_citation" href="./references.html#shaoul2009usenet" id="CITEshaoul2009usenet">Shaoul and Westbury, 2009</a>]</td>
        <td align="left">Microblog</td>
        <td align="left">Unrestricted</td>
        <td align="center">&nbsp;687</td>
        <td align="center">47860</td>
        <td align="center">&nbsp;7B</td>
        <td align="left">UseNet forum postings</td>
        <td align="left"><a class="dllink" href="http://www.psych.ualberta.ca/~westburylab/downloads/usenetcorpus.download.html">Info and Download</a></td>
      </tr>

      <tr>
        <td align="left">NUS SMS Corpus [<a class="tth_citation" href="./references.html#chen2013creating" id="CITEchen2013creating">Chen and Kan, 2013</a>]</td>
        <td align="left">SMS messages</td>
        <td align="left">Unrestricted</td>
        <td align="center">&nbsp;18</td>
        <td align="center">&nbsp;3K</td>
        <td align="center">580,668*<sup><span style="font-size:x-small"><sup>[<u>&#175;</u>]</sup></span></sup></td>
        <td align="left">SMS messages collected between two users, with timing analysis.</td>
        <td align="left"><a class="dllink" href="http://www.comp.nus.edu.sg/~nlp/corpora.html">Info and Download</a></td>
      </tr>
      <tr>
        <td align="left">Reddit Domestic Abuse Corpus [<a class="tth_citation" href="./references.html#ray4183analysis" id="CITEray4183analysis">Schrading et&nbsp;al., 2015</a>]</td>
        <td align="left">Forum</td>
        <td align="left">Abuse help</td>
        <td align="center">17.53</td>
        <td align="center">21,133</td>
        <td align="center">19M-103M <sup>\triangle</sup></td>
        <td align="left">Reddit posts from either domestic abuse subreddits, or general chat.</td>
        <td align="left"><a class="dllink" href="http://nicschrading.com/data/">Info and Download</a></td>
      </tr>
      <tr>
        <td align="left">Reddit All Comments Corpus</td>
        <td align="left">Forum</td>
        <td align="left">General</td>
        <td align="center">--</td>
        <td align="center">--</td>
        <td align="center">--</td>
        <td align="left">1.7 Billion Reddit comments.</td>
        <td align="left"><a class="dllink" href="https://www.reddit.com/r/datasets/comments/3bxlg7/i_have_every_publicly_available_reddit_comment/">Info and Download</a></td>
      </tr>
      <tr>
        <td align="left">Settlers of Catan [<a class="tth_citation" href="./references.html#afantenos2012developing" id="CITEafantenos2012developing">Afantenos et&nbsp;al., 2012</a>]</td>
        <td align="left">Chat</td>
        <td align="left">Game terms</td>
        <td align="center">&nbsp;95</td>
        <td align="center">21</td>
        <td align="center">-</td>
        <td align="left">Conversations between players in the game `Settlers of Catan'.</td>
        <td align="left"><a class="dllink" href="http://settlers.inf.ed.ac.uk/socl/">Info</a><br/><br/>Contact corpus authors for download. </td>
      </tr>
      <tr>
        <td align="left">Cards Corpus [<a class="tth_citation" href="./references.html#djalali2012corpus" id="CITEdjalali2012corpus">Djalali et&nbsp;al., 2012</a>]</td>
        <td align="left">Chat</td>
        <td align="left">Game terms</td>
        <td align="center">38.1</td>
        <td align="center">1,266</td>
        <td align="center">282K</td>
        <td align="left">Conversations between players playing `Cards world'.</td>
        <td align="left"><a class="dllink" href="http://cardscorpus.christopherpotts.net/">Info and Download</a></td>
      </tr>
      <tr>
        <td align="left">Agreement in Wikipedia Talk Pages [<a class="tth_citation" href="./references.html#andreas2012annotating" id="CITEandreas2012annotating">Andreas et&nbsp;al., 2012</a>]</td>
        <td align="left">Forum</td>
        <td align="left">Unrestricted</td>
        <td align="center">2</td>
        <td align="center">822</td>
        <td align="center">110K</td>
        <td align="left">LiveJournal and Wikipedia Discussions forum threads. Agreement type and level annotated.</td>
        <td align="left"><a class="dllink" href="http://www.cs.columbia.edu/~sara/data.php">Info and Download</a></td>
      </tr>
      <tr>
        <td align="left">Agreement by Create Debaters [<a class="tth_citation" href="./references.html#rosenthal2015couldn" id="CITErosenthal2015couldn">Rosenthal and McKeown, 2015</a>]</td>
        <td align="left">Forum</td>
        <td align="left">Unrestricted</td>
        <td align="center">2</td>
        <td align="center">10K</td>
        <td align="center">1.4M</td>
        <td align="left">Create Debate forum conversations. Annotated what type of agreement (e.g. paraphrase) or disagreement.</td>
        <td align="left"><a class="dllink" href="http://www.cs.columbia.edu/~sara/data.php">Info and Download</a></td>
      </tr>
      <tr>
        <td align="left">Internet Argument Corpus [<a class="tth_citation" href="./references.html#walker2012corpus" id="CITEwalker2012corpus">Walker et&nbsp;al., 2012b</a>]</td>
        <td align="left">Forum</td>
        <td align="left">Politics</td>
        <td align="center">&nbsp;35.45</td>
        <td align="center">&nbsp;11K</td>
        <td align="center">&nbsp;73M</td>
        <td align="left">Debates about specific political or moral positions. A separate corpus (Argumentative Summary Corpus, [<a class="tth_citation" href="./references.html#misra2015argsummary" id="CITEmisra2015argsummary">Walker et&nbsp;al., 2012b</a>]) annotates a subset of this corpus with summaries of the arguments.</td>
        <td align="left"><a class="dllink" href="https://nlds.soe.ucsc.edu/iac2">Info and Download</a><br/><a class="dllink" href="https://nlds.soe.ucsc.edu/node/30">Argument Summary Corpus</a></td>
      </tr>
      <tr>
        <td align="left">MPC Corpus [<a class="tth_citation" href="./references.html#shaikh2010mpc" id="CITEshaikh2010mpc">Shaikh et&nbsp;al., 2010</a>]</td>
        <td align="left">Chat</td>
        <td align="left">Social tasks</td>
        <td align="center">520</td>
        <td align="center">14</td>
        <td align="center">58K</td>
        <td align="left">Conversations about general, political, and interview topics.</td>
        <td align="left"><a class="dllink" href="https://github.com/sashank06/MPC-Corpus">Info and Download</a></td>
      </tr>
      <tr>
        <td align="left">Ubuntu Dialogue Corpus [<a class="tth_citeref" href="./references.html#lowe2015ubuntu">Lowe et&nbsp;al., 2015a</a>]</td>
        <td align="left">Chat</td>
        <td align="left">Ubuntu Operating System</td>
        <td align="center">7.71</td>
        <td align="center">930K</td>
        <td align="center">100M</td>
        <td align="left">Dialogues extracted from Ubuntu chat stream on IRC.</td>
        <td align="left"><a class="dllink" href="https://github.com/rkadlec/ubuntu-ranking-dataset-creator">Info and Download</a></td>
      </tr>
      <tr>
        <td align="left">Ubuntu Chat Corpus [<a class="tth_citation" href="./references.html#uthus2013ubuntu" id="CITEuthus2013ubuntu">Uthus and Aha, 2013</a>]</td>
        <td align="left">Chat</td>
        <td align="left">Ubuntu Operating System</td>
        <td align="center">&nbsp;3381.6</td>
        <td align="center">10665</td>
        <td align="center">&nbsp;2B*<sup><span style="font-size:x-small"><sup>[<u>&#175;</u>]</sup></span></sup></td>
        <td align="left">Chat stream scraped from IRC logs (no dialogues extracted).</td>
        <td align="left"><a class="dllink" href="http://daviduthus.org/UCC/">Info and Download</a></td>
      </tr>
      <tr>
        <td align="left">Movie Dialog Dataset [<a class="tth_citeref" href="./references.html#dodge2015evaluating">Dodge et&nbsp;al., 2015</a>]</td>
        <td align="left">Chat, QA &amp; Recommendation</td>
        <td align="left">Movies</td>
        <td align="center">&nbsp;3.3</td>
        <td align="center">&nbsp;3.1M<sup>\blacktriangledown</sup></td>
        <td align="center">&nbsp;185M</td>
        <td align="left">For goal-driven dialogue systems. Includes movie metadata as knowledge triples.</td>
        <td align="left"><a class="dllink" href="https://research.fb.com/downloads/babi/">Info and Download</a></td>
      </tr>
      <tr>
        <td align="left">DailyDialog Dataset [<a class="tth_citeref" href="./references.html#li2017daily">Li et&nbsp;al., 2017</a>]</td>
        <td align="left">Chat</td>
        <td align="left">Daily Life</td>
        <td align="center">&nbsp;7.9</td>
        <td align="center">&nbsp;13K</td>
        <td align="center">&nbsp;1.5M</td>
        <td align="left">Conversations extracted from English language educational texts. Labeled with emotions.</td>
        <td align="left"><a class="dllink" href="http://yanran.li/dailydialog">Info and Download</a></td>
      </tr>
    </table>
    <div style="text-align:left; font-size: 14px;">
      Table 5: <span class="small">Human-human written dialogue datasets. Starred (*) quantities are computed using word counts based on spaces (i.e. a word must be a sequence of characters preceded and followed by a space), but for certain corpora, such as IRC and SMS corpora, proper English words are sometimes concatenated together due to slang usage. Triangle (<sup>\triangle</sup>) indicates lower and upper bounds computed using average words per utterance estimated on a similar Reddit corpus <a class="tth_citation" href="./references.html#schrading2015analyzing" id="CITEschrading2015analyzing">Schrading [2015]</a>. Square (<sup><span style="font-size:x-small"><sup>[<u>&#175;</u>]</sup></span></sup>) indicates estimates based only on English part of the corpus. Note that 2.1M dialogues from the Movie Dialog dataset (<sup>\blacktriangledown</sup>) are in the form of simulated QA pairs. Dialogs indicated by () are contiguous blocks of recorded conversation in a multi-participant chat. In the case of UseNet, we note the total number of newsgroups and find the average turns as average number of posts collected per newsgroup. () indicates an estimate based on a Twitter dataset of similar size and refers to tokens as well as words.</span>
    </div>
  </div>
</div>

                      </div>
                    </div>
                </div>
            </div>
        </div>

    </div>

    <script src="./tfthing_files/underscore.min.js"></script>
    <script src="./tfthing_files/backbone.min.js"></script>
    <script src="./tfthing_files/d3.v3.min.js"></script>
    <script sync src="https://platform.twitter.com/widgets.js"></script>
    <script src="./tfthing_files/tweets.js"></script>
</body>

</html>