

conversational-datasets
A collection of large datasets for conversational response selection.
This repository provides tools to create reproducible datasets for training and evaluating models of conversational response. This includes:
- Reddit - 3.7 billion comments structured in threaded conversations
- OpenSubtitles - over 400 million lines from movie and television subtitles (available in English and other languages)
- Amazon QA - over 3.6 million question-response pairs in the context of Amazon products
Machine learning methods work best with large datasets such as these. At PolyAI we train models of conversational response on huge conversational datasets and then adapt these models to domain-specific tasks in conversational AI. This general approach of pre-training large models on huge datasets has long been popular in the image community and is now taking off in the NLP community.
Rather than providing the raw processed data, we provide scripts and instructions to generate the data yourself. This allows you to view and potentially manipulate the pre-processing and filtering. The instructions define standard datasets, with deterministic train/test splits, which can be used to define reproducible evaluations in research papers.
Datasets
Each dataset has its own directory, which contains a dataflow script, instructions for running it, and unit tests.
Note that these are the dataset sizes after filtering and other processing. For instance, the Reddit dataset is based on a raw database of 3.7 billion comments, but consists of 726 million examples because the script filters out long comments, short comments, uninformative comments (such as '[deleted]', and comments with no replies.