PolyAI-LDN/conversational-datasets — repo provides scripts to generate massive, reproducible

Version	Commit	Size	Downloads	Date
latestLatest	HEAD	85.5 KB	3	1mo ago

conversational-datasets

A collection of large datasets for conversational response selection.

This repository provides tools to create reproducible datasets for training and evaluating models of conversational response. This includes:

Reddit - 3.7 billion comments structured in threaded conversations
OpenSubtitles - over 400 million lines from movie and television subtitles (available in English and other languages)
Amazon QA - over 3.6 million question-response pairs in the context of Amazon products

Machine learning methods work best with large datasets such as these. At PolyAI we train models of conversational response on huge conversational datasets and then adapt these models to domain-specific tasks in conversational AI. This general approach of pre-training large models on huge datasets has long been popular in the image community and is now taking off in the NLP community.

Rather than providing the raw processed data, we provide scripts and instructions to generate the data yourself. This allows you to view and potentially manipulate the pre-processing and filtering. The instructions define standard datasets, with deterministic train/test splits, which can be used to define reproducible evaluations in research papers.

Datasets

Each dataset has its own directory, which contains a dataflow script, instructions for running it, and unit tests.

		Train set size	Test set size
Reddit	2015 - 2019	654 million	72 million
OpenSubtitles	English (other languages available)	286 million	33 million
Amazon QA	-	3 million	0.3 million

Note that these are the dataset sizes after filtering and other processing. For instance, the Reddit dataset is based on a raw database of 3.7 billion comments, but consists of 726 million examples because the script filters out long comments, short comments, uninformative comments (such as '[deleted]', and comments with no replies.

conversational-datasets

Quick Overview

What is this?

What problem does it solve?

Who should use it?

Pros

Cons

Scores

Trust Score

Maintenance

Popularity

Star History

Snapshot Versions

Alternatives

hermes-agent

prompts.chat

dify

open-webui

langchain

awesome-llm-apps

Community Reviews

README

conversational-datasets

Datasets