Deep Learning Datasets

MNIST(http://yann.lecun.com/exdb/mnist/) Handwritten digits

Google House Numbers(http://ufldl.stanford.edu/housenumbers/) from street view

CIFAR-10 and CIFAR-100(http://www.cs.toronto.edu/~kriz/cifar.html)

IMAGENET (http://www.image-net.org/)

Flickr Data (http://yahoolabs.tumblr.com/post/89783581601/one-hundred-million-creative-commons-flickr-images) 100 Million Yahoo dataset

UC Irvine Machine Learning Repository (http://archive.ics.uci.edu/ml/)

The AR Face Database (http://rvl1.ecn.purdue.edu/~aleix/aleix_face_DB.html) - Contains over 4,000 color images corresponding to 126 people's faces (70 men and 56 women). Frontal views with variations in facial expressions, illumination, and occlusions. (Formats: RAW (RGB 24-bit))

Yale Face Database(http://cvc.yale.edu/projects/yalefaces/yalefaces.html) - 165 images (15 individuals) with different lighting, expression, and occlusion configurations.

DeepMind QA Corpus (https://github.com/deepmind/rc-data) - Textual QA corpus from CNN and DailyMail. More than 300K documents in total. [Paper](http://arxiv.org/abs/1506.03340) for reference.

Microsoft Coco Dataset http://mscoco.org/

SimVerb https://arxiv.org/abs/1608.00869v4

Google Open Images https://research.googleblog.com/2016/09/introducing-open-images-dataset.html

The Stanford Question Answering Dataset https://rajpurkar.github.io/SQuAD-explorer/

Open Subtitiles http://opus.lingfil.uu.se/OpenSubtitles2016.php

STL-10 https://cs.stanford.edu/~acoates/stl10/

LSUN https://arxiv.org/abs/1506.03365v3 one million labeled images for each of 10 scene categories and 20 object categories.

Yelp DataSet Challenge https://www.yelp.com/dataset_challenge 2.7M reviews and 649K tips by 687K users for 86K businesses

A Large Dataset of Object Scans http://redwood-data.org/3dscan/ More than ten thousand 3D scans of real objects.

Ubuntu Dialog Corpus https://github.com/rkadlec/ubuntu-ranking-dataset-creator

iLab-20m http://www.cv-foundation.org/openaccess/content_cvpr_2016/html/Borji_iLab-20M_A_Large-Scale_CVPR_2016_paper.html

Yahoo Datasets http://webscope.sandbox.yahoo.com/#datasets

MovieLens http://grouplens.org/ 22,000,000 ratings and 580,000 tags applied to 33,000 movies by 240,000 users.

AWS Public Datasets https://aws.amazon.com/public-data-sets/

YouTube-8M https://research.google.com/youtube8m/ 8 million YouTube video IDs and associated labels from a diverse vocabulary of 4800 visual entities.

MegaFace http://megaface.cs.washington.edu/

Allen Institute Datasets: http://allenai.org/data.html

Unreal Integration: https://arxiv.org/abs/1609.01326

MetaMind Wiki Text http://metamind.io/research/the-wikitext-long-term-dependency-language-modeling-dataset/

Google Trends Dataset: http://googletrends.github.io/data/

NewsQA https://arxiv.org/abs/1611.09830v1

http://research.criteo.com/dataset-release-evaluation-counterfactual-algorithms/

https://www.kaggle.com/benhamner/nips-papers

https://arxiv.org/pdf/1612.00837.pdf Making the V in VQA Matter: Elevating the Role of Image Understanding in Visual Question Answering

https://arxiv.org/pdf/1612.00881v1.pdf Procedural Generation of Videos to Train Deep Action Recognition Networks

http://www.msmarco.org/ Microsoft has released a massive database to 100,000 question and answer pairs written by humans to help AI researchers train their machines to extract information better from websites and respond more naturally to questions asked by users.

https://arxiv.org/pdf/1612.06890v1.pdf CLEVR: A Diagnostic Dataset for Compositional Language and Elementary Visual Reasoning

https://datasets.maluuba.com/Frames

Frames is precisely meant to encourage research towards conversational agents which can support decision-making in complex settings, in this case - booking a vacation including flights and a hotel. More than just searching a database, we believe the next generation of conversational agents will need to help users explore a database, compare items, and reach a decision.

https://arxiv.org/abs/1606.08513v3 This paper presents a new selection-based question answering dataset, SelQA.

https://arxiv.org/pdf/1612.05079v1.pdf SceneNet RGB-D: 5M Photorealistic Images of Synthetic Indoor Trajectories with Ground Truth

http://www.eecs.qmul.ac.uk/~hs308/qmul_toplogo10.html/

http://www.cl.cam.ac.uk/research/rainbow/projects/unityeyes/

https://www.microsoft.com/en-us/download/details.aspx?id=54689 Microsoft Speech Language Translation (MSLT) Corpus

http://suncg.cs.princeton.edu/ SUNCG: A Large 3D Model Repository for Indoor Scenes

https://arxiv.org/abs/1703.00564 MoleculeNet: A Benchmark for Molecular Machine Learning

https://research.google.com/pubs/pub45741.html HolStep: a Machine Learning Dataset for Higher-Order Logic Theorem Proving

Resources

http://mldata.org/repository/data/

http://opendatainception.io/

http://www.infoworld.com/article/3131515/artificial-intelligence/4-google-data-sets-to-kickstart-machine-learning.html

https://www.google.com/publicdata/directory

http://datasets.maluuba.com/NewsQA

https://medium.com/@olivercameron/20-weird-wonderful-datasets-for-machine-learning-c70fc89b73d5#.1qnk2ky7w

https://docs.google.com/spreadsheets/d/1AQvZ7-Kg0lSZtG1wlgbIsrm90HaTZrJGQMz-uKRRlFw/edit#gid=0

https://arxiv.org/pdf/1511.06348v2.pdf How much data is needed to train a medical image deep learning system to achieve necessary high accuracy?

https://github.com/caesar0301/awesome-public-datasets#energy

http://cvgl.stanford.edu/projects/objectnet3d/

https://data.quora.com/First-Quora-Dataset-Release-Question-Pairs

http://blog.paralleldots.com/data-scientist/new-deep-learning-datasets-data-scientists/

https://arxiv.org/pdf/1704.04452.pdf Bringing Structure into Summaries: Crowdsourcing a Benchmark Corpus of Concept Maps

https://arxiv.org/abs/1704.05179v1 SearchQA: A New Q&A Dataset Augmented with Context from a Search Engine

We publicly release a new large-scale dataset, called SearchQA, for machine comprehension, or question-answering. Unlike recently released datasets, such as DeepMind CNN/DailyMail and SQuAD, the proposed SearchQA was constructed to reflect a full pipeline of general question-answering. That is, we start not from an existing article and generate a question-answer pair, but start from an existing question-answer pair, crawled from J! Archive, and augment it with text snippets retrieved by Google. Following this approach, we built SearchQA, which consists of more than 140k question-answer pairs with each pair having 49.6 snippets on average. Each question-answer-context tuple of the SearchQA comes with additional meta-data such as the snippet's URL, which we believe will be valuable resources for future research. We conduct human evaluation as well as test two baseline methods, one simple word selection and the other deep learning based, on the SearchQA. We show that there is a meaningful gap between the human and machine performances. This suggests that the proposed dataset could well serve as a benchmark for question-answering.

http://textbookqa.org/

https://arxiv.org/abs/1704.05579 A Large Self-Annotated Corpus for Sarcasm

https://github.com/markriedl/WikiPlots

https://medium.com/@NYUDataScience/true-false-neutral-teaching-machines-to-understand-words-not-just-read-them-4098c7161e47

http://cs.stanford.edu/people/jcjohns/iep/

http://labs.semanticscholar.org/corpus/

https://arxiv.org/abs/1712.07040 The NarrativeQA Reading Comprehension Challenge

https://www.nih.gov/news-events/news-releases/nih-clinical-center-provides-one-largest-publicly-available-chest-x-ray-datasets-scientific-community

https://arxiv.org/abs/1801.07746v2 HappyDB: A Corpus of 100,000 Crowdsourced Happy Moments

https://rit-public.github.io/HappyDB/

https://arxiv.org/abs/1710.01779v2 Building a Web-Scale Dependency-Parsed Corpus from CommonCrawl

We present DepCC, the largest-to-date linguistically analyzed corpus in English including 365 million documents, composed of 252 billion tokens and 7.5 billion of named entity occurrences in 14.3 billion sentences from a web-scale crawl of the Common Crawl project.

http://convai.io/

http://www.phrasebank.manchester.ac.uk/

http://xviewdataset.org/

https://arxiv.org/pdf/1804.04314v1.pdf A Large-scale Attribute Dataset for Zero-shot Learning

https://summari.es/

https://pbs.twimg.com/media/DcLJIlqX0AEaJ4x.jpg Logical Entailment Dataset https://github.com/deepmind/logical-entailment-dataset https://openreview.net/pdf?id=SkZxCk-0Z

https://conala-corpus.github.io/ CoNaLa: The Code/Natural Language Challenge

https://www.microsoft.com/en-us/research/blog/announcing-microsoft-research-open-data-datasets-by-microsoft-research-now-available-in-the-cloud/

https://interiornet.org/ InteriorNet: Mega-scale Multi-sensor Photo-realistic Indoor Scenes Dataset

https://arxiv.org/abs/1810.00415v1 Optical Illusions Images Dataset

Standard Benchmark

https://gluebenchmark.com/

http://rodrigob.github.io/are_we_there_yet/build/classification_datasets_results.html

https://20bn.com/datasets/something-something/v2

https://github.com/davidsbatista/Annotated-Semantic-Relationships-Datasets Datasets of Annotated Semantic Relationships