Deep Learning Datasets
MNIST(http://yann.lecun.com/exdb/mnist/) Handwritten digits
Google House Numbers(http://ufldl.stanford.edu/housenumbers/) from street view
CIFAR-10 and CIFAR-100(http://www.cs.toronto.edu/~kriz/cifar.html)
IMAGENET (http://www.image-net.org/)
Flickr Data (http://yahoolabs.tumblr.com/post/89783581601/one-hundred-million-creative-commons-flickr-images) 100 Million Yahoo dataset
UC Irvine Machine Learning Repository (http://archive.ics.uci.edu/ml/)
The AR Face Database (http://rvl1.ecn.purdue.edu/~aleix/aleix_face_DB.html) - Contains over 4,000 color images corresponding to 126 people's faces (70 men and 56 women). Frontal views with variations in facial expressions, illumination, and occlusions. (Formats: RAW (RGB 24-bit))
Yale Face Database(http://cvc.yale.edu/projects/yalefaces/yalefaces.html) - 165 images (15 individuals) with different lighting, expression, and occlusion configurations.
DeepMind QA Corpus (https://github.com/deepmind/rc-data) - Textual QA corpus from CNN and DailyMail. More than 300K documents in total. [Paper](http://arxiv.org/abs/1506.03340) for reference.
Microsoft Coco Dataset http://mscoco.org/
SimVerb https://arxiv.org/abs/1608.00869v4
Google Open Images https://research.googleblog.com/2016/09/introducing-open-images-dataset.html
The Stanford Question Answering Dataset https://rajpurkar.github.io/SQuAD-explorer/
Open Subtitiles http://opus.lingfil.uu.se/OpenSubtitles2016.php
STL-10 https://cs.stanford.edu/~acoates/stl10/
LSUN https://arxiv.org/abs/1506.03365v3 one million labeled images for each of 10 scene categories and 20 object categories.
Yelp DataSet Challenge https://www.yelp.com/dataset_challenge 2.7M reviews and 649K tips by 687K users for 86K businesses
A Large Dataset of Object Scans http://redwood-data.org/3dscan/ More than ten thousand 3D scans of real objects.
Ubuntu Dialog Corpus https://github.com/rkadlec/ubuntu-ranking-dataset-creator
Yahoo Datasets http://webscope.sandbox.yahoo.com/#datasets
MovieLens http://grouplens.org/ 22,000,000 ratings and 580,000 tags applied to 33,000 movies by 240,000 users.
AWS Public Datasets https://aws.amazon.com/public-data-sets/
YouTube-8M https://research.google.com/youtube8m/ 8 million YouTube video IDs and associated labels from a diverse vocabulary of 4800 visual entities.
MegaFace http://megaface.cs.washington.edu/
Allen Institute Datasets: http://allenai.org/data.html
Unreal Integration: https://arxiv.org/abs/1609.01326
MetaMind Wiki Text http://metamind.io/research/the-wikitext-long-term-dependency-language-modeling-dataset/
Google Trends Dataset: http://googletrends.github.io/data/
NewsQA https://arxiv.org/abs/1611.09830v1
http://research.criteo.com/dataset-release-evaluation-counterfactual-algorithms/
https://www.kaggle.com/benhamner/nips-papers
https://arxiv.org/pdf/1612.00837.pdf Making the V in VQA Matter: Elevating the Role of Image Understanding in Visual Question Answering
https://arxiv.org/pdf/1612.00881v1.pdf Procedural Generation of Videos to Train Deep Action Recognition Networks
http://www.msmarco.org/ Microsoft has released a massive database to 100,000 question and answer pairs written by humans to help AI researchers train their machines to extract information better from websites and respond more naturally to questions asked by users.
https://arxiv.org/pdf/1612.06890v1.pdf CLEVR: A Diagnostic Dataset for Compositional Language and Elementary Visual Reasoning
https://datasets.maluuba.com/Frames
Frames is precisely meant to encourage research towards conversational agents which can support decision-making in complex settings, in this case - booking a vacation including flights and a hotel. More than just searching a database, we believe the next generation of conversational agents will need to help users explore a database, compare items, and reach a decision.
https://arxiv.org/abs/1606.08513v3 This paper presents a new selection-based question answering dataset, SelQA.
https://arxiv.org/pdf/1612.05079v1.pdf SceneNet RGB-D: 5M Photorealistic Images of Synthetic Indoor Trajectories with Ground Truth
http://www.eecs.qmul.ac.uk/~hs308/qmul_toplogo10.html/
http://www.cl.cam.ac.uk/research/rainbow/projects/unityeyes/
https://www.microsoft.com/en-us/download/details.aspx?id=54689 Microsoft Speech Language Translation (MSLT) Corpus
http://suncg.cs.princeton.edu/ SUNCG: A Large 3D Model Repository for Indoor Scenes
https://arxiv.org/abs/1703.00564 MoleculeNet: A Benchmark for Molecular Machine Learning
https://research.google.com/pubs/pub45741.html HolStep: a Machine Learning Dataset for Higher-Order Logic Theorem Proving
Resources
http://mldata.org/repository/data/
https://www.google.com/publicdata/directory
http://datasets.maluuba.com/NewsQA
https://docs.google.com/spreadsheets/d/1AQvZ7-Kg0lSZtG1wlgbIsrm90HaTZrJGQMz-uKRRlFw/edit#gid=0
https://arxiv.org/pdf/1511.06348v2.pdf How much data is needed to train a medical image deep learning system to achieve necessary high accuracy?
https://github.com/caesar0301/awesome-public-datasets#energy
http://cvgl.stanford.edu/projects/objectnet3d/
https://data.quora.com/First-Quora-Dataset-Release-Question-Pairs
http://blog.paralleldots.com/data-scientist/new-deep-learning-datasets-data-scientists/
https://arxiv.org/pdf/1704.04452.pdf Bringing Structure into Summaries: Crowdsourcing a Benchmark Corpus of Concept Maps
https://arxiv.org/abs/1704.05179v1 SearchQA: A New Q&A Dataset Augmented with Context from a Search Engine
We publicly release a new large-scale dataset, called SearchQA, for machine comprehension, or question-answering. Unlike recently released datasets, such as DeepMind CNN/DailyMail and SQuAD, the proposed SearchQA was constructed to reflect a full pipeline of general question-answering. That is, we start not from an existing article and generate a question-answer pair, but start from an existing question-answer pair, crawled from J! Archive, and augment it with text snippets retrieved by Google. Following this approach, we built SearchQA, which consists of more than 140k question-answer pairs with each pair having 49.6 snippets on average. Each question-answer-context tuple of the SearchQA comes with additional meta-data such as the snippet's URL, which we believe will be valuable resources for future research. We conduct human evaluation as well as test two baseline methods, one simple word selection and the other deep learning based, on the SearchQA. We show that there is a meaningful gap between the human and machine performances. This suggests that the proposed dataset could well serve as a benchmark for question-answering.
https://arxiv.org/abs/1704.05579 A Large Self-Annotated Corpus for Sarcasm
https://github.com/markriedl/WikiPlots
http://cs.stanford.edu/people/jcjohns/iep/
http://labs.semanticscholar.org/corpus/
https://arxiv.org/abs/1712.07040 The NarrativeQA Reading Comprehension Challenge
https://arxiv.org/abs/1801.07746v2 HappyDB: A Corpus of 100,000 Crowdsourced Happy Moments
https://rit-public.github.io/HappyDB/
https://arxiv.org/abs/1710.01779v2 Building a Web-Scale Dependency-Parsed Corpus from CommonCrawl
We present DepCC, the largest-to-date linguistically analyzed corpus in English including 365 million documents, composed of 252 billion tokens and 7.5 billion of named entity occurrences in 14.3 billion sentences from a web-scale crawl of the Common Crawl project.
http://www.phrasebank.manchester.ac.uk/
https://arxiv.org/pdf/1804.04314v1.pdf A Large-scale Attribute Dataset for Zero-shot Learning
https://pbs.twimg.com/media/DcLJIlqX0AEaJ4x.jpg Logical Entailment Dataset https://github.com/deepmind/logical-entailment-dataset https://openreview.net/pdf?id=SkZxCk-0Z
https://conala-corpus.github.io/ CoNaLa: The Code/Natural Language Challenge
https://interiornet.org/ InteriorNet: Mega-scale Multi-sensor Photo-realistic Indoor Scenes Dataset
https://arxiv.org/abs/1810.00415v1 Optical Illusions Images Dataset
https://github.com/google/cog A dataset and architecture for visual reasoning with a working memory
http://decomp.io/ The Decompositional Semantics Initiative Rapid, simple, commonsensical annotations of meaning
http://www.msmarco.org/ MS MARCO(Microsoft Machine Reading Comprehension) is a large scale dataset focused on machine reading comprehension, question answering, and passage ranking.
https://homes.cs.washington.edu/~msap/atomic/ . An Atlas of Machine Commonsense for If-Then Reasoning
http://nlpprogress.com/english/simplification.html
https://www.research.ibm.com/haifa/dept/vst/debating_data.shtml
https://linqs.soe.ucsc.edu/data
Standard Benchmark
http://rodrigob.github.io/are_we_there_yet/build/classification_datasets_results.html
https://20bn.com/datasets/something-something/v2
https://github.com/davidsbatista/Annotated-Semantic-Relationships-Datasets Datasets of Annotated Semantic Relationships
https://ai.google.com/research/NaturalQuestions
https://www.kaggle.com/shujian/arxiv-nlp-papers-with-github-link