**This is an old revision of the document!**

# Semi-Supervised Learning

**References**

http://arxiv.org/abs/1511.09123v1 A Short Survey on Data Clustering Algorithms

https://arxiv.org/abs/1511.01432 Semi-Supervised Deep Learning

http://arxiv.org/pdf/1606.06724v1.pdf Tagger: Deep Unsupervised Perceptual Grouping

Rather than being trained for any specific segmentation, our framework learns the grouping process in an unsupervised manner or alongside any supervised task. By enriching the representations of a neural network, we enable it to group the representations of different objects in an iterative manner. By allowing the system to amortize the iterative inference of the groupings, we achieve very fast convergence

https://arxiv.org/abs/1604.00289

Building Machines That Learn and Think Like People

https://arxiv.org/pdf/1606.05579.pdf

Early Visual Concept Learning with Unsupervised Deep Learning

By enforcing redundancy reduction, encouraging statistical independence, and exposure to data with transform continuities analogous to those to which human infants are exposed, we obtain a variational autoencoder (VAE) framework capable of learning disentangled factors. Our approach makes few assumptions and works well across a wide variety of datasets. Furthermore, our solution has useful emergent properties, such as zero-shot inference and an intuitive understanding of “objectness”.

https://arxiv.org/abs/1511.02251 Learning Visual Features from Large Weakly Supervised Data

In this paper, we explore the potential of leveraging massive, weakly-labeled image collections for learning good visual features. We train convolutional networks on a dataset of 100 million Flickr photos and captions, and show that these networks produce features that perform well in a range of vision problems.

This study demonstrates that convolutional networks can be trained from scratch without any manual annotation and shows that good features can be learned from weakly supervised data. Indeed, our models learn features that are nearly on par with those learned from an image collection with over a million manually defined labels, and achieve good results on a variety of datasets. (Obtaining state-of-the-art results requires averaging predictions over many crops and models, which is outside the scope of this paper.) Moreover, our results show that weakly supervised models can learn semantic structure from image-word co-occurrences

https://research.googleblog.com/2016/10/graph-powered-machine-learning-at-google.html Graph Powered Machine Learning at Google

https://arxiv.org/abs/1512.01752 Large Scale Distributed Semi-Supervised Learning Using Streaming Approximation

Augmenting Supervised Neural Networks with Unsupervised Objectives for Large-scale Image Classification Zhang, Y., Lee, K., & Lee, H. (2016) [29]

This paper starts out with a brief history of using unsupervised and semi-supervised methods in deep learning. The authors showed how such methods can be scaled to solve large-scale problems. Using their approach, existing neural network architectures for image classification can be augmented with unsupervised decoding pathways for image reconstruction. The decoding pathways consist of a deconvolutional network that mirrors the original network using autoencoders. They initialized the weights for the encoding pathway with the original network and for the decoding pathway with random values. Initially, they trained only the decoding pathway while keeping the encoding pathway fixed. Then they fine-tuned the full network with a reduced learning rate. Applying this method to a state-of-the-art image classification network boosted its performance significantly.

Deconstructing the Ladder Network Architecture Pezeshki, M., Fan, L., Brakel, P., Courville, A., & Bengio, Y. (2016) [20]

A different approach for combining supervised and unsupervised training of deep neural networks is the Ladder Network architecture [21]. It also improves the performance of an existing classifier network by augmenting it with an auxiliary decoder network, but it has additional lateral connections between the original and decoder networks. The resultant network forms a deep stack of denoising autoencoders [26] that is trained to reconstruct each layer from a noisy version. In this paper, the authors studied the ladder architecture systematically by removing its components one at a time to see how much each component contributed to performance. They found that the lateral connections are the most important, followed by the injection of noise, and finally by the choice of the combinator function that combines the vertical and lateral connections. They also introduced a new combinator function that improved the already impressive performance of the ladder network on the Permutation-Invariant MNIST handwritten digit recognition task [15], both for the supervised and semi-supervised settings.

https://arxiv.org/abs/1611.09960 Attend in groups: a weakly-supervised deep learning framework for learning from web data

https://arxiv.org/pdf/1703.00848.pdf Unsupervised Image-to-Image Translation Networks

The proposed framework can learn the translation function without any corresponding images in two domains. We enable this learning capability by combining a weight-sharing constraint and an adversarial training objective.

We model each image domain using a VAE and a GAN. Through an adversarial training objective, an image fidelity function is implicitly defined for each domain. The adversarial training objective interacts with a weight-sharing constraint to generate corresponding images in two domains, while the variational autoencoders relate translated images with input images in the respective domains.

Based on the intuition that a pair of corresponding images in different domains should share a same high-level image representation, we enforce several weight sharing constraints. The connection weights of the last few layers (high-level layers) in E1 and E2 are tied, the connection weights of the first few layers (high-level layers) in G1 and G2 are tied, and the connection weights of the last few layers (high-level layers) in D1 and D2 are tied.

https://arxiv.org/abs/1703.00854v1 Learning the Structure of Generative Models without Labeled Data

Recent frameworks address this bottleneck with generative models to synthesize labels at scale from weak supervision sources. The generative model's dependency structure directly affects the quality of the estimated labels, but selecting a structure automatically without any labeled data is a distinct challenge.

https://www.semanticscholar.org/paper/Distant-supervision-for-relation-extraction-Mintz-Bills/8f8139b63a2fc0b3ae8413acaef47acd35a356e0 Distant supervision for relation extraction without labeled data

We investigate an alternative paradigm that does not require labeled corpora, avoiding the domain dependence of ACE-style algorithms, and allowing the use of corpora of any size. Our experiments use Freebase, a large semantic database of several thousand relations, to provide distant supervision. For each pair of entities that appears in some Freebase relation, we find all sentences containing those entities in a large un-labeled corpus and extract textual features to train a relation classifier. Our algorithm combines the advantages of supervised IE (combining 400,000 noisy pattern features in a probabilistic classifier) and unsupervised IE (extracting large numbers of relations from large corpora of any domain). Our model is able to extract 10,000 instances of 102 relations at a precision of 67.6%. We also analyze feature performance, showing that syntactic parse features are particularly helpful for relations that are ambiguous or lexically distant in their expression.

https://arxiv.org/abs/1703.02618v1 Bootstrapped Graph Diffusions: Exposing the Power of Nonlinearity

we place classic linear graph diffusions in a self-training framework. Surprisingly, we observe that SSL using the resulting {\em bootstrapped diffusions} not only significantly improves over the respective non-bootstrapped baselines but also outperform state-of-the-art non-linear SSL methods. Moreover, since the self-training wrapper retains the scalability of the base method, we obtain both higher quality and better scalability.

https://github.com/parthatalukdar/junto

https://arxiv.org/abs/1610.02242 Temporal Ensembling for Semi-Supervised Learning

In this paper, we present a simple and efficient method for training deep neural networks in a semi-supervised setting where only a small portion of training data is labeled. We introduce self-ensembling, where we form a consensus prediction of the unknown labels using the outputs of the network-in-training on different epochs, and most importantly, under different regularization and input augmentation conditions. This ensemble prediction can be expected to be a better predictor for the unknown labels than the output of the network at the most recent training epoch, and can thus be used as a target for training. Using our method, we set new records for two standard semi-supervised learning benchmarks, reducing the (non-augmented) classification error rate from 18.44% to 7.05% in SVHN with 500 labels and from 18.63% to 16.55% in CIFAR-10 with 4000 labels, and further to 5.12% and 12.16% by enabling the standard augmentations. We additionally demonstrate good tolerance to incorrect labels. https://github.com/smlaine2/tempens

http://arxiv.org/abs/1406.5298 Semi-Supervised Learning with Deep Generative Models https://github.com/saemundsson/semisupervised_vae

https://arxiv.org/pdf/1703.04818v1.pdf Neural Graph Machines: Learning Neural Networks Using Graphs

In this work, we propose a training framework with a graph-regularised objective, namely Neural Graph Machines, that can combine the power of neural networks and label propagation. This work generalises previous literature on graphaugmented training of neural networks, enabling it to be applied to multiple neural architectures (Feed-forward NNs, CNNs and LSTM RNNs) and a wide range of graphs. The new objective allows the neural networks to harness both labeled and unlabeled data by: (a) allowing the network to train using labeled data as in the supervised setting, (b) biasing the network to learn similar hidden representations for neighboring nodes on a graph, in the same vein as label propagation. Such architectures with the proposed objective can be trained efficiently using stochastic gradient descent and scaled to large graphs, with a runtime that is linear in the number of edges.

https://arxiv.org/pdf/1703.07464v1.pdf No Fuss Distance Metric Learning using Proxies

We address the problem of distance metric learning (DML), defined as learning a distance consistent with a notion of semantic similarity. Traditionally, for this problem supervision is expressed in the form of sets of points that follow an ordinal relationship – an anchor point x is similar to a set of positive points Y, and dissimilar to a set of negative points Z, and a loss defined over these distances is minimized. While the specifics of the optimization differ, in this work we collectively call this type of supervision Triplets and all methods that follow this pattern Triplet-Based methods. These methods are challenging to optimize. A main issue is the need for finding informative triplets, which is usually achieved by a variety of tricks such as increasing the batch size, hard or semi-hard triplet mining, etc, but even with these tricks, the convergence rate of such methods is slow. In this paper we propose to optimize the triplet loss on a different space of triplets, consisting of an anchor data point and similar and dissimilar proxy points. These proxies approximate the original data points, so that a triplet loss over the proxies is a tight upper bound of the original loss. This proxy-based loss is empirically better behaved. As a result, the proxy-loss improves on state-of-art results for three standard zero-shot learning datasets, by up to 15% points, while converging three times as fast as other triplet-based losses.

https://arxiv.org/pdf/1704.05310v1.pdf Unsupervised Learning by Predicting Noise

We propose to fix a set of target representations, called Noise As Targets (NAT), and to constrain the deep features to align to them. This domain agnostic approach avoids the standard unsupervised learning issues of trivial solutions and collapsing of features. Thanks to a stochastic batch reassignment strategy and a separable square loss function, it scales to millions of images. The proposed approach produces representations that perform on par with state-of-the-art unsupervised methods on ImageNet and PASCAL VOC.

https://arxiv.org/abs/1707.00189v1 An Approach for Weakly-Supervised Deep Information Retrieval

We present an approach for generating weak supervision training data for use in a neural IR model. Specifically, we use a news corpus with article headlines acting as pseudo-queries and article content as pseudo-documents, and we propose a measure of interaction similarity to filter these pseudo-documents.

https://arxiv.org/abs/1706.00909 Learning by Association - A versatile semi-supervised training method for neural networks

We propose a new framework for semi-supervised training of deep neural networks inspired by learning in humans. “Associations” are made from embeddings of labeled samples to those of unlabeled ones and back. The optimization schedule encourages correct association cycles that end up at the same class from which the association was started and penalizes wrong associations ending at a different class. The implementation is easy to use and can be added to any existing end-to-end training setup.

http://dawn.cs.stanford.edu/2017/07/16/weak-supervision/ Weak Supervision: The New Programming Paradigm for Machine Learning

Getting labeled training data has become the key development bottleneck in supervised machine learning. We provide a broad, high-level overview of recent weak supervision approaches, where noisier or higher-level supervision is used as a more expedient and flexible way to get supervision signal, in particular from subject matter experts (SMEs). We provide a simple, broad definition of weak supervision as being comprised of one or more noisy conditional distributions over unlabeled data, and focus on the key technical challenge of unifying and modeling these sources.

https://arxiv.org/abs/1607.06854 Unsupervised Learning from Continuous Video in a Scalable Predictive Recurrent Network

https://arxiv.org/abs/1705.10694v2 Deep Learning is Robust to Massive Label Noise

In this paper, we investigate the behavior of deep neural networks on training sets with massively noisy labels. We show that successful learning is possible even with an essentially arbitrary amount of noise. For example, on MNIST we find that accuracy of above 90 percent is still attainable even when the dataset has been diluted with 100 noisy examples for each clean example.

https://arxiv.org/pdf/1710.02584.pdf Bag-Level Aggregation for Multiple Instance Active Learning in Instance Classification Problems

This paper focuses on AL methods for instance classification problems in multiple instance learning (MIL), where data is arranged into sets, called bags, that are weakly labeled. Most AL methods focus on single instance learning problems. These methods are not suitable for MIL problems because they cannot account for the bag structure of data. In this paper, new methods for bag-level aggregation of instance informativeness are proposed for multiple instance active learning (MIAL). The aggregated informativeness method identifies the most informative instances based on classifier uncertainty, and queries bags incorporating the most information. The other proposed method, called clusterbased aggregative sampling, clusters data hierarchically in the instance space. The informativeness of instances is assessed by considering bag labels, inferred instance labels, and the proportion of labels that remain to be discovered in clusters.

https://openreview.net/pdf?id=ByL48G-AW SIMPLE NEAREST NEIGHBOR POLICY METHOD FOR CONTINUOUS CONTROL TASKS

We design a new policy, called a nearest neighbor policy, that does not require any optimization for simple, low-dimensional continuous control tasks. As this policy does not require any optimization, it allows us to investigate the underlying difficulty of a task without being distracted by optimization difficulty of a learning algorithm. We propose two variants, one that retrieves an entire trajectory based on a pair of initial and goal states, and the other retrieving a partial trajectory based on a pair of current and goal states.

https://papers.nips.cc/paper/6931-deep-sets.pdf Deep Sets

In contrast to traditional approach of operating on fixed dimensional vectors, we consider objective functions defined on sets that are invariant to permutations.