**This is an old revision of the document!**

# Semi-Supervised Learning

**References**

http://arxiv.org/abs/1511.09123v1 A Short Survey on Data Clustering Algorithms

https://arxiv.org/abs/1511.01432 Semi-Supervised Deep Learning

http://arxiv.org/pdf/1606.06724v1.pdf Tagger: Deep Unsupervised Perceptual Grouping

Rather than being trained for any specific segmentation, our framework learns the grouping process in an unsupervised manner or alongside any supervised task. By enriching the representations of a neural network, we enable it to group the representations of different objects in an iterative manner. By allowing the system to amortize the iterative inference of the groupings, we achieve very fast convergence

https://arxiv.org/abs/1604.00289

Building Machines That Learn and Think Like People

https://arxiv.org/pdf/1606.05579.pdf

Early Visual Concept Learning with Unsupervised Deep Learning

By enforcing redundancy reduction, encouraging statistical independence, and exposure to data with transform continuities analogous to those to which human infants are exposed, we obtain a variational autoencoder (VAE) framework capable of learning disentangled factors. Our approach makes few assumptions and works well across a wide variety of datasets. Furthermore, our solution has useful emergent properties, such as zero-shot inference and an intuitive understanding of “objectness”.

https://arxiv.org/abs/1511.02251 Learning Visual Features from Large Weakly Supervised Data

In this paper, we explore the potential of leveraging massive, weakly-labeled image collections for learning good visual features. We train convolutional networks on a dataset of 100 million Flickr photos and captions, and show that these networks produce features that perform well in a range of vision problems.

This study demonstrates that convolutional networks can be trained from scratch without any manual annotation and shows that good features can be learned from weakly supervised data. Indeed, our models learn features that are nearly on par with those learned from an image collection with over a million manually defined labels, and achieve good results on a variety of datasets. (Obtaining state-of-the-art results requires averaging predictions over many crops and models, which is outside the scope of this paper.) Moreover, our results show that weakly supervised models can learn semantic structure from image-word co-occurrences

https://research.googleblog.com/2016/10/graph-powered-machine-learning-at-google.html Graph Powered Machine Learning at Google

https://arxiv.org/abs/1512.01752 Large Scale Distributed Semi-Supervised Learning Using Streaming Approximation

Augmenting Supervised Neural Networks with Unsupervised Objectives for Large-scale Image Classification Zhang, Y., Lee, K., & Lee, H. (2016) [29]

This paper starts out with a brief history of using unsupervised and semi-supervised methods in deep learning. The authors showed how such methods can be scaled to solve large-scale problems. Using their approach, existing neural network architectures for image classification can be augmented with unsupervised decoding pathways for image reconstruction. The decoding pathways consist of a deconvolutional network that mirrors the original network using autoencoders. They initialized the weights for the encoding pathway with the original network and for the decoding pathway with random values. Initially, they trained only the decoding pathway while keeping the encoding pathway fixed. Then they fine-tuned the full network with a reduced learning rate. Applying this method to a state-of-the-art image classification network boosted its performance significantly.

Deconstructing the Ladder Network Architecture Pezeshki, M., Fan, L., Brakel, P., Courville, A., & Bengio, Y. (2016) [20]

A different approach for combining supervised and unsupervised training of deep neural networks is the Ladder Network architecture [21]. It also improves the performance of an existing classifier network by augmenting it with an auxiliary decoder network, but it has additional lateral connections between the original and decoder networks. The resultant network forms a deep stack of denoising autoencoders [26] that is trained to reconstruct each layer from a noisy version. In this paper, the authors studied the ladder architecture systematically by removing its components one at a time to see how much each component contributed to performance. They found that the lateral connections are the most important, followed by the injection of noise, and finally by the choice of the combinator function that combines the vertical and lateral connections. They also introduced a new combinator function that improved the already impressive performance of the ladder network on the Permutation-Invariant MNIST handwritten digit recognition task [15], both for the supervised and semi-supervised settings.

https://arxiv.org/abs/1611.09960 Attend in groups: a weakly-supervised deep learning framework for learning from web data

https://arxiv.org/pdf/1703.00848.pdf Unsupervised Image-to-Image Translation Networks

The proposed framework can learn the translation function without any corresponding images in two domains. We enable this learning capability by combining a weight-sharing constraint and an adversarial training objective.

We model each image domain using a VAE and a GAN. Through an adversarial training objective, an image fidelity function is implicitly defined for each domain. The adversarial training objective interacts with a weight-sharing constraint to generate corresponding images in two domains, while the variational autoencoders relate translated images with input images in the respective domains.

Based on the intuition that a pair of corresponding images in different domains should share a same high-level image representation, we enforce several weight sharing constraints. The connection weights of the last few layers (high-level layers) in E1 and E2 are tied, the connection weights of the first few layers (high-level layers) in G1 and G2 are tied, and the connection weights of the last few layers (high-level layers) in D1 and D2 are tied.

https://arxiv.org/abs/1703.00854v1 Learning the Structure of Generative Models without Labeled Data

Recent frameworks address this bottleneck with generative models to synthesize labels at scale from weak supervision sources. The generative model's dependency structure directly affects the quality of the estimated labels, but selecting a structure automatically without any labeled data is a distinct challenge.

https://www.semanticscholar.org/paper/Distant-supervision-for-relation-extraction-Mintz-Bills/8f8139b63a2fc0b3ae8413acaef47acd35a356e0 Distant supervision for relation extraction without labeled data

We investigate an alternative paradigm that does not require labeled corpora, avoiding the domain dependence of ACE-style algorithms, and allowing the use of corpora of any size. Our experiments use Freebase, a large semantic database of several thousand relations, to provide distant supervision. For each pair of entities that appears in some Freebase relation, we find all sentences containing those entities in a large un-labeled corpus and extract textual features to train a relation classifier. Our algorithm combines the advantages of supervised IE (combining 400,000 noisy pattern features in a probabilistic classifier) and unsupervised IE (extracting large numbers of relations from large corpora of any domain). Our model is able to extract 10,000 instances of 102 relations at a precision of 67.6%. We also analyze feature performance, showing that syntactic parse features are particularly helpful for relations that are ambiguous or lexically distant in their expression.

https://arxiv.org/abs/1703.02618v1 Bootstrapped Graph Diffusions: Exposing the Power of Nonlinearity

we place classic linear graph diffusions in a self-training framework. Surprisingly, we observe that SSL using the resulting {\em bootstrapped diffusions} not only significantly improves over the respective non-bootstrapped baselines but also outperform state-of-the-art non-linear SSL methods. Moreover, since the self-training wrapper retains the scalability of the base method, we obtain both higher quality and better scalability.

https://github.com/parthatalukdar/junto

https://arxiv.org/abs/1610.02242 Temporal Ensembling for Semi-Supervised Learning

In this paper, we present a simple and efficient method for training deep neural networks in a semi-supervised setting where only a small portion of training data is labeled. We introduce self-ensembling, where we form a consensus prediction of the unknown labels using the outputs of the network-in-training on different epochs, and most importantly, under different regularization and input augmentation conditions. This ensemble prediction can be expected to be a better predictor for the unknown labels than the output of the network at the most recent training epoch, and can thus be used as a target for training. Using our method, we set new records for two standard semi-supervised learning benchmarks, reducing the (non-augmented) classification error rate from 18.44% to 7.05% in SVHN with 500 labels and from 18.63% to 16.55% in CIFAR-10 with 4000 labels, and further to 5.12% and 12.16% by enabling the standard augmentations. We additionally demonstrate good tolerance to incorrect labels. https://github.com/smlaine2/tempens

http://arxiv.org/abs/1406.5298 Semi-Supervised Learning with Deep Generative Models https://github.com/saemundsson/semisupervised_vae

https://arxiv.org/pdf/1703.04818v1.pdf Neural Graph Machines: Learning Neural Networks Using Graphs

In this work, we propose a training framework with a graph-regularised objective, namely Neural Graph Machines, that can combine the power of neural networks and label propagation. This work generalises previous literature on graphaugmented training of neural networks, enabling it to be applied to multiple neural architectures (Feed-forward NNs, CNNs and LSTM RNNs) and a wide range of graphs. The new objective allows the neural networks to harness both labeled and unlabeled data by: (a) allowing the network to train using labeled data as in the supervised setting, (b) biasing the network to learn similar hidden representations for neighboring nodes on a graph, in the same vein as label propagation. Such architectures with the proposed objective can be trained efficiently using stochastic gradient descent and scaled to large graphs, with a runtime that is linear in the number of edges.

https://arxiv.org/pdf/1703.07464v1.pdf No Fuss Distance Metric Learning using Proxies

We address the problem of distance metric learning (DML), defined as learning a distance consistent with a notion of semantic similarity. Traditionally, for this problem supervision is expressed in the form of sets of points that follow an ordinal relationship – an anchor point x is similar to a set of positive points Y, and dissimilar to a set of negative points Z, and a loss defined over these distances is minimized. While the specifics of the optimization differ, in this work we collectively call this type of supervision Triplets and all methods that follow this pattern Triplet-Based methods. These methods are challenging to optimize. A main issue is the need for finding informative triplets, which is usually achieved by a variety of tricks such as increasing the batch size, hard or semi-hard triplet mining, etc, but even with these tricks, the convergence rate of such methods is slow. In this paper we propose to optimize the triplet loss on a different space of triplets, consisting of an anchor data point and similar and dissimilar proxy points. These proxies approximate the original data points, so that a triplet loss over the proxies is a tight upper bound of the original loss. This proxy-based loss is empirically better behaved. As a result, the proxy-loss improves on state-of-art results for three standard zero-shot learning datasets, by up to 15% points, while converging three times as fast as other triplet-based losses.

https://arxiv.org/pdf/1704.05310v1.pdf Unsupervised Learning by Predicting Noise

We propose to fix a set of target representations, called Noise As Targets (NAT), and to constrain the deep features to align to them. This domain agnostic approach avoids the standard unsupervised learning issues of trivial solutions and collapsing of features. Thanks to a stochastic batch reassignment strategy and a separable square loss function, it scales to millions of images. The proposed approach produces representations that perform on par with state-of-the-art unsupervised methods on ImageNet and PASCAL VOC.

https://arxiv.org/abs/1707.00189v1 An Approach for Weakly-Supervised Deep Information Retrieval

We present an approach for generating weak supervision training data for use in a neural IR model. Specifically, we use a news corpus with article headlines acting as pseudo-queries and article content as pseudo-documents, and we propose a measure of interaction similarity to filter these pseudo-documents.

https://arxiv.org/abs/1706.00909 Learning by Association - A versatile semi-supervised training method for neural networks

We propose a new framework for semi-supervised training of deep neural networks inspired by learning in humans. “Associations” are made from embeddings of labeled samples to those of unlabeled ones and back. The optimization schedule encourages correct association cycles that end up at the same class from which the association was started and penalizes wrong associations ending at a different class. The implementation is easy to use and can be added to any existing end-to-end training setup.

http://dawn.cs.stanford.edu/2017/07/16/weak-supervision/ Weak Supervision: The New Programming Paradigm for Machine Learning

Getting labeled training data has become the key development bottleneck in supervised machine learning. We provide a broad, high-level overview of recent weak supervision approaches, where noisier or higher-level supervision is used as a more expedient and flexible way to get supervision signal, in particular from subject matter experts (SMEs). We provide a simple, broad definition of weak supervision as being comprised of one or more noisy conditional distributions over unlabeled data, and focus on the key technical challenge of unifying and modeling these sources.

https://arxiv.org/abs/1607.06854 Unsupervised Learning from Continuous Video in a Scalable Predictive Recurrent Network

https://arxiv.org/abs/1705.10694v2 Deep Learning is Robust to Massive Label Noise

In this paper, we investigate the behavior of deep neural networks on training sets with massively noisy labels. We show that successful learning is possible even with an essentially arbitrary amount of noise. For example, on MNIST we find that accuracy of above 90 percent is still attainable even when the dataset has been diluted with 100 noisy examples for each clean example.

https://arxiv.org/pdf/1710.02584.pdf Bag-Level Aggregation for Multiple Instance Active Learning in Instance Classification Problems

This paper focuses on AL methods for instance classification problems in multiple instance learning (MIL), where data is arranged into sets, called bags, that are weakly labeled. Most AL methods focus on single instance learning problems. These methods are not suitable for MIL problems because they cannot account for the bag structure of data. In this paper, new methods for bag-level aggregation of instance informativeness are proposed for multiple instance active learning (MIAL). The aggregated informativeness method identifies the most informative instances based on classifier uncertainty, and queries bags incorporating the most information. The other proposed method, called clusterbased aggregative sampling, clusters data hierarchically in the instance space. The informativeness of instances is assessed by considering bag labels, inferred instance labels, and the proportion of labels that remain to be discovered in clusters.

https://openreview.net/pdf?id=ByL48G-AW SIMPLE NEAREST NEIGHBOR POLICY METHOD FOR CONTINUOUS CONTROL TASKS

We design a new policy, called a nearest neighbor policy, that does not require any optimization for simple, low-dimensional continuous control tasks. As this policy does not require any optimization, it allows us to investigate the underlying difficulty of a task without being distracted by optimization difficulty of a learning algorithm. We propose two variants, one that retrieves an entire trajectory based on a pair of initial and goal states, and the other retrieving a partial trajectory based on a pair of current and goal states.

https://papers.nips.cc/paper/6931-deep-sets.pdf Deep Sets

In contrast to traditional approach of operating on fixed dimensional vectors, we consider objective functions defined on sets that are invariant to permutations. We also derive the necessary and sufficient conditions for permutation equivariance in deep models. We demonstrate the applicability of our method on population statistic estimation, point cloud classification, set expansion, and outlier detection.

https://arxiv.org/abs/1702.07817 Unsupervised Sequence Classification using Sequential Output Statistics

we propose an unsupervised learning cost function and study its properties. We show that, compared to earlier works, it is less inclined to be stuck in trivial solutions and avoids the need for a strong generative model. Although it is harder to optimize in its functional form, a stochastic primal-dual gradient method is developed to effectively solve the problem.

https://openreview.net/forum?id=B1X0mzZCW Fidelity-Weighted Learning

To this end, we propose “fidelity-weighted learning” (FWL), a semi-supervised student- teacher approach for training deep neural networks using weakly-labeled data. FWL modulates the parameter updates to a student network (trained on the task we care about) on a per-sample basis according to the posterior confidence of its label-quality estimated by a teacher (who has access to the high-quality labels). Both student and teacher are learned from the data. We evaluate FWL on two tasks in information retrieval and natural language processing where we outperform state-of-the-art alternative semi-supervised methods, indicating that our approach makes better use of strong and weak labels, and leads to better task-dependent data representations.

https://hazyresearch.github.io/snorkel/blog/ws_blog_post.html

https://arxiv.org/abs/1704.08803 Neural Ranking Models with Weak Supervision

Hence, in this paper, we propose to train a neural ranking model using weak supervision, where labels are obtained automatically without human annotators or any external resources (e.g., click data). To this aim, we use the output of an unsupervised ranking model, such as BM25, as a weak supervision signal.

http://metalearning.ml/papers/metalearn17_dehghani.pdf Learning to Learn from Weak Supervision by Full Supervision

In this paper, we propose a method for training neural networks when we have a large set of data with weak labels and a small amount of data with true labels. In our proposed model, we train two neural networks: a target network, the learner and a confidence network, the meta-learner. The target network is optimized to perform a given task and is trained using a large set of unlabeled data that are weakly annotated. We propose to control the magnitude of the gradient updates to the target network using the scores provided by the second confidence network, which is trained on a small amount of supervised data. Thus we avoid that the weight updates computed from noisy labels harm the quality of the target network model.

https://openreview.net/pdf?id=ByoT9Fkvz LEARNING TO LEARN WITHOUT LABELS

By recasting unsupervised learning as meta-learning, we treat the creation of the unsupervised update rule as a transfer learning problem. Instead of learning transferable features, such as done in (Vinyals et al., 2016; Ravi & Larochelle, 2016; Snell et al., 2017), we learn a transferable learning rule which does not require access to labels and generalizes across domains. Although we focus on the meta-objective of semi-supervised classification here, in principle a learning rule could be optimized to generate representations for any subsequent task.

https://papers.nips.cc/paper/7278-learning-to-model-the-tail Learning to Model the Tail

We describe an approach to learning from long-tailed, imbalanced datasets that are prevalent in real-world settings.

https://arxiv.org/abs/1804.00092 Iterative Learning with Open-set Noisy Labels

Large-scale datasets possessing clean label annotations are crucial for training Convolutional Neural Networks (CNNs). However, labeling large-scale data can be very costly and error-prone, and even high-quality datasets are likely to contain noisy (incorrect) labels. Existing works usually employ a closed-set assumption, whereby the samples associated with noisy labels possess a true class contained within the set of known classes in the training data. However, such an assumption is too restrictive for many applications, since samples associated with noisy labels might in fact possess a true class that is not present in the training data. We refer to this more complex scenario as the \textbf{open-set noisy label} problem and show that it is nontrivial in order to make accurate predictions. To address this problem, we propose a novel iterative learning framework for training CNNs on datasets with open-set noisy labels. Our approach detects noisy labels and learns deep discriminative features in an iterative fashion. To benefit from the noisy label detection, we design a Siamese network to encourage clean labels and noisy labels to be dissimilar. A reweighting module is also applied to simultaneously emphasize the learning from clean labels and reduce the effect caused by noisy labels. Experiments on CIFAR-10, ImageNet and real-world noisy (web-search) datasets demonstrate that our proposed model can robustly train CNNs in the presence of a high proportion of open-set as well as closed-set noisy labels.

https://arxiv.org/abs/1804.03273v1 On the Supermodularity of Active Graph-based Semi-supervised Learning with Stieltjes Matrix Regularization

https://papers.nips.cc/paper/6469-dual-learning-for-machine-translation.pdf Dual Learning for Machine Translation

This mechanism is inspired by the following observation: any machine translation task has a dual task, e.g., English-to-French translation (primal) versus French-to-English translation (dual); the primal and dual tasks can form a closed loop, and generate informative feedback signals to train the translation models, even if without the involvement of a human labeler. In the dual-learning mechanism, we use one agent to represent the model for the primal task and the other agent to represent the model for the dual task, then ask them to teach each other through a reinforcement learning process. Based on the feedback signals generated during this process (e.g., the languagemodel likelihood of the output of a model, and the reconstruction error of the original sentence after the primal and dual translations), we can iteratively update the two models until convergence (e.g., using the policy gradient methods). We call the corresponding approach to neural machine translation dual-NMT.

https://arxiv.org/abs/1606.04596 Semi-Supervised Learning for Neural Machine Translation

While end-to-end neural machine translation (NMT) has made remarkable progress recently, NMT systems only rely on parallel corpora for parameter estimation. Since parallel corpora are usually limited in quantity, quality, and coverage, especially for low-resource languages, it is appealing to exploit monolingual corpora to improve NMT. We propose a semi-supervised approach for training NMT models on the concatenation of labeled (parallel corpora) and unlabeled (monolingual corpora) data. The central idea is to reconstruct the monolingual corpora using an autoencoder, in which the source-to-target and target-to-source translation models serve as the encoder and decoder, respectively. Our approach can not only exploit the monolingual corpora of the target language, but also of the source language. Experiments on the Chinese-English dataset show that our approach achieves significant improvements over state-of-the-art SMT and NMT systems.

https://arxiv.org/abs/1804.09170v1 Realistic Evaluation of Deep Semi-Supervised Learning Algorithms

we argue that these benchmarks fail to address many issues that these algorithms would face in real-world applications. After creating a unified reimplementation of various widely-used SSL techniques, we test them in a suite of experiments designed to address these issues. We find that the performance of simple baselines which do not use unlabeled data is often underreported, that SSL methods differ in sensitivity to the amount of labeled and unlabeled data, and that performance can degrade substantially when the unlabeled dataset contains out-of-class examples. To help guide SSL research towards real-world applicability, we make our unified reimplemention and evaluation platform publicly available.

https://arxiv.org/abs/1808.08485v1 Deep Probabilistic Logic: A Unifying Framework for Indirect Supervision

. In this paper, we propose deep probabilistic logic (DPL) as a general framework for indirect supervision, by composing probabilistic logic with deep learning. DPL models label decisions as latent variables, represents prior knowledge on their relations using weighted first-order logical formulas, and alternates between learning a deep neural network for the end task and refining uncertain formula weights for indirect supervision, using variational EM. This framework subsumes prior indirect supervision methods as special cases, and enables novel combination via infusion of rich domain and linguistic knowledge. http://hanover.azurewebsites.net/

https://openreview.net/forum?id=r1g7y2RqYX Label Propagation Networks

https://arxiv.org/abs/1810.02840 Training Complex Models with Multi-Task Weak Supervision

We show that by solving a matrix completion-style problem, we can recover the accuracies of these multi-task sources given their dependency structure, but without any labeled data, leading to higher-quality supervision for training an end model. Theoretically, we show that the generalization error of models trained with this approach improves with the number of unlabeled data points, and characterize the scaling with respect to the task and dependency structures. On three fine-grained classification problems, we show that our approach leads to average gains of 20.2 points in accuracy over a traditional supervised approach, 6.8 points over a majority vote baseline, and 4.1 points over a previously proposed weak supervision method that models tasks separately.