http://www.ics.uci.edu/~welling/publications/papers/HerdingChaos2016.pdf

Herding defines a deterministic dynamical system at the edge of chaos. It generates a sequence of model states and parameters by alternating parameter perturbations with state maximizations, where the sequence of states can be interpreted as “samples” from an associated MRF model. Herding differs from maximum likelihood estimation in that the sequence of parameters does not converge to a fixed point and differs from an MCMC posterior sampling approach in that the sequence of states is generated deterministically. Herding may be interpreted as a“perturb and map” method where the parameter perturbations are generated using a deterministic nonlinear dynamical system rather than randomly from a Gumbel distribution. This chapter studies the distinct statistical characteristics of the herding algorithm and shows that the fast convergence rate of the controlled moments may be attributed to edge of chaos dynamics. The herding algorithm can also be generalized to models with latent variables and to a discriminative learning setting. The perceptron cycling theorem ensures that the fast moment matching property is preserved in the more general framework.

https://arxiv.org/pdf/1203.4523.pdf On the Equivalence between Herding and Conditional Gradient Algorithms

We show that the herding procedure of Welling (2009b) takes exactly the form of a standard convex optimization algorithm— namely a conditional gradient algorithm minimizing a quadratic moment discrepancy. This link enables us to invoke convergence results from convex optimization and to consider faster alternatives for the task of approximating integrals in a reproducing kernel Hilbert space. We study the behavior of the different variants through numerical simulations. Our experiments shed more light on the learning bias of herding: they indicate that while we can improve over herding on the task of approximating integrals, the original herding algorithm approaches more often the maximum entropy distribution.

We showed that herding generates a sequence of points which give in sequence the descent directions of a conditional gradient algorithm minimizing the quadratic error on the moment vector. Therefore, if herding is only assessed in terms of its ability to approximate the moment vector, it is outperformed by other more efficient algorithms. Clearly, herding was originally defined with another goal, which was to generate a pseudo-sample whose distribution could approach the maximum entropy distribution with a given moment vector. Our experiments suggest empirically, that while this is the case in certain cases, herding fails in other case, which are not chosen to be particularly pathological. This probably prompts for a further study of herding.

Our experiments also show that algorithms that are
more efficient than herding at approximating the moment
vector fail more blatantly to approach a maximum
entropy distribution and they present characteristics
which would rather suggest a minimization of the
entropy. This suggests the question of **whether there is
a tradeoff between approximating most efficiently the mean vector and approximating well the maximum entropy
distribution**, or if the two goals are in fact rather
aligned? In any case, we hope that formulating herding
as an optimization problem can help form a better
understanding of its goals and its properties.

Herding Dynamical Weights to Learn http://machinelearning.org/archive/icml2009/papers/447.pdf

A new “herding” algorithm is proposed which directly converts observed moments into a sequence of pseudo-samples. The pseudosamples respect the moment constraints and may be used to estimate (unobserved) quantities of interest. The procedure allows us to sidestep the usual approach of first learning a joint model (which is intractable) and then sampling from that model (which can easily get stuck in a local mode). Moreover, the algorithm is fully deterministic, avoiding random number generation) and does not need expensive operations such as exponentiation.

https://pdfs.semanticscholar.org/581e/bbabadf59ae0f07c796f2d3b3ecffa839cd5.pdf On Herding in Deep Networks

https://www.ics.uci.edu/~welling/people/thesisYutian.pdf Herding: Driving Deterministic Dynamics to Learn and Sample Probabilistic Models

https://arxiv.org/abs/1703.00887v1 How to Escape Saddle Points Efficiently

A perturbed form of gradient descent converges to a second-order stationary point in a number iterations which depends only poly-logarithmically on dimension (i.e., it is almost “dimension-free”). The convergence rate of this procedure matches the well-known convergence rate of gradient descent to first-order stationary points, up to log factors. When all saddle points are non-degenerate, all second-order stationary points are local minima, and our result thus shows that perturbed gradient descent can escape saddle points almost for free.

https://arxiv.org/pdf/1703.00810v1.pdf Opening the Black Box of Deep Neural Networks via Information

https://arxiv.org/pdf/1610.08401v3.pdf Universal adversarial perturbations

We showed the existence of small universal perturbations that can fool state-of-the-art classifiers on natural images. We proposed an iterative algorithm to generate universal perturbations, and highlighted several properties of such perturbations. In particular, we showed that universal perturbations generalize well across different classification models, resulting in doubly-universal perturbations (imageagnostic, network-agnostic). We further explained the existence of such perturbations with the correlation between different regions of the decision boundary. This provides insights on the geometry of the decision boundaries of deep neural networks, and contributes to a better understanding of such systems.

https://arxiv.org/pdf/1703.06975v1.pdf Learning to Generate Samples from Noise through Infusion Training

In this work, we investigate a novel training procedure to learn a generative model as the transition operator of a Markov chain, such that, when applied repeatedly on an unstructured random noise sample, it will denoise it into a sample that matches the target distribution from the training set. The novel training procedure to learn this progressive denoising operation involves sampling from a slightly different chain than the model chain used for generation in the absence of a denoising target. In the training chain we infuse information from the training target example that we would like the chains to reach with a high probability. The thus learned transition operator is able to produce quality and varied samples in a small number of steps. Experiments show competitive results compared to the samples generated with a basic Generative Adversarial Net.

https://arxiv.org/abs/1703.07928 Guided Perturbations: Self Corrective Behavior in Convolutional Neural Networks

n this work, we present an intriguing behavior: pre-trained CNNs can be made to improve their predictions by structurally perturbing the input. We observe that these perturbations - referred as Guided Perturbations - enable a trained network to improve its prediction performance without any learning or change in network weights. We perform various ablative experiments to understand how these perturbations affect the local context and feature representations. Furthermore, we demonstrate that this idea can improve performance of several existing approaches on semantic segmentation and scene labeling tasks on the PASCAL VOC dataset and supervised classification tasks on MNIST and CIFAR10 datasets.

https://arxiv.org/abs/1104.2018 Efficient learning of generalized linear and single index models with isotonic regression.

The algorithm is a variant of the “gradient-like” perceptron algorithm, with the added twist that on each update, an isotonic regression procedure is performed on the linear predictions. Recall that isotonic regression is a procedure which finds the best monotonic one dimensional regression function. Here, the well-known Pool Adjacent Violator (PAV) algorithm provides a computationally efficient method for this task.

As each iteration of the algorithm throws away all previous training data and requests new examples. Our intuition is that the underlying technical reasons for this are due to the fact that the PAV algorithm need not return a function with a bounded Lipschitz constant.