This is an old revision of the document!

# Random Projections

Aliases SimHash

Intent

Use random projections to construct a mapping that preserves locality.

Motivation

How to create an acceptable mapping between spaces while preserving locality of similar objects?

Structure

<Diagram>

Discussion

The J-L theorem provides a guide as to the minimum number of projections required to preserve the local distances between similar input vectors.

In the case of a Fully Connected Layer, we can see that each layer acts like a random projection. In practice a DNN is best initialized with Random Orthogonal Initialization. This results in each neuron performing a random projection and the collective of neurons, in the same layer, performing a number of projections that tend to map input vectors into a new space preserves the original local distances. A DNN consists of multiple layers of random projections. At each layer, as a consequence of the Activation functions, information that is invariant is ignored and only information that is relevant to the Fitness criteria is preserved over training.

Known Uses

Related Patterns

• Similarity enables random projections. Random projections are a collective of projection operators that lead to an overall vector to vector mapping that preserves locality.
• Random Matrix and a matrix multiplication against a feature vector is equivalent to random projections.
• Clustering is similar to random projections in that it captures similarities to points in multi-dimensional spaces.
• Random Orthogonal Initialization initializes the network to an optimal random projection.
• Ensembles is analogous in that each projection may be considered as an expert and all the projections as being a mixture of experts. This implies that random projections increase entropy.
• Distributed Representation is a consequence in that representation of is distributed among multiple projections.
• Disentangled Basis may be achievable if the projections are sparse.
• Compressed Sensing is a consequence of assuming a sparse random basis.

References

The random projection method of LSH due to Moses Charikar [4] called SimHash (also sometimes called arccos [17]) is designed to approximate the cosine distance between vectors. The basic idea of this technique is to choose a random hyperplane (defined by a normal unit vector r) at the outset and use the hyperplane to hash input vectors.

Given an input vector v and a hyperplane defined by r, we let h(v) = sgn(v \cdot r). That is, h(v) = \pm 1 depending on which side of the hyperplane v lies.

Each possible choice of r defines a single function. Let H be the set of all such functions and let D be the uniform distribution once again. It is not difficult to prove that, for two vectors u,v, Pr[h(u) = h(v)] = 1 - \frac{\theta(u,v)}{\pi}, where \theta(u,v) is the angle between u and v. 1 - \frac{\theta(u,v)}{\pi} is closely related to \cos(\theta(u,v)).

In this instance hashing produces only a single bit. Two vectors' bits match with probability proportional to the cosine of the angle between them.

Charikar, Moses S. (2002). “Similarity Estimation Techniques from Rounding Algorithms”. Proceedings of the 34th Annual ACM Symposium on Theory of Computing. pp. 380–388. doi:10.1145/509907.509965.

http://arxiv.org/pdf/1507.05910v3.pdf CLUSTERING IS EFFICIENT FOR APPROXIMATE MAXIMUM INNER PRODUCT SEARCH

http://arxiv.org/pdf/1312.6199v4.pdf Intriguing properties of neural networks

The inspection of individual units makes the implicit assumption that the units of the last feature layer form a distinguished basis which is particularly useful for extracting semantic information. Instead, we show in section 3 that random projections of φ(x) are semantically indistinguishable from the coordinates of φ(x). This puts into question the conjecture that neural networks disentangle variation factors across coordinates. Generally, it seems that it is the entire space of activations, rather than the individual units, that contains the bulk of the semantic information.

Our experiments show that any random direction v ∈ R^n gives rise to similarly interpretable semantic properties. More formally, we find that images x0 are semantically related to each other, for many x0 such that x0 = arg max x∈I hφ(x), vi This suggests that the natural basis is not better than a random basis for inspecting the properties of φ(x). This puts into question the notion that neural networks disentangle variation factors across coordinates.

Our main result is that for deep neural networks, the smoothness assumption that underlies many kernel methods does not hold. Specifically, we show that by using a simple optimization procedure, we are able to find adversarial examples, which are obtained by imperceptibly small perturbations to a correctly classified input image, so that it is no longer classified correctly.

http://arxiv.org/abs/1603.04733v3 Structured and Efficient Variational Deep Learning with Matrix Gaussian Posteriors

We introduce a scalable variational Bayesian neural network where the parameters are governed by a probability distribution over random matrices: the matrix variate Gaussian.

http://arxiv.org/abs/1412.6572 EXPLAINING AND HARNESSING ADVERSARIAL EXAMPLES

Adversarial examples can be explained as a property of high-dimensional dot products. They are a result of models being too linear, rather than too nonlinear.

he generalization of adversarial examples across different models can be explained as a result of adversarial perturbations being highly aligned with the weight vectors of a model, and different models learning similar functions when trained to perform the same task.

The direction of perturbation, rather than the specific point in space, matters most. Space is not full of pockets of adversarial examples that finely tile the reals like the rational numbers.

Because it is the direction that matters most, adversarial perturbations generalize across different clean examples.

We have demonstrated that adversarial training can result in regularization; even further regularization than dropout.

We have run control experiments that failed to reproduce this effect with simpler but less efficient regularizers including L1 weight decay and adding noise.

Models that are easy to optimize are easy to perturb.

Linear models lack the capacity to resist adversarial perturbation; only structures with a hidden layer (where the universal approximator theorem applies) should be trained to resist adversarial perturbation.

RBF networks are resistant to adversarial examples. (!!)

Models trained to model the input distribution are not resistant to adversarial examples.

Ensembles are not resistant to adversarial examples

The existence of adversarial examples suggests that being able to explain the training data or even being able to correctly label the test data does not imply that our models truly understand the tasks we have asked them to perform. Instead, their linear responses are overly confident at points that do not occur in the data distribution, and these confident predictions are often highly incorrect. This work has shown we can partially correct for this problem by explicitly identifying problematic points and correcting the model at each of these points. However, one may also conclude that the model families we use are intrinsically flawed. Ease of optimization has come at the cost of models that are easily misled. This motivates the development of optimization procedures that are able to train models whose behavior is more locally stable.

Margin Preservation of Deep Neural Networks

The network preserves distances in directions normal to the decision boundary. The distance preservation is characterized by the average behaviour of the network's Jacobian matrix in the neighbourhood of the training samples. The introduced theory also leads to a margin preservation regularization scheme that outperforms weight decay both theoretically and empirically.

Input margin of a training sample is the distance of the training sample to the classification boundary in the input space. Note that the classification boundary in the input space is piecewise linear for DNN with ReLUs and it can not be optimized for directly.

Output margin of a training sample is the distance of the training sample transformed by the deep neural network to the classification boundary induced by the linear classifier in the feature space. The output margin is relevant because in practice it is much easier to compute and to be optimized for compared to the input margin

A Survey on Learning to Hash

Learning to hash is one of the major solutions to this problem and has been widely studied recently. In this paper, we present a comprehensive survey of the learning to hash algorithms, and categorize them according to the manners of preserving the similarities into: pairwise similarity preserving, multiwise similarity preserving, implicit similarity preserving, as well as quantization, and discuss their relations.

Binary embeddings with structured hashed projections

Phase Transitions of the Regular Polytopes and Cone

http://www.fizyka.umk.pl/publications/kmk/09-aRPM.pdf Almost Random Projection Machine

Backpropagation of errors is not only hard to justify from biological perspective but also it fails to solve problems requiring complex logic. A simpler algorithm based on generation and filtering of useful random projections has better biological justification, is faster, easier to train and may in practice solve nonseparable problems of higher complexity than typical feedforward neural networks

http://arxiv.org/pdf/1511.05212v5.pdf Binary embeddings with structured hashed projections

We prove several theoretical results showing that projections via various structured matrices followed by nonlinear mappings accurately preserve the angular distance between input highdimensional vectors

http://arxiv.org/pdf/1606.04801v2.pdf A Powerful Generative Model Using Random Weights for the Deep Image Representation

First we invert representations in feature spaces and reconstruct images from white noise inputs. The reconstruction quality is statistically higher than that of the same method applied on well trained networks with the same architecture. Next we synthesize textures using scaled correlations of representations in multiple layers and our results are almost indistinguishable with the original natural texture and the synthesized textures based on the trained network. Third, by recasting the content of an image in the style of various artworks, we create artistic images with high perceptual quality, highly competitive to the prior work of Gatys et al. on pretrained networks. To our knowledge this is the first demonstration of image representations using untrained deep neural networks.

http://arxiv.org/abs/1412.0233v3 The Loss Surfaces of Multilayer Networks

Hamiltonian of the spherical spin-glass model under the assumptions of: i) variable independence, ii) redundancy in network parametrization, and iii) uniformity. These assumptions enable us to explain the complexity of the fully decoupled neural network through the prism of the results from random matrix theory. We show that for large-size decoupled networks the lowest critical values of the random loss function form a layered structure and they are located in a well-defined band lower-bounded by the global minimum. The number of local minima outside that band diminishes exponentially with the size of the network.

http://arxiv.org/pdf/1607.04331v1.pdf Random projections of random manifolds

Intuitively, a K dimensional linear subspace seems less complex than a K dimensional curved manifold, yet they both have the same intrinsic dimension. Despite their same intrinsic dimensionality, from a machine learning perspective, it can be harder to both learn functions on curved manifolds and compress them using dimensionality reduction, compared to linear subspaces.

https://arxiv.org/pdf/1610.03045v1.pdf Sketching Meets Random Projection in the Dual: A Provable Recovery Algorithm for Big and High-dimensional Data (see also: https://en.wikipedia.org/wiki/Duality_(optimization) )

http://openreview.net/pdf?id=Sy8gdB9xx UNDERSTANDING DEEP LEARNING REQUIRES RETHINKING GENERALIZATION

Our experiments establish that state-of-the-art convolutional networks for image classification trained with stochastic gradient methods easily fit a random labeling of the training data. This phenomenon is qualitatively unaffected by explicit regularization, and occurs even if we replace the true images by completely unstructured random noise.

At the heart of our methodology is a variant of the well-known randomization test from non-parametric statistics (Edgington & Onghena, 2007). In a first set of experiments, we train several standard architectures on a copy of the data where the true labels were replaced by random labels. Our central finding can be summarized as: Deep neural networks easily fit random labels.

More precisely, when trained on a completely random labeling of the true data, neural networks achieve 0 training error. The test error, of course, is no better than random chance as there is no correlation between the training labels and the test labels. In other words, by randomizing labels alone we can force the generalization of a model to jump up considerably without changing the model, its size, hyperparameters, or the optimizer.

Our results challenge the classical view of learning by showing that many successful neural networks easily have the effective capacity for sheer memorization. This leads us to believe that these models may very well make use of massive memorization when tackling the problems they are trained to solve. It is likely that learning in the traditional sense still occurs in part, but it appears to be deeply intertwined with massive memorization. Classical approaches are therefore poorly suited for reasoning about why these models generalize well.

http://web.cecs.pdx.edu/~mm/IJCNN2013.pdf On the Role of Shape Prototypes in Hierarchical Models of Vision

We investigate the role of learned shape-prototypes in an influential family of hierarchical neural-network models of vision. Central to these networks’ design is a dictionary of learned shapes, which are meant to respond to discriminative visual patterns in the input. While higher-level features based on such learned prototypes have been cited as key for viewpoint invariant object-recognition in these models [1], [2], we show that high performance on invariant object-recognition tasks can be obtained by using a simple set of unlearned, “shape-free” features. This behavior is robust to the size of the network. These results call into question the roles of learning and shapes pecificity in the success of such models on difficult vision tasks, and suggest that randomly constructed prototypes may provide a useful “universal” dictionary.

https://arxiv.org/pdf/1704.05310v1.pdf Unsupervised Learning by Predicting Noise

We propose to fix a set of target representations, called Noise As Targets (NAT), and to constrain the deep features to align to them. This domain agnostic approach avoids the standard unsupervised learning issues of trivial solutions and collapsing of features. Thanks to a stochastic batch reassignment strategy and a separable square loss function, it scales to millions of images. The proposed approach produces representations that perform on par with state-of-the-art unsupervised methods on ImageNet and PASCAL VOC.

https://arxiv.org/abs/1412.6616v2 Outperforming Word2Vec on Analogy Tasks with Random Projections

All that is involved is a sum of random vectors and their pointwise products. Despite the simplicity of this technique, it gives state-of-the-art results on analogy problems, in most cases better than Word2Vec. To explain this success, we interpret it as a dimension reduction via random projection.

https://arxiv.org/pdf/1705.07831.pdf Stabilizing GAN Training with Multiple Random Projections

We propose training a single generator simultaneously against an array of discriminators, each of which looks at a different random low-dimensional projection of the data. We show that individual discriminators then provide stable gradients to the generator, and that the generator learns to produce samples consistent with the full data distribution to satisfy all discriminators. We demonstrate the practical utility of this approach experimentally, and show that it is able to produce image samples with higher quality than traditional training with a single discriminator.

Overview of our approach. We train a single generator against an array of discriminators, each of which receives lower-dimensional projections—chosen randomly prior to training—as input. Individually, these discriminators are unable to perfectly separate real and generated samples, and thus provide stable gradients to the generator throughout training. In turn, by trying to fool all the discriminators simultaneously, the generator learns to match the true full data distribution.

http://delivery.acm.org/10.1145/3110000/3106504/p702-yokobayashi.pdf Modeling Random Projection for Tensor Objects

https://arxiv.org/abs/1706.04371v3 A survey of dimensionality reduction techniques based on random projection

http://www.deeplearningpatterns.com/doku.php?id=random_projections ProjectionNet: Learning Efficient On-Device Deep Networks Using Neural Projections