**This is an old revision of the document!**

search?q=canonical&btnI=lucky

# Invariance

**Aliases**

*This identifies the pattern and should be representative of the concept that it describes. The name should be a noun that should be easily usable within a sentence. We would like the pattern to be easily referenceable in conversation between practitioners.
*

**Intent**

*Describes in a single concise sentence the meaning of the pattern.
*

**Motivation**

How can we train our networks to remove or ignore features? We would like to define features. These features do not change across examples. For example, in the image domain, we would like to ignore difference in translation and rotation. These features are removed from the feature set and therefore their absence cannot be expressed in a regularization term.

**Sketch**

*This section provides alternative descriptions of the pattern in the form of an illustration or alternative formal expression. By looking at the sketch a reader may quickly understand the essence of the pattern.
*

**Discussion**

Einstein has originally called his work an “Invariententheorie,” or a “theory of invariance”. To his dismay, the theory was better known as “the theory of relativity”. Invariance is the notion of features remaining constant as things change. The study of invariance is prevalent in Physics, a practice that focuses on model building of physical reality. So one should not be surprised to find that the mechanisms for identifying invariances be equally important in machine learning.

One of the key strengths of deep learning architecture over alternative forms of machine learning is the ability to ignore invariants in a feature set. The treatment of invariant mathematically is particularly difficult. The detection of an invariant condition is usually treated in an indirect manner. Invariance is detected through the identification of a set of transformations that lead to resulting features that remain constant. This is exactly what classifiers do, they are invariant feature detectors.

The importance of identifying invariance is that it leads to more parsimonious models. Furthermore, if we are able to ignore features that do not contribute to the final objective (say classification) then we may have more robust models. Robust models is that there are less likely to be confused by features that otherwise need to be ignored. More robust in that we have a basis that will more likely be in a valid space.

Conventional statistical analysis begins from the starting point of constructing parsimonious models of a system. These priors are constructed based on an experts understanding of a system. By virtue of mathematical convenience, practitioners employ non-parametric distributions that are meant to approximate reality. However we are constrained with the limitations of what's available in our menu to express a distribution. In the case of the gaussian distribution, we are limited to just the mean and deviation as knobs that we can tweak. We are also limited by whatever extra baggage of assumptions that the gaussian distribution carries with it.

There are two common tools that we have to remove invariant features. The first one is by construction of our network. We can build in transformations that augment the data either in training or in the network. These transformation condition the training so as to treat objects that have invariances in the same manner. Here we know a-prior that we should ignore certain features. So for example in image data, we can ignore translation or rotation. We build in transformations or similarity measures that ignore invariances. The second method is through Data Augmentation where we synthetically expand the training set by generating new data that are meant to be invariant in some feature.

Invariance is captured mathematically by analyzing the behavior of sequences of transformations. If the originally feature vector is recovered through a sequence of non-trivial transformation (meaning non-identity transforms), then there exists some set of invariant features in the data. More generally, if through a sequence of transforms, we always arrive at the same final feature, then we conclude that not all features are relevant to the final feature and can be ignored. We can see this mathematical in the following:

An operator Ax = x that leads to the same initial feature space. An operator Ax = z such that Bx = Bz, in other words there exist an operator that takes two different feature vectors into the same vector.

Autoencoders that are able to recreate an object is a method that is used to remove invariant features from the data. Given sufficient sparsity constraints on the hidden layers we may possibly arrive at an encoder and decoder pair that filters out nuisance variables. We see this in Variational Autoencoders where we drive a hidden layer towards a model that is are composed of Gaussian distributions. This kind of unsupervised learning of a compression method is one way to automatically remove invariants.

Not all invariances can be captured mathematically. Let's take for example a image of a three dimensional object. We are not going to be able to learn a two dimensional transform that captures the behavior of three dimensional rotation. Perhaps these kind of invariances can be learned. We see this in Attention where a network is trained to focus on just certain image patches while ignoring the rest. So invariance is about ignoring certain features, however in the 3 dimensional rotation example, it is also about understanding that despite everything changing in view there is an understanding that nothing has changed.

**Known Uses**

http://arxiv.org/abs/1406.3884v1 Learning An Invariant Speech Representation

A quasi-invariant representation for a speech segment can be obtained by projecting it to each template orbit, i.e., the set of transformed signals, and computing the associated one-dimensional empirical probability distributions. The computations can be performed by modules of filtering and pooling, and extended to hierarchical architectures.

http://arxiv.org/abs/1508.01983v4 Digging Deep into the layers of CNNs: In Search of How CNNs Achieve View Invariance

Does the learned CNN representation achieve viewpoint invariance? How does it achieve viewpoint invariance? Is it achieved by collapsing the view manifolds, or separating them while preserving them? At which layer is view invariance achieved? How can the structure of the view manifold at each layer of a deep convolutional neural network be quantified experimentally? How does fine-tuning of a pre-trained CNN on a multi-view dataset affect the representation at each layer of the network?

http://arxiv.org/abs/1609.04382v2 Warped Convolutions: Efficient Invariance to Spatial Transformations

We present a construction that is simple and exact, yet has the same computational complexity that standard convolutions enjoy. It consists of a constant image warp followed by a simple convolution, which are standard blocks in deep learning toolboxes. With a carefully crafted warp, the resulting architecture can be made invariant to one of a wide range of spatial transformations.

**Related Patterns**
*
In this section we describe in a diagram how this pattern is conceptually related to other patterns. The relationships may be as precise or may be fuzzy, so we provide further explanation into the nature of the relationship. We also describe other patterns may not be conceptually related but work well in combination with this pattern.*

Relationship to Canonical Patterns:

- Irreversibility serves as a mechanism to selectively remove nuisance features.
- Regularization * How can we express regularization that removes invariant features?
- Mutual Information Does mutual information restrict reduction of invariant variables?
- Disentangled Basis may be possible due to invariant or nuisance variables.
- Hierarchical Abstraction How can complex invariant variables be removed if they are filtered in lower layers?
- Risk Minimization can be improved by invariant variable identification.
- Adversarial Features leads to classifier misclassification due to its inability to ignore invariant features.

Relationship to other Patterns:

Cited by these patterns:

**Further Reading**

*We provide here some additional external material that will help in exploring this pattern in more detail.*

**References**

http://arxiv.org/abs/1411.5908v2 Understanding image representations by measuring their equivariance and equivalence

We investigate three key mathematical properties of representations: equivariance, invariance, and equivalence. Equivariance studies how transformations of the input image are encoded by the representation, invariance being a special case where a transformation has no effect. Equivalence studies whether two representations, for example two different parametrisations of a CNN, capture the same visual information or not.

http://arxiv.org/pdf/1311.4158v5.pdf Unsupervised learning of invariant representations with low sample complexity: the magic of sensory cortex or a new framework for machine learning?

from the analysis that, except of the last layer, the representation tries to achieve view invariance by separating individual instances’ view manifolds while preserving them, instead of collapsing the view manifolds to degenerate representations. This is violated at the last layer which enforces view invariance.

http://arxiv.org/abs/1512.08806v3 Common Variable Learning and Invariant Representation Learning using Siamese Neural Networks

We consider the statistical problem of learning common source of variability in data which are synchronously captured by multiple sensors, and demonstrate that Siamese neural networks can be naturally applied to this problem. This approach is useful in particular in exploratory, data-driven applications, where neither a model nor label information is available. In recent years, many researchers have successfully applied Siamese neural networks to obtain an embedding of data which corresponds to a “semantic similarity”. We present an interpretation of this “semantic similarity” as learning of equivalence classes. We discuss properties of the embedding obtained by Siamese networks and provide empirical results that demonstrate the ability of Siamese networks to learn common variability.

http://arxiv.org/abs/1503.05938v1 On Invariance and Selectivity in Representation Learning

We discuss data representation which can be learned automatically from data, are invariant to transformations, and at the same time selective, in the sense that two points have the same representation only if they are one the transformation of the other.

Two stages: group and non-group transformations. The core of the theory applies to compact groups such as rotations of the image in the image plane. Exact invariance for each module is equivalent to a localization condition which could be interpreted as a form of sparsity [3]. If the condition is relaxed to hold approximately it becomes a sparsity condition for the class of images w.r.t. the dictionary t k under the group G when restricted to a subclass of similar images. This property, which is similar to compressive sensing “incoherence” (but in a group context), requires that I and t k have a representation with rather sharply peaked autocorrelation (and correlation) and guarantees approximate invariance for transformations which do not have group structure

http://arxiv.org/abs/1206.6418v1 Learning Invariant Representations with Local Transformations

http://arxiv.org/abs/1311.4158v5 Unsupervised Learning of Invariant Representations in Hierarchical Architectures

https://drive.google.com/a/codeaudit.com/file/d/0Bxf-khCt_eknVmNwNDJqaEJzcGs/view Tangent Propagation

Tangent propagation [3] is a method to regularize neural nets explicitly to be invariant to known transformations in the input space. It has been applied mainly for computer vision with transformations like translation and rotation with great success in the 90s.

http://journals.aps.org/prl/pdf/10.1103/PhysRevLett.114.220001 Can there be Physics of the brain?

Biological experiments show that while the values of many parameters constantly change, neural functions remain remarkably stable. In all these cases, diverse arrangements of microscopic variables produce the same macroscopic invariant.

https://arxiv.org/abs/1307.0998v3 A Unified Framework of Elementary Geometric Transformation Representation

As an extension of projective homology, stereohomology is proposed via an extension of Desargues theorem and the extended Desargues configuration. Geometric transformations such as reflection, translation, central symmetry, central projection, parallel projection, shearing, central dilation, scaling, and so on are all included in stereohomology and represented as Householder-Chen elementary matrices. Hence all these geometric transformations are called elementary. This makes it possible to represent these elementary geometric transformations in homogeneous square matrices independent of a particular choice of coordinate system.

https://arxiv.org/abs/1611.00740v1 Why and When Can Deep – but Not Shallow – Networks Avoid the Curse of Dimensionality

The paper reviews an emerging body of theoretical results on deep learning including the conditions under which it can be exponentially better than shallow learning. Deep convolutional networks represent an important special case of these conditions, though weight sharing is not the main reason for their exponential advantage. Explanation of a few key theorems is provided together with new results, open problems and conjectures.

The key property that makes deep nets exponentially better than shallow for compositional functions is the locality of the constituent functions that is their low dimensionality.

http://openreview.net/forum?id=Syfkm6cgx Improving Invariance and Equivariance Properties of Convolutional Neural Networks

We find that CNNs learn invariance wrt all 9 tested transformation types and that invariance extends to transformations outside the training range. Additionally, we also propose a loss function that aims to improve CNN equivariance.

http://openreview.net/pdf?id=BkmM8Dceg WARPED CONVOLUTIONS: EFFICIENT INVARIANCE TO SPATIAL TRANSFORMATIONS

We present a construction that is simple and exact, yet has the same computational complexity that standard convolutions enjoy. It consists of a constant image warp followed by a simple convolution, which are standard blocks in deep learning toolboxes. With a carefully crafted warp, the resulting architecture can be made invariant to one of a wide range of spatial transformations.

http://openreview.net/pdf?id=rk9eAFcxg VARIATIONAL RECURRENT ADVERSARIAL DEEP DOMAIN ADAPTATION

Our model termed as Variational Recurrent Adversarial Deep Domain Adaptation (VRADA) is built atop a variational recurrent neural network (VRNN) and trains adversarially to capture complex temporal relationships that are domain invariant. This is (as far as we know) the first to capture and transfer temporal latent dependencies in multivariate time-series data.

https://arxiv.org/abs/1611.01046 Learning to Pivot with Adversarial Networks

Robust inference is possible if it is based on a pivot – a quantity whose distribution is invariant to the unknown value of the (categorical or continuous) nuisance parameters that parametrizes this family of generation processes. In this work, we introduce a flexible training procedure based on adversarial networks for enforcing the pivotal property on a predictive model. We derive theoretical results showing that the proposed algorithm tends towards a minimax solution corresponding to a predictive model that is both optimal and independent of the nuisance parameters (if that models exists) or for which one can tune the trade-off between power and robustness.

In terms of applications, the proposed solution can be used in any situation where the training data may not be representative of the real data the predictive model will be applied to in practice.

http://openreview.net/pdf?id=Hyq4yhile LEARNING INVARIANT FEATURE SPACES TO TRANSFER SKILLS WITH REINFORCEMENT LEARNING

Our method uses the skills that were learned by both agents to train invariant feature spaces that can then be used to transfer other skills from one agent to another. The process of learning these invariant feature spaces can be viewed as a kind of “analogy making,” or implicit learning of partial correspondences between two distinct domains.

http://scholarworks.umt.edu/cgi/viewcontent.cgi?article=1007&context=cs_pubs

http://openreview.net/pdf?id=SJiFvr9el LINEAR TIME COMPLEXITY DEEP FOURIER SCATTERING NETWORK AND EXTENSION TO NONLINEAR INVARIANTS

A scalable version of a state-of-the-art deterministic timeinvariant feature extraction approach based on consecutive changes of basis and nonlinearities, namely, the scattering network

https://arxiv.org/pdf/1611.04500v1.pdf DEEP LEARNING WITH SETS AND POINT CLOUDS

We study a simple notion of structural invariance that readily suggests a parameter-sharing scheme in deep neural networks. In particular, we define structure as a collection of relations, and derive graph convolution and recurrent neural networks as special cases. We study composition of basic structures in defining models that are invariant to more complex “product” structures such as graph of graphs, sets of images or sequence of sets. For demonstration, our experimental results are focused on the setting where the discrete structure of interest is a set. We present results on several novel and non-trivial problems on sets, including point-cloud classification, set outlier detection and semi-supervised learning using clustering information.

https://arxiv.org/abs/1605.06743v2 Inductive Bias of Deep Convolutional Networks through Pooling Geometry

Our formal understanding of the inductive bias that drives the success of convolutional networks on computer vision tasks is limited. In particular, it is unclear what makes hypotheses spaces born from convolution and pooling operations so suitable for natural images. In this paper we study the ability of convolutional networks to model correlations among regions of their input. We theoretically analyze convolutional arithmetic circuits, and empirically validate our findings on other types of convolutional networks as well. Correlations are formalized through the notion of separation rank, which for a given partition of the input, measures how far a function is from being separable. We show that a polynomially sized deep network supports exponentially high separation ranks for certain input partitions, while being limited to polynomial separation ranks for others. The network's pooling geometry effectively determines which input partitions are favored, thus serves as a means for controlling the inductive bias. Contiguous pooling windows as commonly employed in practice favor interleaved partitions over coarse ones, orienting the inductive bias towards the statistics of natural images. Other pooling schemes lead to different preferences, and this allows tailoring the network to data that departs from the usual domain of natural imagery. In addition to analyzing deep networks, we show that shallow ones support only linear separation ranks, and by this gain insight into the benefit of functions brought forth by depth - they are able to efficiently model strong correlation under favored partitions of the input.

http://www.mathcs.emory.edu/~dzb/teaching/421Fall2014/VGT-Ch-1-2.pdf Visual Group Theory

https://arxiv.org/abs/1605.06743v3 Inductive Bias of Deep Convolutional Networks through Pooling Geometry

Our formal understanding of the inductive bias that drives the success of convolutional networks on computer vision tasks is limited. In particular, it is unclear what makes hypotheses spaces born from convolution and pooling operations so suitable for natural images. In this paper we study the ability of convolutional networks to model correlations among regions of their input. We theoretically analyze convolutional arithmetic circuits, and empirically validate our findings on other types of convolutional networks as well. Correlations are formalized through the notion of separation rank, which for a given partition of the input, measures how far a function is from being separable. We show that a polynomially sized deep network supports exponentially high separation ranks for certain input partitions, while being limited to polynomial separation ranks for others. The network's pooling geometry effectively determines which input partitions are favored, thus serves as a means for controlling the inductive bias. Contiguous pooling windows as commonly employed in practice favor interleaved partitions over coarse ones, orienting the inductive bias towards the statistics of natural images. Other pooling schemes lead to different preferences, and this allows tailoring the network to data that departs from the usual domain of natural imagery. In addition to analyzing deep networks, we show that shallow ones support only linear separation ranks, and by this gain insight into the benefit of functions brought forth by depth - they are able to efficiently model strong correlation under favored partitions of the input.

https://arxiv.org/abs/1611.01046 Learning to Pivot with Adversarial Networks

Robust inference is possible if it is based on a pivot – a quantity whose distribution is invariant to the unknown value of the (categorical or continuous) nuisance parameters that parametrizes this family of generation processes. In this work, we introduce a flexible training procedure based on adversarial networks for enforcing the pivotal property on a predictive model.

https://arxiv.org/abs/1612.04642 Harmonic Networks: Deep Translation and Rotation Equivariance

Translating or rotating an input image should not affect the results of many computer vision tasks. Convolutional neural networks (CNNs) are already translation equivariant: input image translations produce proportionate feature map translations. This is not the case for rotations. Global rotation equivariance is typically sought through data augmentation, but patch-wise equivariance is more difficult. We present Harmonic Networks or H-Nets, a CNN exhibiting equivariance to patch-wise translation and 360-rotation. We achieve this by replacing regular CNN filters with circular harmonics, returning a maximal response and orientation for every receptive field patch. H-Nets use a rich, parameter-efficient and low computational complexity representation, and we show that deep feature maps within the network encode complicated rotational invariants. We demonstrate that our layers are general enough to be used in conjunction with the latest architectures and techniques, such as deep supervision and batch normalization. We also achieve state-of-the-art classification on rotated-MNIST, and competitive results on other benchmark challenges.

https://arxiv.org/abs/1611.05013v1 PixelVAE: A Latent Variable Model for Natural Images

Because the autoregressive conditional likelihood function of PixelVAE is expressive enough to model some properties of the image distribution, it isn’t forced to account for those properties through its latent variables as a standard VAE is. As a result, we can expect PixelVAE to learn latent representations which are invariant to textures, precise positions, and other attributes which are more efficiently modeled by the autoregressive decoder.

https://arxiv.org/abs/1612.08498v1 Steerable CNNs

It has long been recognized that the invariance and equivariance properties of a representation are critically important for success in many vision tasks. In this paper we present Steerable Convolutional Neural Networks, an efficient and flexible class of equivariant convolutional networks. We show that steerable CNNs achieve state of the art results on the CIFAR image classification benchmark. The mathematical theory of steerable representations reveals a type system in which any steerable representation is a composition of elementary feature types, each one associated with a particular kind of symmetry. We show how the parameter cost of a steerable filter bank depends on the types of the input and output features, and show how to use this knowledge to construct CNNs that utilize parameters effectively.

https://arxiv.org/abs/1605.01224v2 Learning Covariant Feature Detectors

Local covariant feature detection, namely the problem of extracting viewpoint invariant features from images, has so far largely resisted the application of machine learning techniques. In this paper, we propose the first fully general formulation for learning local covariant feature detectors. We propose to cast detection as a regression problem, enabling the use of powerful regressors such as deep neural networks. We then derive a covariance constraint that can be used to automatically learn which visual structures provide stable anchors for local feature detection. We support these ideas theoretically, proposing a novel analysis of local features in term of geometric transformations, and we show that all common and many uncommon detectors can be derived in this framework. Finally, we present empirical results on translation and rotation covariant detectors on standard feature benchmarks, showing the power and flexibility of the framework.

https://openreview.net/forum?id=rJQKYt5ll

https://www.semanticscholar.org/paper/Group-Integration-Techniques-in-Pattern-Analysis-A-Reisert/61eee39547bb5ece4f23c290ff152a7d72f51d4c Group Integration Techniques in Pattern Analysis - A Kernel View

https://github.com/carpedm20/visual-analogy-tensorflow

https://arxiv.org/abs/1604.08859 The Z-loss: a shift and scale invariant classification loss belonging to the Spherical Family

Furthermore, we show on a word language modeling task that it also outperforms the log-softmax with respect to certain ranking scores, such as top-k scores, suggesting that the Z-loss has the flexibility to better match the task loss. These qualities thus makes the Z-loss an appealing candidate to train very efficiently large output networks such as word-language models or other extreme classification problems. On the One Billion Word (Chelba et al., 2014) dataset, we are able to train a model with the Z-loss 40 times faster than the log-softmax and more than 4 times faster than the hierarchical softmax.

https://arxiv.org/abs/1703.00356v1 Graph-based Isometry Invariant Representation Learning

In this work we present a novel Transformation Invariant Graph-based Network (TIGraNet), which learns graph-based features that are inherently invariant to isometric transformations such as rotation and translation of input images. In particular, images are represented as signals on graphs, which permits to replace classical convolution and pooling layers in deep networks with graph spectral convolution and dynamic graph pooling layers that together contribute to invariance to isometric transformations.

Our new method is able to correctly classify rotated and translated images even if such transformed images do not appear in the training set. This confirms its high potential in practical settings where the training sets are limited but where the data is expected to present high variability.

https://arxiv.org/abs/1612.09346v2 Rotation equivariant vector field networks

We propose Rotation Equivariant vector field Networks (RotEqNet) to encode rotation equivariance and invariance into Convolutional Neural Networks (CNNs). Each convolutional filter is applied at multiple orientations and returns a vector field that represents the magnitude and angle of the highest scoring orientation at every spatial location. A modified convolution operator using vector fields as inputs and filters can then be applied to obtain deep architectures. We test RotEqNet on several problems requiring different responses with respect to the inputs' rotation: image classification, biomedical image segmentation, orientation estimation and patch matching. In all cases, we show that RotEqNet offers very compact models in terms of number of parameters and provides results in line to those of networks orders of magnitude larger.

https://arxiv.org/abs/1706.01350v1 On the Emergence of Invariance and Disentangling in Deep Representations

Using classical notions of statistical decision and information theory, we show that invariance in a deep neural network is equivalent to minimality of the representation it computes, and can be achieved by stacking layers and injecting noise in the computation, under realistic and empirically validated assumptions. We use an Information Decomposition of the empirical loss to show that overfitting can be reduced by limiting the information content stored in the weights. We then present a sharp inequality that relates the information content in the weights – which are a representation of the training set and inferred by generic optimization agnostic of invariance and disentanglement – and the minimality and total correlation of the activation functions, which are a representation of the test datum. This allows us to tackle recent puzzles concerning the generalization properties of deep networks and their relation to the geometry of the optimization residual.

https://arxiv.org/abs/1706.01350v1 On the Emergence of Invariance and Disentangling in Deep Representations

Using classical notions of statistical decision and information theory, we show that invariance in a deep neural network is equivalent to minimality of the representation it computes, and can be achieved by stacking layers and injecting noise in the computation, under realistic and empirically validated assumptions. We use an Information Decomposition of the empirical loss to show that overfitting can be reduced by limiting the information content stored in the weights. We then present a sharp inequality that relates the information content in the weights – which are a representation of the training set and inferred by generic optimization agnostic of invariance and disentanglement – and the minimality and total correlation of the activation functions, which are a representation of the test datum. This allows us to tackle recent puzzles concerning the generalization properties of deep networks and their relation to the geometry of the optimization residual.

https://arxiv.org/pdf/1710.11386v1.pdf PARAMETRIZING FILTERS OF A CNN WITH A GAN

In this work, we provided a tool allowing to extract transformations w.r.t. which a CNN has been trained to be invariant to, in such a way that these transformations can be both visualized in the image space, and potentially re-used in other computational structures, since they are parametrized by a generator. The generator has been shown to extract a smooth hidden structure lying behind the discrete set of possible filters. It is the first time that a method is proposed to extract the symmetries learned by a CNN in an explicit, parametrized manner.

https://papers.nips.cc/paper/6931-deep-sets.pdf Deep Sets

In contrast to traditional approach of operating on fixed dimensional vectors, we consider objective functions defined on sets that are invariant to permutations. We also derive the necessary and sufficient conditions for permutation equivariance in deep models. We demonstrate the applicability of our method on population statistic estimation, point cloud classification, set expansion, and outlier detection.

https://arxiv.org/pdf/1801.02144v1.pdf Covariant Compositional Networks For Learning Graphs

Most existing neural networks for learning graphs address permutation invariance by conceiving of the network as a message passing scheme, where each node sums the feature vectors coming from its neighbors. We argue that this imposes a limitation on their representation power, and instead propose a new general architecture for representing objects consisting of a hierarchy of parts, which we call Covariant Compositional Networks (CCNs). Here, covariance means that the activation of each neuron must transform in a specific way under permutations, similarly to steerability in CNNs. We achieve covariance by making each activation transform according to a tensor representation of the permutation group, and derive the corresponding tensor aggregation rules that each neuron must implement. Experiments show that CCNs can outperform competing methods on standard graph learning benchmarks.

https://mathematical-coffees.github.io/slides/mc11-oyallon.pdf

https://arxiv.org/abs/1801.02144 Covariant Compositional Networks For Learning Graphs

A new general architecture for representing objects consisting of a hierarchy of parts, which we call Covariant Compositional Networks (CCNs).

https://arxiv.org/abs/1802.03690 On the Generalization of Equivariance and Convolution in Neural Networks to the Action of Compact Groups

In this paper we give a rigorous, theoretical treatment of convolution and equivariance in neural networks with respect to not just translations, but the action of any compact group.

https://link.springer.com/article/10.1007%2Fs11263-017-1001-2 Learning Image Representations Tied to Egomotion from Unlabeled Video

https://arxiv.org/pdf/1802.08219v2.pdf Tensor Field Networks: Rotation- and Translation-Equivariant Neural Networks for 3D Point Clouds

https://arxiv.org/abs/1803.02839 The emergent algebraic structure of RNNs and embeddings in NLP

We conclude that words naturally embed themselves in a Lie group and that RNNs form a nonlinear representation of the group. Appealing to these results, we propose a novel class of recurrent-like neural networks and a word embedding scheme.

https://arxiv.org/abs/1803.03234v1 Improving Optimization in Models With Continuous Symmetry Breaking

We use tools from gauge theory in physics to design an optimization algorithm that solves the slow convergence problem. Our algorithm leads to a fast decay of Goldstone modes, to orders of magnitude faster convergence, and to more interpretable representations, as we show for dynamic extensions of matrix factorization and word embedding models.

We identified a slow convergence problem in representation learning models with a continuous symmetry and a Markovian time series prior, and we solved the problem with a new optimization algorithm, Goldstone-GD. The algorithm separates the minimization in the symmetry subspace from the remaining coordinate directions. Our experiments showed that Goldstone-GD converges orders of magnitude faster and fits more interpretable embedding vectors, which can be compared across the time dimension of a model. We believe that continuous symmetries are common in representation learning and can guide model and algorithm design.

https://arxiv.org/pdf/1803.02879v1.pdf Deep Models of Interactions Across Sets

https://arxiv.org/abs/1803.10743v1 Intertwiners between Induced Representations (with Applications to the Theory of Equivariant Neural Networks)

In this paper we present a general mathematical framework for G-CNNs on homogeneous spaces like Euclidean space or the sphere. We show, using elementary methods, that the layers of an equivariant network are convolutional if and only if the input and output feature spaces transform according to an induced representation. This result, which follows from G.W. Mackey's abstract theory on induced representations, establishes G-CNNs as a universal class of equivariant network architectures, and generalizes the important recent work of Kondor & Trivedi on the intertwiners between regular representations.

https://arxiv.org/abs/1801.10130v3 Spherical CNNs

Convolutional Neural Networks (CNNs) have become the method of choice for learning problems involving 2D planar images. However, a number of problems of recent interest have created a demand for models that can analyze spherical images. Examples include omnidirectional vision for drones, robots, and autonomous cars, molecular regression problems, and global weather and climate modelling. A naive application of convolutional networks to a planar projection of the spherical signal is destined to fail, because the space-varying distortions introduced by such a projection will make translational weight sharing ineffective. In this paper we introduce the building blocks for constructing spherical CNNs. We propose a definition for the spherical cross-correlation that is both expressive and rotation-equivariant. The spherical correlation satisfies a generalized Fourier theorem, which allows us to compute it efficiently using a generalized (non-commutative) Fast Fourier Transform (FFT) algorithm. We demonstrate the computational efficiency, numerical accuracy, and effectiveness of spherical CNNs applied to 3D model recognition and atomization energy regression.

https://arxiv.org/abs/1804.04458 CubeNet: Equivariance to 3D Rotation and Translation

We introduce a Group Convolutional Neural Network with linear equivariance to translations and right angle rotations in three dimensions. We call this network CubeNet, reflecting its cube-like symmetry. By construction, this network helps preserve a 3D shape's global and local signature, as it is transformed through successive layers.

Another perspective on our approach is to think of it as global average pooling over rotations, where we expose a new ‘rotation-dimension.’ Without adhering to a defined group, it would be challenging to disentangle or orient a feature space (at any one layer, or across multiple layers) with respect to such a rotation dimension.

https://arxiv.org/abs/1803.00502v3 PIP Distance: A Unitary-invariant Metric for Understanding Functionality and Dimensionality of Vector Embeddings

With tools from perturbation and stability theory, we provide an upper bound on the PIP loss using the signal spectrum and noise variance, both of which can be readily inferred from data. Our framework sheds light on many empirical phenomena, including the existence of an optimal dimension, and the robustness of embeddings against over-parametrization. The bias-variance tradeoff of PIP loss explicitly answers the fundamental open problem of dimensionality selection for vector embeddings. https://github.com/aaaasssddf/PIP-experiments

In this paper, we introduce a mathematically sound theory for vector embeddings, from a stability point of view. Our theory answers some open questions, in particular: 1. What is an appropriate metric for comparing different vector embeddings? 2. How to select dimensionality for vector embeddings? 3. Why people choose different dimensionalities but they all work well in practice? We present a theoretical analysis for embeddings starting from first principles. We first propose a novel objective, the Pairwise Inner Product (PIP) loss. The PIP loss is closely related to the functionality differences between the embeddings, and a small PIP loss means the two embeddings are close for all practical purposes. We then develop matrix perturbation tools that quantify the objective, for embeddings explicitly or implicitly obtained from matrix factorization. Practical, data-driven upper bounds will also be given. Finally, we conduct extensive empirical studies and validate our theory on real datasets. With this theory, we provide answers to three open questions about vector embeddings, namely the robustness to over-parametrization, forward stability, and dimensionality selection.

https://arxiv.org/abs/1802.03690v1 On the Generalization of Equivariance and Convolution in Neural Networks to the Action of Compact Groups

Convolutional neural networks have been extremely successful in the image recognition domain because they ensure equivariance to translations. There have been many recent attempts to generalize this framework to other domains, including graphs and data lying on manifolds. In this paper we give a rigorous, theoretical treatment of convolution and equivariance in neural networks with respect to not just translations, but the action of any compact group. Our main result is to prove that (given some natural constraints) convolutional structure is not just a sufficient, but also a necessary condition for equivariance to the action of a compact group. Our exposition makes use of concepts from representation theory and noncommutative harmonic analysis and derives new generalized convolution formulae.

https://arxiv.org/pdf/1805.06595.pdf Covariance-Insured Screening

However, existing screening methods, which typically ignore correlation information, are likely to miss these weak signals. By incorporating the inter-feature dependence, we propose a covariance-insured screening methodology to identify predictors that are jointly informative but only marginally weakly associated with outcomes.

http://www.uvm.edu/~cdanfort/courses/237/schmidt-lipson-2009.pdf Distilling Free-Form Natural Laws from Experimental Data

https://arxiv.org/abs/1805.12491 Structure from noise: Mental errors yield abstract representations of events

https://arxiv.org/abs/1807.04689 Explorations in Homeomorphic Variational Auto-Encoding

In this paper we investigate the use of manifold-valued latent variables. Specifically, we focus on the important case of continuously differentiable symmetry groups (Lie groups), such as the group of 3D rotations SO(3). We show how a VAE with SO(3)-valued latent variables can be constructed, by extending the reparameterization trick to compact connected Lie groups. Our experiments show that choosing manifold-valued latent variables that match the topology of the latent data manifold, is crucial to preserve the topological structure and learn a well-behaved latent space.

https://arxiv.org/abs/1808.05563 Learning Invariances using the Marginal Likelihood

We argue that invariances should instead be incorporated in the model structure, and learned using the marginal likelihood, which correctly rewards the reduced complexity of invariant models.

https://arxiv.org/abs/1706.01350 Emergence of Invariance and Disentanglement in Deep Representations

We propose regularizing the loss by bounding such a term in two equivalent ways: One with a Kullbach-Leibler term, which relates to a PAC-Bayes perspective; the other using the information in the weights as a measure of complexity of a learned model, yielding a novel Information Bottleneck for the weights. Finally, we show that invariance and independence of the components of the representation learned by the network are bounded above and below by the information in the weights, and therefore are implicitly optimized during training. The theory enables us to quantify and predict sharp phase transitions between underfitting and overfitting of random labels when using our regularized loss, which we verify in experiments, and sheds light on the relation between the geometry of the loss function, invariance properties of the learned representation, and generalization error.