**This is an old revision of the document!**

# Unitary Model

**Aliases**

*This identifies the pattern and should be representative of the concept that it describes. The name should be a noun that should be easily usable within a sentence. We would like the pattern to be easily referenceable in conversation between practitioners.
*

**Intent**

*Describes in a single concise sentence the meaning of the pattern.
*

**Motivation**

* This section describes the reason why this pattern is needed in practice. Other pattern languages indicate this as the Problem. In our pattern language, we express this in a question or several questions and then we provide further explanation behind the question.*

**Sketch**

*This section provides alternative descriptions of the pattern in the form of an illustration or alternative formal expression. By looking at the sketch a reader may quickly understand the essence of the pattern.
*
**Discussion**

*This is the main section of the pattern that goes in greater detail to explain the pattern. We leverage a vocabulary that we describe in the theory section of this book. We don’t go into intense detail into providing proofs but rather reference the sources of the proofs. How the motivation is addressed is expounded upon in this section. We also include additional questions that may be interesting topics for future research.*

**Known Uses**

*Here we review several projects or papers that have used this pattern.*

**Related Patterns**
*
In this section we describe in a diagram how this pattern is conceptually related to other patterns. The relationships may be as precise or may be fuzzy, so we provide further explanation into the nature of the relationship. We also describe other patterns may not be conceptually related but work well in combination with this pattern.*

*Relationship to Canonical Patterns*

*Relationship to other Patterns*

**Further Reading**

*We provide here some additional external material that will help in exploring this pattern in more detail.*

**References**

*To aid in reading, we include sources that are referenced in the text in the pattern.*

http://arxiv.org/abs/1503.03438 A mathematical motivation for complex-valued convolutional networks

A complex-valued convolutional network (convnet) implements the repeated application of the following composition of three operations, recursively applying the composition to an input vector of nonnegative real numbers: (1) convolution with complex-valued vectors followed by (2) taking the absolute value of every entry of the resulting vectors followed by (3) local averaging. For processing real-valued random vectors, complex-valued convnets can be viewed as “data-driven multiscale windowed power spectra,” “data-driven multiscale windowed absolute spectra,” “data-driven multiwavelet absolute values,” or (in their most general configuration) “data-driven nonlinear multiwavelet packets.”

http://arxiv.org/pdf/1511.06464v4.pdf Unitary Evolution Recurrent Neural Networks

Recurrent neural networks (RNNs) are notoriously difficult to train. When the eigenvalues of the hidden to hidden weight matrix deviate from absolute value 1, optimization becomes difficult due to the well studied issue of vanishing and exploding gradients, especially when trying to learn long-term dependencies. To circumvent this problem, we propose a new architecture that learns a unitary weight matrix, with eigenvalues of absolute value exactly 1. The challenge we address is that of parametrizing unitary matrices in a way that does not require expensive computations (such as eigendecomposition) after each weight update. We construct an expressive unitary weight matrix by composing several structured matrices that act as building blocks with parameters to be learned. Optimization with this parameterization becomes feasible only when considering hidden states in the complex domain. We demonstrate the potential of this architecture by achieving state of the art results in several hard tasks involving very long-term dependencies.

From 2sigma:

The problem of vanishing and exploding gradients occurs when the magnitude of the eigenvalues of weight matrices deviate from 1. Therefore, the authors use weight matrices that are unitary to guarantee that the eigenvalues have magnitude 1. The challenge with this constraint is to ensure that the matrices remain unitary when updating them during training without performing excessive computations. Their strategy is to decompose each unitary weight matrix into the product of several simple unitary matrices. The resulting parameterization makes it possible to learn the weights efficiently while providing sufficient expressiveness. They demonstrate state of the art performance on standard benchmark problems such as the copy and addition tasks. An additional benefit of their approach is that it is relatively insensitive to parameter initialization, since unitary matrices preserve norms.

https://arxiv.org/pdf/1602.03032.pdf Associative Long Short-Term Memory

The system has an associative memory based on complex-valued vectors and is closely related to Holographic Reduced Representations and Long Short-Term Memory networks.

http://arxiv.org/abs/1306.5532v2 Deep Learning by Scattering

Scattering defines a contractive representation of a high-dimensional probability distribution, which preserves its mean-square norm.

http://arxiv.org/abs/1602.06662v1 Orthogonal RNNs and Long-Memory Tasks

These constructions furthermore explain the success of recent methods that specify unitary initializations or constraints on the transition matrices.

http://arxiv.org/pdf/1203.1513.pdf Invariant Scattering Convolution Networks

A wavelet scattering network computes a translation invariant image representation, which is stable to deformations and preserves high frequency information for classification.

http://openreview.net/pdf?id=ry_4vpixl Rotation Plane Doubly Orthogonal Recurrent Neural Networks

Recurrent Neural Networks (RNNs) applied to long sequences suffer from the well known vanishing and exploding gradients problem. The recently proposed Unitary Evolution Recurrent Neural Network (uRNN) alleviates the exploding gradient problem and can learn very long dependencies, but its nonlinearities make it still affected by the vanishing gradient problem and so learning can break down for extremely long dependencies. We propose a new RNN transition architecture where the hidden state is updated multiplicatively by a time invariant orthogonal transformation followed by an input modulated orthogonal transformation. There are no additive interactions and so our architecture exactly preserves forward hid-den state activation norm and backwards gradient norm for all time steps, and is provably not affected by vanishing or exploding gradients. We propose using the rotation plane parameterization to represent the orthogonal matrices. We validate our model on a simplified memory copy task and see that our model can learn dependencies as long as 5,000 timesteps.

https://arxiv.org/abs/1611.00035v1 Full-Capacity Unitary Recurrent Neural Networks

Unitary recurrent neural networks (uRNNs), which use unitary recurrence matrices, have recently been proposed as a means to avoid these issues. However, in previous experiments, the recurrence matrices were restricted to be a product of parameterized unitary matrices, and an open question remains: when does such a parameterization fail to represent all unitary matrices, and how does this restricted representational capacity limit what can be learned? To address this question, we propose full-capacity uRNNs that optimize their recurrence matrix over all unitary matrices, leading to significantly improved performance over uRNNs that use a restricted-capacity recurrence matrix. Our contribution consists of two main components. First, we provide a theoretical argument to determine if a unitary parameterization has restricted capacity. Using this argument, we show that a recently proposed unitary parameterization has restricted capacity for hidden state dimension greater than 7. Second, we show how a complete, full-capacity unitary recurrence matrix can be optimized over the differentiable manifold of unitary matrices. The resulting multiplicative gradient step is very simple and does not require gradient clipping or learning rate adaptation.

https://www.youtube.com/watch?v=sAX2_l4Iv9E

https://arxiv.org/abs/1612.09199v1 Quantum Clustering and Gaussian Mixtures

The mixture of Gaussian distributions, a soft version of k-means , is considered a state-of-the-art clustering algorithm. It is widely used in computer vision for selecting classes, e.g., color, texture, and shapes. In this algorithm, each class is described by a Gaussian distribution, defined by its mean and covariance. The data is described by a weighted sum of these Gaussian distributions. We propose a new method, inspired by quantum interference in physics. Instead of modeling each class distribution directly, we model a class wave function such that its magnitude square is the class Gaussian distribution. We then mix the class wave functions to create the mixture wave function. The final mixture distribution is then the magnitude square of the mixture wave function. As a result, we observe the quantum class interference phenomena, not present in the Gaussian mixture model. We show that the quantum method outperforms the Gaussian mixture method in every aspect of the estimations. It provides more accurate estimations of all distribution parameters, with much less fluctuations, and it is also more robust to data deformations from the Gaussian assumptions. We illustrate our method for color segmentation as an example application.

https://arxiv.org/abs/1612.05231v1 Tunable Efficient Unitary Neural Networks (EUNN) and their application to RNN

We present a method for implementing an Efficient Unitary Neural Network (EUNN) whose computational complexity is merely (1) per parameter and has full tunability, from spanning part of unitary space to all of it. We apply the EUNN in Recurrent Neural Networks, and test its performance on the standard copying task and the MNIST digit recognition benchmark, finding that it significantly outperforms a non-unitary RNN, an LSTM network, an exclusively partial space URNN and a projective URNN with comparable parameter numbers.

https://arxiv.org/abs/1510.06664 Random Projections through multiple optical scattering: Approximating kernels at the speed of light

Random projections have proven extremely useful in many signal processing and machine learning applications. However, they often require either to store a very large random matrix, or to use a different, structured matrix to reduce the computational and memory costs. Here, we overcome this difficulty by proposing an analog, optical device, that performs the random projections literally at the speed of light without having to store any matrix in memory. This is achieved using the physical properties of multiple coherent scattering of coherent light in random media. We use this device on a simple task of classification with a kernel machine, and we show that, on the MNIST database, the experimental results closely match the theoretical performance of the corresponding kernel. This framework can help make kernel methods practical for applications that have large training sets and/or require real-time prediction. We discuss possible extensions of the method in terms of a class of kernels, speed, memory consumption and different problems.

https://openreview.net/pdf?id=rJQKYt5ll STEERABLE CNNS

The mathematical theory of steerable representations reveals a type system in which any steerable representation is a composition of elementary feature types, each one associated with a particular kind of symmetry. We show how the parameter cost of a steerable filter bank depends on the types of the input and output features, and show how to use this knowledge to construct CNNs that utilize parameters effectively.

https://arxiv.org/pdf/1612.05695v2.pdf Reinforcement Learning Using Quantum Boltzmann Machines

https://www.semanticscholar.org/paper/Space-Time-Local-Embeddings-Sun-Wang/1845c18f921a7e449fd98d15a26121bd04e79c14 Space Time Local Embeddings

https://arxiv.org/abs/1612.00188v4 Efficient Orthogonal Parametrisation of Recurrent Neural Networks Using Householder Reflections

s. We first show that constraining the transition matrix to be unitary is a special case of an orthogonal constraint. Therefore, it may not be necessary to work with complex-valued matrices. Then we present a new parametrisation of the transition matrix which allows efficient training of an RNN while ensuring that the matrix is always orthogonal.

https://arxiv.org/pdf/1703.06642v1.pdf Towards a Quantum World Wide Web

‘context and interference (quantum) effects’ are required to explain the probabilities calculated by counting the relative number of documents containing certain words and co-ocurrrences of words.

https://arxiv.org/abs/1603.08788 An Optimal Design for Universal Multiport Interferometers

Universal multiport interferometers, which can be programmed to implement any linear transformation between multiple channels, are emerging as a powerful tool for both classical and quantum photonics.

https://arxiv.org/abs/1704.02146 Quantum ensembles of quantum classifiers

https://arxiv.org/abs/1611.08350v2 Learning an Invariant Hilbert Space for Domain Adaptation

This paper introduces a learning scheme to construct a Hilbert space (i.e., a vector space along its inner product) to address both unsupervised and semi-supervised domain adaptation problems. This is achieved by learning projections from each domain to a latent space along the Mahalanobis metric of the latent space to simultaneously minimizing a notion of domain variance while maximizing a measure of discriminatory power. In particular, we make use of the Riemannian optimization techniques to match statistical properties (e.g., first and second order statistics) between samples projected into the latent space from different domains. Upon availability of class labels, we further deem samples sharing the same label to form more compact clusters while pulling away samples coming from different classes.We extensively evaluate and contrast our proposal against state-of-the-art methods for the task of visual domain adaptation using both handcrafted and deep-net features. Our experiments show that even with a simple nearest neighbor classifier, the proposed method can outperform several state-of-the-art methods benefitting from more involved classification schemes.

https://arxiv.org/pdf/1705.09792v1.pdf Deep Complex Networks

Recent work on recurrent neural networks and older fundamental theoretical analysis suggests that complex numbers could have a richer representational capacity and could also facilitate noise-robust memory retrieval mechanisms.

In this work, we provide the key atomic components for complex-valued deep neural networks and apply them to convolutional feed-forward networks. More precisely, we rely on complex convolutions and present algorithms for complex batch-normalization, complex weight initialization strategies for complex-valued neural nets and we use them in experiments with end-to-end training schemes. We demonstrate that such complex-valued models are able to achieve comparable or better performance than their real-valued counterparts.

https://arxiv.org/abs/1705.10142v2 Kronecker Recurrent Units

We present a flexible recurrent neural network model called Kronecker Recurrent Units (KRU). KRU achieves parameter efficiency in RNNs through a Kronecker factored recurrent matrix. It overcomes the ill-conditioning of the recurrent matrix by enforcing soft unitary constraints on the factors. Thanks to the small dimensionality of the factors, maintaining these constraints is computationally efficient. Our experimental results on five standard data-sets reveal that KRU can reduce the number of parameters by three orders of magnitude in the recurrent weight matrix compared to the existing recurrent models, without trading the statistical performance.

https://arxiv.org/pdf/1705.10461v1.pdf The Numerics of GANs

Using the formalism of smooth two-player games we analyze the associated gradient vector field of GAN training objectives. Our findings suggest that the convergence of current algorithms suffers due to two factors: i) presence of eigenvalues of the Jacobian of the gradient vector field with zero real-part, and ii) eigenvalues with big imaginary part. Using these findings, we design a new algorithm that overcomes some of these limitations and has better convergence properties.

https://arxiv.org/pdf/1312.6115.pdf Neuronal Synchrony in Complex-Valued Deep Networks

https://arxiv.org/pdf/1712.07811v1.pdf Multi-dimensional Graph Fourier Transform

https://arxiv.org/abs/1802.08245v1 Arbitrarily Substantial Number Representation for Complex Number

Researchers are often perplexed when their machine learning algorithms are required to deal with complex number. Various strategies are commonly employed to project complex number into real number, although it is frequently sacrificing the information contained in the complex number. This paper proposes a new method and four techniques to represent complex number as real number, without having to sacrifice the information contained. The proposed techniques are also capable of retrieving the original complex number from the representing real number, with little to none of information loss. The promising applicability of the proposed techniques has been demonstrated and worth to receive further exploration in representing the complex number.

https://arxiv.org/abs/1802.08235v1 Vector Field Based Neural Networks

The data points are interpreted as particles moving along a flow defined by the vector field which intuitively represents the desired movement to enable classification. The architecture moves the data points from their original configuration to anew one following the streamlines of the vector field with the objective of achieving a final configuration where classes are separable. An optimization problem is solved through gradient descent to learn this vector field.

https://arxiv.org/abs/1803.04386v2 Flipout: Efficient Pseudo-Independent Weight Perturbations on Mini-Batches

We introduce flipout, an efficient method for decorrelating the gradients within a mini-batch by implicitly sampling pseudo-independent weight perturbations for each example. Empirically, flipout achieves the ideal linear variance reduction for fully connected networks, convolutional networks, and RNNs. We find significant speedups in training neural networks with multiplicative Gaussian perturbations.

https://eng.uber.com/differentiable-plasticity/

https://arxiv.org/abs/1711.01297v1 Implicit Weight Uncertainty in Neural Networks