**This is an old revision of the document!**

search?q=canonical&btnI=lucky

# Similarity Operator

**Aliases** Projection, Inner Product

**Intent**

A generalization of an operator that computes the similarity between a Model and a Feature.

**Motivation**

How do we calculate the similarity between the model and input? Features found in practice may require different kinds of measures to determine similarity.

**Sketch**

Similarity: $ R^n \times R^n \rightarrow R $

<Diagram>

**Discussion**

In its more generalized sense, similarity is a measure of equivalence between two objects. For vectors, it is described as the inner product. For distributions, it can be described as the KL divergence between two distributions. There are many kinds of similarity measures, this is documented in a survey [Cha 2007]. Cha classifies similarity functions into eight different families.

Similarities are also tightly related to hashing functions. Hash algorithms be classified into serveral families: pairwise similarity preserving, multiwise similarity preserving, implicit similarity preserving and quantization.

In its most generalized sense, a neuron can be thought of being composed of a similarity function between input and parameters, the resulting measure is fed through an activation function. The conventional neuron is an inner product between the input vectors and the internal weight vectors. This is equivalent to projecting the inputs to a random matrix of weight vectors.

The convolution can be considered as a generalization of a correlation operation. Convolution is equivalent to correlation when the kernel distribution is symmetric.

Shannon's entropy is a similarity measure equal to the KL divergence between the observed distribution and a random distribution.

Fisher's Information Matrix (FIM) is a multi-dimensional generalization of the similarity measure. The metric resides in a non-euclidean space.

Does the metric have to map to 1-dimensional space?

Does the metric have to be Euclidean?

What are the minimal characteristics for a metric?

Are neural embeddings favorable if the preserve a similarity measure.

**Known Uses**

**Related Patterns**

Pattern is related to the following Canonical Patterns:

- Irreversibility and Merge form the essential mechanisms of any DL system.
- Entropy is a global similarity measure that drives the evolution of the aggregate system. The local effect of a similarity operator is to neutral(?) to entropy.
- Distance Measure generalizes the many ways we can define similarity beyond the vector dot product.
- Random Projections shows how an collection of similarity operators can lead to a mapping that is able to preserve distance.
- Clustering is a generalization of how space can be partitioned and at its core requires a heuristic for determining similarity.
- Geometry provides a framework for understanding information spaces.
- Random Orthogonal Initialization is a beneficial initialization that leads to good projections and clustering.
- Dissipative Adaptation, where the energy absorption it equivalent to similarity matching.
- Adversarial Features are a consequence of the use of a linear similarity measure.
- Anti-causality expresses the direction of predictability that is a consequence of performing a similarity measure.

Pattern is cited in:

**References**

See Sung-Hyuk Cha, “Comprehensive Survey on Distance/Similarity Measures between Probability Density Functions,” International Journal of Mathematical Models and Methods in Applied Sciences, Volume 1 Issue 4, 2007, pp. 300-307 for a survey.[ii] The author identifies 45 PDF distance functions and classifies them into eight families: Lp Minkowski L1 intersection inner product fidelity (squared chord) squared L2 (χ2) Shannon’s entropy combinations.

http://citeseerx.ist.psu.edu/viewdoc/download?rep=rep1&type=pdf&doi=10.1.1.154.8446 http://elki.dbs.ifi.lmu.de/wiki/DistanceFunctions http://tech.knime.org/wiki/distance-measure-developers-guide

http://turing.cs.washington.edu/papers/uai11-poon.pdf Sum-Product Networks: A New Deep Architecture

http://arxiv.org/pdf/1606.00185v1.pdf

A Survey on Learning to Hash

Learning to hash is one of the major solutions to this problem and has been widely studied recently. In this paper, we present a comprehensive survey of the learning to hash algorithms, and categorize them according to the manners of preserving the similarities into: pairwise similarity preserving, multiwise similarity preserving, implicit similarity preserving, as well as quantization, and discuss their relations.

http://psl.umiacs.umd.edu/files/broecheler-uai10.pdf Probabilistic Similarity Logic

http://arxiv.org/pdf/1606.06086v1.pdf Uncertainty in Neural Network Word Embedding Exploration of Threshold for Similarity

http://arxiv.org/abs/1306.6709v4 A Survey on Metric Learning for Feature Vectors and Structured Data

https://arxiv.org/pdf/1602.01321.pdf A continuum among logarithmic, linear, and exponential functions, and its potential to improve generalization in neural networks

http://openreview.net/pdf?id=r17RD2oxe DEEP NEURAL NETWORKS AND THE TREE OF LIFE

By applying the inner product similarity of the activation vectors at the last fully connected layer for different species, we can roughly build their tree of life. Our work provides a new perspective to the deep representation and sheds light on possible novel applications of deep representation to other areas like Bioinformatics.

http://www.skytree.net/2015/09/04/learning-with-similarity-search

Mercer kernels are essentially a generalization of the inner-product for any kind of data — they are symmetric though self-similarity may not be the maximum. They are quite popular in machine learning and Mercer kernels have been defined for text, graphs, time series, images.

https://arxiv.org/abs/1702.05870 Cosine Normalization: Using Cosine Similarity Instead of Dot Product in Neural Networks

To bound dot product and decrease the variance, we propose to use cosine similarity instead of dot product in neural networks, which we call cosine normalization. Our experiments show that cosine normalization in fully-connected neural networks notably reduces the test err with lower divergence, compared to other normalization techniques. Applied to convolutional networks, cosine normalization also significantly enhances the accuracy of classification and accelerates the training.

https://arxiv.org/abs/1708.00138 The differential geometry of perceptual similarity

Human similarity judgments are inconsistent with Euclidean, Hamming, Mahalanobis, and the majority of measures used in the extensive literatures on similarity and dissimilarity. From intrinsic properties of brain circuitry, we derive principles of perceptual metrics, showing their conformance to Riemannian geometry. As a demonstration of their utility, the perceptual metrics are shown to outperform JPEG compression. Unlike machine-learning approaches, the outperformance uses no statistics, and no learning. Beyond the incidental application to compression, the metrics offer broad explanatory accounts of empirical perceptual findings such as Tverskys triangle inequality violations, contradictory human judgments of identical stimuli such as speech sounds, and a broad range of other phenomena on percepts and concepts that may initially appear unrelated. The findings constitute a set of fundamental principles underlying perceptual similarity.

https://arxiv.org/abs/1410.5792v1 Generalized Compression Dictionary Distance as Universal Similarity Measure

https://arxiv.org/abs/1804.08071v1 Decoupled Networks

we first reparametrize the inner product to a decoupled form and then generalize it to the decoupled convolution operator which serves as the building block of our decoupled networks. We present several effective instances of the decoupled convolution operator. Each decoupled operator is well motivated and has an intuitive geometric interpretation. Based on these decoupled operators, we further propose to directly learn the operator from data.

? Decoupling the intra-class and interclass variation gives us the flexibility to design better models that are more suitable for a given ta

https://arxiv.org/pdf/1804.09458v1.pdf Dynamic Few-Shot Visual Learning without Forgetting

we propose a novel attention based few-shot classification weight generator as well as a cosine-similarity based ConvNet classifier. This allows to recognize in a unified way both novel and base categories and also leads to learn feature representations with better generalization capabilities

https://arxiv.org/abs/1712.07136 Low-Shot Learning with Imprinted Weights

by directly setting the final layer weights from novel training examples during low-shot learning. We call this process weight imprinting as it directly sets weights for a new category based on an appropriately scaled copy of the embedding layer activations for that training example.

https://arxiv.org/abs/1805.06576 A Spline Theory of Deep Networks (Extended Version)

We build a rigorous bridge between deep networks (DNs) and approximation theory via spline functions and operators. Our key result is that a large class of DNs can be written as a composition of max-affine spline operators (MASOs), which provide a powerful portal through which to view and analyze their inner workings. For instance, conditioned on the input signal, the output of a MASO DN can be written as a simple affine transformation of the input. This implies that a DN constructs a set of signal-dependent, class-specific templates against which the signal is compared via a simple inner product; we explore the links to the classical theory of optimal classification via matched filters and the effects of data memorization. Going further, we propose a simple penalty term that can be added to the cost function of any DN learning algorithm to force the templates to be orthogonal with each other; this leads to significantly improved classifi- cation performance and reduced overfitting with no change to the DN architecture. The spline partition of the input signal space that is implicitly induced by a MASO directly links DNs to the theory of vector quantization (VQ) and K-means clustering, which opens up new geometric avenue to study how DNs organize signals in a hierarchical fashion. To validate the utility of the VQ interpretation, we develop and validate a new distance metric for signals and images that quantifies the difference between their VQ encodings. (This paper is a significantly expanded version of a paper with the same title that will appear at ICML 2018.).

Orthogonality penalty a term that penalizes non-zero off-diagonal entries in the matrix leading to the new loss with extra penalty.

https://arxiv.org/abs/1807.02873v1 Separability is not the best goal for machine learning