**This is an old revision of the document!**

** Name **
Weight Quantization (aka Binarization)

**Intent**

Reduce memory requirements by using weights of lower precision.

**Motivation**

**Structure**

<Diagram>

**Discussion**

**Known Uses**

**Related Patterns**

<Diagram>

**References**

http://arxiv.org/pdf/1606.00185v1.pdf A Survey on Learning to Hash

the quantization approach can be derived from the distance-distance difference minimization criterion.

https://arxiv.org/abs/1606.01981 Deep neural networks are robust to weight binarization and other non-linear distortions

https://arxiv.org/pdf/1607.06450v1.pdf Layer Normalization

One way to reduce the training time is to normalize the activities of the neurons. In this paper, we transpose batch normalization into layer normalization by computing the mean and variance used for normalization from all of the summed inputs to the neurons in a layer on a single training case. Like batch normalization, we also give each neuron its own adaptive bias and gain which are applied after the normalization but before the non-linearity. Unlike batch normalization, layer normalization performs exactly the same computation at training and test times. It is also straightforward to apply to recurrent neural networks by computing the normalization statistics separately at each time step. Layer normalization is very effective at stabilizing the hidden state dynamics in recurrent networks. Empirically, we show that layer normalization can substantially reduce the training time compared with previously published techniques.

nvariance properties under the normalization methods.

http://arxiv.org/abs/1608.06902v1 Recurrent Neural Networks With Limited Numerical Precision

This paper addresses the question of how to best reduce weight precision during training in the case of RNNs. We present results from the use of different stochastic and deterministic reduced precision training methods applied to three major RNN types which are then tested on several datasets. The results show that the **weight binarization methods do not work with the RNNs**. However, the stochastic and deterministic ternarization, and pow2-ternarization methods gave rise to low-precision RNNs that produce similar and even higher accuracy on certain datasets therefore providing a path towards training more efficient implementations of RNNs in specialized hardware.

http://arxiv.org/abs/1602.02830v3 Binarized Neural Networks: Training Deep Neural Networks with Weights and Activations Constrained to +1 or -1

http://arxiv.org/abs/1603.05279 XNOR-Net: ImageNet Classification Using Binary Convolutional Neural Networks

In XNOR-Networks, both the filters and the input to convolutional layers are binary. XNOR-Networks approximate convolutions using primarily binary operations

figure illustrates the procedure explained in section 3.2 for approximating a convolution using binary operations.

https://arxiv.org/abs/1609.00222 Ternary Neural Networks for Resource-Efficient AI Applications

We train these TNNs using a teacher-student approach. Using only ternary weights and ternary neurons, with a step activation function of two-thresholds, the student ternary network learns to mimic the behaviour of its teacher network.

https://petewarden.com/2016/05/03/how-to-quantize-neural-networks-with-tensorflow/

https://arxiv.org/abs/1609.08144v1 Google's Neural Machine Translation System: Bridging the Gap between Human and Machine Translation

http://openreview.net/pdf?id=S1_pAu9xl TRAINED TERNARY QUANTIZATION

https://petewarden.com/2016/05/03/how-to-quantize-neural-networks-with-tensorflow/

In this work, we extended a large deviation analysis of the solution space of a single layer neural network from the purely binary and balanced case [14] to the general discrete case.

https://arxiv.org/abs/1604.03058v5 Binarized Neural Networks on the ImageNet Classification Task

We trained Binarized Neural Networks (BNNs) on the high resolution ImageNet ILSVRC-2102 dataset classification task and achieved a good performance. With a moderate size network of 13 layers, we obtained top-5 classification accuracy rate of 84.1 % on validation set through network distillation, much better than previous published results of 73.2% on XNOR network and 69.1% on binarized GoogleNET.

https://arxiv.org/abs/1701.03400v2 Scaling Binarized Neural Networks on Reconfigurable Logic

The Finn framework was recently proposed for building fast and flexible field programmable gate array (FPGA) accelerators for BNNs. Finn utilized a novel set of optimizations that enable efficient mapping of BNNs to hardware and implemented fully connected, non-padded convolutional and pooling layers, with per-layer compute resources being tailored to user-provided throughput requirements.

https://arxiv.org/abs/1702.03044v1 Incremental Network Quantization: Towards Lossless CNNs with Low-Precision Weights

On one hand, we introduce three interdependent operations, namely weight partition, group-wise quantization and re-training.

https://arxiv.org/pdf/1705.09283v2.pdf Gated XNOR Networks: Deep Neural Networks with Ternary Weights and Activations under a Unified Discretization Framework

https://github.com/codeaudit/Gated-XNOR

https://arxiv.org/abs/1708.04788 BitNet: Bit-Regularized Deep Neural Networks

https://arxiv.org/abs/1709.06662v1 Verifying Properties of Binarized Deep Neural Networks

https://arxiv.org/pdf/1710.03740.pdf MIXED PRECISION TRAINING

https://arxiv.org/abs/1711.01243v1 ResBinNet: Residual Binary Neural Network

https://arxiv.org/abs/1711.02213 Flexpoint: An Adaptive Numerical Format for Efficient Training of Deep Neural Networks

https://las.inf.ethz.ch/files/djolonga17learning.pdf Differentiable Learning of Submodular Models

In this paper we focus on the problem of submodular minimization, for which we show that such layers are indeed possible. The key idea is that we can continuously relax the output without sacrificing guarantees. We provide an easily computable approximation to the Jacobian complemented with a complete theoretical analysis. Finally, these contributions let us experimentally learn probabilistic log-supermodular models via a bi-level variational inference formulation.

https://openreview.net/pdf?id=B1IDRdeCW THE HIGH-DIMENSIONAL GEOMETRY OF BINARY NEURAL NETWORKS

https://openreview.net/pdf?id=HJGXzmspb TRAINING AND INFERENCE WITH INTEGERS IN DEEP NEURAL NETWORKS

https://openreview.net/forum?id=S19dR9x0b Alternating Multi-bit Quantization for Recurrent Neural Networks

In this work, we address these problems by quantizing the network, both weights and activations, into multiple binary codes {-1,+1}. We formulate the quantization as an optimization problem. Under the key observation that once the quantization coefficients are fixed the binary codes can be derived efficiently by binary search tree, alternating minimization is then applied.

https://arxiv.org/pdf/1712.05877.pdf Quantization and Training of Neural Networks for Efficient Integer-Arithmetic-Only Inference

https://openreview.net/forum?id=B1IDRdeCWThe High-Dimensional Geometry of Binary Neural Networks

Neural networks with binary weights and activations have similar performance to their continuous counterparts with substantially reduced execution time and power usage. We provide an experimentally verified theory for understanding how one can get away with such a massive reduction in precision based on the geometry of HD vectors. First, we show that binarization of high-dimensional vectors preserves their direction in the sense that the angle between a random vector and its binarized version is much smaller than the angle between two random vectors (Angle Preservation Property). Second, we take the perspective of the network and show that binarization approximately preserves weight-activation dot products (Dot Product Proportionality Property). More generally, when using a network compression technique, we recommend looking at the weight activation dot product histograms as a heuristic to help localize the layers that are most responsible for performance degradation. Third, we discuss the impacts of the low effective dimensionality of the data on the first layer of the network. We recommend either using continuous weights for the first layer or a Generalized Binarization Transformation. Such a transformation may be useful for architectures like LSTMs where the update for the hidden state declares a particular set of axes to be important (e.g. by taking the pointwise multiply of the forget gates with the cell state). Finally, we show that neural networks with ternary weights and activations can also be understood with our approach. More broadly speaking, our theory is useful for analyzing a variety of neural network compression techniques that transform the weights, activations or both to reduce the execution cost without degrading performance.