**This is an old revision of the document!**

https://arxiv.org/abs/1611.02252v1 Hierarchical compositional feature learning

https://arxiv.org/pdf/1505.05401.pdf

https://arxiv.org/abs/1605.06444 Unreasonable Effectiveness of Learning Neural Networks: From Accessible States and Robust Ensembles to Basic Algorithmic Schemes

In artificial neural networks, learning from data is a computationally demanding task in which a large number of connection weights are iteratively tuned through stochastic-gradient-based heuristic processes over a cost-function. It is not well understood how learning occurs in these systems, in particular how they avoid getting trapped in configurations with poor computational performance. Here we study the difficult case of networks with discrete weights, where the optimization landscape is very rough even for simple architectures, and provide theoretical and numerical evidence of the existence of rare—but extremely dense and accessible—regions of configurations in the network weight space. We define a novel measure, which we call the robust ensemble (RE), which suppresses trapping by isolated configurations and amplifies the role of these dense regions. We analytically compute the RE in some exactly solvable models, and also provide a general algorithmic scheme which is straightforward to implement: define a cost-function given by a sum of a finite number of replicas of the original cost-function, with a constraint centering the replicas around a driving assignment. To illustrate this, we derive several powerful new algorithms, ranging from Markov Chains to message passing to gradient descent processes, where the algorithms target the robust dense states, resulting in substantial improvements in performance. The weak dependence on the number of precision bits of the weights leads us to conjecture that very similar reasoning applies to more conventional neural networks. Analogous algorithmic schemes can also be applied to other optimization problems.

https://arxiv.org/pdf/cs/0212002v4.pdf

https://arxiv.org/abs/1509.05753 Subdominant Dense Clusters Allow for Simple Learning and High Computational Performance in Neural Networks with Discrete Synapses

https://arxiv.org/pdf/1711.08141.pdf Shift: A Zero FLOP, Zero Parameter Alternative to Spatial Convolutions

https://openreview.net/forum?id=B1IDRdeCW The High-Dimensional Geometry of Binary Neural Networks

https://arxiv.org/abs/1803.03004v1 Learning Effective Binary Visual Representations with Deep Networks

This paper proposes Approximately Binary Clamping (ABC), which is non-saturating, end-to-end trainable, with fast convergence and can output true binary visual representations. ABC achieves comparable accuracy in ImageNet classification as its real-valued counterpart, and even generalizes better in object detection. On benchmark image retrieval datasets, ABC also outperforms existing hashing methods.

https://arxiv.org/abs/1803.07125v2

n. LBPNet1 uses local binary comparisons and random projection in place of conventional convolu- tion (or approximation of convolution) operations.

We have built a convolution-free, end-to-end, and bitwise LBPNet from scratch for deep learning and verified its effectiveness on MNIST, SVHN, and CIFAR-10 with orders of magnitude speedup (hundred times) in testing and model size reduction (thousand times), when compared with the baseline and the binarized CNNs. The improvement in both size and speed is achieved due to our convolution-free design with logic bitwise operations that are learned directly from scratch.

https://arxiv.org/abs/1711.06597v1 Deep Local Binary Patterns

https://arxiv.org/abs/1608.06049v2 Local Binary Convolutional Neural Networks

https://github.com/cair/TsetlinMachine The Tsetlin Machine - A Game Theoretic Bandit Driven Approach to Optimal Pattern Recognition with Propositional Logic

https://arxiv.org/pdf/1805.04908.pdf On the Practical Computational Power of Finite Precision RNNs for Language Recognition

In particular, we show that the LSTM and the Elman-RNN with ReLU activation are strictly stronger than the RNN with a squashing activation and the GRU. This is achieved because LSTMs and ReLU-RNNs can easily implement counting behavior. We show empirically that the LSTM does indeed learn to effectively use the counting mechanism.

https://arxiv.org/abs/1806.07550v1 Binary Ensemble Neural Network: More Bits per Network or More Networks per Bit?

While ensemble techniques have been broadly believed to be only marginally helpful for strong classifiers such as deep neural networks, our analyses and experiments show that they are naturally a perfect fit to boost BNNs. We find that our BENN, which is faster and much more robust than state-of-the-art binary networks, can even surpass the accuracy of the full-precision floating number network with the same architecture.

https://arxiv.org/abs/1809.03368v1 Probabilistic Binary Neural Networks

Low bit-width weights and activations are an effective way of combating the increasing need for both memory and compute power of Deep Neural Networks. In this work, we present a probabilistic training method for Neural Network with both binary weights and activations, called BLRNet. By embracing stochasticity during training, we circumvent the need to approximate the gradient of non-differentiable functions such as sign(), while still obtaining a fully Binary Neural Network at test time. Moreover, it allows for anytime ensemble predictions for improved performance and uncertainty estimates by sampling from the weight distribution. Since all operations in a layer of the BLRNet operate on random variables, we introduce stochastic versions of Batch Normalization and max pooling, which transfer well to a deterministic network at test time. We evaluate the BLRNet on multiple standardized benchmarks.

https://arxiv.org/abs/1809.04547 . Using the Tsetlin Machine to Learn Human-Interpretable Rules for High-Accuracy Text Categorization with Medical Applications

In all brevity, we represent the terms of a text as propositional variables. From these, we capture categories using simple propositional formulae, such as: if “rash” and “reaction” and “penicillin” then Allergy. The Tsetlin Machine learns these formulae from a labelled text, utilizing conjunctive clauses to represent the particular facets of each category. Indeed, even the absence of terms (negated features) can be used for categorization purposes. Our empirical results are quite conclusive. The Tsetlin Machine either performs on par with or outperforms all of the evaluated methods on both the 20 Newsgroups and IMDb datasets, as well as on a non-public clinical dataset. On average, the Tsetlin Machine delivers the best recall and precision scores across the datasets. The GPU implementation of the Tsetlin Machine is further 8 times faster than the GPU implementation of the neural network.

https://arxiv.org/pdf/1809.09244.pdf No Multiplication? No Floating Point? No Problem! Training Networks for Efficient Inference

we train deep networks that emit only a predefined, static number of discretized values. Despite reducing the number of values that can be emitted from 2 32 to only 32, there is little to no degradation in network performance across a variety of tasks. Compared to existing approaches for discretization, our approach is both conceptually and programmatically simple and has no stochastic component. Second, we provide a method to constrain the network’s weights to a small number of unique values (typically 100-1000) by employing a periodic adaptive clustering step during training.