**This is an old revision of the document!**

https://arxiv.org/abs/1703.04140v1 Multiscale Hierarchical Convolutional Networks

Deep neural network algorithms are difficult to analyze because they lack structure allowing to understand the properties of underlying transforms and invariants. Multiscale hierarchical convolutional networks are structured deep convolutional networks where layers are indexed by progressively higher dimensional attributes, which are learned from training data. Each new layer is computed with multidimensional convolutions along spatial and attribute variables. We introduce an efficient implementation of such networks where the dimensionality is progressively reduced by averaging intermediate layers along attribute indices. Hierarchical networks are tested on CIFAR image data bases where they obtain comparable precisions to state of the art networks, with much fewer parameters. We study some properties of the attributes learned from these databases.

Multiscale Hierarchical convolutional networks give a mathematical framework to study invariants computed by deep neural networks. Layers are parameterized in progressively higher dimensional spaces of hierarchical attributes, which are learned from training data. All network operators are multidimensional convolutions along attribute indices, so that invariants can be computed by summations along these attributes. https://github.com/jhjacobsen/HierarchicalCNN

https://github.com/kevinzakka/research-paper-notes/blob/master/snn.md Self-Normalizing Neural Networks

he authors introduce self-normalizing neural networks (SNNs) whose layer activations automatically converge towards zero mean and unit variance and are robust to noise and perturbations. Significance: Removes the need for the finicky batch normalization and permits training deeper networks with a robust training scheme.

https://arxiv.org/abs/1707.00762 Multiscale sequence modeling with a learned dictionary

Instead of predicting one symbol at a time, our multi-scale model makes predictions over multiple, potentially overlapping multi-symbol tokens. A variation of the byte-pair encoding (BPE) compression algorithm is used to learn the dictionary of tokens that the model is trained with. When applied to language modelling, our model has the flexibility of character-level models while maintaining many of the performance benefits of word-level models. Our experiments show that this model performs better than a regular LSTM on language modeling tasks, especially for smaller models.