Design Patterns for Deep Learning Architectures
Note to reader: Diving into this material here can be a bit overwhelming. One way though to get an understanding of the thought process is to follow the Intuition Machine blog: https://medium.com/intuitionmachine
Deep Learning Architecture can be described as a new method or style of building machine learning systems. Deep Learning is more than likely to lead to more advanced forms of artificial intelligence. The evidence for this is in the sheer number of breakthroughs that had occurred since the beginning of this decade. There is a new found optimism in the air and we are now again in a new AI spring. Unfortunately, the current state of deep learning appears too many ways to be akin to alchemy. Everybody seems to have their own black-magic methods of designing architectures. The field thus needs to move forward and strive towards chemistry, or perhaps even a periodic table for deep learning. Although deep learning is still in its early infancy of development, this book strives towards some kind of unification of the ideas in deep learning. It leverages a method of description called pattern languages.
Pattern Languages are languages derived from entities called patterns that when combined form solutions to complex problems. Each pattern describes a problem and offers alternative solutions. Pattern languages are a way of expressing complex solutions that were derived from experience. The benefit of an improved language of expression is that other practitioners are able to gain a much better understanding of the complex subject as well as a better way of expressing a solution to problems.
The majority of literature in the computer science field, the phrase “design patterns” is used rather than “pattern language”. We purposely use “pattern language” to reflect that the field of Deep Learning is a nascent, but rapidly evolving, field that is not as mature as other topics in computer science. There are patterns that we describe that are not actually patterns, but rather may be fundamental. We are never certain which will are truly fundamental and only further exploration and elucidation can bring about a common consensus in the field. Perhaps in the future, a true design patterns book will arise as a reflection of the maturity of this field.
Pattern Languages were originally promoted by Christopher Alexander to describe the architecture of businesses and towns. These ideas where later adopted by object oriented programming (OOP) practitioners to describe the design of OOP programs. The seminal book “Design Patterns” by the GoF is evidence of the effectiveness of this approach. Pattern languages were extended further into other domains such as user interfaces, interaction design, enterprise integration, SOA and scalability design.
In the domain of machine learning (ML) there is an emerging practice called “Deep Learning”. In ML there are many new terms that one encounters such as Artificial Neural Networks (ANN), Random Forests, Support Vector Machines (SVM) and Non-negative Matrix Factorization (NMF). These however usually refer to a specific kind of machine learning algorithm. Deep Learning (DL) in contrast is not really one kind of algorithm, rather it is a whole class of algorithms that tend to exhibit similar characteristics. DL systems are ANN that are constructed with multiple layers (sometimes called Multi-level Perceptrons). The idea is not entirely new, since it was first proposed back in the 1960s. However, interest in the domain has exploded with the help of advancing computational technology (i.e. GPU) and bigger training data sources. Since 2011, DL systems have been exhibiting impressive results in the field of machine learning.
The confusion with DL arises when one realizes that there actually many algorithms and it is not just a single kind. We find the conventional Feed forward Networks also known as Fully Connected Networks (FCN), Convolution Networks (ConvNet), Recurrent Neural Networks (RNN) and less used Restricted Boltzmann Machines (RBM). They all share a common trait in that these networks are constructed using a hierarchy of layers. One common pattern for example is the employment of differentiable layers, this constraint on the construction of DL systems leads to an incremental way of evolving the machine into something that learns classification. There are many patterns that have been discovered recently and it would be fruitful for practitioners to have at their disposal a compilation of these patterns.
Pattern languages are an ideal vehicle for describing and understanding Deep Learning. One would like to believe the Deep Learning has a solid fundamental foundation based on advanced mathematics. Most academic research papers will conjure up high-falutin math such as path integrals, tensors, hilbert spaces, measure theory etc. but don't let the math distract oneself from the reality that our collective understanding remains minimal. Mathematics you see has its inherent limitations. Physical scientists have known this for centuries. We formulate theories in such a way that the expressions are mathematically convenient. Mathematically convenience means that the math expressions we work with can be conveniently manipulated into other expressions. The Gaussian distribution for example is ubiquitous not because its some magical construct that reality has gifted to us. It is ubiquitous because it is mathematically convenient.
Pattern languages have been leveraged in many fuzzy domains. The original pattern language revolved around the discussion of architecture (i.e. buildings and towns). There are pattern languages that focus on user interfaces, on usability, on interaction design and on software process. These all don't have concise mathematical underpinnings yet we do extract real value from these pattern languages. In fact, the specification of a pattern language is not too far off from the creation of a new algebra or a category theory in mathematics. Algebras are strictly consistent but they are purely abstract and may not need to have any connection with reality. Pattern languages are however connected with reality, with their consistency rules are more relaxed than an algebra. In our attempt to understand the complex world of machine learning (or learning in general) we cannot always leap frog into mathematics. The reality may be such that our current mathematics are woefully incapable of describing what is happening.
An additional confusion that many machine learning practitioners will encounter when they are first presented with this idea of 'patterns' is that they mistakenly associate it with usual use of the word 'pattern' in their own field. Machine learning involves the development of algorithms that perform pattern recognition. So when you google deep learning with patterns, you will find literature that covers the subject of pattern recognition. This book is not about pattern recognition in the conventional machine learning sense.
Covers the motivations for the book. Why Deep Learning? Covers how the book is structured. The central theme of this book is that by understanding the many patterns and their inter-relationships we find in Deep Learning practice we begin to understand how we can best compose solutions.
Pattern Languages are languages derived from entities called patterns that when combined form solutions to complex problems. Each pattern describes a problem and offers solutions. Pattern languages are a way of expressing complex solutions that were derived from experience such that others can gain a better understanding of the solution. This chapter explains the concept of Pattern Languages and the how the Patterns in this book are structured.
This chapter covers some foundational mathematics that will essential in understanding the framework. It provides some common terminology and notation that will be used throughout the book. The book does not cover introductory material to math like linear algebra or probability. That is already well covered in the “Deep Learning” book. However, the book will propose a mathematical framework that serves as a guiding framework on how to reason about Neural Networks. This framework builds from ideas from category theory, dynamical systems, information theory, information geometry and game theory.
Deriving inspiration from other software development methodologies, such as agile development and lean methodology, and apply that in the Deep Learning space. Deep Learning is a new kind of architecture where the creation of a learning machine is performed similar to software development. However DL is different enough in that the system is able to develop itself. There is enough complexity that is becomes necessary to overlay a kind of structure over it to help guide practitioners in the practice.
This chapter is the recommended prerequisite to reading the other patterns chapters. Here we discuss patterns that do appear fundamental and for a foundation for understand other DL patterns.
This chapter covers various kinds of models found in practice.
This chapter covers collections of models and their behavior.
Previous model chapters explored the training of universal functions. In this chapter we explore how memory can be integrated to build even more powerful solutions.
This chapter covers different ways you can represent input and hidden data.
This chapters iterative learning methods found in practice.
This chapter covers more advanced methods of combining multiple neural networks to solve problems beyond classification.
This chapter covers different ways that a network can provide results and feedback to a user.
This chapter covers operational patterns found when neural networks are deployed in the field.
Audience and Coverage
The audience of this pattern language are readers who have prior exposure to Artificial Neural Networks (ANN). An introduction to ANN or a presentation of college-level mathematics are not covered. For a a good introduction, one should read the comprehensive text “Deep Learning” by Ian Goodfellow, Yoshua Bengio and Aaron Courville. This patterns book also has a more narrow coverage than the book “Deep Learning”. The book focuses more on techniques that have worked well in practice. So as an example, we don't cover Restricted Boltzmann Machines despite its importance from a historical perspective. We also ignore several topics covered in the book such as Structured Probabilistic Models and Monte Carlo Methods despite their importance in other machine learning methods. Classical statistics is an important perspective, however in this book we believe these classical concepts don't translate well in domains of high dimensionality. We do hope that this text is sufficient focused and compact that readers will come away with a solid perspective on how to apply this emerging technology.
We will also strictly avoid any discussion of biological plausibility. That topic is out of scope of this book and we are also of the opinion that these sort of discussions are moot considering that 60 years have elapsed since the original Perceptron proposal. The model proposed over half a century ago is at best a cartoonish depiction of how an actual biological neuron might work. It is just an absurd waste of paper that we find in many neural network literature that makes a flailing attempt at a hand-wavy argument based on biological inspiration.
The typical approach to presenting artificial neural networks (ANN) or deep learning (DL) is to take a historical perspective and one that begins with the 1957 perceptron proposal. There is some insight to be learned by studying history and how ideas evolve over time. The history of research in ANN and DL are indeed interesting and do explain why DL become a hot topic in the last few years, however the insight that you get from studying history can only minimally improve one's understanding of this complex subject.
We take a very simple logical approach to DL that makes very few assumptions. The main assumption that we make is that we have a learning machine that is a dynamical system with the goal of finding the best fitting model given the data that it observes. So we only care about how we define that goal, which we define as the entropy of the system. A learning system minimizes the relative entropy between what it observes and its own internal model. What we know from centuries of physics is that we can describe a dynamical system by its measured energy and we have equations that are defined with respect to this energy that governs the evolution of the system. In an analogous way, we take the relative entropy and define equations of evolution (aka learning) of the internal machine's model. If the equations are unconstrained the model can evolve into many different ways. We therefore throw in a bunch of constraints so that the machine is able to not only learn within reasonable time, but is able to learn abstractions in its internal model. These abstractions are necessary to achieve generalization and allows a machine to accurately make predictions on observations it has previously never encountered.
This book does not cover the subject of performance optimization. There are other texts that explore faster algorithms as well as distributed algorithms. Ideally one would be able to use the latest frameworks that already do support optimized GPU algorithms as well as different options for parallel computation. TensorFlow for example already supports the latest CuDNN library from Nvdia. Furthermore, TensorFlow supports distributed computation across multiple computational nodes. There are alternative implementations of distributed computation that are supported by alternative frameworks, however this is out of scope of this book.
© 2016 Copyright - Carlos E. Perez http://www.linkedin.com/in/ceperez contact: ceperez at intuitionmachine dot com