This identifies the pattern and should be representative of the concept that it describes. The name should be a noun that should be easily usable within a sentence. We would like the pattern to be easily referenceable in conversation between practitioners.


Describes in a single concise sentence the meaning of the pattern.


This section describes the reason why this pattern is needed in practice. Other pattern languages indicate this as the Problem. In our pattern language, we express this in a question or several questions and then we provide further explanation behind the question.


This section provides alternative descriptions of the pattern in the form of an illustration or alternative formal expression. By looking at the sketch a reader may quickly understand the essence of the pattern.


This is the main section of the pattern that goes in greater detail to explain the pattern. We leverage a vocabulary that we describe in the theory section of this book. We don’t go into intense detail into providing proofs but rather reference the sources of the proofs. How the motivation is addressed is expounded upon in this section. We also include additional questions that may be interesting topics for future research.

Known Uses

Here we review several projects or papers that have used this pattern.

Related Patterns In this section we describe in a diagram how this pattern is conceptually related to other patterns. The relationships may be as precise or may be fuzzy, so we provide further explanation into the nature of the relationship. We also describe other patterns may not be conceptually related but work well in combination with this pattern.

Relationship to Canonical Patterns

Relationship to other Patterns

Further Reading

We provide here some additional external material that will help in exploring this pattern in more detail.


To aid in reading, we include sources that are referenced in the text in the pattern.

References An Empirical Investigation of Catastrophic Forgetting in Gradient-Based Neural Networks Learning without Forgetting Overcoming catastrophic forgetting in neural networks

The ability to learn tasks in a sequential fashion is crucial to the development of artificial intelligence. Neural networks are not, in general, capable of this and it has been widely thought that catastrophic forgetting is an inevitable feature of connectionist models. We show that it is possible to overcome this limitation and train networks that can maintain expertise on tasks which they have not experienced for a long time. Our approach remembers old tasks by selectively slowing down learning on the weights important for those tasks. We demonstrate our approach is scalable and effective by solving a set of classification tasks based on the MNIST hand written digit dataset and by learning several Atari 2600 games sequentially. Towards Making Systems Forget with Machine Unlearning

We present a general, efficient unlearning approach by transforming learning algorithms used by a system into a summation form. To forget a training data sample, our approach simply updates a small number of summations – asymptotically faster than retraining from scratch. Our approach is general, because the summation form is from the statistical query learning in which many machine learning algorithms can be implemented. Our approach also applies to all stages of machine learning, including feature selection and modeling. Our evaluation, on four diverse learning systems and real-world workloads, shows that our approach is general, effective, fast, and easy to use. Neurogenesis Deep Learning

Extending deep networks to accommodate new classes Understanding Neural Networks through Representation Erasure

While neural networks have been successfully applied to many natural language processing tasks, they come at the cost of interpretability. In this paper, we propose a general methodology to analyze and interpret decisions from a neural model by observing the effects on the model of erasing various parts of the representation, such as input word-vector dimensions, intermediate hidden units, or input words. We present several approaches to analyzing the effects of such erasure, from computing the relative difference in evaluation metrics, to using reinforcement learning to erase the minimum set of input words in order to flip a neural model’s decision. In a comprehensive analysis of multiple NLP tasks, including linguistic feature classification, sentence-level sentiment analysis, and document level sentiment aspect prediction, we show that the proposed methodology not only offers clear explanations about neural model decisions, but also provides a way to conduct error analysis on neural models. Improved multitask learning through synaptic intelligence

We introduce a model of intelligent synapses that accumulate task relevant information over time, and exploit this information to efficiently consolidate memories of old tasks to protect them from being overwritten as new tasks are learned. We apply our framework to learning sequences of related classification problems, and show that it dramatically reduces catastrophic forgetting while maintaining computational efficiency.

we aim to endow each individual synapse with a local measure of “importance” in solving tasks the network has been trained on in the past. When training on a new task we penalize changes to important parameters to avoid old “memories” from being overwritten. To that end, we developed a class of algorithms which keep track of an importance measure ω µ k which reflects past credit for improvements of the global objective Lµ for task µ to individual synapses or parameters θk.

However, in contrast to EWC, here we are putting forward a method which allows for online computation of the importance measure, whereas EWC relies on the diagonal of the Fisher information metric at the final parameters, which has to be computed during a separate phase at the end of each task.

In these frameworks, synapses have to be thought of as multi-dimensional objects rather than simple scalar quantities. This conceptual shift from scalar-valued synapses to higher-dimensional dynamical entities which have the ability to actively influence their fate during training is a phenomenon found ubiquitously in neurobiology. Neurogenesis Deep Learning

Here, inspired by the process of adult neurogenesis in the hippocampus, we explore the potential for adding new neurons to deep layers of artificial neural networks in order to facilitate their acquisition of novel information while preserving previously trained data representations. Our results on the MNIST handwritten digit dataset and the NIST SD 19 dataset, which includes lower and upper case letters and digits, demonstrate that neurogenesis is well suited for addressing the stability-plasticity dilemma that has long challenged adaptive machine learning algorithms. Encoder Based Lifelong Learning A Strategy for an Uncompromising Incremental Learner

Multi-class supervised learning systems require the knowledge of the entire range of labels they predict. Often when learnt incrementally, they suffer from catastrophic forgetting. To avoid this, generous leeways have to be made to the philosophy of incremental learning that either forces a part of the machine to not learn, or to retrain the machine again with a selection of the historic data. While these tricks work to various degrees, they do not adhere to the spirit of incremental learning. In this article, we redefine incremental learning with stringent conditions that do not allow for any undesirable relaxations and assumptions. We design a strategy involving generative models and the distillation of dark knowledge as a means of hallucinating data along with appropriate targets from past distributions. We call this technique phantom sampling. We show that phantom sampling helps avoid catastrophic forgetting during incremental learning. Using an implementation based on deep neural networks, we demonstrate that phantom sampling dramatically avoids catastrophic forgetting. We apply these strategies to competitive multi-class incremental learning of deep neural networks. Using various benchmark datasets through our strategy, we demonstrate that strict incremental learning could be achieved. Incremental Learning Through Deep Adaptation

Built into our method is the ability to easily switch the representation between the various learned tasks, enabling a single network to perform seamlessly on various domains. We find it surprising that using combinations of existing representations yield ones which are useful for other tasks almost as training the entire network from scratch. Continual Learning with Deep Generative Replay

Inspired by the generative nature of hippocampus as a short-term memory system in primate brain, we propose the Deep Generative Replay, a novel framework with a cooperative dual model architecture consisting of a deep generative model (“generator”) and a task solving model (“solver”). With only these two models, training data for previous tasks can easily be sampled and interleaved with those for a new task. Continual Learning in Generative Adversarial Nets

In this paper, we adapt recent work in reducing catastrophic forgetting to the task of training generative adversarial networks on a sequence of distinct distributions, enabling continual generative modeling.

Experimental results demonstrate that sequential training on different sets of conditional inputs utilizing an EWCaugmented loss counteracts catastrophic forgetting of previously learned distributions. The approach is general and applicable to any setting where the observed distribution of conditional inputs (e.g., class label, partially complete sample) changes over time, or where a conditional input representing the time of data capture can be appended to the data. Gated Orthogonal Recurrent Units: On Learning to Forget

We present a novel recurrent neural network (RNN) architecture that combines the remembering ability of unitary RNNs with the ability of gated RNNs to effectively forget redundant information in the input sequence. We achieve this by extending Unitary RNNs with a gating mechanism. Gradient Episodic Memory for Continuum Learning

We propose a model to learn over continuums of data, called Gradient of Episodic Memory (GEM), which alleviates forgetting while allowing beneficial transfer of knowledge to previous tasks. FEARNET: BRAIN-INSPIRED MODEL FOR INCREMENTAL LEARNING

We proposed a brain-inspired framework capable of incrementally learning data with different modalities and object classes. FearNet outperforms existing methods for incremental class learning on large image and audio classification benchmarks, demonstrating that FearNet is capable of recalling and consolidating recently learned information while also retaining old information. Memory Aware Synapses: Learning what (not) to forget

Inspired by neuroplasticity, we propose an online method to compute the importance of the parameters of a neural network, based on the data that the network is actively applied to, in an unsupervised manner. After learning a task, whenever a sample is fed to the network, we accumulate an importance measure for each parameter of the network, based on how sensitive the predicted output is to a change in this parameter. When learning a new task, changes to important parameters are penalized. We show that a local version of our method is a direct application of Hebb's rule in identifying the important connections between neurons. Overcoming Catastrophic Forgetting with Hard Attention to the Task Memory-based Parameter Adaptation

Our method, Memory-based Parameter Adaptation, stores examples in memory and then uses a context-based lookup to directly modify the weights of a neural network. Much higher learning rates can be used for this local adaptation, reneging the need for many iterations over similar data before good predictions can be made. As our method is memory-based, it alleviates several shortcomings of neural networks, such as catastrophic forgetting, fast, stable acquisition of new knowledge, learning with an imbalanced class labels, and fast learning during evaluation. We demonstrate this on a range of supervised tasks: large-scale image classification and language modelling. PackNet: Adding Multiple Tasks to a Single Network by Iterative Pruning Theory of the superposition principle for randomized connectionist representations in neural networks Low-Shot Learning with Imprinted Weights

by directly setting the final layer weights from novel training examples during low-shot learning. We call this process weight imprinting as it directly sets weights for a new category based on an appropriately scaled copy of the embedding layer activations for that training example. Gradient Episodic Memory for Continual Learning Lifelong Learning with Dynamically Expandable Networks Life-Long Disentangled Representation Learning with Cross-Domain Latent Homologies

: Variational Autoencoder with Shared Embeddings (VASE). Based on the Minimum Description Length principle, VASE automatically detects shifts in the data distribution and allocates spare representational capacity to new knowledge, while simultaneously protecting previously learnt representations from catastrophic forgetting. Our approach encourages the learnt representations to be disentangled, which imparts a number of desirable properties: VASE can deal sensibly with ambiguous inputs, it can enhance its own representations through imagination-based exploration, and most importantly, it exhibits semantically meaningful sharing of latents between different datasets. Compared to baselines with entangled representations, our approach is able to reason beyond surface-level statistics and perform semantically meaningful cross-domain inference. Lifelong Generative Modeling n this work we focus on a lifelong learning approach to generative modeling where we continuously incorporate newly observed distributions into our learnt model. We do so through a student-teacher Variational Autoencoder architecture which allows us to learn and preserve all the distributions seen so far without the need to retain the past data nor the past models. Through the introduction of a novel cross-model regularizer, inspired by a Bayesian update rule, the student model leverages the information learnt by the teacher, which acts as a summary of everything seen till now. The regularizer has the additional benefit of reducing the effect of catastrophic interference that appears when we learn over sequences of distributions. We demonstrate its efficacy in learning sequentially observed distributions as well as its ability to learn a common latent representation across a complex transfer learning scenario. Synthesis of Differentiable Functional Programs for Lifelong Learning Memory Replay GANs: learning to generate images from new categories without forgetting

We study two methods to prevent forgetting by leveraging these replays, namely joint training with replay and replay alignment. Qualitative and quantitative experimental results in MNIST, SVHN and LSUN datasets show that our memory replay approach can generate competitive images while significantly mitigating the forgetting of previous categories Memory Aware Synapses: Learning what (not) to forget Learning to remember: Dynamic Generative Memory for Continual Learning HC-Net: Memory-based Incremental Dual-Network System for Continual learning A comprehensive, application-oriented study of catastrophic forgetting in DNNs Overcoming Catastrophic Forgetting via Model Adaptation Differentiable plasticity: training plastic neural networks with backpropagation