**This is an old revision of the document!**

**Name** Meta Learner aka Learning to Optimize, Learning to Learn

**Intent**

Use a DL network to learn and optimize the learning algorithm.

**Motivation**

How can we leverage DL to learn how to improve the training process?

**Structure**

<Diagram>

**Discussion**

This is one of the more intriguing questions of Deep Learning. We are aware that Stochastic Gradient Descent (SGD) works surprisingly well. However, can we train a network to learn a more efficient method? Can we train a network to learn how to optimize or learn how to learn? This is a meta-level question. Extending this question, can we train a network to improve its own learning? There are two important metrics that we would like to improve. Trainability is the most obvious metric in that we would like to speed up training. The other metric is whether we can learn more generalized models.

“There are about a dozen primitive Model Free Methods (Generate-and-test, Enumeration, Table Lookup, Mindless Copying, Adaptation, Evolution, etc). Depending on context you can view them as meta-methods, machine learning, evolution-in-the-abstract, Models of Mind, Models of Saliency and Reduction, or Pattern Discovery methods. They can be combined in myriad ways, yielding more complex Model Free Methods.”

**Known Uses**

There are several approaches that we find in the literature. One approach is to treat the optimization problem as a reinforcement learning problem with a policy net that learns the optimization algorithm. This reduces the problem to discovering the optimal policy. In [LEARNOPT] researhers demonstrate that this method converges faster than conventional optimizers. A similar research effort [LEARNQ] employs the Deep Q-Learning RL algorithm to learn how to speed up optimization. There seems to a proof that if a network is trained by fewer iterations then it may have better generalization ability [PROOFGEN]. This proof is used in the arguments of the the paper that faster convergence implies improved generalization. Both approaches use a RL reward function that is chosen to learns a policy that is able to find an optimal objective function with the fewest number of iterations.

Another approach is to treat the design of the optimization problem as a RNN using an LSTM [LEARNGRAD]. The DeepMind researcher show tha their trained neural optimizers compare favorably against even state-of-the-art optimization methods. The measure of generalization proposed in this paper is based on a network's ability to capture via transfer learning the same capabilities of a much larger network. The paper argues that their LSTM based approach lead to an LSTM that was four times smaller than the originally trained network and thus achieved high generalization.

The very notion of using a neural network to learn the neural network optimization algorithm was introduced by Schmidhuber, way back in 1992, where he used RNNs for metalearning. This did not work well in practice until LSTM networks were invented. Shmidhuber earlier in 1987, had proposed, in his thesis, the use of genetic algorithms to the problem of learning to learn.

**Related Patterns**

- Stochastic Gradient Descent
- Irreducible Computation -
- Transfer Learning -
- Compressed Sensing -
- Disentangled Models -
- Reinforcement Learning
- RNN

Transfer Learning - In our setting the examples are themselves problem instances, which means generalization corresponds to the ability to transfer knowledge between different problems. This reuse of problem structure is commonly known as transfer learning, and is often treated as a subject in its own right. However, by taking a meta-learning perspective, we can cast the problem of transfer learning as one of generalization, which is much better studied in the machine learning community.

<Diagram>

**References**

[LEARNOPT] http://arxiv.org/abs/1606.01885v1

Learning to Optimize

we believe to be the first method that can automatically discover a better algorithm. We approach this problem from a reinforcement learning perspective and represent any particular optimization algorithm as a policy. We learn an optimization algorithm using guided policy search and demonstrate that the resulting algorithm outperforms existing hand-engineered algorithms in terms of convergence speed and/or the final objective value. we believe to be the first method that can automatically discover a better algorithm. We approach this problem from a reinforcement learning perspective and represent any particular optimization algorithm as a policy. We learn an optimization algorithm using guided policy search and demonstrate that the resulting algorithm outperforms existing hand-engineered algorithms in terms of convergence speed and/or the final objective value.

[LEARNQ] https://arxiv.org/abs/1606.01467

Deep Q-Networks for Accelerating the Training of Deep Neural Networks

http://arxiv.org/pdf/1606.04080v1.pdf Matching Networks for One Shot Learning

This is a form of meta-learning since the training procedure explicitly learns to learn from a given support set to minimise a loss over a batch.

[LEARNGRAD] https://arxiv.org/abs/1606.04474 Learning to learn by gradient descent by gradient descent

In this paper we show how the design of an optimization algorithm can be cast as a learning problem, allowing the algorithm to learn to exploit structure in the problems of interest in an automatic way.

We have shown how to cast the design of optimization algorithms as a learning problem, which enables us to train optimizers that are specialized to particular classes of functions. Our experiments have confirmed that learned neural optimizers compare favorably against state-of-the-art optimization methods used in deep learning. We witnessed a remarkable degree of transfer, with for example the LSTM optimizer trained on 12,288 parameter neural art tasks being able to generalize to tasks with 49,152 parameters, different styles, and different content images all at the same time. We observed similar impressive results when transferring to different architectures in the MNIST task.

[PROOFGEN] Moritz Hardt, Benjamin Recht, and Yoram Singer. Train faster, generalize better: Stability of stochastic gradient descent. arXiv preprint arXiv:1509.01240, 2015.

http://people.idsia.ch/~juergen/metalearner.html

https://arxiv.org/pdf/1606.05233.pdf Learning feed-forward one-shot learners

We construct the learner as a second deep network, called a learnet, which predicts the parameters of a pupil network from a single exemplar. In this manner we obtain an efficient feed-forward one-shot learner, trained end-to-end by minimizing a one-shot classification objective in a learning to learn formulation. In order to make the construction feasible, we propose a number of factorizations of the parameters of the pupil network.

http://arxiv.org/abs/1606.02580 Convolution by Evolution: Differentiable Pattern Producing Networks

Our main result is that DPPNs can be evolved/trained to compress the weights of a denoising autoencoder from 157684 to roughly 200 parameters, while achieving a reconstruction accuracy comparable to a fully connected network with more than two orders of magnitude more parameters.

http://arxiv.org/abs/1606.01467v3 Deep Q-Networks for Accelerating the Training of Deep Neural Networks

https://arxiv.org/abs/1609.02228v2 Learning to learn with backpropagation of Hebbian plasticity

Here we derive analytical expressions for activity gradients in neural networks with Hebbian plastic connections. Using these expressions, we can use backpropagation to train not just the baseline weights of the connections, but also their plasticity. As a result, the networks “learn how to learn” in order to solve the problem at hand: the trained networks automatically perform fast learning of unpredictable environmental features during their lifetime, expanding the range of solvable problems.

https://arxiv.org/abs/1610.06402v1 A Growing Long-term Episodic & Semantic Memory

The long-term memory of most connectionist systems lies entirely in the weights of the system. Since the number of weights is typically fixed, this bounds the total amount of knowledge that can be learned and stored. Though this is not normally a problem for a neural network designed for a specific task, such a bound is undesirable for a system that continually learns over an open range of domains. To address this, we describe a lifelong learning system that leverages a fast, though non-differentiable, content-addressable memory which can be exploited to encode both a long history of sequential episodic knowledge and semantic knowledge over many episodes for an unbounded number of domains. This opens the door for investigation into transfer learning, and leveraging prior knowledge that has been learned over a lifetime of experiences to new domains.

https://arxiv.org/abs/1610.06072v1 Learning to Learn Neural Networks

We use a Long Short Term Memory (LSTM) based network to learn to compute on-line updates of the parameters of another neural network. These parameters are stored in the cell state of the LSTM. Our framework allows to compare learned algorithms to hand-made algorithms within the traditional train and test methodology. In an experiment, we learn a learning algorithm for a one-hidden layer Multi-Layer Perceptron (MLP) on non-linearly separable datasets. The learned algorithm is able to update parameters of both layers and generalise well on similar datasets.

https://arxiv.org/pdf/1606.04474v1.pdf Learning to learn by gradient descent by gradient descent

The move from hand-designed features to learned features in machine learning has been wildly successful. In spite of this, optimization algorithms are still designed by hand. In this paper we show how the design of an optimization algorithm can be cast as a learning problem, allowing the algorithm to learn to exploit structure in the problems of interest in an automatic way. Our learned algorithms, implemented by LSTMs, outperform generic, hand-designed competitors on the tasks for which they are trained, and also generalize well to new tasks with similar structure. We demonstrate this on a number of tasks, including simple convex problems, training neural networks, and styling images with neural art.

https://www.youtube.com/watch?v=x1kf4Zojtb0 Learning to learn and compositionality with deep recurrent neural networks

http://openreview.net/pdf?id=rJY0-Kcll OPTIMIZATION AS A MODEL FOR FEW-SHOT LEARNING

Here, we propose an LSTMbased meta-learner model to learn the exact optimization algorithm used to train another learner neural network in the few-shot regime. The parametrization of our model allows it to learn appropriate parameter updates specifically for the scenario where a set amount of updates will be made, while also learning a general initialization of the learner network that allows for quick convergence of training.

http://openreview.net/pdf?id=HyWG0H5ge NEURAL TAYLOR APPROXIMATION

http://openreview.net/pdf?id=Syg_lYixe DEEP REINFORCEMENT LEARNING FOR ACCELERATING THE CONVERGENCE RATE

In this paper, we propose a principled deep reinforcement learning (RL) approach that is able to accelerate the convergence rate of general deep neural networks (DNNs). With our approach, a deep RL agent (synonym for optimizer in this work) is used to automatically learn policies about how to schedule learning rates during the optimization of a DNN. The state features of the agent are learned from the weight statistics of the optimizee during training. The reward function of this agent is designed to learn policies that minimize the optimizee’s training time given a certain performance goal. The actions of the agent correspond to changing the learning rate for the optimizee during training.

https://github.com/bigaidreamprojects/qan

https://arxiv.org/abs/1611.05763v2 Learning to reinforcement learn

What emerges is a system that is trained using one RL algorithm, but whose recurrent dynamics implement a second, quite separate RL procedure. This second, learned RL algorithm can differ from the original one in arbitrary ways. Importantly, because it is learned, it is configured to exploit structure in the training domain.

https://arxiv.org/abs/1611.02779v2 Fast Reinforcement Learning via Slow Reinforcement Learning

Rather than designing a “fast” reinforcement learning algorithm, we propose to represent it as a recurrent neural network (RNN) and learn it from data. In our proposed method, RL2, the algorithm is encoded in the weights of the RNN, which are learned slowly through a general-purpose (“slow”) RL algorithm.

https://arxiv.org/abs/1611.03824 Learning to Learn for Global Optimization of Black Box Functions

We present a learning to learn approach for training recurrent neural networks to perform black-box global optimization. In the meta-learning phase we use a large set of smooth target functions to learn a recurrent neural network (RNN) optimizer, which is either a long-short term memory network or a differentiable neural computer. After learning, the RNN can be applied to learn policies in reinforcement learning, as well as other black-box learning tasks, including continuous correlated bandits and experimental design. We compare this approach to Bayesian optimization, with emphasis on the issues of computation speed, horizon length, and exploration-exploitation trade-offs.

**The experiments have also shown that the RNNs are massively faster than Bayesian optimization.
Hence, for applications involving a known horizon or where speed matters, we recommend the use
of the RNN optimizers.**

RNNs outperform Spearmint, with DNC doing slightly better than the LSTMs, for a horizon less than T = 30, beyond this training horizon Spearmint eventually achieves a lower error than the neural networks. This is because Spearmint has a mechanism for continued exploration that we have not yet incorporated into our neural network models. We are currently exploring extensions to our approach to improve exploration beyond the training horizon.

https://openreview.net/forum?id=Bk8BvDqex Metacontrol for Adaptive Imagination-Based Optimization

Many machine learning systems are built to solve the hardest examples of a particular task, which often makes them large and expensive to run—especially with respect to the easier examples, which might require much less computation. For an agent with a limited computational budget, this “one-size-fits-all” approach may result in the agent wasting valuable computation on easy examples, while not spending enough on hard examples. Rather than learning a single, fixed policy for solving all instances of a task, we introduce a metacontroller which learns to optimize a sequence of “imagined” internal simulations over predictive models of the world in order to construct a more informed, and more economical, solution. The metacontroller component is a model-free reinforcement learning agent, which decides both how many iterations of the optimization procedure to run, as well as which model to consult on each iteration. The models (which we call “experts”) can be state transition models, action-value functions, or any other mechanism that provides information useful for solving the task, and can be learned on-policy or off-policy in parallel with the metacontroller. When the metacontroller, controller, and experts were trained with “interaction networks” (Battaglia et al., 2016) as expert models, our approach was able to solve a challenging decision-making problem under complex non-linear dynamics. The metacontroller learned to adapt the amount of computation it performed to the difficulty of the task, and learned how to choose which experts to consult by factoring in both their reliability and individual computational resource costs. This allowed the metacontroller to achieve a lower overall cost (task loss plus computational cost) than more traditional fixed policy approaches. These results demonstrate that our approach is a powerful framework for using rich forward models for efficient model-based reinforcement learning.

https://arxiv.org/abs/1612.09030v2 Meta-Unsupervised-Learning: A supervised approach to unsupervised learning

We introduce a new paradigm to investigate unsupervised learning, reducing unsupervised learning to supervised learning. Specifically, we mitigate the subjectivity in unsupervised decision-making by leveraging knowledge acquired from prior, possibly heterogeneous, supervised learning tasks. We demonstrate the versatility of our framework via comprehensive expositions and detailed experiments on several unsupervised problems such as (a) clustering, (b) outlier detection, and © similarity prediction under a common umbrella of meta-unsupervised-learning. We also provide rigorous PAC-agnostic bounds to establish the theoretical foundations of our framework, and show that our framing of meta-clustering circumvents Kleinberg's impossibility theorem for clustering.

https://arxiv.org/pdf/1606.02492v3.pdf Convolutional Neural Fabrics

Despite the success of CNNs, selecting the optimal architecture for a given task remains an open problem. Instead of aiming to select a single optimal architecture, we propose a “fabric” that embeds an exponentially large number of architectures. The fabric consists of a 3D trellis that connects response maps at different layers, scales, and channels with a sparse homogeneous local connectivity pattern. The only hyper-parameters of a fabric are the number of channels and layers. While individual architectures can be recovered as paths, the fabric can in addition ensemble all embedded architectures together, sharing their weights where their paths overlap. Parameters can be learned using standard methods based on backpropagation, at a cost that scales linearly in the fabric size. We present benchmark results competitive with the state of the art for image classification on MNIST and CIFAR10, and for semantic segmentation on the Part Labels dataset.

https://openreview.net/pdf?id=rJY0-Kcll OPTIMIZATION AS A MODEL FOR FEW-SHOT LEARNING

Though deep neural networks have shown great success in the large data domain, they generally perform poorly on few-shot learning tasks, where a classifier has to quickly generalize after seeing very few examples from each class. The general belief is that gradient-based optimization in high capacity classifiers requires many iterative steps over many examples to perform well. Here, we propose an LSTMbased meta-learner model to learn the exact optimization algorithm used to train another learner neural network classifier in the few-shot regime. The parametrization of our model allows it to learn appropriate parameter updates specifically for the scenario where a set amount of updates will be made, while also learning a general initialization of the learner (classifier) network that allows for quick convergence of training. We demonstrate that this meta-learning model is competitive with deep metric-learning techniques for few-shot learning.

https://arxiv.org/abs/1502.03492 Gradient-based Hyperparameter Optimization through Reversible Learning

Tuning hyperparameters of learning algorithms is hard because gradients are usually unavailable. We compute exact gradients of cross-validation performance with respect to all hyperparameters by chaining derivatives backwards through the entire training procedure. These gradients allow us to optimize thousands of hyperparameters, including step-size and momentum schedules, weight initialization distributions, richly parameterized regularization schemes, and neural network architectures. We compute hyperparameter gradients by exactly reversing the dynamics of stochastic gradient descent with momentum. https://github.com/HIPS/hypergrad

https://arxiv.org/abs/1703.03400v1 Model-Agnostic Meta-Learning for Fast Adaptation of Deep Networks

We propose an algorithm for meta-learning that is model-agnostic, in the sense that it is compatible with any model trained with gradient descent and applicable to a variety of different learning problems, including classification, regression, and reinforcement learning. The goal of meta-learning is to train a model on a variety of learning tasks, such that it can solve new learning tasks using only a small number of training samples. In our approach, the parameters of the model are explicitly trained such that a small number of gradient steps with a small amount of training data from a new task will produce good generalization performance on that task. In effect, our method trains the model to be easy to fine-tune. We demonstrate that this approach leads to state-of-the-art performance on a few-shot image classification benchmark, produces good results on few-shot regression, and accelerates fine-tuning for policy gradient reinforcement learning with neural network policies.

https://arxiv.org/abs/1606.02185 Towards a Neural Statistician

An efficient learner is one who reuses what they already know to tackle a new problem. For a machine learner, this means understanding the similarities amongst datasets. In order to do this, one must take seriously the idea of working with datasets, rather than datapoints, as the key objects to model. Towards this goal, we demonstrate an extension of a variational autoencoder that can learn a method for computing representations, or statistics, of datasets in an unsupervised fashion. The network is trained to produce statistics that encapsulate a generative model for each dataset. Hence the network enables efficient learning from new datasets for both unsupervised and supervised tasks. We show that we are able to learn statistics that can be used for: clustering datasets, transferring generative models to new datasets, selecting representative samples of datasets and classifying previously unseen classes.

https://arxiv.org/abs/1703.03633v2 Learning Gradient Descent: Better Generalization and Longer Horizons

Our optimizer outperforms generic, hand-crafted optimization algorithms and state-of-the-art learning-to-learn optimizers by DeepMind in many tasks. We demonstrate the effectiveness of our algorithms on a number of tasks, including deep MLPs, CNNs, and simple LSTMs. https://github.com/vfleaking/rnnprop

https://arxiv.org/abs/1703.04813v1 Learned Optimizers that Scale and Generalize

Two of the primary barriers to its adoption are an inability to scale to larger problems and a limited ability to generalize to new tasks. We introduce a learned gradient descent optimizer that generalizes well to new tasks, and which has significantly reduced memory and computation overhead. We achieve this by introducing a novel hierarchical RNN architecture, with minimal per-parameter overhead, augmented with additional architectural features that mirror the known structure of optimization tasks. We also develop a meta-training ensemble of small, diverse, optimization tasks capturing common properties of loss landscapes. The optimizer learns to out-perform RMSProp/ADAM on problems in this corpus.

We have shown that RNN-based optimizers meta-trained on small problems can scale and generalize to training large problems like ResNet and Inception on the ImageNet dataset. To achieve these results, we introduced a novel hierarchical architecture that reduces memory overhead and allows communication across parameters, and augmented it with additional features shown to be useful in previous optimization and recurrent neural network literature. We also developed an ensemble of small optimization problems that capture common and diverse properties of loss landscapes.

In experiments, for small minibatches, we significantly underperform ADAM and RMSProp in terms of wall clock time. However, consistent with the prediction in 3.1.1, since our overhead is constant in terms of minibatch we see that the overhead can be made small by increasing the minibatch size.

https://arxiv.org/abs/1703.03400v2 Model-Agnostic Meta-Learning for Fast Adaptation of Deep Networks

We propose an algorithm for meta-learning that is model-agnostic, in the sense that it is compatible with any model trained with gradient descent and applicable to a variety of different learning problems, including classification, regression, and reinforcement learning. The goal of meta-learning is to train a model on a variety of learning tasks, such that it can solve new learning tasks using only a small number of training samples. In our approach, the parameters of the model are explicitly trained such that a small number of gradient steps with a small amount of training data from a new task will produce good generalization performance on that task. In effect, our method trains the model to be easy to fine-tune. We demonstrate that this approach leads to state-of-the-art performance on two few-shot image classification benchmarks, produces good results on few-shot regression, and accelerates fine-tuning for policy gradient reinforcement learning with neural network policies.

We introduced a meta-learning algorithm based on learning easily adaptable network parameters through gradient descent. Our approach has a number of benefits. It is simple and does not introduce any additional learned parameters in the meta-learning. It can be combined with any model representation that is amenable to training through gradient descent, and any differentiable objective, including classification, regression, and reinforcement learning. Lastly, since our method merely produces a good weight initialization, adaptation can be performed with any amount of data and any number of gradient steps, though we demonstrate that it can achieve state-of-the-art results on classification with only one or five examples per class. We also show that our method can adapt an RL agent through policy gradients using a very modest amount of experience.

https://arxiv.org/pdf/1705.03562.pdf Deep Episodic Value Iteration for Model-based Meta-Reinforcement Learning

We present a new deep meta reinforcement learner, which we call Deep Episodic Value Iteration (DEVI). DEVI uses a deep neural network to learn a similarity metric for a non-parametric model-based reinforcement learning algorithm. Our model is trained end-to-end via back-propagation. Despite being trained using the model-free Q-learning objective, we show that DEVI’s model-based internal structure provides ‘one-shot’ transfer to changes in reward and transition structure, even for tasks with very high-dimensional state spaces.

https://arxiv.org/abs/1706.09529v1 Learning to Learn: Meta-Critic Networks for Sample Efficient Learning

We propose a novel and flexible approach to meta-learning for learning-to-learn from only a few examples. Our framework is motivated by actor-critic reinforcement learning, but can be applied to both reinforcement and supervised learning. The key idea is to learn a meta-critic: an action-value function neural network that learns to criticise any actor trying to solve any specified task. For supervised learning, this corresponds to the novel idea of a trainable task-parametrised loss generator. This meta-critic approach provides a route to knowledge transfer that can flexibly deal with few-shot and semi-supervised conditions for both reinforcement and supervised learning. Promising results are shown on both reinforcement and supervised learning problems.

https://arxiv.org/abs/1707.03141 Meta-Learning with Temporal Convolutions We propose a class of simple and generic meta-learner architectures, based on temporal convolutions, that is domain- agnostic and has no particular strategy or algorithm encoded into it.

We can view the TCML architecture as a flavor of RNN that can remember information through the activations of the network rather than through an explicit memory module. Because of its convolutional structure, the TCML better preserves the temporal structure of the inputs it receives, at the expense of only being able to remember information for a fixed amount of time. However, by exponentially increasing the dilation factors of the higher convolutional layers (as done by van den Oord et al. [27]), TCML architectures can tractably store information for long periods of time.

https://arxiv.org/pdf/1707.07012v1.pdf Learning Transferable Architectures for Scalable Image Recognition

In our experiments, we search for the best convolutional cell on the CIFAR-10 dataset and then apply this learned cell to the ImageNet dataset by stacking together more of this cell. Although the cell is not learned directly on ImageNet, an architecture constructed from the best learned cell achieves state-of-the-art accuracy of 82.3% top-1 and 96.0% top-5 on ImageNet, which is 0.8% better in top-1 accuracy than the best human-invented architectures while having 9 billion fewer FLOPS. This cell can also be scaled down two orders of magnitude: a smaller network constructed from the best cell also achieves 74% top-1 accuracy, which is 3.1% better than the equivalently-sized, state-of-the-art models for mobile platforms.

https://openreview.net/pdf?id=HyjC5yWCW META-LEARNING AND UNIVERSALITY: DEEP REPRESENTATIONS AND GRADIENT DESCENT CAN APPROXIMATE ANY LEARNING ALGORITHM https://arxiv.org/pdf/1710.11622v1.pdf

https://arxiv.org/pdf/1710.10304.pdf FEW-SHOT AUTOREGRESSIVE DENSITY ESTIMATION: TOWARDS LEARNING TO LEARN DISTRIBUTIONS

https://arxiv.org/pdf/1707.09835v2.pdf Meta-SGD: Learning to Learn Quickly for Few-Shot Learning

https://openreview.net/pdf?id=B1DmUzWAW A SIMPLE NEURAL ATTENTIVE META-LEARNER

We propose a class of simple and generic meta-learner architectures that use a novel combination of temporal convolutions and soft attention; the former to aggregate information from past experience and the latter to pinpoint specific pieces of information. In the most extensive set of meta-learning experiments to date, we evaluate the resulting Simple Neural AttentIve Learner (or SNAIL) on several heavily-benchmarked tasks. On all tasks, in both supervised and reinforcement learning, SNAIL attains state-of-the-art performance by significant margins.

https://arxiv.org/abs/1801.05558 Meta-Learning with Adaptive Layerwise Metric and Subspace

A promising approach is the model-agnostic meta-learning (MAML) which embeds gradient descent into the meta-learner. It optimizes for the initial parameters of the learner to warm-start the gradient descent updates, such that new tasks can be solved using a small number of examples. In this paper we elaborate the gradient-based meta-learning, developing two new schemes. First, we present a feedforward neural network, referred to as T-net, where the linear transformation between two adjacent layers is decomposed as T W such that W is learned by task-specific learners and the transformation T, which is shared across tasks, is meta-learned to speed up the convergence of gradient updates for task-specific learners. Second, we present MT-net where gradient updates in the T-net are guided by a binary mask M that is meta-learned, restricting the updates to be performed in a subspace.

https://openreview.net/pdf?id=HyjC5yWCW META-LEARNING AND UNIVERSALITY: DEEP REPRESENTATIONS AND GRADIENT DESCENT CAN APPROXIMATE ANY LEARNING ALGORITHM

In particular, we seek to answer the following question: does deep representation combined with standard gradient descent have sufficient capacity to approximate any learning algorithm? We find that this is indeed true, and further find, in our experiments, that gradient-based meta-learning consistently leads to learning strategies that generalize more widely compared to those represented by recurrent models.