This is an old revision of the document!

.Hyper Parameter Tuning

Known Uses Machine learning automation: Many machine learning algorithms have numerous hyperparameters that can be optimized. At Facebook's scale, a 1 percent improvement in accuracy for many models can have a meaningful impact on people's experiences. So with Flow, we built support for large-scale parameter sweeps and other AutoML features that leverage idle cycles to further improve these models. We are investing further in this area to help scale Facebook's AI/ML experts across many products in Facebook.


Use grid search and checkpointing to scan many hyper-parameters, then continue training the most promising models.

Optunity provides a variety of solvers for hyperparameter tuning problems. A tuning problem is specified by an objective function that provides a score for some tuple of hyperparameters. Specifying the objective function must be done by the user. The software offers a diverse set of solvers to optimize the objective function. A solver determines a good set of hyperparameters.

A pure learning algorithm can be seen as a function taking training data as input and producing as output a function (e.g. a predictor) or model (i.e. a bunch of functions). However, in practice, many learning algorithms involve hyper-parameters, i.e., annoying knobs to be adjusted. In many algorithms such as Deep Learning algorithms the number of hyper-parameters (ten or more!) can make the idea of having to adjust all of them unappealing. In addition, it has been shown that the use of computer clusters for hyper-parameter selection can have an important effect on results (Pinto et al., 2009). Choosing hyper-parameter values is formally equivalent to the question of model selection, i.e., given a family or set of learning algorithms, how to pick the most appropriate one inside the set? We define a hyper- parameter for a learning algorithm A as a variable to be set prior to the actual application of A to the data, one that is not directly selected by the learning algorithm itself. It is basically an outside control knob. It can be discrete (as in model selection) or continuous (such as the learning rate discussed above). Of course, one can hide these hyper-parameters by wrapping another learning algorithm, say B, around A, to selects A’s hyper-parameters (e.g. to minimize validation set error). We can then call B a hyper-learner, and if B has no hyper-parameters itself then the composition of B over A could be a “pure” learning algorithm, with no hyper-parameter. In the end, to apply a learner to training data, one has to have a pure learning algorithm. The hyper-parameters can be fixed by hand or tuned by an algorithm, but their value has to be selected. The value of some hyperparameters can be selected based on the performance of A on its training data, but most cannot. For any hyper-parameter that has an impact on the effective capacity of a learner, it makes more sense to select its value based on out-of-sample data (outside the training set), e.g., a validation set performance, online error, or cross-validation error. Note that some learning algorithms (in particular unsupervised learning algorithms such as algorithms for training RBMs by approximate maximum likelihood) are problematic in this respect because we cannot directly measure the quantity that is to be optimized (e.g. the likelihood) because it is intractable. On the other hand, the expected denoising reconstruction error is easy to estimate (by just averaging the denoising error over a validation set). Once some out-of-sample data has been used for selecting hyper-parameter values, it cannot be used anymore to obtain an unbiased estimator of generalization performance, so one typically uses a test set (or double cross-validation14, in the case of small datasets) to estimate generalization error of the pure learning algorithm (with hyper-parameter selection hidden inside). In some of our earliest work to leverage AI and ML — such as delivering the most relevant content to each person — we noticed that the largest improvements in accuracy often came from quick experiments, feature engineering, and model tuning rather than applying fundamentally different algorithms. Bayesian Hyperparameter Optimization for Ensemble Learning ESDF: Ensemble Selection using Diversity and Frequency

We propose an efficient method of ensemble selection for a large ensemble. We prioritize the partitions in the ensemble based on diversity and frequency. Our method selects top K of the partitions in order of priority, where K is decided by the user. We observe that considering jointly the diversity and frequency helps in identifying few representative partitions whose consensus is qualitatively better than the consensus of the entire ensemble. Gradient-based Hyperparameter Optimization through Reversible Learning

Tuning hyperparameters of learning algorithms is hard because gradients are usually unavailable. We compute exact gradients of cross-validation performance with respect to all hyperparameters by chaining derivatives backwards through the entire training procedure. These gradients allow us to optimize thousands of hyperparameters, including step-size and momentum schedules, weight initialization distributions, richly parameterized regularization schemes, and neural network architectures. We compute hyperparameter gradients by exactly reversing the dynamics of stochastic gradient descent with momentum.

Scalable Gradient-Based Tuning of Continuous Regularization Hyperparameters Luketina, J., Raiko, T., Berglund, M., & Greff, K. (2016) [16]

Tuning hyperparameters is often necessary to get good results with deep neural networks. Typically, the turning is performed either by manual trial-and-error, by using search, or by evaluating validation set performance. The authors propose a gradient based method that is less tedious and less computationally expensive to find good regularization hyperparameters. Unlike previous methods, their method is simpler and computationally lightweight, and it updates both hyperparameters and regular parameters using stochastic gradient descent in the same training run. The gradient of the hyperparameters is obtained from the cost of the unregularized model on the validation set. Although the authors show that their method is effective in finding good regularization hyperparameters, they haven't extended it to common training techniques such as dropout regularization and learning rate adaptation. Cyclical Learning Rates for Training Neural Networks

Instead of monotonically decreasing the learning rate, this method lets the learning rate cyclically vary between reasonable boundary values. Training with cyclical learning rates instead of fixed values achieves improved classification accuracy without a need to tune and often in fewer iterations. This paper also describes a simple way to estimate “reasonable bounds” – linearly increasing the learning rate of the network for a few epochs Hyperband: A Novel Bandit-Based Approach to Hyperparameter Optimization

Hyperband provides five times to thirty times speedup over state-of-the-art Bayesian optimization algorithms on a variety of deep-learning and kernel-based learning problems. The observer-assisted method for adjusting hyper-parameters in deep learning algorithms

This paper presents a concept of a novel method for adjusting hyper-parameters in Deep Learning (DL) algorithms. An external agent-observer monitors a performance of a selected Deep Learning algorithm. The observer learns to model the DL algorithm using a series of random experiments. Consequently, it may be used for predicting a response of the DL algorithm in terms of a selected quality measurement to a set of hyper-parameters. This allows to construct an ensemble composed of a series of evaluators which constitute an observer-assisted architecture. The architecture may be used to gradually iterate towards to the best achievable quality score in tiny steps governed by a unit of progress. The algorithm is stopped when the maximum number of steps is reached or no further progress is made. Learning to Learn for Global Optimization of Black Box Functions

We present a learning to learn approach for training recurrent neural networks to perform black-box global optimization. In the meta-learning phase we use a large set of smooth target functions to learn a recurrent neural network (RNN) optimizer, which is either a long-short term memory network or a differentiable neural computer. After learning, the RNN can be applied to learn policies in reinforcement learning, as well as other black-box learning tasks, including continuous correlated bandits and experimental design. We compare this approach to Bayesian optimization, with emphasis on the issues of computation speed, horizon length, and exploration-exploitation trade-offs.

The experiments have also shown that the RNNs are massively faster than Bayesian optimization. Hence, for applications involving a known horizon or where speed matters, we recommend the use of the RNN optimizers.

RNNs outperform Spearmint, with DNC doing slightly better than the LSTMs, for a horizon less than T = 30, beyond this training horizon Spearmint eventually achieves a lower error than the neural networks. This is because Spearmint has a mechanism for continued exploration that we have not yet incorporated into our neural network models. We are currently exploring extensions to our approach to improve exploration beyond the training horizon. Making a Science of Model Search: Hyperparameter Optimization in Hundreds of Dimensions for Vision Architectures

Recent automated machine learning (AutoML) systems find the right algorithm and hyperparameters in a data-driven way without any human intervention. In this blog post we describe how this is done in our AutoML system Auto-sklearn that won the recent AutoML challenge. Hyperband: A Novel Bandit-Based Approach to Hyperparameter Optimization

We formulate hyperparameter optimization as a pure-exploration non-stochastic infinitely many armed bandit problem where a predefined resource like iterations, data samples, or features is allocated to randomly sampled configurations. We introduce Hyperband for this framework and analyze its theoretical properties, providing several desirable guarantees. Furthermore, we compare Hyperband with state-of-the-art methods on a suite of hyperparameter optimization problems. We observe that Hyperband provides five times to thirty times speedup over state-of-the-art Bayesian optimization algorithms on a variety of deep-learning and kernel-based learning problems.

The idea behind the original SuccessiveHalving algorithm follows directly from its name: uniformly allocate a budget to a set of hyperparameter configurations, evaluate the performance of all configurations, throw out the worst half, and repeat until one configurations remains. Forward and Reverse Gradient-Based Hyperparameter Optimization

We study two procedures (reverse-mode and forward-mode) for computing the gradient of the validation error with respect to the hyperparameters of any iterative learning algorithm such as stochastic gradient descent. These procedures mirror two methods of computing gradients for recurrent neural networks and have different trade-offs in terms of running time and space requirements. Our formulation of the reverse-mode procedure is linked to previous work by Maclaurin et al. [2015] but does not require reversible dynamics. The forward-mode procedure is suitable for real-time hyperparameter updates, which may significantly speed up hyperparameter optimization on large datasets. We present experiments on data cleaning and on learning task interactions. DeepArchitect: Automatically Designing and Training Deep Architectures

We propose an extensible and modular language that allows the human expert to compactly represent complex search spaces over architectures and their hyperparameters. The resulting search spaces are tree-structured and therefore easy to traverse. Models can be automatically compiled to computational graphs once values for all hyperparameters have been chosen. We can leverage the structure of the search space to introduce different model search algorithms, such as random search, Monte Carlo tree search (MCTS), and sequential model-based optimization (SMBO). Google Vizier: A Service for Black-Box Optimization

Any sufficiently complex system acts as a black box when it becomes easier to experiment with than to understand. Hence, black-box optimization has become increasingly important as systems have become more complex. In this paper we describe Google Vizier, a Google-internal service for performing black-box optimization that has become the de facto parameter tuning engine at Google. Google Vizier is used to optimize many of our machine learning models and other systems, and also provides core capabilities to Google’s Cloud Machine Learning HyperTune subsystem. We discuss our requirements, infrastructure design, underlying algorithms, and advanced features such as transfer learning and automated early stopping that the service provides. Learning Transferable Architectures for Scalable Image Recognition

In this work, we demonstrate how to learn scalable, convolutional cells from data that transfer to multiple image classification tasks. The key insight to this approach is to design a search space that decouples the complexity of an architecture from the depth of a network. This resulting search space permits identifying good architectures on a small dataset (i.e., CIFAR-10) and transferring the learned architecture to image classifications across a range of data and computational scales. The resulting architectures approach or exceed state-of-the-art performance in terms of CIFAR- 10, ImageNet classification with less computational demand than human-designed architectures. The ImageNet results are particularly important because many state-of-the-art computer vision problems (derive image features or architectures from ImageNet classification models. Finally, we demonstrate that we can employ the resulting learned architecture to perform ImageNet classification with reduced computational budgets that outperform streamlined architectures targeted to mobile and embedded platforms. Our results have strong implications for transfer learning and meta-learning as this is the first work to demonstrate state-of-the-art results using meta-learning on a large scale problem. This work also highlights that learned elements of network architectures, beyond model weights, can transfer across datasets. SIMPLE AND EFFICIENT ARCHITECTURE SEARCH FOR CONVOLUTIONAL NEURAL NETWORKS Progressive Neural Architecture Search

, we use a sequential model-based optimization (SMBO) strategy, in which we search for architectures in order of increasing complexity, while simultaneously learning a surrogate function to guide the search, similar to A* search. Finding Competitive Network Architectures Within a Day Using UCT A Bridge Between Hyperparameter Optimization and Larning-to-learn

We consider a class of a nested optimization problems involving inner and outer objectives. We observe that by taking into explicit account the optimization dynamics for the inner objective it is possible to derive a general framework that unifies gradient-based hyperparameter optimization and meta-learning (or learning-to-learn). Depending on the specific setting, the variables of the outer objective take either the meaning of hyperparameters in a supervised learning problem or parameters of a meta-learner. We show that some recently proposed methods in the latter setting can be instantiated in our framework and tackled with the same gradient-based algorithms. Finally, we discuss possible design patterns for learning-to-learn and present encouraging preliminary experiments for few-shot learning.

We observed that hyperparameter optimization and learning-to-learn share the same mathematical structure, captured by a bilevel programming problem, which consists in minimizing an outer objective function whose value implicitly depends on the solution of an inner problem. The objective function of this latter problem — whose optimization variables are identified with the parameters of (ground) models — is, in turn, parametrized by the outer problem variables, identified either as hyperparameters or parameters of a meta-model, depending on the context. Since the solution of the inner optimization problem does not have, in general, a closed form expression, we formulate a related constrained optimization problem by considering explicitly an optimization dynamics for the inner problem (e.g. gradient descent). In this way we are able to (A) compute the outer objective and optimize it by gradient descent and (B) optimize also variables that parametrize the learning dynamics. We discussed examples of the framework and present experiments on few-shots learning, introducing a method for learning a shared, cross-episode, representation. A domain specific language for automated rnn architecture search DENSER: Deep Evolutionary Network Structured Representation

The algorithm not only searches for the best network topology (e.g., number of layers, type of layers), but also tunes hyper-parameters, such as, learning parameters or data augmentation parameters. The automatic design is achieved using a representation with two distinct levels, where the outer level encodes the general structure of the network, i.e., the sequence of layers, and the inner level encodes the parameters associated with each layer. Generating Neural Networks with Neural Networks

We formulate the hypernetwork training objective as a compromise between accuracy and diversity, where the diversity takes into account trivial symmetry transformations of the target network. GitGraph - Architecture Search Space Creation through Frequent Computational Subgraph Mining

Concretely, we (a) extract and publish GitGraph, a corpus of neural architectures and their descriptions; (b) we create problem-specific neural architecture search spaces, implemented as a textual search mechanism over GitGraph; © we propose a method of identifying unique common subgraphs within the architectures solving each problem (e.g., image processing, reinforcement learning), that can then serve as modules in the newly created problem specific neural search space. IMPALA: Scalable Distributed Deep-RL with Importance Weighted Actor-Learner Architectures

We have developed a new distributed agent IMPALA (Importance-Weighted Actor Learner Architecture) that can scale to thousands of machines and achieve a throughput rate of 250,000 frames per second. We achieve stable learning at high throughput by combining decoupled acting and learning with a novel off-policy correction method called V-trace, which was critical for achieving learning stability. We demonstrate the effectiveness of IMPALA for multi-task reinforcement learning on DMLab-30 (a set of 30 tasks from the DeepMind Lab environment (Beattie et al., 2016)) and Atari-57 (all available Atari games in Arcade Learning Environment (Bellemare et al., 2013a)). Our results show that IMPALA is able to achieve better performance than previous agents, use less data and crucially exhibits positive transfer between tasks as a result of its multi-task approach.

To the best of our knowledge, IMPALA is the first Deep-RL agent that has been successfully tested in such large-scale multi-task settings and it has shown superior performance compared to A3C based agents (49.4% vs. 23.8% human normalised score). Most importantly, our experiments on DMLab-30 show that, in the multi-task setting, positive transfer between individual tasks lead IMPALA to achieve better performance compared to the expert training setting. We believe that IMPALA provides a very simple yet scalable and robust framework for building better Deep-RL agents and has the potential to enable research on new challenges.

Unlike the popular A3C-based agents, in which workers communicate gradients with respect to the parameters of the policy to a central parameter server, IMPALA actors communicate trajectories of experience (sequences of states, actions, and rewards) to a centralised learner.