Sunday, July 29, 2018

Hyperparameter tuning using Scikit-Optimize

One of my favourite academic discoveries this year was Scikit-Optimize, a nifty little Python automatic hyperparameter tuning library that comes with a lot of features I found missing in other similar libraries.

So as explained in an earlier blog post, automatic hyperparameter tuning is about finding the right hyperparameters for a machine learning algorithm automatically. Usually this is done manually using human experience but even simple MonteCarlo search random guesses can result in better performance than human tweaking (see here). So automatic methods were developed that try to explore the space of hyperparameters and their resulting performance after training the machine learning model and then try to home in on the best performing hyperparameters. Of course each time you want to evaluate a new hyperparameter combination you'd need to retrain and evaluate your machine learning model, which might take a very long time to finish, so it's important that a good hyperparameter combination is found with as little evaluations as possible. To do this we'll use Bayesian Optimization, a process where a separate simpler model is trained to predict the resulting performance of the whole hyperparameter space. We check this trained model to predict which hyperparameters will give the best resulting performance and actually evaluate them by training our machine learning model with them. The actual resulting performance is then used to update the hyperparameter space model so that it makes better predictions and then we'll get a new promising hyperparameter combination from it. This is repeated for a set number of times. Now the most common hyperparameter space model to use is a Gaussian Process which maps continuous numerical hyperparameters to a single number which is the predicted performance. This is not very good when your hyperparameters contain categorical data such as a choice of activation function. There is a paper that suggests that random forests are much better at mapping general hyperparameter combinations to predicted performance.

Now that we got the theory out of the way, let's see how to use the library. We'll apply it on a gradient descent algorithm that needs to minimize the squared function. For this we'll need 3 hyperparameters: the range of the initial value to be selected randomly, the learning rate, and the number of epochs to run. So we have two continuous values and one discrete integer value.

import random

def cost(x):
    return x**2

def cost_grad(x):
    return 2*x

def graddesc(learning_rate, max_init_x, num_epochs):
    x = random.uniform(-max_init_x, max_init_x)
    for _ in range(num_epochs):
        x = x - learning_rate*cost_grad(x)
    return x

Now we need to define the skopt optimizer:

import skopt

opt = skopt.Optimizer(
            [
                skopt.space.Real(0.0, 10.0, name='max_init_x'),
                skopt.space.Real(1.0, 1e-10, 'log-uniform', name='learning_rate'),
                skopt.space.Integer(1, 20, name='num_epochs'),
            ],
            n_initial_points=5,
            base_estimator='RF',
            acq_func='EI',
            acq_optimizer='auto',
            random_state=0,
        )

The above code is specifying 3 hyperparameters:
  • the maximum initial value that is a real number (continuous) and that can be between 10 and 0
  • the learning rate that is also a real number but that is also on a logarithmic scale (so that you are equally likely to try very large values and very small values) and can be between 1 and 1e-10
  • the number of epochs that is an integer (whole number) and that can be between 1 and 20
It is also saying that the hyperparameter space model should be initialized based on 5 random hyperparameter combinations (you train the hyperparameter space model on 5 randomly chosen hyperparameters in order to be able to get the first suggested hyperparameter), that this model should be a random forest (RF), that the acquisition function (the function to decide which hyperparameter combination is the most promising to try next according to the hyperparameter space model) is the expected improvement (EI) of the hyperparameter combination, that the acquisition optimizer (the optimizer to find the next promising hyperparameter combination) is automatically chosen, and that the random state is set to a fixed number (zero) so that it always gives the same random values each time you run it.

Next we will use the optimizer to find good hyperparameter combinations.

best_cost = 1e100
best_hyperparams = None
for trial in range(5 + 20):
    [max_init_x, learning_rate, num_epochs] = opt.ask()
    [max_init_x, learning_rate, num_epochs] = [max_init_x.tolist(), learning_rate.tolist(), num_epochs.tolist()]
    next_hyperparams = [max_init_x, learning_rate, num_epochs]
    next_cost = cost(graddesc(max_init_x, learning_rate, num_epochs))
    if next_cost < best_cost:
        best_cost = next_cost
        best_hyperparams = next_hyperparams
    print(trial+1, next_cost, next_hyperparams)
    opt.tell(next_hyperparams, next_cost)
print(best_hyperparams)
The nice thing about this library is that you can use an 'ask/tell' system where you ask the library to give you the next hyperparameters to try and then you do something with them in order to get the actual performance value and finally you tell the library what this performance value is. This lets you do some nifty things such as ask for another value if the hyperparameters resulted in an invalid state in the machine learning model or even to save your progress and continue later.

In the above code we're running a for loop to run the number of times we want to evaluate different hyperparameters. We need to run it for the 5 random values we specified before to initialize the hyperparameter space model plus another 20 evaluations to actually optimize the hyperameters. Now skopt does something funny which is that it returns not plain Python values for hyperparameters but rather each number is represented as a numpy scalar. Because of this we convert each numpy scalar back into a plain Python float or int using ".tolist()". We ask for the next hyperparamters to try, convert them to plain Python values, get their resulting cost after running gradient descent, store the best hyperparameters found up to now, and tell the library what the given hyperparameters' resulting performance was. At the end we print the best hyperparamter combination found.

Some extra stuff:
  • You can ask for categorical hyperparameters using "skopt.space.Categorical(['option1', 'option2'], name='options')" which will return one of the values in the list when calling "ask".
  • You can ask for a different hyperparameter combination in case of an invalid one by changing "ask" to give you several hyperparameter suggestions rather than just one and then trying each one of them until one works by using "opt.ask(num_hyperpars)" (you can also incrementally ask for more values and always take the last one).
  • You can save and continue by saving all the hyperparameters evaluated and their corresponding performance value in a text file. You then later resupply the saved hyperparameters and their performance using "tell" for each one. This is much faster than actually evaluating them on the machine learning model so straight supplying known values will be ready quickly. Just be careful that you also call "ask" before each "tell" in order to get the same exact behaviour from the optimizer or else the next "tell" will give different values from what it would have given had you not loaded the previous ones manually.