Bayesian Optimization

Introduction

Bayesian optimization chooses new positions by calculating the expected improvement of every position in the search space based on a gaussian process that trains on already evaluated positions.

Example

from hyperactive import Hyperactive
from hyperactive.optimizers import BayesianOptimizer

...

optimizer = BayesianOptimizer(xi=0.15)

hyper = Hyperactive()
hyper.add_search(model, search_space, n_iter=50, optimizer=optimizer)
hyper.run()

About the implementation

The bayesian optimizer collects the information about the position and score in each iteration. The gaussian process regressor fits to the position (features) and score (target), and predicts the scores of all unknown positions. This is why the bayesian optimization needs at least one initial position. The gaussian process returns the standard deviation in addition to the prediction (or mean), both of which are required to compute the acquisition function. The position of the best predicted score is evaluated next. The selected position and its true score is then collected, restarting the cycle. The acquisition function used in this algorithm is the expected improvement. The expected improvement is calculated by the following equation:

\[ \text{expected improvement} = ( \mu - y_{sample, max} - \xi ) \cdot \varphi(Z) + \sigma \cdot \Phi(Z) \]

where:

\[ \mu, \sigma = \text{surrogate-model.predict}(...) \]

and:

\(y_{sample, max}\) => best known score
\(\xi\) => xi-parameter
\(\varphi\) => Probability density function
\(\Phi\) => Cumulative distribution function

The surrogate model used in bayesian optimization is the gassian process regressor. A crucial property of this model is, that it returns the uncertainty of the prediction \(\sigma\) together with the predicted value \(\mu\).

Parameters

`xi`

Parameter for the expected uncertainty of the estimation. It is a parameter that belongs to the expected-improvement acquisition-function.

type: float
default: 0.3
typical range: 0.1 ... 0.9

`gpr`

The access to the surrogate model. To pass a surrogate model it must be similar to the following code:

class GPR:
    def __init__(self):
        self.gpr = GaussianProcessRegressor()

    def fit(self, X, y):
        self.gpr.fit(X, y)

    def predict(self, X, return_std=False):
        return self.gpr.predict(X, return_std=return_std)

The predict-method returns only \(\mu\) if return_std=False and returns \(\mu\) and \(\sigma\) if return_std=True. Note that you have to pass the instantiated class to the gpr-parameter:

surrogate_model = GPR()
opt=BayesianOptimizer(gpr=surrogate_model)

type: class
default: -
possible values: -

`max_sample_size`

The max_sample_size is a first pass of randomly sampling, before all possible positions are generated for the sequence-model-based optimization. It samples the search space directly and takes effect if the search-space is very large:

search_data = {
  "x1": np.arange(0, 1000, 0.01),
  "x2": np.arange(0, 1000, 0.01),
  "x3": np.arange(0, 1000, 0.01),
  "x4": np.arange(0, 1000, 0.01),
}

The max_sample_size-parameter is necessary to avoid a memory overload from generating all possible positions from the search-space. The search-space above corresponds to a list of \(100000^4 = 100000000000000000000\) numpy arrays. This memory overload is expected for a sequence-model-based optimization algorithm, because the surrogate model has the job make a prediction for every position in the search-space to calculate the acquisition-function. The max_sample_size-parameter was introduced to provide a better out-of-the-box experience if using smb-optimizers.

type: int
default: 10000000
typical range: 1000000 ... 100000000

`sampling`

The sampling-parameter is a second pass of randomly sampling. It samples from the list of all possible positions (not directly from the search-space). This might be necessary, because the predict-method of the surrogate model could overload the memory.

type: dict
default: {'random': 1000000}
typical range: -

`warm_start_smbo`

The warm_start_smbo-parameter is a pandas dataframe that contains search-data with the results from a previous optimization run. The dataframe containing the search-data could look like this:

x1	x2	score
5	15	0.3
10	12	0.7
...	...	...
...	...	...

Where the corresponding search-space would look like this:

search_space = {
  "x1": np.arange(0, 20),
  "x2": np.arange(0, 20),
}

Before passing the search-data to the optimizer make sure, that the columns match the search-space of the new optimization run. So you could not add another dimension ("x3") to the search-space and expect the warm-start to work. The dimensionality of the optimization must be preserved and fit the problem.

opt = BayesianOptimization(warm_start_smbo=search_data)

type: pandas dataframe, None
default: None
possible values: -

`rand_rest_p`

Probability for the optimization algorithm to jump to a random position in an iteration step. It is set to 0 per default. The idea of this parameter is to give the possibility to inject randomness into algorithms that don't normally support it.

type: float
default: 0
typical range: 0.01 ... 0.1