Forest Optimization
Introduction
The forest-optimizer calculates the expected improvement of the position space with a tree-based model. This optimization technique is very similar to bayesian-optimization in every part, except its surrogate model.
Example
About the implementation
The forest-optimizer shares most of its code base with the bayesian-optimizer. Only its model to calculate the expected score \(\mu\) and its standard deviation \(\sigma\) is different. Tree based models do not calculate the standard deviation by them self. A modification is necessary to determine the standard deviation from the impurity of the trees in the ensemble. This modification is taken from the paper "Algorithm Runtime Prediction: Methods & Evaluation" chapter 4.3.3.
Parameters
xi
Parameter for the expected uncertainty of the estimation. It is a parameter that belongs to the expected-improvement acquisition-function.
- type: float
- default: 0.3
- typical range: 0.1 ... 0.9
replacement
This parameter determines if a position is replaced into the list of possible positions after it was selected and evaluated by the sequential model.
- type: bool
- default: True
- possible values: True, False
tree_regressor
The tree_regressor
-parameter determines the type of surrogate model the forest-optimization algorithm will use to calculate \(\mu\) and \(\sigma\).
- type: string
- default: extra_tree
- possible values: "extra_tree", "random_forest", "gradient_boost"
max_sample_size
The max_sample_size
is a first pass of randomly sampling, before all possible positions are generated for the sequence-model-based optimization. It samples the search space directly and takes effect if the search-space is very large:
search_data = {
"x1": np.arange(0, 1000, 0.01),
"x2": np.arange(0, 1000, 0.01),
"x3": np.arange(0, 1000, 0.01),
"x4": np.arange(0, 1000, 0.01),
}
The max_sample_size
-parameter is necessary to avoid a memory overload from generating all possible positions from the search-space. The search-space above corresponds to a list of \(100000^4 = 100000000000000000000\) numpy arrays. This memory overload is expected for a sequence-model-based optimization algorithm, because the surrogate model has the job make a prediction for every position in the search-space to calculate the acquisition-function. The max_sample_size
-parameter was introduced to provide a better out-of-the-box experience if using smb-optimizers.
- type: int
- default: 10000000
- typical range: 1000000 ... 100000000
sampling
The sampling
-parameter is a second pass of randomly sampling. It samples from the list of all possible positions (not directly from the search-space). This might be necessary, because the predict
-method of the surrogate model could overload the memory.
- type: dict
- default: {'random': 1000000}
- typical range: -
warm_start_smbo
The warm_start_smbo
-parameter is a pandas dataframe that contains search-data with the results from a previous optimization run. The dataframe containing the search-data could look like this:
x1 | x2 | score |
---|---|---|
5 | 15 | 0.3 |
10 | 12 | 0.7 |
... | ... | ... |
... | ... | ... |
Where the corresponding search-space would look like this:
Before passing the search-data to the optimizer make sure, that the columns match the search-space of the new optimization run. So you could not add another dimension ("x3") to the search-space and expect the warm-start to work. The dimensionality of the optimization must be preserved and fit the problem.
- type: pandas dataframe, None
- default: None
- possible values: -
rand_rest_p
Probability for the optimization algorithm to jump to a random position in an iteration step. It is set to 0 per default. The idea of this parameter is to give the possibility to inject randomness into algorithms that don't normally support it.
- type: float
- default: 0
- typical range: 0.01 ... 0.1