|
| 1 | +====== |
| 2 | +CMA-ES |
| 3 | +====== |
| 4 | + |
| 5 | +CMA-ES (Covariance Matrix Adaptation Evolution Strategy) is a state-of-the-art |
| 6 | +evolutionary algorithm for continuous optimization. It maintains a multivariate |
| 7 | +normal distribution over the search space and adapts a full covariance matrix to |
| 8 | +learn the correlation structure of the fitness landscape. Each generation, the |
| 9 | +algorithm samples candidate solutions, ranks them by fitness, shifts the |
| 10 | +distribution mean toward the best solutions, and updates the covariance matrix |
| 11 | +using evolution paths. A cumulative step-size adaptation mechanism controls the |
| 12 | +global step size. |
| 13 | + |
| 14 | +CMA-ES is widely regarded as the default algorithm for continuous black-box |
| 15 | +optimization in moderate dimensions (up to ~100). Unlike simpler evolution |
| 16 | +strategies that use only a scalar or diagonal step-size, CMA-ES learns |
| 17 | +arbitrary axis-aligned and rotated ellipsoidal distributions. This makes it |
| 18 | +particularly effective when parameters are correlated or have very different |
| 19 | +sensitivities. For example, if increasing ``x`` should be accompanied by |
| 20 | +decreasing ``y`` to improve the objective, CMA-ES will learn this relationship |
| 21 | +and sample accordingly. |
| 22 | + |
| 23 | +For mixed search spaces with discrete or categorical dimensions, the |
| 24 | +implementation samples in a normalized continuous space and maps back to valid |
| 25 | +grid values via rounding. This is the standard MI-CMA-ES approach. |
| 26 | + |
| 27 | + |
| 28 | +Algorithm |
| 29 | +--------- |
| 30 | + |
| 31 | +Each generation: |
| 32 | + |
| 33 | +1. **Sample**: Draw ``population`` candidates from :math:`\mathcal{N}(m, \sigma^2 C)` |
| 34 | +2. **Evaluate**: Score all candidates |
| 35 | +3. **Rank**: Sort by fitness, select the best ``mu`` |
| 36 | +4. **Update mean**: Shift ``m`` toward the weighted mean of the best ``mu`` |
| 37 | +5. **Update evolution paths**: Accumulate step information (p_sigma, p_c) |
| 38 | +6. **Update covariance**: Rank-one update (from p_c) + rank-mu update (from selected solutions) |
| 39 | +7. **Adapt step size**: Increase sigma if steps are correlated, decrease if oscillating |
| 40 | + |
| 41 | +.. code-block:: text |
| 42 | +
|
| 43 | + x_k = mean + sigma * B @ D @ z_k # sample (z_k ~ N(0, I)) |
| 44 | + mean_new = sum(w_i * x_i:mu) # weighted recombination |
| 45 | + p_sigma = (1-c_s) * p_sigma + ... # evolution path for step size |
| 46 | + p_c = (1-c_c) * p_c + ... # evolution path for covariance |
| 47 | + C = (1-c_1-c_mu) * C + c_1 * p_c @ p_c.T + c_mu * rank_mu_update |
| 48 | + sigma = sigma * exp(c_s/d_s * (||p_sigma||/E||N(0,I)|| - 1)) |
| 49 | +
|
| 50 | +The covariance matrix ``C`` is decomposed as :math:`C = B D^2 B^T` where ``B`` |
| 51 | +holds the eigenvectors (rotation) and ``D`` the eigenvalues (axis lengths). |
| 52 | + |
| 53 | +.. note:: |
| 54 | + |
| 55 | + CMA-ES automatically sets most internal parameters (learning rates, |
| 56 | + weights, damping) from the dimensionality and population size. You |
| 57 | + typically only need to set ``population``, ``sigma``, and optionally |
| 58 | + ``ipop_restart``. |
| 59 | + |
| 60 | + |
| 61 | +Parameters |
| 62 | +---------- |
| 63 | + |
| 64 | +.. list-table:: |
| 65 | + :header-rows: 1 |
| 66 | + :widths: 20 15 15 50 |
| 67 | + |
| 68 | + * - Parameter |
| 69 | + - Type |
| 70 | + - Default |
| 71 | + - Description |
| 72 | + * - ``population`` |
| 73 | + - int | None |
| 74 | + - None |
| 75 | + - Candidates per generation (lambda). ``None`` uses ``4 + floor(3 * ln(n))``. |
| 76 | + * - ``mu`` |
| 77 | + - int | None |
| 78 | + - None |
| 79 | + - Number of parents selected. ``None`` uses ``population // 2``. |
| 80 | + * - ``sigma`` |
| 81 | + - float |
| 82 | + - 0.3 |
| 83 | + - Initial step size as fraction of normalized space. |
| 84 | + * - ``ipop_restart`` |
| 85 | + - bool |
| 86 | + - False |
| 87 | + - Enable IPOP restart on stagnation (doubles population each restart). |
| 88 | + |
| 89 | + |
| 90 | +Step Size (sigma) |
| 91 | +^^^^^^^^^^^^^^^^^ |
| 92 | + |
| 93 | +The initial sigma controls the initial spread of samples. CMA-ES adapts it |
| 94 | +automatically, so the starting value is not critical. |
| 95 | + |
| 96 | +.. code-block:: python |
| 97 | +
|
| 98 | + # Conservative start (fine-tuning around a known good region) |
| 99 | + opt = CMAESOptimizer(search_space, sigma=0.1) |
| 100 | +
|
| 101 | + # Broad initial exploration |
| 102 | + opt = CMAESOptimizer(search_space, sigma=0.5) |
| 103 | +
|
| 104 | +
|
| 105 | +IPOP Restart |
| 106 | +^^^^^^^^^^^^ |
| 107 | + |
| 108 | +When stagnation is detected, IPOP restarts with a doubled population and a new |
| 109 | +random starting point. This is effective for multi-modal landscapes where |
| 110 | +a single run may converge to a suboptimal local optimum. |
| 111 | + |
| 112 | +.. code-block:: python |
| 113 | +
|
| 114 | + opt = CMAESOptimizer( |
| 115 | + search_space, |
| 116 | + ipop_restart=True, |
| 117 | + ) |
| 118 | +
|
| 119 | +
|
| 120 | +Example |
| 121 | +------- |
| 122 | + |
| 123 | +.. code-block:: python |
| 124 | +
|
| 125 | + import numpy as np |
| 126 | + from gradient_free_optimizers import CMAESOptimizer |
| 127 | +
|
| 128 | + def rosenbrock(para): |
| 129 | + x, y = para["x"], para["y"] |
| 130 | + return -(100 * (y - x**2)**2 + (1 - x)**2) |
| 131 | +
|
| 132 | + search_space = { |
| 133 | + "x": np.linspace(-5, 5, 1000), |
| 134 | + "y": np.linspace(-5, 5, 1000), |
| 135 | + } |
| 136 | +
|
| 137 | + opt = CMAESOptimizer( |
| 138 | + search_space, |
| 139 | + population=20, |
| 140 | + sigma=0.3, |
| 141 | + ) |
| 142 | +
|
| 143 | + opt.search(rosenbrock, n_iter=500) |
| 144 | + print(f"Best: {opt.best_para}, Score: {opt.best_score}") |
| 145 | +
|
| 146 | +
|
| 147 | +When to Use |
| 148 | +----------- |
| 149 | + |
| 150 | +**Good for:** |
| 151 | + |
| 152 | +- Continuous optimization with correlated parameters |
| 153 | +- Problems where parameter sensitivities differ strongly |
| 154 | +- Moderate dimensionality (2-100 dimensions) |
| 155 | +- Multi-modal landscapes (with ``ipop_restart=True``) |
| 156 | + |
| 157 | +**Not ideal for:** |
| 158 | + |
| 159 | +- Very high dimensions (>100), where the covariance matrix becomes expensive |
| 160 | +- Purely discrete/combinatorial problems (GA or DE are better suited) |
| 161 | +- Very tight iteration budgets (CMA-ES needs several generations to adapt) |
| 162 | + |
| 163 | +**Compared to other population-based optimizers:** |
| 164 | + |
| 165 | +- CMA-ES vs ES: CMA-ES adapts a full covariance matrix; ES uses scalar/diagonal step sizes |
| 166 | +- CMA-ES vs PSO: CMA-ES models the landscape shape; PSO uses velocity/social dynamics |
| 167 | +- CMA-ES vs DE: CMA-ES learns correlations explicitly; DE derives steps from population differences |
| 168 | + |
| 169 | + |
| 170 | +High-Dimensional Example |
| 171 | +------------------------- |
| 172 | + |
| 173 | +.. code-block:: python |
| 174 | +
|
| 175 | + import numpy as np |
| 176 | + from gradient_free_optimizers import CMAESOptimizer |
| 177 | +
|
| 178 | + def ellipsoid(para): |
| 179 | + total = 0 |
| 180 | + for i, key in enumerate(sorted(para)): |
| 181 | + total += (10 ** (2 * i / 9)) * para[key] ** 2 |
| 182 | + return -total |
| 183 | +
|
| 184 | + search_space = { |
| 185 | + f"x{i}": np.linspace(-5, 5, 200) |
| 186 | + for i in range(10) |
| 187 | + } |
| 188 | +
|
| 189 | + opt = CMAESOptimizer( |
| 190 | + search_space, |
| 191 | + population=30, |
| 192 | + sigma=0.3, |
| 193 | + ipop_restart=True, |
| 194 | + ) |
| 195 | +
|
| 196 | + opt.search(ellipsoid, n_iter=2000) |
| 197 | + print(f"Best score: {opt.best_score}") |
| 198 | +
|
| 199 | +
|
| 200 | +Trade-offs |
| 201 | +---------- |
| 202 | + |
| 203 | +- **Exploration vs. exploitation**: sigma controls initial spread; the algorithm |
| 204 | + self-adapts over time. IPOP restart adds macro-level exploration. |
| 205 | +- **Computational overhead**: Per generation, CMA-ES performs an eigendecomposition |
| 206 | + of the covariance matrix (O(n^3)), making it expensive for very high dimensions. |
| 207 | +- **Population size**: Larger populations improve robustness on multi-modal problems |
| 208 | + but require more evaluations per generation. The default heuristic is a good |
| 209 | + starting point. |
| 210 | + |
| 211 | + |
| 212 | +Related Algorithms |
| 213 | +------------------ |
| 214 | + |
| 215 | +- :doc:`evolution_strategy` - Simpler ES with mutation-based search |
| 216 | +- :doc:`differential_evolution` - Self-adaptive step sizes from population differences |
| 217 | +- :doc:`particle_swarm` - Swarm-based approach with velocity dynamics |
0 commit comments