Skip to content

Commit e66c657

Browse files
committed
Add StackHPC Ironic tunings
1 parent 4effb00 commit e66c657

13 files changed

Lines changed: 268 additions & 1 deletion

File tree

doc/source/configuration/index.rst

Lines changed: 2 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -10,6 +10,7 @@ the various features provided.
1010

1111
release-train
1212
host-images
13+
ironic
1314
lvm
1415
cephadm
1516
monitoring
@@ -22,3 +23,4 @@ the various features provided.
2223
ci-cd
2324
cloudkitty
2425
ipa
26+
stackhpc-mixin-environments
Lines changed: 11 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,11 @@
1+
======
2+
Ironic
3+
======
4+
5+
Mixin environments
6+
------------------
7+
8+
The following mixin environments are provided to customise the Ironic configuration:
9+
10+
* :ref:`mixin-baremetal` - StackHPC opinionated defaults for Ironic.
11+
* :ref:`mixin-baremetal-policy` - Policy tweaks for Ironic
Lines changed: 35 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,35 @@
1+
.. _stackhpc-mixin-environments:
2+
3+
===========================
4+
StackHPC Mixin Environments
5+
===========================
6+
7+
StackHPC Kayobe configuration provides a set of mixin environments, which can
8+
be used to apply configuration in modular way. These provide a mechanism where
9+
users can opt into new sets of configuration mid-cycle, at a time of the their
10+
choosing, and thereby facilitate gradual adoption of new features. Config may
11+
be moved into the the base configuration for the next major release.
12+
13+
For more information about Kayobe environments, please see the `upstream Kayobe
14+
documentation
15+
<https://docs.openstack.org/kayobe/latest/multiple-environments.html#defining-kayobe-environments>`__.
16+
17+
.. note::
18+
19+
To override settings in mixin environments, you will need to define the
20+
overrides in an environment that inherits from that one, rather than in the
21+
base configuration.
22+
23+
.. _mixin-baremetal:
24+
25+
baremetal
26+
---------
27+
28+
.. include:: ../../../etc/kayobe/environments/baremetal/README.rst
29+
30+
.. _mixin-baremetal-policy:
31+
32+
baremetal-policy
33+
----------------
34+
35+
.. include:: ../../../etc/kayobe/environments/baremetal-policy/README.rst

etc/kayobe/environments/baremetal-policy/README.rst

Lines changed: 18 additions & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -1,5 +1,5 @@
11
Policy for a baremetaluser role
2-
===============================
2+
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
33

44
When deploying Slurm on baremetal nodes, it is typical to select a specific
55
baremetal node, and give it the expected hostname. We allow this via a tweak to
@@ -11,3 +11,20 @@ not own the network.
1111

1212
We should never use the admin role to do these operations, as it has far too
1313
much privilege.
14+
15+
Consuming this environment
16+
^^^^^^^^^^^^^^^^^^^^^^^^^^
17+
18+
Add the ``baremetal-policy`` environment to your ``.kayobe-environment`` file:
19+
20+
.. code-block:: yaml
21+
:caption: $KAYOBE_CONFIG_PATH/$KAYOBE_ENVIRONMENT/.kayobe-environment
22+
23+
dependencies:
24+
- baremetal-policy
25+
26+
Redeploy Neutron, and Nova:
27+
28+
.. code-block:: console
29+
30+
kayobe overcloud service deploy -kt neutron,nova
Lines changed: 49 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,49 @@
1+
StackHPC Ironic environment
2+
~~~~~~~~~~~~~~~~~~~~~~~~~~~
3+
4+
Mixin that adds StackHPC opinionated defaults for Ironic.
5+
6+
Consuming this environment
7+
^^^^^^^^^^^^^^^^^^^^^^^^^^
8+
9+
Add the ``baremetal`` environment to your ``.kayobe-environment`` file:
10+
11+
.. code-block:: yaml
12+
:caption: $KAYOBE_CONFIG_PATH/$KAYOBE_ENVIRONMENT/.kayobe-environment
13+
14+
dependencies:
15+
- baremetal
16+
17+
Redeploy the loadbalancer, Neutron, Nova, and Ironic:
18+
19+
.. code-block:: console
20+
21+
kayobe overcloud service deploy -kt loadbalancer,ironic,nova
22+
23+
Cleaning
24+
^^^^^^^^
25+
26+
Storage
27+
"""""""
28+
29+
Hardware assisted secure erase, i.e the ``erase_devices`` clean step, is
30+
enabled by default. This is normally dependent on the `Hardware Manager
31+
<https://docs.openstack.org/ironic-python-agent/latest/contributor/hardware_managers.html>`__
32+
in use. For example, when using the GenericHardwareManager the priority would
33+
be 10, whereas if using the `ProliantHardwareManager
34+
<https://docs.openstack.org/ironic/latest/admin/drivers/ilo.html#disk-erase-support>`__
35+
it would be 0. The idea is that we will prevent the catastrophic case where
36+
data could be leaked to another tenant; forcing you to have to explicitly relax
37+
this setting if this is a risk you want to take. This can be customised by
38+
editing the following variables:
39+
40+
.. code-block:: yaml
41+
:caption: $KAYOBE_CONFIG_PATH/$KAYOBE_ENVIRONMENT/kolla/config/ironic/ironic-conductor.conf
42+
43+
[deploy]
44+
erase_devices_priority=10
45+
erase_devices_metadata_priority=0
46+
47+
See `Ironic documentation
48+
<https://docs.openstack.org/ironic/latest/admin/cleaning.html>`__ for more
49+
details.
Lines changed: 13 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,13 @@
1+
[DEFAULT]
2+
timeout = 0
3+
{% if "genericswitch" in kolla_neutron_ml2_mechanism_drivers %}
4+
# We are increasing the RPC response timeouts to 5 minutes due to the neutron
5+
# generic switch driver, which synchronously applies switch configuration for
6+
# each ironic port during node provisioning and tear down.
7+
# The specific API calls that require this long timeout are:
8+
# - Creation and deletion of VLAN networks.
9+
# - Creation or update of ports, adding binding information.
10+
# - Update of ports, removing binding information.
11+
# - Deletion of ports.
12+
rpc_response_timeout = 360
13+
{% endif %}
Lines changed: 6 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,6 @@
1+
[DEFAULT]
2+
# Avoid some timeouts of heartbeats and vif deletes
3+
rpc_response_timeout = 360
4+
5+
[neutron]
6+
timeout = 300
Lines changed: 60 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,60 @@
1+
[DEFAULT]
2+
# Make direct deploy faster, transfer sparse qcow2 images
3+
force_raw_images = False
4+
# Avoid some rpc timeouts
5+
rpc_response_timeout = 360
6+
7+
[conductor]
8+
automated_clean=true
9+
# We have busy conductors failing to heartbeat
10+
# Default is 10 secs
11+
heartbeat_interval = 30
12+
# Default is 60 seconds
13+
heartbeat_timeout = 360
14+
sync_local_state_interval = 360
15+
16+
# Normally this is 100. We see eventlet threads
17+
# not making much progress, to for saftey reduce
18+
# this by half, should leave work on rabbit queu
19+
workers_pool_size = 50
20+
# Normally this is 8, keep it same
21+
period_max_workers = 8
22+
23+
# Increase power sync interval to reduce load
24+
sync_power_state_interval = 120
25+
power_failure_recovery_interval = 120
26+
# Stop checking for orphan allocations for now
27+
check_allocations_interval = 120
28+
29+
# Wait much longer before provision timeout check, to reduce background load
30+
# The default is 60 seconds
31+
check_provision_state_interval = 120
32+
check_rescue_state_interval = 120
33+
34+
[database]
35+
# Usually this is 50, reduce to stop DB connection timeouts
36+
# and instead just make eventlet threads wait a bit longer
37+
max_overflow = 5
38+
# By default this is 30 seconds, but as we reduce
39+
# the pool overflow, some people will need to wait longer
40+
pool_timeout = 60
41+
42+
[deploy]
43+
# Force Hardware assisted secure erase by default.
44+
erase_devices_priority=10
45+
erase_devices_metadata_priority=0
46+
47+
[pxe]
48+
# Increase cache size to 120GB and TTL to 28 hours
49+
image_cache_size = 122880
50+
image_cache_ttl = 100800
51+
52+
[neutron]
53+
# Increase the neutron client timeout to allow for the slow management
54+
# switches.
55+
timeout = 300
56+
request_timeout = 300
57+
58+
[glance]
59+
# Retry image download at least once if failure
60+
num_retries = 1
Lines changed: 12 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,12 @@
1+
[DEFAULT]
2+
{% if kolla_enable_ironic | bool and "genericswitch" in kolla_neutron_ml2_mechanism_drivers %}
3+
# We are increasing the RPC response timeouts to 5 minutes due to the neutron
4+
# generic switch driver, which synchronously applies switch configuration for
5+
# each ironic port during node provisioning and tear down.
6+
# The specific API calls that require this long timeout are:
7+
# - Creation and deletion of VLAN networks.
8+
# - Creation or update of ports, adding binding information.
9+
# - Update of ports, removing binding information.
10+
# - Deletion of ports.
11+
rpc_response_timeout = 360
12+
{% endif %}
Lines changed: 12 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,12 @@
1+
[DEFAULT]
2+
{% if kolla_enable_ironic | bool and "genericswitch" in kolla_neutron_ml2_mechanism_drivers %}
3+
# We are increasing the RPC response timeouts to 5 minutes due to the neutron
4+
# generic switch driver, which synchronously applies switch configuration for
5+
# each ironic port during node provisioning and tear down.
6+
# The specific API calls that require this long timeout are:
7+
# - Creation and deletion of VLAN networks.
8+
# - Creation or update of ports, adding binding information.
9+
# - Update of ports, removing binding information.
10+
# - Deletion of ports.
11+
rpc_response_timeout = 360
12+
{% endif %}

0 commit comments

Comments
 (0)