From 5ca6b834bbb03fb99b3ca3f203e7da47da50e3e8 Mon Sep 17 00:00:00 2001
From: Seunghun Lee <seunghun@stackhpc.com>
Date: Wed, 22 May 2024 16:36:41 +0100
Subject: [PATCH 01/42] Add openstack projects & users management doc

---
 ...penstack-projects-and-users-management.rst | 38 +++++++++++++++++++
 1 file changed, 38 insertions(+)
 create mode 100644 doc/source/operations/openstack-projects-and-users-management.rst
diff --git a/doc/source/operations/openstack-projects-and-users-management.rst b/doc/source/operations/openstack-projects-and-users-management.rst
new file mode 100644
index 0000000000..676db20cab
--- /dev/null
+++ b/doc/source/operations/openstack-projects-and-users-management.rst
@@ -0,0 +1,38 @@
+=======================================
+Openstack Projects and Users Management
+=======================================
+
+Projects (in OpenStack) can be defined in the ``openstack-config`` repository
+
+To initialise the working environment for ``openstack-config``:
+
+.. code-block:: console
+
+   git clone <openstack-config-repository> ~/src/openstack-config
+   python3 -m venv ~/venvs/openstack-config-venv
+   source ~/venvs/openstack-config-venv/bin/activate
+   cd ~/src/openstack-config
+   pip install -U pip
+   pip install -r requirements.txt
+   ansible-galaxy collection install \
+    -p ansible/collections \
+    -r requirements.yml
+
+To define a new project, add a new project to
+``etc/openstack-config/openstack-config.yml``:
+
+Example invocation:
+
+.. code-block:: console
+
+   source ~/src/kayobe-config/etc/kolla/public-openrc.sh
+   source ~/venvs/openstack-config-venv/bin/activate
+   cd ~/src/openstack-config
+   tools/openstack-config -- --vault-password-file <vault password file path>
+
+Deleting Users and Projects
+---------------------------
+
+Ansible is designed for adding configuration that is not present; removing
+state is less easy. To remove a project or user, the configuration should be
+manually removed.

From 6b835765cae53facf7eb366f121ed6d0834b73bc Mon Sep 17 00:00:00 2001
From: Seunghun Lee <seunghun@stackhpc.com>
Date: Thu, 23 May 2024 15:47:31 +0100
Subject: [PATCH 02/42] Add horizon customisation doc

---
 doc/source/operations/customising_horizon.rst | 167 ++++++++++++++++++
 1 file changed, 167 insertions(+)
 create mode 100644 doc/source/operations/customising_horizon.rst

diff --git a/doc/source/operations/customising_horizon.rst b/doc/source/operations/customising_horizon.rst
new file mode 100644
index 0000000000..1f8977a31e
--- /dev/null
+++ b/doc/source/operations/customising_horizon.rst
@@ -0,0 +1,167 @@
+.. include:: vars.rst
+
+====================================
+Customising Horizon
+====================================
+
+Horizon is the most frequent site-specific container customisation required:
+other customisations tend to be common across deployments, but personalisation
+of Horizon is unique to each institution.
+
+This describes a simple process for customising the Horizon theme.
+
+Creating a custom Horizon theme
+-------------------------------
+
+A simple custom theme for Horizon can be implemented as small modifications of
+an existing theme, such as the `Default
+<https://opendev.org/openstack/horizon/src/branch/master/openstack_dashboard/themes/default>`__
+one.
+
+A theme contains at least two files: ``static/_styles.scss``, which can be empty, and
+``static/_variables.scss``, which can reference another theme like this:
+
+.. code-block:: scss
+
+   @import "/themes/default/variables";
+   @import "/themes/default/styles";
+
+Some resources such as logos can be overridden by dropping SVG image files into
+``static/img`` (since the Ocata release, files must be SVG instead of PNG). See
+`the Horizon documentation
+<https://docs.openstack.org/horizon/latest/configuration/themes.html#customizing-the-logo>`__
+for more details.
+
+Content on some pages such as the splash (login) screen can be updated using
+templates.
+
+See `our example horizon-theme <https://github.com/stackhpc/horizon-theme>`__
+which inherits from the default theme and includes:
+
+* a custom splash screen logo
+* a custom top-left logo
+* a custom message on the splash screen
+
+Further reading:
+
+* https://docs.openstack.org/horizon/latest/configuration/customizing.html
+* https://docs.openstack.org/horizon/latest/configuration/themes.html
+* https://docs.openstack.org/horizon/latest/configuration/branding.html
+
+Building a Horizon container image with custom theme
+----------------------------------------------------
+
+Building a custom container image for Horizon can be done by modifying
+``kolla.yml`` to fetch the custom theme and include it in the image:
+
+.. code-block:: yaml
+   :substitutions:
+
+   kolla_sources:
+     horizon-additions-theme-<custom theme name>:
+       type: "git"
+       location: <custom theme repository url>
+       reference: master
+
+   kolla_build_blocks:
+     horizon_footer: |
+       # Binary images cannot use the additions mechanism.
+       {% raw %}
+       {% if install_type == 'source' %}
+       ADD additions-archive /
+       RUN mkdir -p /etc/openstack-dashboard/themes/<custom theme name> \
+         && cp -R /additions/horizon-additions-theme-<custom theme name>-archive-master/* /etc/openstack-dashboard/themes/<custom theme name>/ \
+         && chown -R horizon: /etc/openstack-dashboard/themes
+       {% endif %}
+       {% endraw %}
+
+If using a specific container image tag, don't forget to set:
+
+.. code-block:: yaml
+
+   kolla_tag: mytag
+
+Build the image with:
+
+.. code-block:: console
+
+   kayobe overcloud container image build horizon -e kolla_install_type=source --push
+
+Pull the new Horizon container to the controller:
+
+.. code-block:: console
+
+   kayobe overcloud container image pull --kolla-tags horizon
+
+Deploy and use the custom theme
+-------------------------------
+
+Switch to source image type in ``${KAYOBE_CONFIG_PATH}/kolla/globals.yml``:
+
+.. code-block:: yaml
+
+   horizon_install_type: source
+
+You may also need to update the container image tag:
+
+.. code-block:: yaml
+
+   horizon_tag: mytag
+
+Configure Horizon to include the custom theme and use it by default:
+
+.. code-block:: console
+
+   mkdir -p ${KAYOBE_CONFIG_PATH}/kolla/config/horizon/
+
+Add to ``${KAYOBE_CONFIG_PATH}/kolla/config/horizon/custom_local_settings``:
+
+.. code-block:: console
+
+   AVAILABLE_THEMES = [
+       ('default', 'Default', 'themes/default'),
+       ('material', 'Material', 'themes/material'),
+       ('<custom theme name>', '<custom theme visible name>', '/etc/openstack-dashboard/themes/<custom theme name>'),
+   ]
+   DEFAULT_THEME = '<custom theme name>'
+
+You can also set other customisations in this file, such as the HTML title of the page:
+
+.. code-block:: console
+
+   SITE_BRANDING = "<Your Branding>"
+
+Deploy with:
+
+.. code-block:: console
+
+   kayobe overcloud service reconfigure --kolla-tags horizon
+
+Troubleshooting
+---------------
+
+Make sure you build source images, as binary images cannot use the addition
+mechanism used here.
+
+If the theme is selected but the logo doesn’t load, try running these commands
+inside the ``horizon`` container:
+
+.. code-block:: console
+
+   /var/lib/kolla/venv/bin/python /var/lib/kolla/venv/bin/manage.py collectstatic --noinput --clear
+   /var/lib/kolla/venv/bin/python /var/lib/kolla/venv/bin/manage.py compress --force
+   settings_bundle | md5sum > /var/lib/kolla/.settings.md5sum.txt
+
+Alternatively, try changing anything in ``custom_local_settings`` and restarting
+the ``horizon`` container.
+
+If the ``horizon`` container is restarting with the following error:
+
+.. code-block:: console
+
+   /var/lib/kolla/venv/bin/python /var/lib/kolla/venv/bin/manage.py compress --force
+   CommandError: An error occurred during rendering /var/lib/kolla/venv/lib/python3.6/site-packages/openstack_dashboard/templates/horizon/_scripts.html: Couldn't find any precompiler in COMPRESS_PRECOMPILERS setting for mimetype '\'text/javascript\''.
+
+It can be resolved by dropping cached content with ``docker restart
+memcached``. Note this will log out users from Horizon, as Django sessions are
+stored in Memcached.

From ac8bcbadbbe29524c29c95d35e55f4c84b8d54d8 Mon Sep 17 00:00:00 2001
From: Seunghun Lee <seunghun@stackhpc.com>
Date: Thu, 23 May 2024 16:05:38 +0100
Subject: [PATCH 03/42] Add ceph management doc

---
 doc/source/operations/ceph-management.rst | 123 ++++++++++++++++++++++
 1 file changed, 123 insertions(+)
 create mode 100644 doc/source/operations/ceph-management.rst

diff --git a/doc/source/operations/ceph-management.rst b/doc/source/operations/ceph-management.rst
new file mode 100644
index 0000000000..754a6deb9c
--- /dev/null
+++ b/doc/source/operations/ceph-management.rst
@@ -0,0 +1,123 @@
+==========================
+Managing Ceph with Cephadm
+==========================
+
+cephadm configuration location
+==============================
+
+In kayobe-config repository, under ``etc/kayobe/cephadm.yml`` (or in a specific
+Kayobe environment when using multiple environment, e.g.
+``etc/kayobe/environments/production/cephadm.yml``)
+
+StackHPC's cephadm Ansible collection relies on multiple inventory groups:
+
+- ``mons``
+- ``mgrs``
+- ``osds``
+- ``rgws`` (optional)
+
+Those groups are usually defined in ``etc/kayobe/inventory/groups``.
+
+Running cephadm playbooks
+=========================
+
+In kayobe-config repository, under ``etc/kayobe/ansible`` there is a set of
+cephadm based playbooks utilising stackhpc.cephadm Ansible Galaxy collection.
+
+- ``cephadm.yml`` - runs the end to end process starting with deployment and
+  defining EC profiles/crush rules/pools and users
+- ``cephadm-crush-rules.yml`` - defines Ceph crush rules according
+- ``cephadm-deploy.yml`` - runs the bootstrap/deploy playbook without the
+  additional playbooks
+- ``cephadm-ec-profiles.yml`` - defines Ceph EC profiles
+- ``cephadm-gather-keys.yml`` - gather Ceph configuration and keys and populate
+  kayobe-config
+- ``cephadm-keys.yml`` - defines Ceph users/keys
+- ``cephadm-pools.yml`` - defines Ceph pools\
+
+Running Ceph commands
+=====================
+
+Ceph commands are usually run inside a ``cephadm shell`` utility container:
+
+.. code-block:: console
+
+   # From the node that runs Ceph
+   ceph# sudo cephadm shell
+
+Operating a cluster requires a keyring with an admin access to be available for Ceph
+commands. Cephadm will copy such keyring to the nodes carrying
+`_admin <https://docs.ceph.com/en/quincy/cephadm/host-management/#special-host-labels>`__
+label - present on MON servers by default when using
+`StackHPC Cephadm collection <https://github.com/stackhpc/ansible-collection-cephadm>`__.
+
+Adding a new storage node
+=========================
+
+Add a node to a respective group (e.g. osds) and run ``cephadm-deploy.yml``
+playbook.
+
+.. note::
+   To add other node types than osds (mons, mgrs, etc) you need to specify
+   ``-e cephadm_bootstrap=True`` on playbook run.
+
+Removing a storage node
+=======================
+
+First drain the node
+
+.. code-block:: console
+
+   ceph# cephadm shell
+   ceph# ceph orch host drain <host>
+
+Once all daemons are removed - you can remove the host:
+
+.. code-block:: console
+
+   ceph# cephadm shell
+   ceph# ceph orch host rm <host>
+
+And then remove the host from inventory (usually in
+``etc/kayobe/inventory/overcloud``)
+
+Additional options/commands may be found in
+`Host management <https://docs.ceph.com/en/latest/cephadm/host-management/>`_
+
+Replacing a Failed Ceph Drive
+=============================
+
+Once an OSD has been identified as having a hardware failure,
+the affected drive will need to be replaced.
+
+If rebooting a Ceph node, first set ``noout`` to prevent excess data
+movement:
+
+.. code-block:: console
+
+   ceph# cephadm shell
+   ceph# ceph osd set noout
+
+Reboot the node and replace the drive
+
+Unset noout after the node is back online
+
+.. code-block:: console
+
+   ceph# cephadm shell
+   ceph# ceph osd unset noout
+
+Remove the OSD using Ceph orchestrator command:
+
+.. code-block:: console
+
+   ceph# cephadm shell
+   ceph# ceph orch osd rm <ID> --replace
+
+After removing OSDs, if the drives the OSDs were deployed on once again become
+available, cephadm may automatically try to deploy more OSDs on these drives if
+they match an existing drivegroup spec.
+If this is not your desired action plan - it's best to modify the drivegroup
+spec before (``cephadm_osd_spec`` variable in ``etc/kayobe/cephadm.yml``).
+Either set ``unmanaged: true`` to stop cephadm from picking up new disks or
+modify it in some way that it no longer matches the drives you want to remove.

From e5b7a77275496173eb0ba06cf67c246fdd6177f7 Mon Sep 17 00:00:00 2001
From: Seunghun Lee <seunghun@stackhpc.com>
Date: Tue, 28 May 2024 13:19:20 +0100
Subject: [PATCH 04/42] Add ceph operation doc

---
 doc/source/operations/ceph-management.rst | 178 ++++++++++++++++++++--
 1 file changed, 167 insertions(+), 11 deletions(-)

diff --git a/doc/source/operations/ceph-management.rst b/doc/source/operations/ceph-management.rst
index 754a6deb9c..8e3d1f4e94 100644
--- a/doc/source/operations/ceph-management.rst
+++ b/doc/source/operations/ceph-management.rst
@@ -1,13 +1,16 @@
-==========================
-Managing Ceph with Cephadm
-==========================
+===========================
+Managing and Operating Ceph
+===========================
+
+Working with Cephadm
+====================
 
 cephadm configuration location
-==============================
+------------------------------
 
 In kayobe-config repository, under ``etc/kayobe/cephadm.yml`` (or in a specific
 Kayobe environment when using multiple environment, e.g.
-``etc/kayobe/environments/production/cephadm.yml``)
+``etc/kayobe/environments/<Environment Name>/cephadm.yml``)
 
 StackHPC's cephadm Ansible collection relies on multiple inventory groups:
 
@@ -19,7 +22,7 @@ StackHPC's cephadm Ansible collection relies on multiple inventory groups:
 Those groups are usually defined in ``etc/kayobe/inventory/groups``.
 
 Running cephadm playbooks
-=========================
+-------------------------
 
 In kayobe-config repository, under ``etc/kayobe/ansible`` there is a set of
 cephadm based playbooks utilising stackhpc.cephadm Ansible Galaxy collection.
@@ -36,7 +39,7 @@ cephadm based playbooks utilising stackhpc.cephadm Ansible Galaxy collection.
 - ``cephadm-pools.yml`` - defines Ceph pools\
 
 Running Ceph commands
-=====================
+---------------------
 
 Ceph commands are usually run inside a ``cephadm shell`` utility container:
 
@@ -47,12 +50,12 @@ Ceph commands are usually run inside a ``cephadm shell`` utility container:
 
 Operating a cluster requires a keyring with an admin access to be available for Ceph
 commands. Cephadm will copy such keyring to the nodes carrying
-`_admin <https://docs.ceph.com/en/quincy/cephadm/host-management/#special-host-labels>`__
+`_admin <https://docs.ceph.com/en/latest/cephadm/host-management/#special-host-labels>`__
 label - present on MON servers by default when using
 `StackHPC Cephadm collection <https://github.com/stackhpc/ansible-collection-cephadm>`__.
 
 Adding a new storage node
-=========================
+-------------------------
 
 Add a node to a respective group (e.g. osds) and run ``cephadm-deploy.yml``
 playbook.
@@ -62,7 +65,7 @@ playbook.
    ``-e cephadm_bootstrap=True`` on playbook run.
 
 Removing a storage node
-=======================
+-----------------------
 
 First drain the node
 
@@ -85,7 +88,7 @@ Additional options/commands may be found in
 `Host management <https://docs.ceph.com/en/latest/cephadm/host-management/>`_
 
 Replacing a Failed Ceph Drive
-=============================
+-----------------------------
 
 Once an OSD has been identified as having a hardware failure,
 the affected drive will need to be replaced.
@@ -121,3 +124,156 @@ If this is not your desired action plan - it's best to modify the drivegroup
 spec before (``cephadm_osd_spec`` variable in ``etc/kayobe/cephadm.yml``).
 Either set ``unmanaged: true`` to stop cephadm from picking up new disks or
 modify it in some way that it no longer matches the drives you want to remove.
+
+
+Operations
+==========
+
+Replacing drive
+---------------
+
+See upstream documentation:
+https://docs.ceph.com/en/latest/cephadm/services/osd/#replacing-an-osd
+
+In case where disk holding DB and/or WAL fails, it is necessary to recreate
+(using replacement procedure above) all OSDs that are associated with this
+disk - usually NVMe drive. The following single command is sufficient to
+identify which OSDs are tied to which physical disks:
+
+.. code-block:: console
+
+   ceph# ceph device ls
+
+Host maintenance
+----------------
+
+https://docs.ceph.com/en/latest/cephadm/host-management/#maintenance-mode
+
+Upgrading
+---------
+
+https://docs.ceph.com/en/latest/cephadm/upgrade/
+
+
+Troubleshooting
+===============
+
+Investigating a Failed Ceph Drive
+---------------------------------
+
+A failing drive in a Ceph cluster will cause OSD daemon to crash.
+In this case Ceph will go into `HEALTH_WARN` state.
+Ceph can report details about failed OSDs by running:
+
+.. code-block:: console
+
+   ceph# ceph health detail
+
+.. note ::
+
+   Remember to run ceph/rbd commands from within ``cephadm shell``
+   (preferred method) or after installing Ceph client. Details in the
+   official `documentation <https://docs.ceph.com/en/latest/cephadm/install/#enable-ceph-cli>`__.
+   It is also required that the host where commands are executed has admin
+   Ceph keyring present - easiest to achieve by applying
+   `_admin <https://docs.ceph.com/en/latest/cephadm/host-management/#special-host-labels>`__
+   label (Ceph MON servers have it by default when using
+   `StackHPC Cephadm collection <https://github.com/stackhpc/ansible-collection-cephadm>`__).
+
+A failed OSD will also be reported as down by running:
+
+.. code-block:: console
+
+   ceph# ceph osd tree
+
+Note the ID of the failed OSD.
+
+The failed disk is usually logged by the Linux kernel too:
+
+.. code-block:: console
+
+   storage-0# dmesg -T
+
+Cross-reference the hardware device and OSD ID to ensure they match.
+(Using `pvs` and `lvs` may help make this connection).
+
+Inspecting a Ceph Block Device for a VM
+---------------------------------------
+
+To find out what block devices are attached to a VM, go to the hypervisor that
+it is running on (an admin-level user can see this from ``openstack server
+show``).
+
+On this hypervisor, enter the libvirt container:
+
+.. code-block:: console
+   :substitutions:
+
+   |hypervisor_hostname|# docker exec -it nova_libvirt /bin/bash
+
+Find the VM name using libvirt:
+
+.. code-block:: console
+   :substitutions:
+
+   (nova-libvirt)[root@|hypervisor_hostname| /]# virsh list
+    Id    Name                State
+   ------------------------------------
+    1     instance-00000001   running
+
+Now inspect the properties of the VM using ``virsh dumpxml``:
+
+.. code-block:: console
+   :substitutions:
+
+   (nova-libvirt)[root@|hypervisor_hostname| /]# virsh dumpxml instance-00000001 | grep rbd
+         <source protocol='rbd' name='|nova_rbd_pool|/51206278-e797-4153-b720-8255381228da_disk'>
+
+On a Ceph node, the RBD pool can be inspected and the volume extracted as a RAW
+block image:
+
+.. code-block:: console
+   :substitutions:
+
+   ceph# rbd ls |nova_rbd_pool|
+   ceph# rbd export |nova_rbd_pool|/51206278-e797-4153-b720-8255381228da_disk blob.raw
+
+The raw block device (blob.raw above) can be mounted using the loopback device.
+
+Inspecting a QCOW Image using LibGuestFS
+----------------------------------------
+
+The virtual machine's root image can be inspected by installing
+libguestfs-tools and using the guestfish command:
+
+.. code-block:: console
+
+   ceph# export LIBGUESTFS_BACKEND=direct
+   ceph# guestfish -a blob.qcow
+   ><fs> run
+    100% [XXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXX] 00:00
+   ><fs> list-filesystems
+   /dev/sda1: ext4
+   ><fs> mount /dev/sda1 /
+   ><fs> ls /
+   bin
+   boot
+   dev
+   etc
+   home
+   lib
+   lib64
+   lost+found
+   media
+   mnt
+   opt
+   proc
+   root
+   run
+   sbin
+   srv
+   sys
+   tmp
+   usr
+   var
+   ><fs> quit

From f4b2630557c284c24bf6513344d8b3120cacff80 Mon Sep 17 00:00:00 2001
From: Seunghun Lee <seunghun@stackhpc.com>
Date: Tue, 28 May 2024 13:19:59 +0100
Subject: [PATCH 05/42] Add openstack operation docs

---
 .../operations/control-plane-operation.rst    | 391 ++++++++++++++++++
 doc/source/operations/migrating-vm.rst        |  22 +
 .../operations/openstack-reconfiguration.rst  | 186 +++++++++
 3 files changed, 599 insertions(+)
 create mode 100644 doc/source/operations/control-plane-operation.rst
 create mode 100644 doc/source/operations/migrating-vm.rst
 create mode 100644 doc/source/operations/openstack-reconfiguration.rst

diff --git a/doc/source/operations/control-plane-operation.rst b/doc/source/operations/control-plane-operation.rst
new file mode 100644
index 0000000000..c5c629d52f
--- /dev/null
+++ b/doc/source/operations/control-plane-operation.rst
@@ -0,0 +1,391 @@
+=======================
+Operating Control Plane
+=======================
+
+Backup of the OpenStack Control Plane
+=====================================
+
+As the backup procedure is constantly changing, it is normally best to check
+the upstream documentation for an up to date procedure. Here is a high level
+overview of the key things you need to backup:
+
+Controllers
+-----------
+
+* `Back up SQL databases <https://docs.openstack.org/kayobe/latest/administration/overcloud.html#performing-database-backups>`__
+* `Back up configuration in /etc/kolla <https://docs.openstack.org/kayobe/latest/administration/overcloud.html#saving-overcloud-service-configuration>`__
+
+Compute
+-------
+
+The compute nodes can largely be thought of as ephemeral, but you do need to
+make sure you have migrated any instances and disabled the hypervisor before
+decommissioning or making any disruptive configuration change.
+
+Monitoring
+----------
+
+* `Back up InfluxDB <https://docs.influxdata.com/influxdb/v1.8/administration/backup_and_restore/>`__
+* `Back up ElasticSearch <https://www.elastic.co/guide/en/elasticsearch/reference/current/backup-cluster-data.html>`__
+* `Back up Prometheus <https://prometheus.io/docs/prometheus/latest/querying/api/#snapshot>`__
+
+Seed
+----
+
+* `Back up bifrost <https://docs.openstack.org/kayobe/latest/administration/seed.html#database-backup-restore>`__
+
+Ansible control host
+--------------------
+
+* Back up service VMs such as the seed VM
+
+Control Plane Monitoring
+========================
+
+The control plane has been configured to collect logs centrally using the EFK
+stack (Elasticsearch, Fluentd and Kibana).
+
+Telemetry monitoring of the control plane is performed by Prometheus. Metrics
+are collected by Prometheus exporters, which are either running on all hosts
+(e.g.  node exporter), on specific hosts (e.g. controllers for the memcached
+exporter or monitoring hosts for the OpenStack exporter). These exporters are
+scraped by the Prometheus server.
+
+Configuring Prometheus Alerts
+-----------------------------
+
+Alerts are defined in code and stored in Kayobe configuration. See ``*.rules``
+files in ``${KAYOBE_CONFIG_PATH}/kolla/config/prometheus`` as a model to add
+custom rules.
+
+Silencing Prometheus Alerts
+---------------------------
+
+Sometimes alerts must be silenced because the root cause cannot be resolved
+right away, such as when hardware is faulty. For example, an unreachable
+hypervisor will produce several alerts:
+
+* ``InstanceDown`` from Node Exporter
+* ``OpenStackServiceDown`` from the OpenStack exporter, which reports status of
+  the ``nova-compute`` agent on the host
+* ``PrometheusTargetMissing`` from several Prometheus exporters
+
+Rather than silencing each alert one by one for a specific host, a silence can
+apply to multiple alerts using a reduced list of labels. :ref:`Log into
+Alertmanager <prometheus-alertmanager>`, click on the ``Silence`` button next
+to an alert and adjust the matcher list to keep only ``instance=<hostname>``
+label.  Then, create another silence to match ``hostname=<hostname>`` (this is
+required because, for the OpenStack exporter, the instance is the host running
+the monitoring service rather than the host being monitored).
+
+.. note::
+
+   After creating the silence, you may get redirected to a 404 page. This is a
+   `known issue <https://github.com/prometheus/alertmanager/issues/1377>`__
+   when running several Alertmanager instances behind HAProxy.
+
+Generating Alerts from Metrics
+++++++++++++++++++++++++++++++
+
+Alerts are defined in code and stored in Kayobe configuration. See ``*.rules``
+files in ``${KAYOBE_CONFIG_PATH}/kolla/config/prometheus`` as a model to add
+custom rules.
+
+Control Plane Shutdown Procedure
+================================
+
+Overview
+--------
+
+* Verify integrity of clustered components (RabbitMQ, Galera, Keepalived). They
+  should all report a healthy status.
+* Put node into maintenance mode in bifrost to prevent it from automatically
+  powering back on
+* Shutdown down nodes one at a time gracefully using systemctl poweroff
+
+Controllers
+-----------
+
+If you are restarting the controllers, it is best to do this one controller at
+a time to avoid the clustered components losing quorum.
+
+Checking Galera state
++++++++++++++++++++++
+
+On each controller perform the following:
+
+.. code-block:: console
+
+   [stack@controller0 ~]$ docker exec -i mariadb mysql -u root -p -e "SHOW STATUS LIKE 'wsrep_local_state_comment'"
+   Variable_name   Value
+   wsrep_local_state_comment       Synced
+
+The password can be found using:
+
+.. code-block:: console
+
+   kayobe# ansible-vault view ${KAYOBE_CONFIG_PATH}/kolla/passwords.yml \
+           --vault-password-file <Vault password file path> | grep ^database
+
+Checking RabbitMQ
++++++++++++++++++
+
+RabbitMQ health is determined using the command ``rabbitmqctl cluster_status``:
+
+.. code-block:: console
+
+   [stack@controller0 ~]$ docker exec rabbitmq rabbitmqctl cluster_status
+   Cluster status of node rabbit@controller0 ...
+   [{nodes,[{disc,['rabbit@controller0','rabbit@controller1',
+                   'rabbit@controller2']}]},
+    {running_nodes,['rabbit@controller1','rabbit@controller2',
+                    'rabbit@controller0']},
+    {cluster_name,<<"rabbit@controller2">>},
+    {partitions,[]},
+    {alarms,[{'rabbit@controller1',[]},
+             {'rabbit@controller2',[]},
+             {'rabbit@controller0',[]}]}]
+
+Checking Keepalived
++++++++++++++++++++
+
+On (for example) three controllers:
+
+.. code-block:: console
+
+   [stack@controller0 ~]$ docker logs keepalived
+
+Two instances should show:
+
+.. code-block:: console
+
+   VRRP_Instance(kolla_internal_vip_51) Entering BACKUP STATE
+
+and the other:
+
+.. code-block:: console
+
+   VRRP_Instance(kolla_internal_vip_51) Entering MASTER STATE
+
+Ansible Control Host
+--------------------
+
+The Ansible control host is not enrolled in bifrost. This node may run services
+such as the seed virtual machine which will need to be gracefully powered down.
+
+Compute
+-------
+
+If you are shutting down a single hypervisor, to avoid down time to tenants it
+is advisable to migrate all of the instances to another machine. See
+:ref:`evacuating-all-instances`.
+
+.. ifconfig:: deployment['ceph_managed']
+
+   Ceph
+   ----
+
+   The following guide provides a good overview:
+   https://access.redhat.com/documentation/en-us/red_hat_openstack_platform/8/html/director_installation_and_usage/sect-rebooting-ceph
+
+Shutting down the seed VM
+-------------------------
+
+.. code-block:: console
+
+   kayobe# virsh shutdown <Seed node>
+
+.. _full-shutdown:
+
+Full shutdown
+-------------
+
+In case a full shutdown of the system is required, we advise to use the
+following order:
+
+* Perform a graceful shutdown of all virtual machine instances
+* Shut down compute nodes
+* Shut down monitoring node
+* Shut down network nodes (if separate from controllers)
+* Shut down controllers
+* Shut down Ceph nodes (if applicable)
+* Shut down seed VM
+* Shut down Ansible control host
+
+Rebooting a node
+----------------
+
+Example: Reboot all compute hosts apart from compute0:
+
+.. code-block:: console
+
+   kayobe# kayobe overcloud host command run --limit 'compute:!compute0' -b --command "shutdown -r"
+
+References
+----------
+
+* https://galeracluster.com/library/training/tutorials/restarting-cluster.html
+
+Control Plane Power on Procedure
+================================
+
+Overview
+--------
+
+* Remove the node from maintenance mode in bifrost
+* Bifrost should automatically power on the node via IPMI
+* Check that all docker containers are running
+* Check Kibana for any messages with log level ERROR or equivalent
+
+Controllers
+-----------
+
+If all of the servers were shut down at the same time, it is necessary to run a
+script to recover the database once they have all started up. This can be done
+with the following command:
+
+.. code-block:: console
+
+   kayobe# kayobe overcloud database recover
+
+Ansible Control Host
+--------------------
+
+The Ansible control host is not enrolled in Bifrost and will have to be powered
+on manually.
+
+Seed VM
+-------
+
+The seed VM (and any other service VM) should start automatically when the seed
+hypervisor is powered on. If it does not, it can be started with:
+
+.. code-block:: console
+
+   kayobe# virsh start seed-0
+
+Full power on
+-------------
+
+Follow the order in :ref:`full-shutdown`, but in reverse order.
+
+Shutting Down / Restarting Monitoring Services
+----------------------------------------------
+
+Shutting down
++++++++++++++
+
+Log into the monitoring host(s):
+
+.. code-block:: console
+
+   kayobe# ssh stack@monitoring0
+
+Stop all Docker containers:
+
+.. code-block:: console
+
+   monitoring0# for i in `docker ps -q`; do docker stop $i; done
+
+Shut down the node:
+
+.. code-block:: console
+
+   monitoring0# sudo shutdown -h
+
+Starting up
++++++++++++
+
+The monitoring services containers will automatically start when the monitoring
+node is powered back on.
+
+Software Updates
+================
+
+Update Packages on Control Plane
+--------------------------------
+
+OS packages can be updated with:
+
+.. code-block:: console
+
+   kayobe# kayobe overcloud host package update --limit <Hypervisor node> --packages '*'
+   kayobe# kayobe overcloud seed package update --packages '*'
+
+See https://docs.openstack.org/kayobe/latest/administration/overcloud.html#updating-packages
+
+Minor Upgrades to OpenStack Services
+------------------------------------
+
+* Pull latest changes from upstream stable branch to your own ``kolla`` fork (if applicable)
+* Update ``kolla_openstack_release`` in ``etc/kayobe/kolla.yml`` (unless using default)
+* Update tags for the images in ``etc/kayobe/kolla/globals.yml`` to use the new value of ``kolla_openstack_release``
+* Rebuild container images
+* Pull container images to overcloud hosts
+* Run kayobe overcloud service upgrade
+
+For more information, see: https://docs.openstack.org/kayobe/latest/upgrading.html
+
+Troubleshooting
+===============
+
+Deploying to a Specific Hypervisor
+----------------------------------
+
+To test creating an instance on a specific hypervisor, *as an admin-level user*
+you can specify the hypervisor name as part of an extended availability zone
+description.
+
+To see the list of hypervisor names:
+
+.. code-block:: console
+
+   admin# openstack hypervisor list
+
+To boot an instance on a specific hypervisor
+
+.. code-block:: console
+
+   admin# openstack server create --flavor <Flavour name>--network <Network name> --key-name <key> --image <Image name> --availability-zone nova::<Hypervisor name> <VM name>
+
+Cleanup Procedures
+==================
+
+OpenStack services can sometimes fail to remove all resources correctly. This
+is the case with Magnum, which fails to clean up users in its domain after
+clusters are deleted. `A patch has been submitted to stable branches
+<https://review.opendev.org/#/q/Ibadd5b57fe175bb0b100266e2dbcc2e1ea4efcf9>`__.
+Until this fix becomes available, if Magnum is in use, administrators can
+perform the following cleanup procedure regularly:
+
+.. code-block:: console
+
+   admin# for user in $(openstack user list --domain magnum -f value -c Name | grep -v magnum_trustee_domain_admin); do
+            if openstack coe cluster list -c uuid -f value | grep -q $(echo $user | sed 's/_[0-9a-f]*$//'); then
+              echo "$user still in use, not deleting"
+            else
+              openstack user delete --domain magnum $user
+            fi
+          done
+
+OpenSearch indexes retention
+=============================
+
+To alter default rotation values for OpenSearch, edit
+
+``${KAYOBE_CONFIG_PATH}/kolla/globals.yml``:
+
+.. code-block:: console
+   # Duration after which index is closed (default 30)
+   opensearch_soft_retention_period_days: 90
+   # Duration after which index is deleted (default 60)
+   opensearch_hard_retention_period_days: 180
+
+Reconfigure Opensearch with new values:
+
+.. code-block:: console
+   kayobe overcloud service reconfigure --kolla-tags opensearch
+
+For more information see the `upstream documentation
+
+<https://docs.openstack.org/kolla-ansible/latest/reference/logging-and-monitoring/central-logging-guide.html#applying-log-retention-policies>`__.
diff --git a/doc/source/operations/migrating-vm.rst b/doc/source/operations/migrating-vm.rst
new file mode 100644
index 0000000000..784abe74a4
--- /dev/null
+++ b/doc/source/operations/migrating-vm.rst
@@ -0,0 +1,22 @@
+==========================
+Migrating virtual machines
+==========================
+
+To see where all virtual machines are running on the hypervisors:
+
+.. code-block:: console
+
+   admin# openstack server list --all-projects --long
+
+To move a virtual machine with shared storage or booted from volume from one hypervisor to another, for example to
+hypervisor-01:
+
+.. code-block:: console
+
+   admin# openstack server migrate --live-migration --host hypervisor-01 <VM name or uuid>
+
+To move a virtual machine with local disks:
+
+.. code-block:: console
+
+   admin# openstack  server migrate --live-migration --block-migration --host hypervisor-01 <VM name or uuid>
diff --git a/doc/source/operations/openstack-reconfiguration.rst b/doc/source/operations/openstack-reconfiguration.rst
new file mode 100644
index 0000000000..dfba372f26
--- /dev/null
+++ b/doc/source/operations/openstack-reconfiguration.rst
@@ -0,0 +1,186 @@
+=========================
+OpenStack Reconfiguration
+=========================
+
+Disabling a Service
+-------------------
+
+Ansible is oriented towards adding or reconfiguring services, but removing a
+service is handled less well, because of Ansible's imperative style.
+
+To remove a service, it is disabled in Kayobe's Kolla config, which prevents
+other services from communicating with it. For example, to disable
+``cinder-backup``, edit ``${KAYOBE_CONFIG_PATH}/kolla.yml``:
+
+.. code-block:: diff
+
+   -enable_cinder_backup: true
+   +enable_cinder_backup: false
+
+Then, reconfigure Cinder services with Kayobe:
+
+.. code-block:: console
+
+   kayobe# kayobe overcloud service reconfigure --kolla-tags cinder
+
+However, the service itself, no longer in Ansible's manifest of managed state,
+must be manually stopped and prevented from restarting.
+
+On each controller:
+
+.. code-block:: console
+
+   kayobe# docker rm -f cinder_backup
+
+Some services may store data in a dedicated Docker volume, which can be removed
+with ``docker volume rm``.
+
+Installing TLS Certificates
+---------------------------
+
+To configure TLS for the first time, we write the contents of a PEM
+file to the ``secrets.yml`` file as ``secrets_kolla_external_tls_cert``.
+Use a command of this form:
+
+.. code-block:: console
+
+   kayobe# ansible-vault edit ${KAYOBE_CONFIG_PATH}/secrets.yml --vault-password-file=<Vault password file path>
+
+Concatenate the contents of the certificate and key files to create
+``secrets_kolla_external_tls_cert``.  The certificates should be installed in
+this order:
+
+* TLS certificate for the public endpoint FQDN
+* Any intermediate certificates
+* The TLS certificate private key
+
+In ``${KAYOBE_CONFIG_PATH}/kolla.yml``, set the following:
+
+.. code-block:: yaml
+
+   kolla_enable_tls_external: True
+   kolla_external_tls_cert: "{{ secrets_kolla_external_tls_cert }}"
+
+To apply TLS configuration, we need to reconfigure all services, as endpoint URLs need to
+be updated in Keystone:
+
+.. code-block:: console
+
+   kayobe# kayobe overcloud service reconfigure
+
+Alternative Configuration
++++++++++++++++++++++++++
+
+As an alternative to writing the certificates as a variable to
+``secrets.yml``, it is also possible to write the same data to a file,
+``etc/kayobe/kolla/certificates/haproxy.pem``.  The file should be
+vault-encrypted in the same manner as secrets.yml.  In this instance,
+variable ``kolla_external_tls_cert`` does not need to be defined.
+
+See `Kolla-Ansible TLS guide
+<https://docs.openstack.org/kolla-ansible/latest/admin/tls.html>`__ for
+further details.
+
+Updating TLS Certificates
+-------------------------
+
+Check the expiry date on an installed TLS certificate from a host that can
+reach the OpenStack APIs:
+
+.. code-block:: console
+   :substitutions:
+
+   openstack# openssl s_client -connect <Public endpoint FQDN>:443 2> /dev/null | openssl x509 -noout -dates
+
+*NOTE*: Prometheus Blackbox monitoring can check certificates automatically
+and alert when expiry is approaching.
+
+To update an existing certificate, for example when it has reached expiration,
+change the value of ``secrets_kolla_external_tls_cert``, in the same order as
+above.  Run the following command:
+
+.. code-block:: console
+
+   kayobe# kayobe overcloud service reconfigure --kolla-tags haproxy
+
+.. _taking-a-hypervisor-out-of-service:
+
+Taking a Hypervisor out of Service
+----------------------------------
+
+To take a hypervisor out of Nova scheduling:
+
+.. code-block:: console
+
+   admin# openstack compute service set --disable \
+          <Hypervisor name> nova-compute
+
+Running instances on the hypervisor will not be affected, but new instances
+will not be deployed on it.
+
+A reason for disabling a hypervisor can be documented with the
+``--disable-reason`` flag:
+
+.. code-block:: console
+
+   admin# openstack compute service set --disable \
+          --disable-reason "Broken drive" <Hypervisor name> nova-compute
+
+Details about all hypervisors and the reasons they are disabled can be
+displayed with:
+
+.. code-block:: console
+
+   admin# openstack compute service list --long
+
+And then to enable a hypervisor again:
+
+.. code-block:: console
+
+   admin# openstack compute service set --enable \
+          <Hypervisor name> nova-compute
+
+Managing Space in the Docker Registry
+-------------------------------------
+
+If the Docker registry becomes full, this can prevent container updates and
+(depending on the storage configuration of the seed host) could lead to other
+problems with services provided by the seed host.
+
+To remove container images from the Docker Registry, follow this process:
+
+* Reconfigure the registry container to allow deleting containers. This can be
+  done in ``docker-registry.yml`` with Kayobe:
+
+.. code-block:: yaml
+
+   docker_registry_env:
+     REGISTRY_STORAGE_DELETE_ENABLED: "true"
+
+* For the change to take effect, run:
+
+.. code-block:: console
+
+   kayobe seed host configure
+
+* A helper script is useful, such as https://github.com/byrnedo/docker-reg-tool
+  (this requires ``jq``). To delete all images with a specific tag, use:
+
+.. code-block:: console
+
+   for repo in `./docker_reg_tool http://registry-ip:4000 list`; do
+        ./docker_reg_tool http://registry-ip:4000 delete $repo $tag
+   done
+
+* Deleting the tag does not actually release the space. To actually free up
+  space, run garbage collection:
+
+.. code-block:: console
+
+   seed# docker exec docker_registry bin/registry garbage-collect /etc/docker/registry/config.yml
+
+The seed host can also accrue a lot of data from building container images.
+The images stored locally in the seed host can be seen using ``docker image ls``.
+
+Old and redundant images can be identified from their names and tags, and
+removed using ``docker image rm``.

From 8f304839c478550e7eaf2eb0d534e33ce123aa18 Mon Sep 17 00:00:00 2001
From: Seunghun Lee <seunghun@stackhpc.com>
Date: Tue, 28 May 2024 13:20:17 +0100
Subject: [PATCH 06/42] Add wazuh operation docs

---
 doc/source/operations/wazuh-operation.rst | 89 +++++++++++++++++++++++
 1 file changed, 89 insertions(+)
 create mode 100644 doc/source/operations/wazuh-operation.rst

diff --git a/doc/source/operations/wazuh-operation.rst b/doc/source/operations/wazuh-operation.rst
new file mode 100644
index 0000000000..23800ff849
--- /dev/null
+++ b/doc/source/operations/wazuh-operation.rst
@@ -0,0 +1,89 @@
+=======================
+Wazuh Security Platform
+=======================
+
+`Wazuh <https://wazuh.com>`_ is a security monitoring platform.
+It monitors for:
+
+* Security-related system events.
+* Known vulnerabilities (CVEs) in versions of installed software.
+* Misconfigurations in system security.
+
+One method for deploying and maintaining Wazuh is the `official
+Ansible playbooks <https://github.com/wazuh/wazuh-ansible>`_.  These
+can be integrated into ``kayobe-config`` as a custom playbook.
+
+Configuring Wazuh Manager
+-------------------------
+
+Wazuh Manager is configured by editing the ``wazuh-manager.yml``
+groups vars file found at
+``etc/kayobe/inventory/group_vars/wazuh-manager/``.  This file
+controls various aspects of Wazuh Manager configuration.
+Most notably:
+
+*domain_name*:
+    The domain used by Search Guard CE when generating certificates.
+
+*wazuh_manager_ip*:
+    The IP address that the Wazuh Manager shall reside on for communicating with the agents.
+
+*wazuh_manager_connection*:
+    Used to define port and protocol for the manager to be listening on.
+
+*wazuh_manager_authd*:
+    Connection settings for the daemon responsible for registering new agents.
+
+Running ``kayobe playbook run
+$KAYOBE_CONFIG_PATH/ansible/wazuh-manager.yml`` will deploy these
+changes.
+
+Secrets
+-------
+
+Wazuh requires that secrets or passwords are set for itself and the services with which it communiticates.
+The playbook ``etc/kayobe/ansible/wazuh-secrets.yml`` automates the creation of these secrets, which should then be encrypted with Ansible Vault.
+
+To update the secrets you can execute the following two commands
+
+.. code-block:: shell
+
+    kayobe# kayobe playbook run $KAYOBE_CONFIG_PATH/ansible/wazuh-secrets.yml \
+        -e wazuh_user_pass=$(uuidgen) \
+        -e wazuh_admin_pass=$(uuidgen)
+    kayobe# ansible-vault encrypt --vault-password-file <Vault password file path> \
+        $KAYOBE_CONFIG_PATH/inventory/group_vars/wazuh-manager/wazuh-secrets.yml
+
+Once generated, run ``kayobe playbook run $KAYOBE_CONFIG_PATH/ansible/wazuh-manager.yml`` which copies the secrets into place.
+
+.. note:: Use ``ansible-vault`` to view the secrets:
+
+  ``ansible-vault view --vault-password-file ~/vault.password $KAYOBE_CONFIG_PATH/inventory/group_vars/wazuh-manager/wazuh-secrets.yml``
+
+Adding a New Agent
+------------------
+The Wazuh Agent is deployed to all hosts in the ``wazuh-agent``
+inventory group, comprising the ``seed`` group
+plus the ``overcloud`` group (containing all hosts in the
+OpenStack control plane).
+
+.. code-block:: ini
+
+    [wazuh-agent:children]
+    seed
+    overcloud
+
+The following playbook deploys the Wazuh Agent to all hosts in the
+``wazuh-agent`` group:
+
+.. code-block:: shell
+
+  kayobe playbook run $KAYOBE_CONFIG_PATH/ansible/wazuh-agent.yml
+
+The hosts running Wazuh Agent should automatically be registered
+and visible within the Wazuh Manager dashboard.
+
+.. note:: It is good practice to use a `Kayobe deploy hook
+  <https://docs.openstack.org/kayobe/wallaby/custom-ansible-playbooks.html#hooks>`_
+  to automate deployment and configuration of the Wazuh Agent
+  following a run of ``kayobe overcloud host configure``.

From 7b29ccd1f58619c8ffcb8ce0f41ba09de6632640 Mon Sep 17 00:00:00 2001
From: Seunghun Lee <seunghun@stackhpc.com>
Date: Tue, 28 May 2024 16:35:11 +0100
Subject: [PATCH 07/42] Add hardware inventory management doc

---
 .../hardware-inventory-management.rst         | 253 ++++++++++++++++++
 1 file changed, 253 insertions(+)
 create mode 100644 doc/source/operations/hardware-inventory-management.rst

diff --git a/doc/source/operations/hardware-inventory-management.rst b/doc/source/operations/hardware-inventory-management.rst
new file mode 100644
index 0000000000..43fcb73aff
--- /dev/null
+++ b/doc/source/operations/hardware-inventory-management.rst
@@ -0,0 +1,253 @@
+=============================
+Hardware Inventory Management
+=============================
+
+At its lowest level, hardware inventory is managed in the Bifrost service.
+
+Reconfiguring Control Plane Hardware
+------------------------------------
+
+If a server's hardware or firmware configuration is changed, it should be
+re-inspected in Bifrost before it is redeployed into service. A single server
+can be reinspected like this:
+
+.. code-block:: console
+
+   kayobe# kayobe overcloud hardware inspect --limit <Host name>
+
+.. _enrolling-new-hypervisors:
+
+Enrolling New Hypervisors
+-------------------------
+
+New hypervisors can be added to the Bifrost inventory by using its discovery
+capabilities. Assuming that new hypervisors have IPMI enabled and are
+configured to network boot on the provisioning network, the following commands
+will instruct them to PXE boot. The nodes will boot on the Ironic Python Agent
+kernel and ramdisk, which is configured to extract hardware information and
+send it to Bifrost. Note that IPMI credentials can be found in the encrypted
+file located at ``${KAYOBE_CONFIG_PATH}/secrets.yml``.
+
+.. code-block:: console
+
+   bifrost# ipmitool -I lanplus -U <ipmi username> -H <Hostname>-ipmi chassis bootdev pxe
+
+If node is are off, power them on:
+
+.. code-block:: console
+
+   bifrost# ipmitool -I lanplus -U <ipmi username> -H <Hostname>-ipmi power on
+
+If nodes is on, reset them:
+
+.. code-block:: console
+
+   bifrost# ipmitool -I lanplus -U <ipmi username> -H <Hostname>-ipmi power reset
+
+Once node have booted and have completed introspection, they should be visible
+in Bifrost:
+
+.. code-block:: console
+
+   bifrost# baremetal node list --provision-state enroll
+   +--------------------------------------+-----------------------+---------------+-------------+--------------------+-------------+
+   | UUID                                 | Name                  | Instance UUID | Power State | Provisioning State | Maintenance |
+   +--------------------------------------+-----------------------+---------------+-------------+--------------------+-------------+
+   | da0c61af-b411-41b9-8909-df2509f2059b | example-hypervisor-01 | None          | power off   | enroll             | False       |
+   +--------------------------------------+-----------------------+---------------+-------------+--------------------+-------------+
+
+After editing ``${KAYOBE_CONFIG_PATH}/overcloud.yml`` to add these new hosts to
+the correct groups, import them in Kayobe's inventory with:
+
+.. code-block:: console
+
+   kayobe# kayobe overcloud inventory discover
+
+We can then provision and configure them:
+
+.. code-block:: console
+
+   kayobe# kayobe overcloud provision --limit <Hostname>
+   kayobe# kayobe overcloud host configure --limit <Hostname>
+   kayobe# kayobe overcloud service deploy --limit <Hostname> --kolla-limit <Hostname>
+
+Replacing a Failing Hypervisor
+------------------------------
+
+To replace a failing hypervisor, proceed as follows:
+
+* :ref:`Disable the hypervisor to avoid scheduling any new instance on it <taking-a-hypervisor-out-of-service>`
+* :ref:`Evacuate all instances <evacuating-all-instances>`
+* :ref:`Set the node to maintenance mode in Bifrost <set-bifrost-maintenance-mode>`
+* Physically fix or replace the node
+* It may be necessary to reinspect the node if hardware was changed (this will require deprovisioning and reprovisioning)
+* If the node was replaced or reprovisioned, follow :ref:`enrolling-new-hypervisors`
+
+To deprovision an existing hypervisor, run:
+
+.. code-block:: console
+
+   kayobe# kayobe overcloud deprovision --limit <Hypervisor hostname>
+
+.. warning::
+
+   Always use ``--limit`` with ``kayobe overcloud deprovision`` on a production
+   system. Running this command without a limit will deprovision all overcloud
+   hosts.
+
+.. _evacuating-all-instances:
+
+Evacuating all instances
+------------------------
+
+.. code-block:: console
+
+   admin# openstack server evacuate $(openstack server list --host <Hypervisor hostname> --format value --column ID)
+
+You should now check the status of all the instances that were running on that
+hypervisor. They should all show the status ACTIVE. This can be verified with:
+
+.. code-block:: console
+
+   admin# openstack server show <instance uuid>
+
+Troubleshooting
++++++++++++++++
+
+Servers that have been shut down
+********************************
+
+If there are any instances that are SHUTOFF they won’t be migrated, but you can
+use ``openstack server migrate`` for them once the live migration is finished.
+
+Also if a VM does heavy memory access, it may take ages to migrate (Nova tries
+to incrementally increase the expected downtime, but is quite conservative).
+You can use ``openstack server migration force complete --os-compute-api-version 2.22 <instance_uuid>
+<migration_id>`` to trigger the final move.
+
+You get the migration ID via ``openstack server migration list --server <instance_uuid>``.
+
+For more details see:
+http://www.danplanet.com/blog/2016/03/03/evacuate-in-nova-one-command-to-confuse-us-all/
+
+Flavors have changed
+********************
+
+If the size of the flavors has changed, some instances will also fail to
+migrate as the process needs manual confirmation. You can do this with:
+
+.. code-block:: console
+
+   openstack # openstack server resize confirm <instance-uuid>
+
+The symptom to look out for is that the server is showing a status of ``VERIFY
+RESIZE`` as shown in this snippet of ``openstack server show <instance-uuid>``:
+
+.. code-block:: console
+
+   | status | VERIFY_RESIZE |
+
+.. _set-bifrost-maintenance-mode:
+
+Set maintenance mode on a node in Bifrost
++++++++++++++++++++++++++++++++++++++++++
+
+.. code-block:: console
+
+   seed# docker exec -it bifrost_deploy /bin/bash
+   (bifrost-deploy)[root@seed bifrost-base]# export OS_CLOUD=bifrost
+   (bifrost-deploy)[root@seed bifrost-base]# baremetal node maintenance set <Hostname>
+
+.. _unset-bifrost-maintenance-mode:
+
+Unset maintenance mode on a node in Bifrost
++++++++++++++++++++++++++++++++++++++++++++
+
+.. code-block:: console
+
+   seed# docker exec -it bifrost_deploy /bin/bash
+   (bifrost-deploy)[root@seed bifrost-base]# export OS_CLOUD=bifrost
+   (bifrost-deploy)[root@seed bifrost-base]# baremetal node maintenance unset <Hostname>
+
+Detect hardware differences with ADVise
+=======================================
+
+Extract Bifrost introspection data
+----------------------------------
+
+The ADVise tool assumes that hardware introspection data has already been gathered in JSON format.
+The ``extra-hardware`` disk builder element enabled when building the IPA image for the required data to be available.
+
+To build ipa image with extra-hardware  you need to edit ``ipa.yml`` and add this:
+.. code-block:: console
+
+   # Whether to build IPA images from source.
+   ipa_build_images: true
+
+   # List of additional Diskimage Builder (DIB) elements to use when building IPA
+   images. Default is none.
+   ipa_build_dib_elements_extra:
+   - "extra-hardware"
+
+   # List of additional inspection collectors to run.
+   ipa_collectors_extra:
+   - "extra-hardware"
+
+Extract introspection data from Bifrost with Kayobe. JSON files will be created
+into ``${KAYOBE_CONFIG_PATH}/overcloud-introspection-data``:
+
+.. code-block:: console
+
+   kayobe# kayobe overcloud introspection data save
+
+Using ADVise
+------------
+
+Hardware information captured during the Ironic introspection process can be
+analysed to detect hardware differences, such as mismatches in firmware
+versions or missing storage devices. The `ADVise <https://github.com/stackhpc/ADVise>`__
+tool can be used for this purpose.
+
+The Ansible playbook ``advise-run.yml`` can be found at ``${KAYOBE_CONFIG_PATH}/ansible/advise-run.yml``.
+
+The playbook will:
+
+1. Install ADVise and dependencies
+2. Run the mungetout utility for extracting the required information from the introspection data ready for use with ADVise.
+3. Run ADVise on the data.
+
+.. code-block:: console
+
+   cd ${KAYOBE_CONFIG_PATH}
+   ansible-playbook ${KAYOBE_CONFIG_PATH}/ansible/advise-run.yml
+
+The playbook has the following optional parameters:
+
+- venv : path to the virtual environment to use. Default: ``"~/venvs/advise-review"``
+- input_dir: path to the hardware introspection data. Default: ``"{{ lookup('env', 'PWD') }}/overcloud-introspection-data"``
+- output_dir: path to where results should be saved. Default: ``"{{ lookup('env', 'PWD') }}/review"``
+- advise-pattern: regular expression to specify what introspection data should be analysed. Default: ``".*.eval"``
+
+Example command to run the tool on data about the compute nodes in a system, where compute nodes are named cpt01, cpt02, cpt03…:
+
+.. code-block:: console
+
+    ansible-playbook advise-run.yml -e advise_pattern=’(cpt)(.*)(.eval)’
+
+
+.. note::
+    The mungetout utility will always use the file extension .eval
+
+Using the results
+-----------------
+
+The ADVise tool will output a selection of results found under output_dir/results these include:
+
+- ``.html`` files to display network visualisations of any hardware differences.
+- The folder ``Paired_Comparisons`` which contains information on the shared and differing fields found between the systems. This is a reflection of the network visualisation webpage, with more detail as to what the differences are.
+- ``_summary``, a listing of how the systems can be grouped into sets of identical hardware.
+- ``_performance``, the results of analysing the benchmarking data gathered.
+- ``_perf_summary``, a subset of the performance metrics, just showing any potentially anomalous data such as where variance is too high, or individual nodes have been found to over/underperform.
+
+To get visuallised result, It is recommanded to copy instrospection data and review directories to your
+local machine then run ADVise playbook locally with the data.

From 3dc7a2c5adb4d2372cc37a83a411a02a80fdbddb Mon Sep 17 00:00:00 2001
From: Seunghun Lee <seunghun@stackhpc.com>
Date: Wed, 29 May 2024 10:44:10 +0100
Subject: [PATCH 08/42] Move advise tool intro

---
 .../operations/hardware-inventory-management.rst       | 10 +++++-----
 1 file changed, 5 insertions(+), 5 deletions(-)

diff --git a/doc/source/operations/hardware-inventory-management.rst b/doc/source/operations/hardware-inventory-management.rst
index 43fcb73aff..0d6fd8adf1 100644
--- a/doc/source/operations/hardware-inventory-management.rst
+++ b/doc/source/operations/hardware-inventory-management.rst
@@ -172,6 +172,11 @@ Unset maintenance mode on a node in Bifrost
 Detect hardware differences with ADVise
 =======================================
 
+Hardware information captured during the Ironic introspection process can be
+analysed to detect hardware differences, such as mismatches in firmware
+versions or missing storage devices. The `ADVise <https://github.com/stackhpc/ADVise>`__
+tool can be used for this purpose.
+
 Extract Bifrost introspection data
 ----------------------------------
 
@@ -203,11 +208,6 @@ into ``${KAYOBE_CONFIG_PATH}/overcloud-introspection-data``:
 Using ADVise
 ------------
 
-Hardware information captured during the Ironic introspection process can be
-analysed to detect hardware differences, such as mismatches in firmware
-versions or missing storage devices. The `ADVise <https://github.com/stackhpc/ADVise>`__
-tool can be used for this purpose.
-
 The Ansible playbook ``advise-run.yml`` can be found at ``${KAYOBE_CONFIG_PATH}/ansible/advise-run.yml``.
 
 The playbook will:

From 21aaa5a24492efa33b124a9e402550ef6c236cbd Mon Sep 17 00:00:00 2001
From: Seunghun Lee <seunghun@stackhpc.com>
Date: Wed, 29 May 2024 13:55:43 +0100
Subject: [PATCH 09/42] Add baremetal node management doc

---
 .../operations/baremetal-node-management.rst  | 277 ++++++++++++++++++
 1 file changed, 277 insertions(+)
 create mode 100644 doc/source/operations/baremetal-node-management.rst

diff --git a/doc/source/operations/baremetal-node-management.rst b/doc/source/operations/baremetal-node-management.rst
new file mode 100644
index 0000000000..f45903dad9
--- /dev/null
+++ b/doc/source/operations/baremetal-node-management.rst
@@ -0,0 +1,277 @@
+======================================
+Bare Metal Compute Hardware Management
+======================================
+
+Bare metal compute nodes are managed by the Ironic services.
+This section describes elements of the configuration of this service.
+
+.. _ironic-node-lifecycle:
+
+Ironic node life cycle
+----------------------
+
+The deployment process is documented in the `Ironic User Guide <https://docs.openstack.org/ironic/latest/user/index.html>`__.
+OpenStack deployment uses the
+`direct deploy method <https://docs.openstack.org/ironic/latest/user/index.html#example-1-pxe-boot-and-direct-deploy-process>`__.
+
+The Ironic state machine can be found `here <https://docs.openstack.org/ironic/latest/user/states.html>`__. The rest of
+this documentation refers to these states and assumes that you have familiarity.
+
+High level overview of state transitions
+~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
+
+The following section attempts to describe the state transitions for various Ironic operations at a high level.
+It focuses on trying to describe the steps where dynamic switch reconfiguration is triggered.
+For a more detailed overview, refer to the :ref:`ironic-node-lifecycle` section.
+
+Provisioning
+~~~~~~~~~~~~
+
+Provisioning starts when an instance is created in Nova using a bare metal flavor.
+
+- Node starts in the available state (available)
+- User provisions an instance (deploying)
+- Ironic will switch the node onto the provisioning network (deploying)
+- Ironic will power on the node and will await a callback (wait-callback)
+- Ironic will image the node with an operating system using the image provided at creation (deploying)
+- Ironic switches the node onto the tenant network(s) via neutron (deploying)
+- Transition node to active state (active)
+
+.. _baremetal-management-deprovisioning:
+
+Deprovisioning
+~~~~~~~~~~~~~~
+
+Deprovisioning starts when an instance created in Nova using a bare metal flavor is destroyed.
+
+If automated cleaning is enabled, it occurs when nodes are deprovisioned.
+
+- Node starts in active state (active)
+- User deletes instance (deleting)
+- Ironic will remove the node from any tenant network(s) (deleting)
+- Ironic will switch the node onto the cleaning network (deleting)
+- Ironic will power on the node and will await a callback (clean-wait)
+- Node boots into Ironic Python Agent and issues callback, Ironic starts cleaning (cleaning)
+- Ironic removes node from cleaning network (cleaning)
+- Node transitions to available (available)
+
+If automated cleaning is disabled.
+
+- Node starts in active state (active)
+- User deletes instance (deleting)
+- Ironic will remove the node from any tenant network(s) (deleting)
+- Node transitions to available (available)
+
+Cleaning
+~~~~~~~~
+
+Manual cleaning is not part of the regular state transitions when using Nova, however nodes can be manually cleaned by administrators.
+
+- Node starts in the manageable state (manageable)
+- User triggers cleaning with API (cleaning)
+- Ironic will switch the node onto the cleaning network (cleaning)
+- Ironic will power on the node and will await a callback (clean-wait)
+- Node boots into Ironic Python Agent and issues callback, Ironic starts cleaning (cleaning)
+- Ironic removes node from cleaning network (cleaning)
+- Node transitions back to the manageable state (manageable)
+
+Rescuing
+~~~~~~~~
+
+Feature not used. The required rescue network is not currently configured.
+
+Baremetal networking
+--------------------
+
+Baremetal networking with the Neutron Networking Generic Switch ML2 driver requires a combination of static and dynamic switch configuration.
+
+.. _static-switch-config:
+
+Static switch configuration
+~~~~~~~~~~~~~~~~~~~~~~~~~~~
+
+Static physical network configuration is managed via Kayobe.
+
+.. TODO: Fill in the switch configuration
+
+- Some initial switch configuration is required before networking generic switch can take over the management of an interface.
+    First, LACP must be configured on the switch ports attached to the baremetal node, e.g:
+
+    .. code-block:: shell
+
+    The interface is then partially configured:
+
+    .. code-block:: shell
+
+    For :ref:`ironic-node-discovery` to work, you need to manually switch the port to the provisioning network:
+
+    .. code-block:: shell
+
+    **NOTE**: You only need to do this if Ironic isn't aware of the node.
+
+Configuration with kayobe
+^^^^^^^^^^^^^^^^^^^^^^^^^
+
+Kayobe can be used to apply the :ref:`static-switch-config`.
+
+- Upstream documentation can be found `here <https://docs.openstack.org/kayobe/latest/configuration/reference/physical-network.html>`__.
+- Kayobe does all the switch configuration that isn't :ref:`dynamically updated using Ironic <dynamic-switch-configuration>`.
+- Optionally switches the node onto the provisioning network (when using ``--enable-discovery``)
+
+    + NOTE: This is a dangerous operation as it can wipe out the dynamic VLAN configuration applied by neutron/ironic.
+      You should only run this when initially enrolling a node, and should always use the ``interface-description-limit`` option. For example:
+
+    .. code-block::
+
+        kayobe physical network configure --interface-description-limit <description> --group switches --display --enable-discovery
+
+    In this example, ``--display`` is used to preview the switch configuration without applying it.
+
+.. TODO: Fill in information about how switches are configured in kayobe-config, with links
+
+- Configuration is done using a combination of ``group_vars`` and ``host_vars``
+
+.. _dynamic-switch-configuration:
+
+Dynamic switch configuration
+~~~~~~~~~~~~~~~~~~~~~~~~~~~~
+
+Ironic dynamically configures the switches using the Neutron `Networking Generic Switch <https://docs.openstack.org/networking-generic-switch/latest/>`_ ML2 driver.
+
+- Used to toggle the baremetal nodes onto different networks
+
+  + Can use any VLAN network defined in OpenStack, providing that the VLAN has been trunked to the controllers
+    as this is required for DHCP to function.
+  + See :ref:`ironic-node-lifecycle`. This attempts to illustrate when any switch reconfigurations happen.
+
+- Only configures VLAN membership of the switch interfaces or port groups. To prevent conflicts with the static switch configuration,
+  the convention used is: after the node is in service in Ironic, VLAN membership should not be manually adjusted and
+  should be left to be controlled by ironic i.e *don't* use ``--enable-discovery`` without an interface limit when configuring the
+  switches with kayobe.
+- Ironic is configured to use the neutron networking driver.
+
+.. _ngs-commands:
+
+Commands that NGS will execute
+^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
+
+Networking Generic Switch is mainly concerned with toggling the ports onto different VLANs. It
+cannot fully configure the switch.
+
+.. TODO: Fill in the switch configuration
+
+- Switching the port onto the provisioning network
+
+  .. code-block:: shell
+
+- Switching the port onto the tenant network.
+
+  .. code-block:: shell
+
+- When deleting the instance, the VLANs are removed from the port. Using:
+
+  .. code-block:: shell
+
+NGS will save the configuration after each reconfiguration (by default).
+
+Ports managed by NGS
+^^^^^^^^^^^^^^^^^^^^
+
+The command below extracts a list of port UUID, node UUID and switch port information.
+
+.. code-block:: bash
+
+   admin# openstack baremetal port list --field uuid --field node_uuid --field local_link_connection --format value
+
+NGS will manage VLAN membership for ports when the ``local_link_connection`` fields match one of the switches in ``ml2_conf.ini``.
+The rest of the switch configuration is static.
+The switch configuration that NGS will apply to these ports is detailed in :ref:`dynamic-switch-configuration`.
+
+.. _ironic-node-discovery:
+
+Ironic node discovery
+---------------------
+
+Discovery is a process used to automatically enrol new nodes in Ironic.
+It works by PXE booting the nodes into the Ironic Python Agent (IPA) ramdisk.
+This ramdisk will collect hardware and networking configuration from the node in a process known as introspection.
+This data is used to populate the baremetal node object in Ironic.
+The series of steps you need to take to enrol a new node is as follows:
+
+- Configure credentials on the BMC. These are needed for Ironic to be able to perform power control actions.
+
+- Controllers should have network connectivity with the target BMC.
+
+- (If kayobe manages physical network) Add any additional switch configuration to kayobe config.
+  The minimal switch configuration that kayobe needs to know about is described in :ref:`tor-switch-configuration`.
+
+- Apply any :ref:`static switch configration <static-switch-config>`. This performs the initial
+  setup of the switchports that is needed before Ironic can take over. The static configuration
+  will not be modified by Ironic, so it should be safe to reapply at any point. See :ref:`ngs-commands`
+  for details about the switch configuation that Networking Generic Switch will apply.
+
+- (If kayobe manages physical network) Put the node onto the provisioning network by using the
+  ``--enable-discovery`` flag and either ``--interface-description-limit`` or ``--interface-limit``
+  (do not run this command without one of these limits). See :ref:`static-switch-config`.
+
+    * This is only necessary to initially discover the node. Once the node is in registered in Ironic,
+      it will take over control of the the VLAN membership. See :ref:`dynamic-switch-configuration`.
+
+    * This provides ethernet connectivity with the controllers over the `workload provisioning` network
+
+- (If kayobe doesn't manage physical network) Put the node onto the provisioning network.
+
+.. TODO: link to the relevant file in kayobe config
+
+- Add node to the kayobe inventory.
+
+.. TODO: Fill in details about necessary BIOS & RAID config
+
+- Apply any necesary BIOS & RAID configuration.
+
+.. TODO: Fill in details about how to trigger a PXE boot
+
+- PXE boot the node.
+
+- If the discovery process is successful, the node will appear in Ironic and will get populated with the necessary information from the hardware inspection process.
+
+.. TODO: Link to the Kayobe inventory in the repo
+
+- Add node to the Kayobe inventory in the ``baremetal-compute`` group.
+
+- The node will begin in the ``enroll`` state, and must be moved first to ``manageable``, then ``available`` before it can be used.
+
+  If Ironic automated cleaning is enabled, the node must complete a cleaning process before it can reach the available state.
+
+  * Use Kayobe to attempt to move the node to the ``available`` state.
+
+    .. code-block:: console
+
+       source etc/kolla/public-openrc.sh
+       kayobe baremetal compute provide --limit <node>
+
+- Once the node is in the ``available`` state, Nova will make the node available for scheduling. This happens periodically, and typically takes around a minute.
+
+.. _tor-switch-configuration:
+
+Top of Rack (ToR) switch configuration
+~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
+
+Networking Generic Switch must be aware of the Top-of-Rack switch connected to the new node.
+Switches managed by NGS are configured in ``ml2_conf.ini``.
+
+.. TODO: Fill in details about how switches are added to NGS config in kayobe-config
+
+After adding switches to the NGS configuration, Neutron must be redeployed.
+
+Considerations when booting baremetal compared to VMs
+------------------------------------------------------
+
+- You can only use networks of type: vlan
+- Without using trunk ports, it is only possible to directly attach one network to each port or port group of an instance.
+
+  * To access other networks you can use routers
+  * You can still attach floating IPs
+
+- Instances take much longer to provision (expect at least 15 mins)
+- When booting an instance use one of the flavors that maps to a baremetal node via the RESOURCE_CLASS configured on the flavor.

From 5482670eb448fb2bd6567a2f28015f15f5e63416 Mon Sep 17 00:00:00 2001
From: Seunghun Lee <seunghun@stackhpc.com>
Date: Wed, 29 May 2024 14:44:19 +0100
Subject: [PATCH 10/42] Add gpu doc

---
 doc/source/operations/gpu-in-openstack.rst | 1124 ++++++++++++++++++++
 1 file changed, 1124 insertions(+)
 create mode 100644 doc/source/operations/gpu-in-openstack.rst

diff --git a/doc/source/operations/gpu-in-openstack.rst b/doc/source/operations/gpu-in-openstack.rst
new file mode 100644
index 0000000000..66170c6800
--- /dev/null
+++ b/doc/source/operations/gpu-in-openstack.rst
@@ -0,0 +1,1124 @@
+.. include:: vars.rst
+
+=============================
+Support for GPUs in OpenStack
+=============================
+
+NVIDIA Virtual GPU
+##################
+
+BIOS configuration
+------------------
+
+Intel
+^^^^^
+
+* Enable `VT-x` in the BIOS for virtualisation support.
+* Enable `VT-d` in the BIOS for IOMMU support.
+
+Dell
+^^^^
+
+Enabling SR-IOV with `racadm`:
+
+.. code:: shell
+
+    /opt/dell/srvadmin/bin/idracadm7 set BIOS.IntegratedDevices.SriovGlobalEnable Enabled
+    /opt/dell/srvadmin/bin/idracadm7 jobqueue create BIOS.Setup.1-1
+    <reboot>
+
+
+Obtain driver from NVIDIA licensing portal
+-------------------------------------------
+
+Download Nvidia GRID driver from `here <https://docs.nvidia.com/grid/latest/grid-software-quick-start-guide/index.html#redeeming-pak-and-downloading-grid-software>`__
+(This requires a login). The file can either be placed on the :ref:`ansible control host<NVIDIA control host>` or :ref:`uploaded to pulp<NVIDIA Pulp>`.
+
+.. _NVIDIA Pulp:
+
+Uploading the GRID driver to pulp
+---------------------------------
+
+Uploading the driver to pulp will make it possible to run kayobe from any host. This can be useful when
+running in a CI environment.
+
+.. code:: shell
+
+    pulp artifact upload --file ~/NVIDIA-GRID-Linux-KVM-525.105.14-525.105.17-528.89.zip
+    pulp file content create --relative-path "NVIDIA-GRID-Linux-KVM-525.105.14-525.105.17-528.89.zip" --sha256 c8e12c15b881df35e618bdee1f141cbfcc7e112358f0139ceaa95b48e20761e0
+    pulp file repository create --name nvidia
+    pulp file repository content add --repository nvidia --sha256 c8e12c15b881df35e618bdee1f141cbfcc7e112358f0139ceaa95b48e20761e0 --relative-path "NVIDIA-GRID-Linux-KVM-525.105.14-525.105.17-528.89.zip"
+    pulp file publication create --repository nvidia
+    pulp file distribution create --name nvidia --base-path nvidia --repository nvidia
+
+The file will then be available at ``<pulp_url>/pulp/content/nvidia/NVIDIA-GRID-Linux-KVM-525.105.14-525.105.17-528.89.zip``. You
+will need to set the ``vgpu_driver_url`` configuration option to this value:
+
+.. code:: yaml
+
+   # URL of GRID driver in pulp
+   vgpu_driver_url: "{{ pulp_url }}/pulp/content/nvidia/NVIDIA-GRID-Linux-KVM-525.105.14-525.105.17-528.89.zip"
+
+See :ref:`NVIDIA Role Configuration`.
+
+.. _NVIDIA control host:
+
+Placing the GRID driver on the ansible control host
+---------------------------------------------------
+
+Copy the driver bundle to a known location on the ansible control host. Set the ``vgpu_driver_url`` configuration variable to reference this
+path using ``file`` as the url scheme e.g:
+
+.. code:: yaml
+
+    # Location of NVIDIA GRID driver on localhost
+    vgpu_driver_url: "file://{{ lookup('env', 'HOME') }}/NVIDIA-GRID-Linux-KVM-525.105.14-525.105.17-528.89.zip"
+
+See :ref:`NVIDIA Role Configuration`.
+
+.. _NVIDIA OS Configuration:
+
+OS Configuration
+----------------
+
+Host OS configuration is done by using roles in the `stackhpc.linux <https://github.com/stackhpc/ansible-collection-linux>`_ ansible collection.
+
+Add the following to your ansible ``requirements.yml``:
+
+.. code-block:: yaml
+   :caption: $KAYOBE_CONFIG_PATH/ansible/requirements.yml
+
+    #FIXME: Update to known release When VGPU and IOMMU roles have landed
+    collections:
+      - name: stackhpc.linux
+        source: git+https://github.com/stackhpc/ansible-collection-linux.git,preemptive/vgpu-iommu
+        type: git
+
+Create a new playbook or update an existing on to apply the roles:
+
+.. code-block:: yaml
+   :caption: $KAYOBE_CONFIG_PATH/ansible/host-configure.yml
+
+    ---
+
+      - hosts: iommu
+        tags:
+          - iommu
+        tasks:
+          - import_role:
+              name: stackhpc.linux.iommu
+        handlers:
+          - name: reboot
+            set_fact:
+              kayobe_needs_reboot: true
+
+      - hosts: vgpu
+        tags:
+          - vgpu
+        tasks:
+          - import_role:
+              name: stackhpc.linux.vgpu
+        handlers:
+          - name: reboot
+            set_fact:
+              kayobe_needs_reboot: true
+
+      - name: Reboot when required
+        hosts: iommu:vgpu
+        tags:
+          - reboot
+        tasks:
+          - name: Reboot
+            reboot:
+              reboot_timeout: 3600
+            become: true
+            when: kayobe_needs_reboot | default(false) | bool
+
+Ansible Inventory Configuration
+-------------------------------
+
+Add some hosts into the ``vgpu`` group. The example below maps two custom
+compute groups, ``compute_multi_instance_gpu`` and ``compute_vgpu``,
+into the ``vgpu`` group:
+
+.. code-block:: yaml
+   :caption: $KAYOBE_CONFIG_PATH/inventory/custom
+
+    [compute]
+    [compute_multi_instance_gpu]
+    [compute_vgpu]
+
+    [vgpu:children]
+    compute_multi_instance_gpu
+    compute_vgpu
+
+    [iommu:children]
+    vgpu
+
+Having multiple groups is useful if you want to be able to do conditional
+templating in ``nova.conf`` (see :ref:`NVIDIA Kolla Ansible
+Configuration`). Since the vgpu role requires iommu to be enabled, all of the
+hosts in the ``vgpu`` group are also added to the ``iommu`` group.
+
+If using bifrost and the ``kayobe overcloud inventory discover`` mechanism,
+hosts can automatically be mapped to these groups by configuring
+``overcloud_group_hosts_map``:
+
+.. code-block:: yaml
+   :caption: ``$KAYOBE_CONFIG_PATH/overcloud.yml``
+
+    overcloud_group_hosts_map:
+      compute_vgpu:
+        - "computegpu000"
+      compute_mutli_instance_gpu:
+        - "computegpu001"
+
+.. _NVIDIA Role Configuration:
+
+Role Configuration
+^^^^^^^^^^^^^^^^^^
+
+Configure the location of the NVIDIA driver:
+
+.. code-block:: yaml
+   :caption: $KAYOBE_CONFIG_PATH/vgpu.yml
+
+    ---
+
+    vgpu_driver_url: "http://{{ pulp_url }}/pulp/content/nvidia/NVIDIA-GRID-Linux-KVM-525.105.14-525.105.17-528.89.zip"
+
+Configure the VGPU devices:
+
+.. code-block:: yaml
+   :caption: $KAYOBE_CONFIG_PATH/inventory/group_vars/compute_vgpu/vgpu
+
+    #nvidia-692 GRID A100D-4C
+    #nvidia-693 GRID A100D-8C
+    #nvidia-694 GRID A100D-10C
+    #nvidia-695 GRID A100D-16C
+    #nvidia-696 GRID A100D-20C
+    #nvidia-697 GRID A100D-40C
+    #nvidia-698 GRID A100D-80C
+    #nvidia-699 GRID A100D-1-10C
+    #nvidia-700 GRID A100D-2-20C
+    #nvidia-701 GRID A100D-3-40C
+    #nvidia-702 GRID A100D-4-40C
+    #nvidia-703 GRID A100D-7-80C
+    #nvidia-707 GRID A100D-1-10CME
+    vgpu_definitions:
+        # Configuring a MIG backed VGPU
+        - pci_address: "0000:17:00.0"
+          virtual_functions:
+            - mdev_type: nvidia-700
+              index: 0
+            - mdev_type: nvidia-700
+              index: 1
+            - mdev_type: nvidia-700
+              index: 2
+            - mdev_type: nvidia-699
+              index: 3
+          mig_devices:
+            "1g.10gb": 1
+            "2g.20gb": 3
+        # Configuring a card in a time-sliced configuration (non-MIG backed)
+        - pci_address: "0000:65:00.0"
+          virtual_functions:
+            - mdev_type: nvidia-697
+              index: 0
+            - mdev_type: nvidia-697
+              index: 1
+
+Running the playbook
+^^^^^^^^^^^^^^^^^^^^
+
+The playbook defined in the :ref:`previous step<NVIDIA OS Configuration>`
+should be run after `kayobe overcloud host configure` has completed. This will
+ensure the host has been fully bootstrapped. With default settings, internet
+connectivity is required to download `MIG Partition Editor for NVIDIA GPUs`. If
+this is not desirable, you can override the one of the following variables
+(depending on host OS):
+
+.. code-block:: yaml
+   :caption: $KAYOBE_CONFIG_PATH/inventory/group_vars/compute_vgpu/vgpu
+
+   vgpu_nvidia_mig_manager_rpm_url: "https://github.com/NVIDIA/mig-parted/releases/download/v0.5.1/nvidia-mig-manager-0.5.1-1.x86_64.rpm"
+   vgpu_nvidia_mig_manager_deb_url: "https://github.com/NVIDIA/mig-parted/releases/download/v0.5.1/nvidia-mig-manager_0.5.1-1_amd64.deb"
+
+For example, you may wish to upload these artifacts to the local pulp.
+
+Run the playbook that you defined earlier:
+
+.. code-block:: shell
+
+  kayobe playbook run $KAYOBE_CONFIG_PATH/ansible/host-configure.yml
+
+Note: This will reboot the hosts on first run.
+
+The playbook may be added as a hook in ``$KAYOBE_CONFIG_PATH/hooks/overcloud-host-configure/post.d``; this will
+ensure you do not forget to run it when hosts are enrolled in the future.
+
+.. _NVIDIA Kolla Ansible Configuration:
+
+Kolla-Ansible configuration
+^^^^^^^^^^^^^^^^^^^^^^^^^^^
+
+To use the mdev devices that were created, modify nova.conf to add a list of mdev devices that
+can be passed through to guests:
+
+.. code-block::
+   :caption: $KAYOBE_CONFIG_PATH/kolla/config/nova/nova-compute.conf
+
+    {% if inventory_hostname in groups['compute_multi_instance_gpu'] %}
+    [devices]
+    enabled_mdev_types = nvidia-700, nvidia-699
+
+    [mdev_nvidia-700]
+    device_addresses = 0000:21:00.4,0000:21:00.5,0000:21:00.6,0000:81:00.4,0000:81:00.5,0000:81:00.6
+    mdev_class = CUSTOM_NVIDIA_700
+
+    [mdev_nvidia-699]
+    device_addresses = 0000:21:00.7,0000:81:00.7
+    mdev_class = CUSTOM_NVIDIA_699
+
+    {% elif inventory_hostname in groups['compute_vgpu'] %}
+    [devices]
+    enabled_mdev_types = nvidia-697
+
+    [mdev_nvidia-697]
+    device_addresses = 0000:21:00.4,0000:21:00.5,0000:81:00.4,0000:81:00.5
+    # Custom resource classes don't work when you only have single resource type.
+    mdev_class = VGPU
+
+    {% endif %}
+
+You will need to adjust the PCI addresses to match the virtual function
+addresses. These can be obtained by checking the mdevctl configuration after
+running the role:
+
+.. code-block:: shell
+
+   # mdevctl list
+
+   73269d0f-b2c9-438d-8f28-f9e4bc6c6995 0000:17:00.4 nvidia-700 manual (defined)
+   dc352ef3-efeb-4a5d-a48e-912eb230bc76 0000:17:00.5 nvidia-700 manual (defined)
+   a464fbae-1f89-419a-a7bd-3a79c7b2eef4 0000:17:00.6 nvidia-700 manual (defined)
+   f3b823d3-97c8-4e0a-ae1b-1f102dcb3bce 0000:17:00.7 nvidia-699 manual (defined)
+   330be289-ba3f-4416-8c8a-b46ba7e51284 0000:65:00.4 nvidia-700 manual (defined)
+   1ba5392c-c61f-4f48-8fb1-4c6b2bbb0673 0000:65:00.5 nvidia-700 manual (defined)
+   f6868020-eb3a-49c6-9701-6c93e4e3fa9c 0000:65:00.6 nvidia-700 manual (defined)
+   00501f37-c468-5ba4-8be2-8d653c4604ed 0000:65:00.7 nvidia-699 manual (defined)
+
+The mdev_class maps to a resource class that you can set in your flavor definition.
+Note that if you only define a single mdev type on a given hypervisor, then the
+mdev_class configuration option is silently ignored and it will use the ``VGPU``
+resource class (bug?).
+
+Map through the kayobe inventory groups into kolla:
+
+.. code-block:: yaml
+   :caption: $KAYOBE_CONFIG_PATH/kolla.yml
+
+    kolla_overcloud_inventory_top_level_group_map:
+      control:
+        groups:
+          - controllers
+      network:
+        groups:
+          - network
+      compute_cpu:
+        groups:
+          - compute_cpu
+      compute_gpu:
+        groups:
+          - compute_gpu
+      compute_multi_instance_gpu:
+        groups:
+          - compute_multi_instance_gpu
+      compute_vgpu:
+        groups:
+          - compute_vgpu
+      compute:
+        groups:
+          - compute
+      monitoring:
+        groups:
+          - monitoring
+      storage:
+        groups:
+          "{{ kolla_overcloud_inventory_storage_groups }}"
+
+Where the ``compute_<suffix>`` groups have been added to the kayobe defaults.
+
+You will need to reconfigure nova for this change to be applied:
+
+.. code-block:: shell
+
+  kayobe overcloud service deploy -kt nova --kolla-limit compute_vgpu
+
+Openstack flavors
+^^^^^^^^^^^^^^^^^
+
+Define some flavors that request the resource class that was configured in nova.conf.
+An example definition, that can be used with ``openstack.cloud.compute_flavor`` Ansible module,
+is shown below:
+
+.. code-block:: yaml
+
+  vgpu_a100_2g_20gb:
+    name: "vgpu.a100.2g.20gb"
+    ram: 65536
+    disk: 30
+    vcpus: 8
+    is_public: false
+    extra_specs:
+      hw:cpu_policy: "dedicated"
+      hw:cpu_thread_policy: "prefer"
+      hw:mem_page_size: "1GB"
+      hw:cpu_sockets: 2
+      hw:numa_nodes: 8
+      hw_rng:allowed: "True"
+      resources:CUSTOM_NVIDIA_700: "1"
+
+You now should be able to launch a VM with this flavor.
+
+NVIDIA License Server
+^^^^^^^^^^^^^^^^^^^^^
+
+The Nvidia delegated license server is a virtual machine based appliance. You simply need to boot an instance
+using the image supplied on the NVIDIA Licensing portal. This can be done on the OpenStack cloud itself. The
+requirements are:
+
+* All tenants wishing to use GPU based instances must have network connectivity to this machine. (network licensing)
+  - It is possible to configure node locked licensing where tenants do not need access to the license server
+* Satisfy minimum requirements detailed `here <https://docs.nvidia.com/license-system/dls/2.1.0/nvidia-dls-user-guide/index.html#dls-virtual-appliance-platform-requirements>`__.
+
+The official documentation for configuring the instance
+can be found `here <https://docs.nvidia.com/license-system/dls/2.1.0/nvidia-dls-user-guide/index.html#about-service-instances>`__.
+
+Below is a snippet of openstack-config for defining a project, and a security group that can be used for a non-HA deployment:
+
+.. code-block:: yaml
+
+  secgroup_rules_nvidia_dls:
+    # Allow ICMP (for ping, etc.).
+    - ethertype: IPv4
+      protocol: icmp
+    # Allow SSH.
+    - ethertype: IPv4
+      protocol: tcp
+      port_range_min: 22
+      port_range_max: 22
+    # https://docs.nvidia.com/license-system/latest/nvidia-license-system-user-guide/index.html
+    - ethertype: IPv4
+      protocol: tcp
+      port_range_min: 443
+      port_range_max: 443
+    - ethertype: IPv4
+      protocol: tcp
+      port_range_min: 80
+      port_range_max: 80
+    - ethertype: IPv4
+      protocol: tcp
+      port_range_min: 7070
+      port_range_max: 7070
+
+    secgroup_nvidia_dls:
+      name: nvidia-dls
+      project: "{{ project_cloud_services.name }}"
+      rules: "{{ secgroup_rules_nvidia_dls }}"
+
+    openstack_security_groups:
+      - "{{ secgroup_nvidia_dls }}"
+
+    project_cloud_services:
+      name: "cloud-services"
+      description: "Internal Cloud services"
+      project_domain: default
+      user_domain: default
+      users: []
+      quotas: "{{ quotas_project }}"
+
+Booting the VM:
+
+.. code-block:: shell
+
+  # Uploading the image and making it available in the cloud services project
+  $ openstack image create --file nls-3.0.0-bios.qcow2 nls-3.0.0-bios --disk-format qcow2
+  $ openstack image add project nls-3.0.0-bios cloud-services
+  $ openstack image set --accept nls-3.0.0-bios --project cloud-services
+  $ openstack image member list nls-3.0.0-bios
+
+  # Booting a server as the admin user in the cloud-services project. We pre-create the port so that
+  # we can recreate it without changing the MAC address.
+  $ openstack port create --mac-address fa:16:3e:a3:fd:19 --network external nvidia-dls-1 --project cloud-services
+  $ openstack role add member --project cloud-services --user admin
+  $ export OS_PROJECT_NAME=cloud-services
+  $ openstack server group create nvidia-dls --policy anti-affinity
+  $ openstack server create --flavor 8cpu-8gbmem-30gbdisk --image nls-3.0.0-bios --port nvidia-dls-1 --hint group=179dfa59-0947-4925-a0ff-b803bc0e58b2 nvidia-dls-cci1-1 --security-group nvidia-dls
+  $ openstack server add security group nvidia-dls-1 nvidia-dls
+
+
+Manual VM driver and licence configuration
+^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
+
+vGPU client VMs need to be configured with Nvidia drivers to run GPU workloads.
+The host drivers should already be applied to the hypervisor.
+
+GCP hosts compatible client drivers `here
+<https://cloud.google.com/compute/docs/gpus/grid-drivers-table>`__.
+
+Find the correct version (when in doubt, use the same version as the host) and
+download it to the VM. The exact dependencies will depend on the base image you
+are using but at a minimum, you will need GCC installed.
+
+Ubuntu Jammy example:
+
+.. code-block:: bash
+
+    sudo apt update
+    sudo apt install -y make gcc wget
+    wget https://storage.googleapis.com/nvidia-drivers-us-public/GRID/vGPU17.1/NVIDIA-Linux-x86_64-550.54.15-grid.run
+    sudo sh NVIDIA-Linux-x86_64-550.54.15-grid.run
+
+Check the ``nvidia-smi`` client is available:
+
+.. code-block:: bash
+
+    nvidia-smi
+
+Generate a token from the licence server, and copy the token file to the client
+VM.
+
+On the client, create an Nvidia grid config file from the template:
+
+.. code-block:: bash
+
+    sudo cp /etc/nvidia/gridd.conf.template  /etc/nvidia/gridd.conf
+
+Edit it to set ``FeatureType=1`` and leave the rest of the settings as default.
+
+Copy the client configuration token into the ``/etc/nvidia/ClientConfigToken``
+directory.
+
+Ensure the correct permissions are set:
+
+.. code-block:: bash
+
+    sudo chmod 744 /etc/nvidia/ClientConfigToken/client_configuration_token_<datetime>.tok
+
+Restart the ``nvidia-gridd`` service:
+
+.. code-block:: bash
+
+    sudo systemctl restart nvidia-gridd
+
+Check that the token has been recognised:
+
+.. code-block:: bash
+
+    nvidia-smi -q | grep 'License Status'
+
+If not, an error should appear in the journal:
+
+.. code-block:: bash
+
+    sudo journalctl -xeu nvidia-gridd
+
+A successfully licenced VM can be snapshotted to create an image in Glance that
+includes the drivers and licencing token. Alternatively, an image can be
+created using Diskimage Builder.
+
+Disk image builder recipe to automatically license VGPU on boot
+^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
+
+`stackhpc-image-elements <https://github.com/stackhpc/stackhpc-image-elements>`__ provides a ``nvidia-vgpu``
+element to configure the nvidia-gridd service in VGPU mode. This allows you to boot VMs that automatically license themselves.
+Snippets of ``openstack-config`` that allow you to do this are shown below:
+
+.. code-block:: shell
+
+  image_rocky9_nvidia:
+    name: "Rocky9-NVIDIA"
+    type: raw
+    elements:
+      - "rocky-container"
+      - "rpm"
+      - "nvidia-vgpu"
+      - "cloud-init"
+      - "epel"
+      - "cloud-init-growpart"
+      - "selinux-permissive"
+      - "dhcp-all-interfaces"
+      - "vm"
+      - "extra-repos"
+      - "grub2"
+      - "stable-interface-names"
+      - "openssh-server"
+    is_public: True
+    packages:
+      - "dkms"
+      - "git"
+      - "tmux"
+      - "cuda-minimal-build-12-1"
+      - "cuda-demo-suite-12-1"
+      - "cuda-libraries-12-1"
+      - "cuda-toolkit"
+      - "vim-enhanced"
+    env:
+      DIB_CONTAINERFILE_NETWORK_DRIVER: host
+      DIB_CONTAINERFILE_RUNTIME: docker
+      DIB_RPMS: "http://192.168.1.2:80/pulp/content/nvidia/nvidia-linux-grid-525-525.105.17-1.x86_64.rpm"
+      YUM: dnf
+      DIB_EXTRA_REPOS: "https://developer.download.nvidia.com/compute/cuda/repos/rhel9/x86_64/cuda-rhel9.repo"
+      DIB_NVIDIA_VGPU_CLIENT_TOKEN: "{{ lookup('file' , 'secrets/client_configuration_token_05-30-2023-12-41-40.tok') }}"
+      DIB_CLOUD_INIT_GROWPART_DEVICES:
+        - "/"
+      DIB_RELEASE: "9"
+    properties:
+      os_type: "linux"
+      os_distro: "rocky"
+      os_version: "9"
+
+  openstack_images:
+    - "{{ image_rocky9_nvidia }}"
+
+  openstack_image_git_elements:
+    - repo: "https://github.com/stackhpc/stackhpc-image-elements"
+      local: "{{ playbook_dir }}/stackhpc-image-elements"
+      version: master
+      elements_path: elements
+
+The gridd driver was uploaded pulp using the following procedure:
+
+.. code-block:: shell
+
+  $ unzip NVIDIA-GRID-Linux-KVM-525.105.14-525.105.17-528.89.zip
+  $ pulp artifact upload --file ~/nvidia-linux-grid-525-525.105.17-1.x86_64.rpm
+  $ pulp file content create --relative-path "nvidia-linux-grid-525-525.105.17-1.x86_64.rpm" --sha256 58fda68d01f00ea76586c9fd5f161c9fbb907f627b7e4f4059a309d8112ec5f5
+  $ pulp file repository add --name nvidia --sha256 58fda68d01f00ea76586c9fd5f161c9fbb907f627b7e4f4059a309d8112ec5f5 --relative-path "nvidia-linux-grid-525-525.105.17-1.x86_64.rpm"
+  $ pulp file publication create --repository nvidia
+  $ pulp file distribution update --name nvidia --base-path nvidia --repository nvidia
+
+This is the file we reference in ``DIB_RPMS``. It is important to keep the driver versions aligned between hypervisor and guest VM.
+
+The client token can be downloaded from the web interface of the licensing portal. Care should be taken
+when copying the contents as it can contain invisible characters. It is best to copy the file directly
+into your openstack-config repository and vault encrypt it. The ``file`` lookup plugin can be used to decrypt
+the file (as shown in the example above).
+
+Testing vGPU VMs
+^^^^^^^^^^^^^^^^
+
+vGPU VMs can be validated using the following test workload. The test should
+succeed if the VM is correctly licenced and drivers are correctly installed for
+both the host and client VM.
+
+Install ``cuda-toolkit`` using the instructions `here
+<https://docs.nvidia.com/cuda/cuda-installation-guide-linux/index.html>`__.
+
+Ubuntu Jammy example:
+
+.. code-block:: bash
+
+    wget https://developer.download.nvidia.com/compute/cuda/repos/ubuntu2204/x86_64/cuda-keyring_1.1-1_all.deb
+    sudo dpkg -i cuda-keyring_1.1-1_all.deb
+    sudo apt update -y
+    sudo apt install -y cuda-toolkit make
+
+The VM may require a reboot at this point.
+
+Clone the ``cuda-samples`` repo:
+
+.. code-block:: bash
+
+    git clone https://github.com/NVIDIA/cuda-samples.git
+
+Build and run a test workload:
+
+.. code-block:: bash
+
+    cd cuda-samples/Samples/6_Performance/transpose
+    make
+    ./transpose
+
+Example output:
+
+.. code-block::
+
+    Transpose Starting...
+
+    GPU Device 0: "Ampere" with compute capability 8.0
+
+    > Device 0: "GRID A100D-1-10C MIG 1g.10gb"
+    > SM Capability 8.0 detected:
+    > [GRID A100D-1-10C MIG 1g.10gb] has 14 MP(s) x 64 (Cores/MP) = 896 (Cores)
+    > Compute performance scaling factor = 1.00
+
+    Matrix size: 1024x1024 (64x64 tiles), tile size: 16x16, block size: 16x16
+
+    transpose simple copy       , Throughput = 159.1779 GB/s, Time = 0.04908 ms, Size = 1048576 fp32 elements, NumDevsUsed = 1, Workgroup = 256
+    transpose shared memory copy, Throughput = 152.1922 GB/s, Time = 0.05133 ms, Size = 1048576 fp32 elements, NumDevsUsed = 1, Workgroup = 256
+    transpose naive             , Throughput = 117.2670 GB/s, Time = 0.06662 ms, Size = 1048576 fp32 elements, NumDevsUsed = 1, Workgroup = 256
+    transpose coalesced         , Throughput = 135.0813 GB/s, Time = 0.05784 ms, Size = 1048576 fp32 elements, NumDevsUsed = 1, Workgroup = 256
+    transpose optimized         , Throughput = 145.4326 GB/s, Time = 0.05372 ms, Size = 1048576 fp32 elements, NumDevsUsed = 1, Workgroup = 256
+    transpose coarse-grained    , Throughput = 145.2941 GB/s, Time = 0.05377 ms, Size = 1048576 fp32 elements, NumDevsUsed = 1, Workgroup = 256
+    transpose fine-grained      , Throughput = 150.5703 GB/s, Time = 0.05189 ms, Size = 1048576 fp32 elements, NumDevsUsed = 1, Workgroup = 256
+    transpose diagonal          , Throughput = 117.6831 GB/s, Time = 0.06639 ms, Size = 1048576 fp32 elements, NumDevsUsed = 1, Workgroup = 256
+    Test passed
+
+Changing VGPU device types
+^^^^^^^^^^^^^^^^^^^^^^^^^^
+
+Converting the second card to an NVIDIA-698 (whole card). The hypervisor
+is empty so we can freely delete mdevs. First clean up the mdev
+definition:
+
+.. code:: shell
+
+   [stack@computegpu007 ~]$ sudo mdevctl list
+   5c630867-a673-5d75-aa31-a499e6c7cb19 0000:21:00.4 nvidia-697 manual (defined)
+   eaa6e018-308e-58e2-b351-aadbcf01f5a8 0000:21:00.5 nvidia-697 manual (defined)
+   72291b01-689b-5b7a-9171-6b3480deabf4 0000:81:00.4 nvidia-697 manual (defined)
+   0a47ffd1-392e-5373-8428-707a4e0ce31a 0000:81:00.5 nvidia-697 manual (defined)
+
+   [stack@computegpu007 ~]$ sudo mdevctl stop --uuid 72291b01-689b-5b7a-9171-6b3480deabf4
+   [stack@computegpu007 ~]$ sudo mdevctl stop --uuid 0a47ffd1-392e-5373-8428-707a4e0ce31a
+
+   [stack@computegpu007 ~]$ sudo mdevctl undefine --uuid 0a47ffd1-392e-5373-8428-707a4e0ce31a
+
+   [stack@computegpu007 ~]$ sudo mdevctl list --defined
+   5c630867-a673-5d75-aa31-a499e6c7cb19 0000:21:00.4 nvidia-697 manual (active)
+   eaa6e018-308e-58e2-b351-aadbcf01f5a8 0000:21:00.5 nvidia-697 manual (active)
+   72291b01-689b-5b7a-9171-6b3480deabf4 0000:81:00.4 nvidia-697 manual
+
+   # We can re-use the first virtual function
+
+Secondly remove the systemd unit that starts the mdev device:
+
+.. code:: shell
+
+   [stack@computegpu007 ~]$ sudo rm /etc/systemd/system/multi-user.target.wants/nvidia-mdev@0a47ffd1-392e-5373-8428-707a4e0ce31a.service
+
+Example config change:
+
+.. code:: shell
+
+   diff --git a/etc/kayobe/environments/cci1/inventory/host_vars/computegpu007/vgpu b/etc/kayobe/environments/cci1/inventory/host_vars/computegpu007/vgpu
+   new file mode 100644
+   index 0000000..6cea9bf
+   --- /dev/null
+   +++ b/etc/kayobe/environments/cci1/inventory/host_vars/computegpu007/vgpu
+   @@ -0,0 +1,12 @@
+   +---
+   +vgpu_definitions:
+   +    - pci_address: "0000:21:00.0"
+   +      virtual_functions:
+   +        - mdev_type: nvidia-697
+   +          index: 0
+   +        - mdev_type: nvidia-697
+   +          index: 1
+   +    - pci_address: "0000:81:00.0"
+   +      virtual_functions:
+   +        - mdev_type: nvidia-698
+   +          index: 0
+   diff --git a/etc/kayobe/kolla/config/nova/nova-compute.conf b/etc/kayobe/kolla/config/nova/nova-compute.conf
+   index 6f680cb..e663ec4 100644
+   --- a/etc/kayobe/kolla/config/nova/nova-compute.conf
+   +++ b/etc/kayobe/kolla/config/nova/nova-compute.conf
+   @@ -39,7 +39,19 @@ cpu_mode = host-model
+    {% endraw %}
+
+    {% raw %}
+   -{% if inventory_hostname in groups['compute_multi_instance_gpu'] %}
+   +{% if inventory_hostname == "computegpu007" %}
+   +[devices]
+   +enabled_mdev_types = nvidia-697, nvidia-698
+   +
+   +[mdev_nvidia-697]
+   +device_addresses = 0000:21:00.4,0000:21:00.5
+   +mdev_class = VGPU
+   +
+   +[mdev_nvidia-698]
+   +device_addresses = 0000:81:00.4
+   +mdev_class = CUSTOM_NVIDIA_698
+   +
+   +{% elif inventory_hostname in groups['compute_multi_instance_gpu'] %}
+    [devices]
+    enabled_mdev_types = nvidia-700, nvidia-699
+
+   @@ -50,15 +62,14 @@ mdev_class = CUSTOM_NVIDIA_700
+    [mdev_nvidia-699]
+    device_addresses = 0000:21:00.7,0000:81:00.7
+    mdev_class = CUSTOM_NVIDIA_699
+   -{% endif %}
+
+   -{% if inventory_hostname in groups['compute_vgpu'] %}
+   +{% elif inventory_hostname in groups['compute_vgpu'] %}
+    [devices]
+    enabled_mdev_types = nvidia-697
+
+    [mdev_nvidia-697]
+    device_addresses = 0000:21:00.4,0000:21:00.5,0000:81:00.4,0000:81:00.5
+   -# Custom resource classes don't seem to work for this card.
+   +# Custom resource classes don't work when you only have single resource type.
+    mdev_class = VGPU
+
+    {% endif %}
+
+Re-run the configure playbook:
+
+.. code:: shell
+
+   (kayobe) [stack@ansiblenode1 kayobe]$ kayobe playbook run $KAYOBE_CONFIG_PATH/ansible/host-configure.yml --tags vgpu --limit computegpu007
+
+Check the result:
+
+.. code:: shell
+
+   [stack@computegpu007 ~]$ mdevctl list
+   5c630867-a673-5d75-aa31-a499e6c7cb19 0000:21:00.4 nvidia-697 manual
+   eaa6e018-308e-58e2-b351-aadbcf01f5a8 0000:21:00.5 nvidia-697 manual
+   72291b01-689b-5b7a-9171-6b3480deabf4 0000:81:00.4 nvidia-698 manual
+
+Reconfigure nova to match the change:
+
+.. code:: shell
+
+   kayobe overcloud service reconfigure -kt nova --kolla-limit computegpu007 --skip-prechecks
+
+
+PCI Passthrough
+###############
+
+This guide has been developed for Nvidia GPUs and CentOS 8.
+
+See `Kayobe Ops <https://github.com/stackhpc/kayobe-ops>`_ for
+a playbook implementation of host setup for GPU.
+
+BIOS Configuration Requirements
+-------------------------------
+
+On an Intel system:
+
+* Enable `VT-x` in the BIOS for virtualisation support.
+* Enable `VT-d` in the BIOS for IOMMU support.
+
+Hypervisor Configuration Requirements
+-------------------------------------
+
+Find the GPU device IDs
+^^^^^^^^^^^^^^^^^^^^^^^
+
+From the host OS, use ``lspci -nn`` to find the PCI vendor ID and
+device ID for the GPU device and supporting components.  These are
+4-digit hex numbers.
+
+For example:
+
+.. code-block:: text
+
+   01:00.0 VGA compatible controller [0300]: NVIDIA Corporation GM204M [GeForce GTX 980M] [10de:13d7] (rev a1) (prog-if 00 [VGA controller])
+   01:00.1 Audio device [0403]: NVIDIA Corporation GM204 High Definition Audio Controller [10de:0fbb] (rev a1)
+
+In this case the vendor ID is ``10de``, display ID is ``13d7`` and audio ID is ``0fbb``.
+
+Alternatively, for an Nvidia Quadro RTX 6000:
+
+.. code-block:: yaml
+
+   # NVIDIA Quadro RTX 6000/8000 PCI device IDs
+   vendor_id: "10de"
+   display_id: "1e30"
+   audio_id: "10f7"
+   usba_id: "1ad6"
+   usba_class: "0c0330"
+   usbc_id: "1ad7"
+   usbc_class: "0c8000"
+
+These parameters will be used for device-specific configuration.
+
+Kernel Ramdisk Reconfiguration
+^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
+
+The ramdisk loaded during kernel boot can be extended to include the
+vfio PCI drivers and ensure they are loaded early in system boot.
+
+.. code-block:: yaml
+
+   - name: Template dracut config
+     blockinfile:
+       path: /etc/dracut.conf.d/gpu-vfio.conf
+       block: |
+         add_drivers+="vfio vfio_iommu_type1 vfio_pci vfio_virqfd"
+       owner: root
+       group: root
+       mode: 0660
+       create: true
+     become: true
+     notify:
+       - Regenerate initramfs
+       - reboot
+
+The handler for regenerating the Dracut initramfs is:
+
+.. code-block:: yaml
+
+   - name: Regenerate initramfs
+     shell: |-
+       #!/bin/bash
+       set -eux
+       dracut -v -f /boot/initramfs-$(uname -r).img $(uname -r)
+     become: true
+
+Kernel Boot Parameters
+^^^^^^^^^^^^^^^^^^^^^^
+
+Set the following kernel parameters by adding to
+``GRUB_CMDLINE_LINUX_DEFAULT`` or ``GRUB_CMDLINE_LINUX`` in
+``/etc/default/grub.conf``.  We can use the
+`stackhpc.grubcmdline <https://galaxy.ansible.com/stackhpc/grubcmdline>`_
+role from Ansible Galaxy:
+
+.. code-block:: yaml
+
+   - name: Add vfio-pci.ids kernel args
+     include_role:
+       name: stackhpc.grubcmdline
+     vars:
+       kernel_cmdline:
+         - intel_iommu=on
+         - iommu=pt
+         - "vfio-pci.ids={{ vendor_id }}:{{ display_id }},{{ vendor_id }}:{{ audio_id }}"
+       kernel_cmdline_remove:
+         - iommu
+         - intel_iommu
+         - vfio-pci.ids
+
+Kernel Device Management
+^^^^^^^^^^^^^^^^^^^^^^^^
+
+In the hypervisor, we must prevent kernel device initialisation of
+the GPU and prevent drivers from loading for binding the GPU in the
+host OS.  We do this using ``udev`` rules:
+
+.. code-block:: yaml
+
+   - name: Template udev rules to blacklist GPU usb controllers
+     blockinfile:
+       # We want this to execute as soon as possible
+       path: /etc/udev/rules.d/99-gpu.rules
+       block: |
+         #Remove NVIDIA USB xHCI Host Controller Devices, if present
+         ACTION=="add", SUBSYSTEM=="pci", ATTR{vendor}=="0x{{ vendor_id }}", ATTR{class}=="0x{{ usba_class }}", ATTR{remove}="1"
+         #Remove NVIDIA USB Type-C UCSI devices, if present
+         ACTION=="add", SUBSYSTEM=="pci", ATTR{vendor}=="0x{{ vendor_id }}", ATTR{class}=="0x{{ usbc_class }}", ATTR{remove}="1"
+       owner: root
+       group: root
+       mode: 0644
+       create: true
+      become: true
+
+Kernel Drivers
+^^^^^^^^^^^^^^
+
+Prevent the ``nouveau`` kernel driver from loading by
+blacklisting the module:
+
+.. code-block:: yaml
+
+   - name: Blacklist nouveau
+     blockinfile:
+       path: /etc/modprobe.d/blacklist-nouveau.conf
+       block: |
+         blacklist nouveau
+         options nouveau modeset=0
+       mode: 0664
+       owner: root
+       group: root
+       create: true
+     become: true
+     notify:
+       - reboot
+       - Regenerate initramfs
+
+Ensure that the ``vfio`` drivers are loaded into the kernel on boot:
+
+.. code-block:: yaml
+
+   - name: Add vfio to modules-load.d
+     blockinfile:
+       path: /etc/modules-load.d/vfio.conf
+       block: |
+         vfio
+         vfio_iommu_type1
+         vfio_pci
+         vfio_virqfd
+       owner: root
+       group: root
+       mode: 0664
+       create: true
+     become: true
+     notify: reboot
+
+Once this code has taken effect (after a reboot), the VFIO kernel drivers should be loaded on boot:
+
+.. code-block:: text
+
+   # lsmod | grep vfio
+   vfio_pci               49152  0
+   vfio_virqfd            16384  1 vfio_pci
+   vfio_iommu_type1       28672  0
+   vfio                   32768  2 vfio_iommu_type1,vfio_pci
+   irqbypass              16384  5 vfio_pci,kvm
+
+   # lspci -nnk -s 3d:00.0
+   3d:00.0 VGA compatible controller [0300]: NVIDIA Corporation GM107GL [Tesla M10] [10de:13bd] (rev a2)
+	 Subsystem: NVIDIA Corporation Tesla M10 [10de:1160]
+	 Kernel driver in use: vfio-pci
+	 Kernel modules: nouveau
+
+IOMMU should be enabled at kernel level as well - we can verify that on the compute host:
+
+.. code-block:: text
+
+   # docker exec -it nova_libvirt virt-host-validate | grep IOMMU
+   QEMU: Checking for device assignment IOMMU support                         : PASS
+   QEMU: Checking if IOMMU is enabled by kernel                               : PASS
+
+OpenStack Nova configuration
+----------------------------
+
+Configure nova-scheduler
+^^^^^^^^^^^^^^^^^^^^^^^^
+
+The nova-scheduler service must be configured to enable the ``PciPassthroughFilter``
+To enable it add it to the list of filters to Kolla-Ansible configuration file:
+``etc/kayobe/kolla/config/nova.conf``, for instance:
+
+.. code-block:: yaml
+
+   [filter_scheduler]
+   available_filters = nova.scheduler.filters.all_filters
+   enabled_filters = AvailabilityZoneFilter, ComputeFilter, ComputeCapabilitiesFilter, ImagePropertiesFilter, ServerGroupAntiAffinityFilter, ServerGroupAffinityFilter, PciPassthroughFilter
+
+Configure nova-compute
+^^^^^^^^^^^^^^^^^^^^^^
+
+Configuration can be applied in flexible ways using Kolla-Ansible's
+methods for `inventory-driven customisation of configuration
+<https://docs.openstack.org/kayobe/latest/configuration/reference/kolla-ansible.html#service-configuration>`_.
+The following configuration could be added to
+``etc/kayobe/kolla/config/nova/nova-compute.conf`` to enable PCI
+passthrough of GPU devices for hosts in a group named ``compute_gpu``.
+Again, the 4-digit PCI Vendor ID and Device ID extracted from ``lspci
+-nn`` can be used here to specify the GPU device(s).
+
+.. code-block:: jinja
+
+   [pci]
+   {% raw %}
+   {% if inventory_hostname in groups['compute_gpu'] %}
+   # We could support multiple models of GPU.
+   # This can be done more selectively using different inventory groups.
+   # GPU models defined here:
+   # NVidia Tesla V100 16GB
+   # NVidia Tesla V100 32GB
+   # NVidia Tesla P100 16GB
+   passthrough_whitelist = [{ "vendor_id":"10de", "product_id":"1db4" },
+                            { "vendor_id":"10de", "product_id":"1db5" },
+                            { "vendor_id":"10de", "product_id":"15f8" }]
+   alias = { "vendor_id":"10de", "product_id":"1db4", "device_type":"type-PCI", "name":"gpu-v100-16" }
+   alias = { "vendor_id":"10de", "product_id":"1db5", "device_type":"type-PCI", "name":"gpu-v100-32" }
+   alias = { "vendor_id":"10de", "product_id":"15f8", "device_type":"type-PCI", "name":"gpu-p100" }
+   {% endif %}
+   {% endraw %}
+
+Configure nova-api
+^^^^^^^^^^^^^^^^^^
+
+pci.alias also needs to be configured on the controller.
+This configuration should match the configuration found on the compute nodes.
+Add it to Kolla-Ansible configuration file:
+``etc/kayobe/kolla/config/nova/nova-api.conf``, for instance:
+
+.. code-block:: yaml
+
+   [pci]
+   alias = { "vendor_id":"10de", "product_id":"1db4", "device_type":"type-PCI", "name":"gpu-v100-16" }
+   alias = { "vendor_id":"10de", "product_id":"1db5", "device_type":"type-PCI", "name":"gpu-v100-32" }
+   alias = { "vendor_id":"10de", "product_id":"15f8", "device_type":"type-PCI", "name":"gpu-p100" }
+
+Reconfigure nova service
+^^^^^^^^^^^^^^^^^^^^^^^^
+
+.. code-block:: text
+
+   kayobe overcloud service reconfigure --kolla-tags nova --kolla-skip-tags common --skip-prechecks
+
+Configure a flavor
+^^^^^^^^^^^^^^^^^^
+
+For example, to request two of the GPUs with alias gpu-p100
+
+.. code-block:: text
+
+   openstack flavor set m1.medium --property "pci_passthrough:alias"="gpu-p100:2"
+
+
+This can be also defined in the openstack-config repository
+
+add extra_specs to flavor in etc/openstack-config/openstack-config.yml:
+
+.. code-block:: console
+
+   admin# cd src/openstack-config
+   admin# vim etc/openstack-config/openstack-config.yml
+
+    name: "m1.medium"
+    ram: 4096
+    disk: 40
+    vcpus: 2
+    extra_specs:
+      "pci_passthrough:alias": "gpu-p100:2"
+
+Invoke configuration playbooks afterwards:
+
+.. code-block:: console
+
+   admin# source src/kayobe-config/etc/kolla/public-openrc.sh
+   admin# source venvs/openstack/bin/activate
+   admin# tools/openstack-config --vault-password-file <Vault password file path>
+
+Create instance with GPU passthrough
+^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
+
+.. code-block:: text
+
+   openstack server create --flavor m1.medium --image ubuntu2004 --wait test-pci
+
+Testing GPU in a Guest VM
+-------------------------
+
+The Nvidia drivers must be installed first.  For example, on an Ubuntu guest:
+
+.. code-block:: text
+
+   sudo apt install nvidia-headless-440 nvidia-utils-440 nvidia-compute-utils-440
+
+The ``nvidia-smi`` command will generate detailed output if the driver has loaded
+successfully.
+
+Further Reference
+-----------------
+
+For PCI Passthrough and GPUs in OpenStack:
+
+* Consumer-grade GPUs: https://gist.github.com/claudiok/890ab6dfe76fa45b30081e58038a9215
+* https://www.jimmdenton.com/gpu-offloading-openstack/
+* https://docs.openstack.org/nova/latest/admin/pci-passthrough.html
+* https://docs.openstack.org/nova/latest/admin/virtual-gpu.html (vGPU only)
+* Tesla models in OpenStack: https://egallen.com/openstack-nvidia-tesla-gpu-passthrough/
+* https://wiki.archlinux.org/index.php/PCI_passthrough_via_OVMF
+* https://www.kernel.org/doc/Documentation/Intel-IOMMU.txt
+* https://access.redhat.com/documentation/en-us/red_hat_virtualization/4.1/html/installation_guide/appe-configuring_a_hypervisor_host_for_pci_passthrough
+* https://www.gresearch.co.uk/article/utilising-the-openstack-placement-service-to-schedule-gpu-and-nvme-workloads-alongside-general-purpose-instances/

From 05cff81681bd1f1b0b72908aafc13589ad490a49 Mon Sep 17 00:00:00 2001
From: Seunghun Lee <seunghun@stackhpc.com>
Date: Wed, 29 May 2024 16:32:20 +0100
Subject: [PATCH 11/42] Fix errors and add to index

---
 .../operations/baremetal-node-management.rst  |  2 +-
 doc/source/operations/ceph-management.rst     | 62 ++++++++++---------
 .../operations/control-plane-operation.rst    | 47 +++++++-------
 ...ng_horizon.rst => customising-horizon.rst} |  7 +--
 .../hardware-inventory-management.rst         | 12 ++--
 doc/source/operations/index.rst               | 10 +++
 .../operations/openstack-reconfiguration.rst  | 11 ++--
 doc/source/operations/wazuh-operation.rst     |  2 +-
 8 files changed, 83 insertions(+), 70 deletions(-)
 rename doc/source/operations/{customising_horizon.rst => customising-horizon.rst} (97%)

diff --git a/doc/source/operations/baremetal-node-management.rst b/doc/source/operations/baremetal-node-management.rst
index f45903dad9..f5c82bd5ce 100644
--- a/doc/source/operations/baremetal-node-management.rst
+++ b/doc/source/operations/baremetal-node-management.rst
@@ -181,7 +181,7 @@ The command below extracts a list of port UUID, node UUID and switch port inform
 
 .. code-block:: bash
 
-   admin# openstack baremetal port list --field uuid --field node_uuid --field local_link_connection --format value
+    openstack baremetal port list --field uuid --field node_uuid --field local_link_connection --format value
 
 NGS will manage VLAN membership for ports when the ``local_link_connection`` fields match one of the switches in ``ml2_conf.ini``.
 The rest of the switch configuration is static.
diff --git a/doc/source/operations/ceph-management.rst b/doc/source/operations/ceph-management.rst
index 8e3d1f4e94..3c82a3ffe0 100644
--- a/doc/source/operations/ceph-management.rst
+++ b/doc/source/operations/ceph-management.rst
@@ -45,8 +45,8 @@ Ceph commands are usually run inside a ``cephadm shell`` utility container:
 
 .. code-block:: console
 
-   # From the node that runs Ceph
-   ceph# sudo cephadm shell
+   # From storage host
+   sudo cephadm shell
 
 Operating a cluster requires a keyring with an admin access to be available for Ceph
 commands. Cephadm will copy such keyring to the nodes carrying
@@ -71,15 +71,17 @@ First drain the node
 
 .. code-block:: console
 
-   ceph# cephadm shell
-   ceph# ceph orch host drain <host>
+   # From storage host
+   sudo cephadm shell
+   ceph orch host drain <host>
 
 Once all daemons are removed - you can remove the host:
 
 .. code-block:: console
 
-   ceph# cephadm shell
-   ceph# ceph orch host rm <host>
+   # From storage host
+   sudo cephadm shell
+   ceph orch host rm <host>
 
 And then remove the host from inventory (usually in
 ``etc/kayobe/inventory/overcloud``)
@@ -98,8 +100,9 @@ movement:
 
 .. code-block:: console
 
-   ceph# cephadm shell
-   ceph# ceph osd set noout
+   # From storage host
+   sudo cephadm shell
+   ceph osd set noout
 
 Reboot the node and replace the drive
 
@@ -107,15 +110,17 @@ Unset noout after the node is back online
 
 .. code-block:: console
 
-   ceph# cephadm shell
-   ceph# ceph osd unset noout
+   # From storage host
+   sudo cephadm shell
+   ceph osd unset noout
 
 Remove the OSD using Ceph orchestrator command:
 
 .. code-block:: console
 
-   ceph# cephadm shell
-   ceph# ceph orch osd rm <ID> --replace
+   # From storage host
+   sudo cephadm shell
+   ceph orch osd rm <ID> --replace
 
 After removing OSDs, if the drives the OSDs were deployed on once again become
 available, cephadm may automatically try to deploy more OSDs on these drives if
@@ -142,7 +147,7 @@ identify which OSDs are tied to which physical disks:
 
 .. code-block:: console
 
-   ceph# ceph device ls
+   ceph device ls
 
 Host maintenance
 ----------------
@@ -167,7 +172,7 @@ Ceph can report details about failed OSDs by running:
 
 .. code-block:: console
 
-   ceph# ceph health detail
+   ceph health detail
 
 .. note ::
 
@@ -184,7 +189,7 @@ A failed OSD will also be reported as down by running:
 
 .. code-block:: console
 
-   ceph# ceph osd tree
+   ceph osd tree
 
 Note the ID of the failed OSD.
 
@@ -192,7 +197,8 @@ The failed disk is usually logged by the Linux kernel too:
 
 .. code-block:: console
 
-   storage-0# dmesg -T
+   # From storage host
+   dmesg -T
 
 Cross-reference the hardware device and OSD ID to ensure they match.
 (Using `pvs` and `lvs` may help make this connection).
@@ -207,16 +213,15 @@ show``).
 On this hypervisor, enter the libvirt container:
 
 .. code-block:: console
-   :substitutions:
 
-   |hypervisor_hostname|# docker exec -it nova_libvirt /bin/bash
+   # From hypervisor host
+   docker exec -it nova_libvirt /bin/bash
 
 Find the VM name using libvirt:
 
 .. code-block:: console
-   :substitutions:
 
-   (nova-libvirt)[root@|hypervisor_hostname| /]# virsh list
+   (nova-libvirt)[root@compute-01 /]# virsh list
     Id    Name                State
    ------------------------------------
     1     instance-00000001   running
@@ -224,19 +229,19 @@ Find the VM name using libvirt:
 Now inspect the properties of the VM using ``virsh dumpxml``:
 
 .. code-block:: console
-   :substitutions:
 
-   (nova-libvirt)[root@|hypervisor_hostname| /]# virsh dumpxml instance-00000001 | grep rbd
-         <source protocol='rbd' name='|nova_rbd_pool|/51206278-e797-4153-b720-8255381228da_disk'>
+   (nova-libvirt)[root@compute-01 /]# virsh dumpxml instance-00000001 | grep rbd
+         <source protocol='rbd' name='<nova rbd pool>/51206278-e797-4153-b720-8255381228da_disk'>
 
 On a Ceph node, the RBD pool can be inspected and the volume extracted as a RAW
 block image:
 
 .. code-block:: console
-   :substitutions:
 
-   ceph# rbd ls |nova_rbd_pool|
-   ceph# rbd export |nova_rbd_pool|/51206278-e797-4153-b720-8255381228da_disk blob.raw
+   # From storage host
+   sudo cephadm shell
+   rbd ls <nova rbd pool>
+   rbd export <nova rbd pool>/51206278-e797-4153-b720-8255381228da_disk blob.raw
 
 The raw block device (blob.raw above) can be mounted using the loopback device.
 
@@ -248,8 +253,9 @@ libguestfs-tools and using the guestfish command:
 
 .. code-block:: console
 
-   ceph# export LIBGUESTFS_BACKEND=direct
-   ceph# guestfish -a blob.qcow
+   # From storage host
+   export LIBGUESTFS_BACKEND=direct
+   guestfish -a blob.qcow
    ><fs> run
     100% [XXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXX] 00:00
    ><fs> list-filesystems
diff --git a/doc/source/operations/control-plane-operation.rst b/doc/source/operations/control-plane-operation.rst
index c5c629d52f..ffd5299ce3 100644
--- a/doc/source/operations/control-plane-operation.rst
+++ b/doc/source/operations/control-plane-operation.rst
@@ -55,7 +55,7 @@ Configuring Prometheus Alerts
 -----------------------------
 
 Alerts are defined in code and stored in Kayobe configuration. See ``*.rules``
-files in ``${KAYOBE_CONFIG_PATH}/kolla/config/prometheus`` as a model to add
+files in ``$KAYOBE_CONFIG_PATH/kolla/config/prometheus`` as a model to add
 custom rules.
 
 Silencing Prometheus Alerts
@@ -88,7 +88,7 @@ Generating Alerts from Metrics
 ++++++++++++++++++++++++++++++
 
 Alerts are defined in code and stored in Kayobe configuration. See ``*.rules``
-files in ``${KAYOBE_CONFIG_PATH}/kolla/config/prometheus`` as a model to add
+files in ``$KAYOBE_CONFIG_PATH/kolla/config/prometheus`` as a model to add
 custom rules.
 
 Control Plane Shutdown Procedure
@@ -124,7 +124,7 @@ The password can be found using:
 
 .. code-block:: console
 
-   kayobe# ansible-vault view ${KAYOBE_CONFIG_PATH}/kolla/passwords.yml \
+   kayobe# ansible-vault view $KAYOBE_CONFIG_PATH/kolla/passwords.yml \
            --vault-password-file <Vault password file path> | grep ^database
 
 Checking RabbitMQ
@@ -135,6 +135,7 @@ RabbitMQ health is determined using the command ``rabbitmqctl cluster_status``:
 .. code-block:: console
 
    [stack@controller0 ~]$ docker exec rabbitmq rabbitmqctl cluster_status
+
    Cluster status of node rabbit@controller0 ...
    [{nodes,[{disc,['rabbit@controller0','rabbit@controller1',
                    'rabbit@controller2']}]},
@@ -180,20 +181,18 @@ If you are shutting down a single hypervisor, to avoid down time to tenants it
 is advisable to migrate all of the instances to another machine. See
 :ref:`evacuating-all-instances`.
 
-.. ifconfig:: deployment['ceph_managed']
-
-   Ceph
-   ----
+Ceph
+----
 
-   The following guide provides a good overview:
-   https://access.redhat.com/documentation/en-us/red_hat_openstack_platform/8/html/director_installation_and_usage/sect-rebooting-ceph
+The following guide provides a good overview:
+https://access.redhat.com/documentation/en-us/red_hat_openstack_platform/8/html/director_installation_and_usage/sect-rebooting-ceph
 
 Shutting down the seed VM
 -------------------------
 
 .. code-block:: console
 
-   kayobe# virsh shutdown <Seed node>
+   kayobe# virsh shutdown <Seed hostname>
 
 .. _full-shutdown:
 
@@ -262,7 +261,7 @@ hypervisor is powered on. If it does not, it can be started with:
 
 .. code-block:: console
 
-   kayobe# virsh start seed-0
+   kayobe# virsh start <Seed hostname>
 
 Full power on
 -------------
@@ -340,13 +339,14 @@ To see the list of hypervisor names:
 
 .. code-block:: console
 
-   admin# openstack hypervisor list
+   # From host that can reach Openstack
+   openstack hypervisor list
 
 To boot an instance on a specific hypervisor
 
 .. code-block:: console
 
-   admin# openstack server create --flavor <Flavour name>--network <Network name> --key-name <key> --image <Image name> --availability-zone nova::<Hypervisor name> <VM name>
+   openstack server create --flavor <Flavour name>--network <Network name> --key-name <key> --image <Image name> --availability-zone nova::<Hypervisor name> <VM name>
 
 Cleanup Procedures
 ==================
@@ -360,22 +360,23 @@ perform the following cleanup procedure regularly:
 
 .. code-block:: console
 
-   admin# for user in $(openstack user list --domain magnum -f value -c Name | grep -v magnum_trustee_domain_admin); do
-            if openstack coe cluster list -c uuid -f value | grep -q $(echo $user | sed 's/_[0-9a-f]*$//'); then
-              echo "$user still in use, not deleting"
-            else
-              openstack user delete --domain magnum $user
-            fi
-          done
+   for user in $(openstack user list --domain magnum -f value -c Name | grep -v magnum_trustee_domain_admin); do
+      if openstack coe cluster list -c uuid -f value | grep -q $(echo $user | sed 's/_[0-9a-f]*$//'); then
+         echo "$user still in use, not deleting"
+      else
+         openstack user delete --domain magnum $user
+      fi
+      done
 
 OpenSearch indexes retention
 =============================
 
 To alter default rotation values for OpenSearch, edit
 
-``${KAYOBE_CONFIG_PATH}/kolla/globals.yml``:
+``$KAYOBE_CONFIG_PATH/kolla/globals.yml``:
 
 .. code-block:: console
+
    # Duration after which index is closed (default 30)
    opensearch_soft_retention_period_days: 90
    # Duration after which index is deleted (default 60)
@@ -384,8 +385,8 @@ To alter default rotation values for OpenSearch, edit
 Reconfigure Opensearch with new values:
 
 .. code-block:: console
-   kayobe overcloud service reconfigure --kolla-tags opensearch
 
-For more information see the `upstream documentation
+   kayobe# kayobe overcloud service reconfigure --kolla-tags opensearch
 
+For more information see the `upstream documentation
 <https://docs.openstack.org/kolla-ansible/latest/reference/logging-and-monitoring/central-logging-guide.html#applying-log-retention-policies>`__.
diff --git a/doc/source/operations/customising_horizon.rst b/doc/source/operations/customising-horizon.rst
similarity index 97%
rename from doc/source/operations/customising_horizon.rst
rename to doc/source/operations/customising-horizon.rst
index 1f8977a31e..4b2e157d86 100644
--- a/doc/source/operations/customising_horizon.rst
+++ b/doc/source/operations/customising-horizon.rst
@@ -1,8 +1,6 @@
-.. include:: vars.rst
-
-====================================
+===================
 Customising Horizon
-====================================
+===================
 
 Horizon is the most frequent site-specific container customisation required:
 other customisations tend to be common across deployments, but personalisation
@@ -55,7 +53,6 @@ Building a custom container image for Horizon can be done by modifying
 ``kolla.yml`` to fetch the custom theme and include it in the image:
 
 .. code-block:: yaml
-   :substitutions:
 
    kolla_sources:
      horizon-additions-theme-<custom theme name>:
diff --git a/doc/source/operations/hardware-inventory-management.rst b/doc/source/operations/hardware-inventory-management.rst
index 0d6fd8adf1..8636d5b562 100644
--- a/doc/source/operations/hardware-inventory-management.rst
+++ b/doc/source/operations/hardware-inventory-management.rst
@@ -5,7 +5,7 @@ Hardware Inventory Management
 At its lowest level, hardware inventory is managed in the Bifrost service.
 
 Reconfiguring Control Plane Hardware
-------------------------------------
+====================================
 
 If a server's hardware or firmware configuration is changed, it should be
 re-inspected in Bifrost before it is redeployed into service. A single server
@@ -112,10 +112,10 @@ hypervisor. They should all show the status ACTIVE. This can be verified with:
    admin# openstack server show <instance uuid>
 
 Troubleshooting
-+++++++++++++++
+===============
 
 Servers that have been shut down
-********************************
+--------------------------------
 
 If there are any instances that are SHUTOFF they won’t be migrated, but you can
 use ``openstack server migrate`` for them once the live migration is finished.
@@ -131,7 +131,7 @@ For more details see:
 http://www.danplanet.com/blog/2016/03/03/evacuate-in-nova-one-command-to-confuse-us-all/
 
 Flavors have changed
-********************
+--------------------
 
 If the size of the flavors has changed, some instances will also fail to
 migrate as the process needs manual confirmation. You can do this with:
@@ -150,7 +150,7 @@ RESIZE`` as shown in this snippet of ``openstack server show <instance-uuid>``:
 .. _set-bifrost-maintenance-mode:
 
 Set maintenance mode on a node in Bifrost
-+++++++++++++++++++++++++++++++++++++++++
+-----------------------------------------
 
 .. code-block:: console
 
@@ -161,7 +161,7 @@ Set maintenance mode on a node in Bifrost
 .. _unset-bifrost-maintenance-mode:
 
 Unset maintenance mode on a node in Bifrost
-+++++++++++++++++++++++++++++++++++++++++++
+-------------------------------------------
 
 .. code-block:: console
 
diff --git a/doc/source/operations/index.rst b/doc/source/operations/index.rst
index 825384c4bf..aea4139980 100644
--- a/doc/source/operations/index.rst
+++ b/doc/source/operations/index.rst
@@ -7,11 +7,21 @@ This guide is for operators of the StackHPC Kayobe configuration project.
 .. toctree::
    :maxdepth: 1
 
+   baremetal-node-management
+   ceph-management
+   control-plane-operation
+   customsing-horizon
+   gpu-in-openstack
+   hardware-inventory-management
    hotfix-playbook
+   migrating-vm
    nova-compute-ironic
    octavia
+   openstack-projects-and-users-management
+   openstack-reconfiguration
    rabbitmq
    secret-rotation
    tempest
    upgrading-openstack
    upgrading-ceph
+   wazuh-operation
diff --git a/doc/source/operations/openstack-reconfiguration.rst b/doc/source/operations/openstack-reconfiguration.rst
index dfba372f26..db157f6a91 100644
--- a/doc/source/operations/openstack-reconfiguration.rst
+++ b/doc/source/operations/openstack-reconfiguration.rst
@@ -3,7 +3,7 @@ OpenStack Reconfiguration
 =========================
 
 Disabling a Service
--------------------
+===================
 
 Ansible is oriented towards adding or reconfiguring services, but removing a
 service is handled less well, because of Ansible's imperative style.
@@ -36,7 +36,7 @@ Some services may store data in a dedicated Docker volume, which can be removed
 with ``docker volume rm``.
 
 Installing TLS Certificates
----------------------------
+===========================
 
 To configure TLS for the first time, we write the contents of a PEM
 file to the ``secrets.yml`` file as ``secrets_kolla_external_tls_cert``.
@@ -69,7 +69,7 @@ be updated in Keystone:
    kayobe# kayobe overcloud service reconfigure
 
 Alternative Configuration
-+++++++++++++++++++++++++
+-------------------------
 
 As an alternative to writing the certificates as a variable to
 ``secrets.yml``, it is also possible to write the same data to a file,
@@ -88,7 +88,6 @@ Check the expiry date on an installed TLS certificate from a host that can
 reach the OpenStack APIs:
 
 .. code-block:: console
-   :substitutions:
 
    openstack# openssl s_client -connect <Public endpoint FQDN>:443 2> /dev/null | openssl x509 -noout -dates
 
@@ -106,7 +105,7 @@ above.  Run the following command:
 .. _taking-a-hypervisor-out-of-service:
 
 Taking a Hypervisor out of Service
-----------------------------------
+==================================
 
 To take a hypervisor out of Nova scheduling:
 
@@ -141,7 +140,7 @@ And then to enable a hypervisor again:
           <Hypervisor name> nova-compute
 
 Managing Space in the Docker Registry
--------------------------------------
+=====================================
 
 If the Docker registry becomes full, this can prevent container updates and
 (depending on the storage configuration of the seed host) could lead to other
diff --git a/doc/source/operations/wazuh-operation.rst b/doc/source/operations/wazuh-operation.rst
index 23800ff849..3f56c24603 100644
--- a/doc/source/operations/wazuh-operation.rst
+++ b/doc/source/operations/wazuh-operation.rst
@@ -14,7 +14,7 @@ Ansible playbooks <https://github.com/wazuh/wazuh-ansible>`_.  These
 can be integrated into ``kayobe-config`` as a custom playbook.
 
 Configuring Wazuh Manager
--------------------------
+=========================
 
 Wazuh Manager is configured by editing the ``wazuh-manager.yml``
 groups vars file found at

From 4fced04b38dc8c6e0109bc2e205e1151fa392b6d Mon Sep 17 00:00:00 2001
From: Seunghun Lee <seunghun@stackhpc.com>
Date: Thu, 30 May 2024 09:22:10 +0100
Subject: [PATCH 12/42] Remove repeating section

---
 doc/source/operations/control-plane-operation.rst | 7 -------
 1 file changed, 7 deletions(-)

diff --git a/doc/source/operations/control-plane-operation.rst b/doc/source/operations/control-plane-operation.rst
index ffd5299ce3..8212b95fba 100644
--- a/doc/source/operations/control-plane-operation.rst
+++ b/doc/source/operations/control-plane-operation.rst
@@ -84,13 +84,6 @@ the monitoring service rather than the host being monitored).
    `known issue <https://github.com/prometheus/alertmanager/issues/1377>`__
    when running several Alertmanager instances behind HAProxy.
 
-Generating Alerts from Metrics
-++++++++++++++++++++++++++++++
-
-Alerts are defined in code and stored in Kayobe configuration. See ``*.rules``
-files in ``$KAYOBE_CONFIG_PATH/kolla/config/prometheus`` as a model to add
-custom rules.
-
 Control Plane Shutdown Procedure
 ================================
 

From 879f8dc57a66a70a806a13a096e70614368c9534 Mon Sep 17 00:00:00 2001
From: Seunghun Lee <seunghun@stackhpc.com>
Date: Thu, 30 May 2024 09:37:43 +0100
Subject: [PATCH 13/42] Add more instruction for ADVise tool

---
 .../hardware-inventory-management.rst         | 30 ++++++++++++++++---
 1 file changed, 26 insertions(+), 4 deletions(-)

diff --git a/doc/source/operations/hardware-inventory-management.rst b/doc/source/operations/hardware-inventory-management.rst
index 8636d5b562..f88626a827 100644
--- a/doc/source/operations/hardware-inventory-management.rst
+++ b/doc/source/operations/hardware-inventory-management.rst
@@ -228,6 +228,8 @@ The playbook has the following optional parameters:
 - output_dir: path to where results should be saved. Default: ``"{{ lookup('env', 'PWD') }}/review"``
 - advise-pattern: regular expression to specify what introspection data should be analysed. Default: ``".*.eval"``
 
+You can override them by provide new values with ``-e <variable>=<value>``
+
 Example command to run the tool on data about the compute nodes in a system, where compute nodes are named cpt01, cpt02, cpt03…:
 
 .. code-block:: console
@@ -244,10 +246,30 @@ Using the results
 The ADVise tool will output a selection of results found under output_dir/results these include:
 
 - ``.html`` files to display network visualisations of any hardware differences.
-- The folder ``Paired_Comparisons`` which contains information on the shared and differing fields found between the systems. This is a reflection of the network visualisation webpage, with more detail as to what the differences are.
+- The folder ``Paired_Comparisons`` which contains information on the shared and differing fields found between the systems.
+  This is a reflection of the network visualisation webpage, with more detail as to what the differences are.
 - ``_summary``, a listing of how the systems can be grouped into sets of identical hardware.
 - ``_performance``, the results of analysing the benchmarking data gathered.
-- ``_perf_summary``, a subset of the performance metrics, just showing any potentially anomalous data such as where variance is too high, or individual nodes have been found to over/underperform.
+- ``_perf_summary``, a subset of the performance metrics, just showing any potentially anomalous data such as where variance
+  is too high, or individual nodes have been found to over/underperform.
+
+The ADVise tool will also launch an interactive Dash webpage, which displays the network visualisations,
+tables with information on the differing hardware attributes, the performance metrics as a range of box-plots,
+and specifies which individual nodes may be anomalous via box-plot outliers. This can be accessed at ``localhost:8050``.
+To close this service, simply ``Ctrl+C`` in the terminal where you ran the playbook.
+
+To get visuallised result, It is recommanded to copy instrospection data to your local machine then run ADVise playbook locally.
+
+Recommanded Workflow
+--------------------
 
-To get visuallised result, It is recommanded to copy instrospection data and review directories to your
-local machine then run ADVise playbook locally with the data.
+1. Run the playbook as outlined above.
+2. Open the Dash webpage at ``localhost:8050``.
+3. Review the hardware differences. Note that hovering over a group will display the nodes it contains.
+4. Identify any unexpected differences in the systems. If multiple differing fields exist they will be graphed separately.
+   As an example, here we expected all compute nodes to be identical.
+5. Use the dropdown menu beneath each graph to show a table of the differences found between two sets of groups.
+   If required, information on shared fields can be found under ``output_dir/results/Paired_Comparisons``.
+6. Scroll down the webpage to the performance review. Identify if any of the discovered performance results could be
+   indicative of a larger issue.
+7. Examine the ``_performance`` and ``_perf_summary`` files if you require any more information.

From 2ddc2a2e5a9ab8c77b5ad22ae200c27a31f52ec2 Mon Sep 17 00:00:00 2001
From: Seunghun Lee <seunghun@stackhpc.com>
Date: Thu, 30 May 2024 10:02:22 +0100
Subject: [PATCH 14/42] Fix formatting

---
 doc/source/operations/baremetal-node-management.rst | 6 ++----
 doc/source/operations/gpu-in-openstack.rst          | 2 --
 2 files changed, 2 insertions(+), 6 deletions(-)

diff --git a/doc/source/operations/baremetal-node-management.rst b/doc/source/operations/baremetal-node-management.rst
index f5c82bd5ce..25a12760d6 100644
--- a/doc/source/operations/baremetal-node-management.rst
+++ b/doc/source/operations/baremetal-node-management.rst
@@ -99,13 +99,11 @@ Static physical network configuration is managed via Kayobe.
 
     .. code-block:: shell
 
-    The interface is then partially configured:
+      The interface is then partially configured:
 
     .. code-block:: shell
 
-    For :ref:`ironic-node-discovery` to work, you need to manually switch the port to the provisioning network:
-
-    .. code-block:: shell
+      For :ref:`ironic-node-discovery` to work, you need to manually switch the port to the provisioning network:
 
     **NOTE**: You only need to do this if Ironic isn't aware of the node.
 
diff --git a/doc/source/operations/gpu-in-openstack.rst b/doc/source/operations/gpu-in-openstack.rst
index 66170c6800..259e39e8c1 100644
--- a/doc/source/operations/gpu-in-openstack.rst
+++ b/doc/source/operations/gpu-in-openstack.rst
@@ -1,5 +1,3 @@
-.. include:: vars.rst
-
 =============================
 Support for GPUs in OpenStack
 =============================

From 828f42c5b223f482adddad91073b5608a8430038 Mon Sep 17 00:00:00 2001
From: Seunghun Lee <seunghun@stackhpc.com>
Date: Fri, 14 Jun 2024 16:24:55 +0200
Subject: [PATCH 15/42] Merge drive replacement related sections into one

---
 doc/source/operations/ceph-management.rst | 116 ++++++++++------------
 1 file changed, 53 insertions(+), 63 deletions(-)

diff --git a/doc/source/operations/ceph-management.rst b/doc/source/operations/ceph-management.rst
index 3c82a3ffe0..46de62fc83 100644
--- a/doc/source/operations/ceph-management.rst
+++ b/doc/source/operations/ceph-management.rst
@@ -89,11 +89,60 @@ And then remove the host from inventory (usually in
 Additional options/commands may be found in
 `Host management <https://docs.ceph.com/en/latest/cephadm/host-management/>`_
 
-Replacing a Failed Ceph Drive
------------------------------
+Replacing failing drive
+-----------------------
 
-Once an OSD has been identified as having a hardware failure,
-the affected drive will need to be replaced.
+A failing drive in a Ceph cluster will cause OSD daemon to crash.
+In this case Ceph will go into `HEALTH_WARN` state.
+Ceph can report details about failed OSDs by running:
+
+.. code-block:: console
+   # From storage host
+   sudo cephadm shell
+   ceph health detail
+
+.. note ::
+
+   Remember to run ceph/rbd commands from within ``cephadm shell``
+   (preferred method) or after installing Ceph client. Details in the
+   official `documentation <https://docs.ceph.com/en/latest/cephadm/install/#enable-ceph-cli>`__.
+   It is also required that the host where commands are executed has admin
+   Ceph keyring present - easiest to achieve by applying
+   `_admin <https://docs.ceph.com/en/latest/cephadm/host-management/#special-host-labels>`__
+   label (Ceph MON servers have it by default when using
+   `StackHPC Cephadm collection <https://github.com/stackhpc/ansible-collection-cephadm>`__).
+
+A failed OSD will also be reported as down by running:
+
+.. code-block:: console
+
+   ceph osd tree
+
+Note the ID of the failed OSD.
+
+The failed disk is usually logged by the Linux kernel too:
+
+.. code-block:: console
+
+   # From storage host
+   dmesg -T
+
+Cross-reference the hardware device and OSD ID to ensure they match.
+(Using `pvs` and `lvs` may help make this connection).
+
+See upstream documentation:
+https://docs.ceph.com/en/latest/cephadm/services/osd/#replacing-an-osd
+
+In case where disk holding DB and/or WAL fails, it is necessary to recreate
+all OSDs that are associated with this disk - usually NVMe drive. The
+following single command is sufficient to identify which OSDs are tied to
+which physical disks:
+
+.. code-block:: console
+
+   ceph device ls
+
+Once OSDs on failed disks are identified, follow procedure below.
 
 If rebooting a Ceph node, first set ``noout`` to prevent excess data
 movement:
@@ -130,25 +179,6 @@ spec before (``cephadm_osd_spec`` variable in ``etc/kayobe/cephadm.yml``).
 Either set ``unmanaged: true`` to stop cephadm from picking up new disks or
 modify it in some way that it no longer matches the drives you want to remove.
 
-
-Operations
-==========
-
-Replacing drive
----------------
-
-See upstream documentation:
-https://docs.ceph.com/en/latest/cephadm/services/osd/#replacing-an-osd
-
-In case where disk holding DB and/or WAL fails, it is necessary to recreate
-(using replacement procedure above) all OSDs that are associated with this
-disk - usually NVMe drive. The following single command is sufficient to
-identify which OSDs are tied to which physical disks:
-
-.. code-block:: console
-
-   ceph device ls
-
 Host maintenance
 ----------------
 
@@ -163,46 +193,6 @@ https://docs.ceph.com/en/latest/cephadm/upgrade/
 Troubleshooting
 ===============
 
-Investigating a Failed Ceph Drive
----------------------------------
-
-A failing drive in a Ceph cluster will cause OSD daemon to crash.
-In this case Ceph will go into `HEALTH_WARN` state.
-Ceph can report details about failed OSDs by running:
-
-.. code-block:: console
-
-   ceph health detail
-
-.. note ::
-
-   Remember to run ceph/rbd commands from within ``cephadm shell``
-   (preferred method) or after installing Ceph client. Details in the
-   official `documentation <https://docs.ceph.com/en/latest/cephadm/install/#enable-ceph-cli>`__.
-   It is also required that the host where commands are executed has admin
-   Ceph keyring present - easiest to achieve by applying
-   `_admin <https://docs.ceph.com/en/latest/cephadm/host-management/#special-host-labels>`__
-   label (Ceph MON servers have it by default when using
-   `StackHPC Cephadm collection <https://github.com/stackhpc/ansible-collection-cephadm>`__).
-
-A failed OSD will also be reported as down by running:
-
-.. code-block:: console
-
-   ceph osd tree
-
-Note the ID of the failed OSD.
-
-The failed disk is usually logged by the Linux kernel too:
-
-.. code-block:: console
-
-   # From storage host
-   dmesg -T
-
-Cross-reference the hardware device and OSD ID to ensure they match.
-(Using `pvs` and `lvs` may help make this connection).
-
 Inspecting a Ceph Block Device for a VM
 ---------------------------------------
 

From a092cba3296c5c4ee25ddcc11a33b5b7774f37cc Mon Sep 17 00:00:00 2001
From: Seunghun Lee <seunghun@stackhpc.com>
Date: Fri, 14 Jun 2024 16:26:30 +0200
Subject: [PATCH 16/42] Reference Cephadm & Kayobe doc as deployment guide

---
 doc/source/configuration/cephadm.rst      | 8 +++++---
 doc/source/operations/ceph-management.rst | 3 +++
 2 files changed, 8 insertions(+), 3 deletions(-)

diff --git a/doc/source/configuration/cephadm.rst b/doc/source/configuration/cephadm.rst
index bcb13cd6ce..19112a0839 100644
--- a/doc/source/configuration/cephadm.rst
+++ b/doc/source/configuration/cephadm.rst
@@ -1,6 +1,8 @@
-====
-Ceph
-====
+.. _cephadm-kayobe:
+
+================
+Cephadm & Kayobe
+================
 
 This section describes how to use the Cephadm integration included in StackHPC
 Kayobe configuration to deploy Ceph.
diff --git a/doc/source/operations/ceph-management.rst b/doc/source/operations/ceph-management.rst
index 46de62fc83..fc17571278 100644
--- a/doc/source/operations/ceph-management.rst
+++ b/doc/source/operations/ceph-management.rst
@@ -5,6 +5,9 @@ Managing and Operating Ceph
 Working with Cephadm
 ====================
 
+This documentation provides guide for Ceph operations. For deploying Ceph,
+please refer to :ref:`cephadm-kayobe` documentation.
+
 cephadm configuration location
 ------------------------------
 

From 5cf8890a604af87fc75f776ca0126f91e968e150 Mon Sep 17 00:00:00 2001
From: Seunghun Lee <seunghun@stackhpc.com>
Date: Mon, 17 Jun 2024 12:18:59 +0100
Subject: [PATCH 17/42] Merge Wazuh documents

---
 doc/source/configuration/wazuh.rst        | 35 ++++++++-
 doc/source/operations/index.rst           |  4 +-
 doc/source/operations/wazuh-operation.rst | 89 -----------------------
 3 files changed, 36 insertions(+), 92 deletions(-)
 delete mode 100644 doc/source/operations/wazuh-operation.rst

diff --git a/doc/source/configuration/wazuh.rst b/doc/source/configuration/wazuh.rst
index cd57716d34..ca6e519b17 100644
--- a/doc/source/configuration/wazuh.rst
+++ b/doc/source/configuration/wazuh.rst
@@ -2,13 +2,20 @@
 Wazuh
 =====
 
+`Wazuh <https://wazuh.com>`_ is a security monitoring platform.
+It monitors for:
+
+* Security-related system events.
+* Known vulnerabilities (CVEs) in versions of installed software.
+* Misconfigurations in system security.
+
 The short version
 =================
 
 #. Create an infrastructure VM for the Wazuh manager, and add it to the wazuh-manager group
 #. Configure the infrastructure VM with kayobe: ``kayobe infra vm host configure``
 #. Edit your config under
-   ``etc/kayobe/inventory/group_vars/wazuh-manager/wazuh-manager``, in
+   ``$KAYOBE_CONFIG_PATHinventory/group_vars/wazuh-manager/wazuh-manager``, in
    particular the defaults assume that the ``provision_oc_net`` network will be
    used.
 #. Generate secrets: ``kayobe playbook run $KAYOBE_CONFIG_PATH/ansible/wazuh-secrets.yml``
@@ -233,9 +240,12 @@ You may need to modify some of the variables, including:
     - etc/kayobe/wazuh-manager.yml
     - etc/kayobe/inventory/group_vars/wazuh/wazuh-agent/wazuh-agent
 
+You'll need to run ``wazuh-manager.yml`` playbook again to apply customisation.
+
 Secrets
 -------
 
+Wazuh requires that secrets or passwords are set for itself and the services with which it communiticates.
 Wazuh secrets playbook is located in ``etc/kayobe/ansible/wazuh-secrets.yml``.
 Running this playbook will generate and put pertinent security items into secrets
 vault file which will be placed in ``$KAYOBE_CONFIG_PATH/wazuh-secrets.yml``.
@@ -250,6 +260,10 @@ It will be used by wazuh secrets playbook to generate wazuh secrets vault file.
 
   kayobe playbook run $KAYOBE_CONFIG_PATH/ansible/wazuh-secrets.yml
 
+.. note:: Use ``ansible-vault`` to view the secrets:
+
+  ``ansible-vault view --vault-password-file ~/vault.password $KAYOBE_CONFIG_PATH/inventory/group_vars/wazuh-manager/wazuh-secrets.yml``
+
 Configure Wazuh Dashboard's Server Host
 ---------------------------------------
 
@@ -390,6 +404,25 @@ Deploy the Wazuh agents:
 
 ``kayobe playbook run $KAYOBE_CONFIG_PATH/ansible/wazuh-agent.yml``
 
+The Wazuh Agent is deployed to all hosts in the ``wazuh-agent``
+inventory group, comprising the ``seed`` group
+plus the ``overcloud`` group (containing all hosts in the
+OpenStack control plane).
+
+.. code-block:: ini
+
+    [wazuh-agent:children]
+    seed
+    overcloud
+
+The hosts running Wazuh Agent should automatically be registered
+and visible within the Wazuh Manager dashboard.
+
+.. note:: It is good practice to use a `Kayobe deploy hook
+  <https://docs.openstack.org/kayobe/latest/custom-ansible-playbooks.html#hooks>`_
+  to automate deployment and configuration of the Wazuh Agent
+  following a run of ``kayobe overcloud host configure``.
+
 Verification
 ------------
 
diff --git a/doc/source/operations/index.rst b/doc/source/operations/index.rst
index aea4139980..ae8a71901e 100644
--- a/doc/source/operations/index.rst
+++ b/doc/source/operations/index.rst
@@ -1,6 +1,6 @@
-=================
+==============
 Operator Guide
-=================
+==============
 
 This guide is for operators of the StackHPC Kayobe configuration project.
 
diff --git a/doc/source/operations/wazuh-operation.rst b/doc/source/operations/wazuh-operation.rst
deleted file mode 100644
index 3f56c24603..0000000000
--- a/doc/source/operations/wazuh-operation.rst
+++ /dev/null
@@ -1,89 +0,0 @@
-=======================
-Wazuh Security Platform
-=======================
-
-`Wazuh <https://wazuh.com>`_ is a security monitoring platform.
-It monitors for:
-
-* Security-related system events.
-* Known vulnerabilities (CVEs) in versions of installed software.
-* Misconfigurations in system security.
-
-One method for deploying and maintaining Wazuh is the `official
-Ansible playbooks <https://github.com/wazuh/wazuh-ansible>`_.  These
-can be integrated into ``kayobe-config`` as a custom playbook.
-
-Configuring Wazuh Manager
-=========================
-
-Wazuh Manager is configured by editing the ``wazuh-manager.yml``
-groups vars file found at
-``etc/kayobe/inventory/group_vars/wazuh-manager/``.  This file
-controls various aspects of Wazuh Manager configuration.
-Most notably:
-
-*domain_name*:
-    The domain used by Search Guard CE when generating certificates.
-
-*wazuh_manager_ip*:
-    The IP address that the Wazuh Manager shall reside on for communicating with the agents.
-
-*wazuh_manager_connection*:
-    Used to define port and protocol for the manager to be listening on.
-
-*wazuh_manager_authd*:
-    Connection settings for the daemon responsible for registering new agents.
-
-Running ``kayobe playbook run
-$KAYOBE_CONFIG_PATH/ansible/wazuh-manager.yml`` will deploy these
-changes.
-
-Secrets
--------
-
-Wazuh requires that secrets or passwords are set for itself and the services with which it communiticates.
-The playbook ``etc/kayobe/ansible/wazuh-secrets.yml`` automates the creation of these secrets, which should then be encrypted with Ansible Vault.
-
-To update the secrets you can execute the following two commands
-
-.. code-block:: shell
-
-    kayobe# kayobe playbook run $KAYOBE_CONFIG_PATH/ansible/wazuh-secrets.yml \
-        -e wazuh_user_pass=$(uuidgen) \
-        -e wazuh_admin_pass=$(uuidgen)
-    kayobe# ansible-vault encrypt --vault-password-file <Vault password file path> \
-        $KAYOBE_CONFIG_PATH/inventory/group_vars/wazuh-manager/wazuh-secrets.yml
-
-Once generated, run ``kayobe playbook run $KAYOBE_CONFIG_PATH/ansible/wazuh-manager.yml`` which copies the secrets into place.
-
-.. note:: Use ``ansible-vault`` to view the secrets:
-
-  ``ansible-vault view --vault-password-file ~/vault.password $KAYOBE_CONFIG_PATH/inventory/group_vars/wazuh-manager/wazuh-secrets.yml``
-
-Adding a New Agent
-------------------
-The Wazuh Agent is deployed to all hosts in the ``wazuh-agent``
-inventory group, comprising the ``seed`` group
-plus the ``overcloud`` group (containing all hosts in the
-OpenStack control plane).
-
-.. code-block:: ini
-
-    [wazuh-agent:children]
-    seed
-    overcloud
-
-The following playbook deploys the Wazuh Agent to all hosts in the
-``wazuh-agent`` group:
-
-.. code-block:: shell
-
-  kayobe playbook run $KAYOBE_CONFIG_PATH/ansible/wazuh-agent.yml
-
-The hosts running Wazuh Agent should automatically be registered
-and visible within the Wazuh Manager dashboard.
-
-.. note:: It is good practice to use a `Kayobe deploy hook
-  <https://docs.openstack.org/kayobe/wallaby/custom-ansible-playbooks.html#hooks>`_
-  to automate deployment and configuration of the Wazuh Agent
-  following a run of ``kayobe overcloud host configure``.

From 342a4a2d9740fe89d665805484f1daf24ca5e103 Mon Sep 17 00:00:00 2001
From: Seunghun Lee <seunghun@stackhpc.com>
Date: Mon, 17 Jun 2024 12:43:53 +0100
Subject: [PATCH 18/42] Update old contents

---
 .../operations/control-plane-operation.rst    | 62 +++++++++++++----
 doc/source/operations/customising-horizon.rst | 67 +++----------------
 .../operations/openstack-reconfiguration.rst  | 45 -------------
 3 files changed, 58 insertions(+), 116 deletions(-)

diff --git a/doc/source/operations/control-plane-operation.rst b/doc/source/operations/control-plane-operation.rst
index 8212b95fba..c7b10e75f4 100644
--- a/doc/source/operations/control-plane-operation.rst
+++ b/doc/source/operations/control-plane-operation.rst
@@ -294,27 +294,61 @@ node is powered back on.
 Software Updates
 ================
 
-Update Packages on Control Plane
---------------------------------
+Update Host Packages on Control Plane
+-------------------------------------
 
-OS packages can be updated with:
+The host packages and Kolla container images are distributed from `StackHPC Release Train
+<https://stackhpc.github.io/stackhpc-release-train/>`__ to ensure tested and reliable
+software releases are provided.
+
+Syncing new StackHPC Release Train contents to local Pulp server is needed before updating
+host packages and/or Kolla services.
+
+To sync host packages:
+
+.. code-block:: console
+
+   kayobe# kayobe playbook run $KAYOBE_CONFIG_PATH/ansible/pulp-repo-sync.yml
+   kayobe# kayobe playbook run $KAYOBE_CONFIG_PATH/ansible/pulp-repo-publish.yml
+
+If the system is production environment and want to use packages tested in test/staging
+environment, you can promote them by:
+
+.. code-block:: console
+
+   kayobe# kayobe playbook run $KAYOBE_CONFIG_PATH/ansible/pulp-repo-promote-production.yml
+
+To sync container images:
 
 .. code-block:: console
 
-   kayobe# kayobe overcloud host package update --limit <Hypervisor node> --packages '*'
-   kayobe# kayobe overcloud seed package update --packages '*'
+   kayobe# kayobe playbook run $KAYOBE_CONFIG_PATH/ansible/pulp-container-sync.yml
+   kayobe# kayobe playbook run $KAYOBE_CONFIG_PATH/ansible/pulp-container-publish.yml
+
+Once sync with StackHPC Release Train is done, new contents will be accessible from local
+Pulp server.
+
+Host packages can be updated with:
+
+.. code-block:: console
+
+   kayobe# kayobe overcloud host package update --limit <node> --packages '*'
+   kayobe# kayobe seed host package update --packages '*'
 
 See https://docs.openstack.org/kayobe/latest/administration/overcloud.html#updating-packages
 
-Minor Upgrades to OpenStack Services
-------------------------------------
+Upgrading OpenStack Services
+----------------------------
+
+* Update tags for the images in ``etc/kayobe/kolla-image-tags.yml`` to use the new value of ``kolla_openstack_release``
+* Pull container images to overcloud hosts with ``kayobe overcloud container image pull``
+* Run ``kayobe overcloud service upgrade``
+
+You can update the subset of containers or hosts by
+
+.. code-block:: console
 
-* Pull latest changes from upstream stable branch to your own ``kolla`` fork (if applicable)
-* Update ``kolla_openstack_release`` in ``etc/kayobe/kolla.yml`` (unless using default)
-* Update tags for the images in ``etc/kayobe/kolla/globals.yml`` to use the new value of ``kolla_openstack_release``
-* Rebuild container images
-* Pull container images to overcloud hosts
-* Run kayobe overcloud service upgrade
+   kayobe# kayobe overcloud service upgrade --kolla-tags <service> --limit <hostname> --kolla-limit <hostname>
 
 For more information, see: https://docs.openstack.org/kayobe/latest/upgrading.html
 
@@ -339,7 +373,7 @@ To boot an instance on a specific hypervisor
 
 .. code-block:: console
 
-   openstack server create --flavor <Flavour name>--network <Network name> --key-name <key> --image <Image name> --availability-zone nova::<Hypervisor name> <VM name>
+   openstack server create --flavor <flavour name>--network <network name> --key-name <key> --image <Image name> --os-compute-api-version 2.74 --host <hypervisor hostname> <vm name>
 
 Cleanup Procedures
 ==================
diff --git a/doc/source/operations/customising-horizon.rst b/doc/source/operations/customising-horizon.rst
index 4b2e157d86..d1fcd5e65b 100644
--- a/doc/source/operations/customising-horizon.rst
+++ b/doc/source/operations/customising-horizon.rst
@@ -46,79 +46,35 @@ Further reading:
 * https://docs.openstack.org/horizon/latest/configuration/themes.html
 * https://docs.openstack.org/horizon/latest/configuration/branding.html
 
-Building a Horizon container image with custom theme
-----------------------------------------------------
+Adding the custom theme
+-----------------------
 
-Building a custom container image for Horizon can be done by modifying
-``kolla.yml`` to fetch the custom theme and include it in the image:
+Create a directory and transfer custom theme files to it ``$KAYOBE_CONFIG_PATH/kolla/config/horizon/themes/<custom theme name>``.
 
-.. code-block:: yaml
-
-   kolla_sources:
-     horizon-additions-theme-<custom theme name>:
-       type: "git"
-       location: <custom theme repository url>
-       reference: master
-
-   kolla_build_blocks:
-     horizon_footer: |
-       # Binary images cannot use the additions mechanism.
-       {% raw %}
-       {% if install_type == 'source' %}
-       ADD additions-archive /
-       RUN mkdir -p /etc/openstack-dashboard/themes/<custom theme name> \
-         && cp -R /additions/horizon-additions-theme-<custom theme name>-archive-master/* /etc/openstack-dashboard/themes/<custom theme name>/ \
-         && chown -R horizon: /etc/openstack-dashboard/themes
-       {% endif %}
-       {% endraw %}
-
-If using a specific container image tag, don't forget to set:
+Define the custom theme in ``etc/kayobe/kolla/globals.yml``
 
 .. code-block:: yaml
-
-   kolla_tag: mytag
-
-Build the image with:
-
-.. code-block:: console
-
-   kayobe overcloud container image build horizon -e kolla_install_type=source --push
-
-Pull the new Horizon container to the controller:
-
-.. code-block:: console
-
-   kayobe overcloud container image pull --kolla-tags horizon
+   horizon_custom_themes:
+      - name: <custom theme name>
+        label: <custom theme label> # This will be the visible name to users
 
 Deploy and use the custom theme
 -------------------------------
 
-Switch to source image type in ``${KAYOBE_CONFIG_PATH}/kolla/globals.yml``:
-
-.. code-block:: yaml
-
-   horizon_install_type: source
-
-You may also need to update the container image tag:
-
-.. code-block:: yaml
-
-   horizon_tag: mytag
-
 Configure Horizon to include the custom theme and use it by default:
 
 .. code-block:: console
 
-   mkdir -p ${KAYOBE_CONFIG_PATH}/kolla/config/horizon/
+   mkdir -p $KAYOBE_CONFIG_PATH/kolla/config/horizon/
 
-Add to ``${KAYOBE_CONFIG_PATH}/kolla/config/horizon/custom_local_settings``:
+Create file ``$KAYOBE_CONFIG_PATH/kolla/config/horizon/custom_local_settings`` and add followings
 
 .. code-block:: console
 
    AVAILABLE_THEMES = [
        ('default', 'Default', 'themes/default'),
        ('material', 'Material', 'themes/material'),
-       ('<custom theme name>', '<custom theme visible name>', '/etc/openstack-dashboard/themes/<custom theme name>'),
+       ('<custom theme name>', '<custom theme label>', 'themes/<custom theme name>'),
    ]
    DEFAULT_THEME = '<custom theme name>'
 
@@ -137,9 +93,6 @@ Deploy with:
 Troubleshooting
 ---------------
 
-Make sure you build source images, as binary images cannot use the addition
-mechanism used here.
-
 If the theme is selected but the logo doesn’t load, try running these commands
 inside the ``horizon`` container:
 
diff --git a/doc/source/operations/openstack-reconfiguration.rst b/doc/source/operations/openstack-reconfiguration.rst
index db157f6a91..712a0f779e 100644
--- a/doc/source/operations/openstack-reconfiguration.rst
+++ b/doc/source/operations/openstack-reconfiguration.rst
@@ -138,48 +138,3 @@ And then to enable a hypervisor again:
 
    admin# openstack compute service set --enable \
           <Hypervisor name> nova-compute
-
-Managing Space in the Docker Registry
-=====================================
-
-If the Docker registry becomes full, this can prevent container updates and
-(depending on the storage configuration of the seed host) could lead to other
-problems with services provided by the seed host.
-
-To remove container images from the Docker Registry, follow this process:
-
-* Reconfigure the registry container to allow deleting containers. This can be
-  done in ``docker-registry.yml`` with Kayobe:
-
-.. code-block:: yaml
-
-   docker_registry_env:
-     REGISTRY_STORAGE_DELETE_ENABLED: "true"
-
-* For the change to take effect, run:
-
-.. code-block:: console
-
-   kayobe seed host configure
-
-* A helper script is useful, such as https://github.com/byrnedo/docker-reg-tool
-  (this requires ``jq``). To delete all images with a specific tag, use:
-
-.. code-block:: console
-
-   for repo in `./docker_reg_tool http://registry-ip:4000 list`; do
-        ./docker_reg_tool http://registry-ip:4000 delete $repo $tag
-   done
-
-* Deleting the tag does not actually release the space. To actually free up
-  space, run garbage collection:
-
-.. code-block:: console
-
-   seed# docker exec docker_registry bin/registry garbage-collect /etc/docker/registry/config.yml
-
-The seed host can also accrue a lot of data from building container images.
-The images stored locally in the seed host can be seen using ``docker image ls``.
-
-Old and redundant images can be identified from their names and tags, and
-removed using ``docker image rm``.

From e5a0f509788252bd9d2a14277b8ec1ada9cb707e Mon Sep 17 00:00:00 2001
From: Seunghun Lee <seunghun@stackhpc.com>
Date: Mon, 17 Jun 2024 14:31:33 +0100
Subject: [PATCH 19/42] Attach Release Train document for more info

---
 doc/source/configuration/release-train.rst        |  2 +-
 doc/source/operations/control-plane-operation.rst | 11 ++++++++---
 doc/source/operations/upgrading-openstack.rst     |  2 +-
 3 files changed, 10 insertions(+), 5 deletions(-)

diff --git a/doc/source/configuration/release-train.rst b/doc/source/configuration/release-train.rst
index f77109aff9..5ed9b50c74 100644
--- a/doc/source/configuration/release-train.rst
+++ b/doc/source/configuration/release-train.rst
@@ -1,4 +1,4 @@
-.. _stackhpc_release_train:
+.. _stackhpc-release-train:
 
 ======================
 StackHPC Release Train
diff --git a/doc/source/operations/control-plane-operation.rst b/doc/source/operations/control-plane-operation.rst
index c7b10e75f4..ebacc0568a 100644
--- a/doc/source/operations/control-plane-operation.rst
+++ b/doc/source/operations/control-plane-operation.rst
@@ -294,8 +294,8 @@ node is powered back on.
 Software Updates
 ================
 
-Update Host Packages on Control Plane
--------------------------------------
+Sync local Pulp server with StackHPC Release Train
+--------------------------------------------------
 
 The host packages and Kolla container images are distributed from `StackHPC Release Train
 <https://stackhpc.github.io/stackhpc-release-train/>`__ to ensure tested and reliable
@@ -325,9 +325,14 @@ To sync container images:
    kayobe# kayobe playbook run $KAYOBE_CONFIG_PATH/ansible/pulp-container-sync.yml
    kayobe# kayobe playbook run $KAYOBE_CONFIG_PATH/ansible/pulp-container-publish.yml
 
+For more information about StackHPC Release Train, see :ref:`stackhpc-release-train` documentation.
+
 Once sync with StackHPC Release Train is done, new contents will be accessible from local
 Pulp server.
 
+Update Host Packages on Control Plane
+-------------------------------------
+
 Host packages can be updated with:
 
 .. code-block:: console
@@ -340,7 +345,7 @@ See https://docs.openstack.org/kayobe/latest/administration/overcloud.html#updat
 Upgrading OpenStack Services
 ----------------------------
 
-* Update tags for the images in ``etc/kayobe/kolla-image-tags.yml`` to use the new value of ``kolla_openstack_release``
+* Update tags for the images in ``etc/kayobe/kolla-image-tags.yml``
 * Pull container images to overcloud hosts with ``kayobe overcloud container image pull``
 * Run ``kayobe overcloud service upgrade``
 
diff --git a/doc/source/operations/upgrading-openstack.rst b/doc/source/operations/upgrading-openstack.rst
index 0b0df50563..647ac2702c 100644
--- a/doc/source/operations/upgrading-openstack.rst
+++ b/doc/source/operations/upgrading-openstack.rst
@@ -459,7 +459,7 @@ To upgrade the Ansible control host:
 Syncing Release Train artifacts
 -------------------------------
 
-New :ref:`stackhpc_release_train` content should be synced to the local Pulp
+New :ref:`stackhpc-release-train` content should be synced to the local Pulp
 server. This includes host packages (Deb/RPM) and container images.
 
 .. _sync-rt-package-repos:

From d170a9e7481891253fbd6d3a6e3cec5fa62cbccd Mon Sep 17 00:00:00 2001
From: Seunghun Lee <seunghun@stackhpc.com>
Date: Mon, 17 Jun 2024 15:13:19 +0100
Subject: [PATCH 20/42] Remove baremetal management doc

This doc will need client specific infromation to be helpful
---
 .../operations/baremetal-node-management.rst  | 275 ------------------
 doc/source/operations/index.rst               |   1 -
 2 files changed, 276 deletions(-)
 delete mode 100644 doc/source/operations/baremetal-node-management.rst

diff --git a/doc/source/operations/baremetal-node-management.rst b/doc/source/operations/baremetal-node-management.rst
deleted file mode 100644
index 25a12760d6..0000000000
--- a/doc/source/operations/baremetal-node-management.rst
+++ /dev/null
@@ -1,275 +0,0 @@
-======================================
-Bare Metal Compute Hardware Management
-======================================
-
-Bare metal compute nodes are managed by the Ironic services.
-This section describes elements of the configuration of this service.
-
-.. _ironic-node-lifecycle:
-
-Ironic node life cycle
-----------------------
-
-The deployment process is documented in the `Ironic User Guide <https://docs.openstack.org/ironic/latest/user/index.html>`__.
-OpenStack deployment uses the
-`direct deploy method <https://docs.openstack.org/ironic/latest/user/index.html#example-1-pxe-boot-and-direct-deploy-process>`__.
-
-The Ironic state machine can be found `here <https://docs.openstack.org/ironic/latest/user/states.html>`__. The rest of
-this documentation refers to these states and assumes that you have familiarity.
-
-High level overview of state transitions
-~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
-
-The following section attempts to describe the state transitions for various Ironic operations at a high level.
-It focuses on trying to describe the steps where dynamic switch reconfiguration is triggered.
-For a more detailed overview, refer to the :ref:`ironic-node-lifecycle` section.
-
-Provisioning
-~~~~~~~~~~~~
-
-Provisioning starts when an instance is created in Nova using a bare metal flavor.
-
-- Node starts in the available state (available)
-- User provisions an instance (deploying)
-- Ironic will switch the node onto the provisioning network (deploying)
-- Ironic will power on the node and will await a callback (wait-callback)
-- Ironic will image the node with an operating system using the image provided at creation (deploying)
-- Ironic switches the node onto the tenant network(s) via neutron (deploying)
-- Transition node to active state (active)
-
-.. _baremetal-management-deprovisioning:
-
-Deprovisioning
-~~~~~~~~~~~~~~
-
-Deprovisioning starts when an instance created in Nova using a bare metal flavor is destroyed.
-
-If automated cleaning is enabled, it occurs when nodes are deprovisioned.
-
-- Node starts in active state (active)
-- User deletes instance (deleting)
-- Ironic will remove the node from any tenant network(s) (deleting)
-- Ironic will switch the node onto the cleaning network (deleting)
-- Ironic will power on the node and will await a callback (clean-wait)
-- Node boots into Ironic Python Agent and issues callback, Ironic starts cleaning (cleaning)
-- Ironic removes node from cleaning network (cleaning)
-- Node transitions to available (available)
-
-If automated cleaning is disabled.
-
-- Node starts in active state (active)
-- User deletes instance (deleting)
-- Ironic will remove the node from any tenant network(s) (deleting)
-- Node transitions to available (available)
-
-Cleaning
-~~~~~~~~
-
-Manual cleaning is not part of the regular state transitions when using Nova, however nodes can be manually cleaned by administrators.
-
-- Node starts in the manageable state (manageable)
-- User triggers cleaning with API (cleaning)
-- Ironic will switch the node onto the cleaning network (cleaning)
-- Ironic will power on the node and will await a callback (clean-wait)
-- Node boots into Ironic Python Agent and issues callback, Ironic starts cleaning (cleaning)
-- Ironic removes node from cleaning network (cleaning)
-- Node transitions back to the manageable state (manageable)
-
-Rescuing
-~~~~~~~~
-
-Feature not used. The required rescue network is not currently configured.
-
-Baremetal networking
---------------------
-
-Baremetal networking with the Neutron Networking Generic Switch ML2 driver requires a combination of static and dynamic switch configuration.
-
-.. _static-switch-config:
-
-Static switch configuration
-~~~~~~~~~~~~~~~~~~~~~~~~~~~
-
-Static physical network configuration is managed via Kayobe.
-
-.. TODO: Fill in the switch configuration
-
-- Some initial switch configuration is required before networking generic switch can take over the management of an interface.
-    First, LACP must be configured on the switch ports attached to the baremetal node, e.g:
-
-    .. code-block:: shell
-
-      The interface is then partially configured:
-
-    .. code-block:: shell
-
-      For :ref:`ironic-node-discovery` to work, you need to manually switch the port to the provisioning network:
-
-    **NOTE**: You only need to do this if Ironic isn't aware of the node.
-
-Configuration with kayobe
-^^^^^^^^^^^^^^^^^^^^^^^^^
-
-Kayobe can be used to apply the :ref:`static-switch-config`.
-
-- Upstream documentation can be found `here <https://docs.openstack.org/kayobe/latest/configuration/reference/physical-network.html>`__.
-- Kayobe does all the switch configuration that isn't :ref:`dynamically updated using Ironic <dynamic-switch-configuration>`.
-- Optionally switches the node onto the provisioning network (when using ``--enable-discovery``)
-
-    + NOTE: This is a dangerous operation as it can wipe out the dynamic VLAN configuration applied by neutron/ironic.
-      You should only run this when initially enrolling a node, and should always use the ``interface-description-limit`` option. For example:
-
-    .. code-block::
-
-        kayobe physical network configure --interface-description-limit <description> --group switches --display --enable-discovery
-
-    In this example, ``--display`` is used to preview the switch configuration without applying it.
-
-.. TODO: Fill in information about how switches are configured in kayobe-config, with links
-
-- Configuration is done using a combination of ``group_vars`` and ``host_vars``
-
-.. _dynamic-switch-configuration:
-
-Dynamic switch configuration
-~~~~~~~~~~~~~~~~~~~~~~~~~~~~
-
-Ironic dynamically configures the switches using the Neutron `Networking Generic Switch <https://docs.openstack.org/networking-generic-switch/latest/>`_ ML2 driver.
-
-- Used to toggle the baremetal nodes onto different networks
-
-  + Can use any VLAN network defined in OpenStack, providing that the VLAN has been trunked to the controllers
-    as this is required for DHCP to function.
-  + See :ref:`ironic-node-lifecycle`. This attempts to illustrate when any switch reconfigurations happen.
-
-- Only configures VLAN membership of the switch interfaces or port groups. To prevent conflicts with the static switch configuration,
-  the convention used is: after the node is in service in Ironic, VLAN membership should not be manually adjusted and
-  should be left to be controlled by ironic i.e *don't* use ``--enable-discovery`` without an interface limit when configuring the
-  switches with kayobe.
-- Ironic is configured to use the neutron networking driver.
-
-.. _ngs-commands:
-
-Commands that NGS will execute
-^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
-
-Networking Generic Switch is mainly concerned with toggling the ports onto different VLANs. It
-cannot fully configure the switch.
-
-.. TODO: Fill in the switch configuration
-
-- Switching the port onto the provisioning network
-
-  .. code-block:: shell
-
-- Switching the port onto the tenant network.
-
-  .. code-block:: shell
-
-- When deleting the instance, the VLANs are removed from the port. Using:
-
-  .. code-block:: shell
-
-NGS will save the configuration after each reconfiguration (by default).
-
-Ports managed by NGS
-^^^^^^^^^^^^^^^^^^^^
-
-The command below extracts a list of port UUID, node UUID and switch port information.
-
-.. code-block:: bash
-
-    openstack baremetal port list --field uuid --field node_uuid --field local_link_connection --format value
-
-NGS will manage VLAN membership for ports when the ``local_link_connection`` fields match one of the switches in ``ml2_conf.ini``.
-The rest of the switch configuration is static.
-The switch configuration that NGS will apply to these ports is detailed in :ref:`dynamic-switch-configuration`.
-
-.. _ironic-node-discovery:
-
-Ironic node discovery
----------------------
-
-Discovery is a process used to automatically enrol new nodes in Ironic.
-It works by PXE booting the nodes into the Ironic Python Agent (IPA) ramdisk.
-This ramdisk will collect hardware and networking configuration from the node in a process known as introspection.
-This data is used to populate the baremetal node object in Ironic.
-The series of steps you need to take to enrol a new node is as follows:
-
-- Configure credentials on the BMC. These are needed for Ironic to be able to perform power control actions.
-
-- Controllers should have network connectivity with the target BMC.
-
-- (If kayobe manages physical network) Add any additional switch configuration to kayobe config.
-  The minimal switch configuration that kayobe needs to know about is described in :ref:`tor-switch-configuration`.
-
-- Apply any :ref:`static switch configration <static-switch-config>`. This performs the initial
-  setup of the switchports that is needed before Ironic can take over. The static configuration
-  will not be modified by Ironic, so it should be safe to reapply at any point. See :ref:`ngs-commands`
-  for details about the switch configuation that Networking Generic Switch will apply.
-
-- (If kayobe manages physical network) Put the node onto the provisioning network by using the
-  ``--enable-discovery`` flag and either ``--interface-description-limit`` or ``--interface-limit``
-  (do not run this command without one of these limits). See :ref:`static-switch-config`.
-
-    * This is only necessary to initially discover the node. Once the node is in registered in Ironic,
-      it will take over control of the the VLAN membership. See :ref:`dynamic-switch-configuration`.
-
-    * This provides ethernet connectivity with the controllers over the `workload provisioning` network
-
-- (If kayobe doesn't manage physical network) Put the node onto the provisioning network.
-
-.. TODO: link to the relevant file in kayobe config
-
-- Add node to the kayobe inventory.
-
-.. TODO: Fill in details about necessary BIOS & RAID config
-
-- Apply any necesary BIOS & RAID configuration.
-
-.. TODO: Fill in details about how to trigger a PXE boot
-
-- PXE boot the node.
-
-- If the discovery process is successful, the node will appear in Ironic and will get populated with the necessary information from the hardware inspection process.
-
-.. TODO: Link to the Kayobe inventory in the repo
-
-- Add node to the Kayobe inventory in the ``baremetal-compute`` group.
-
-- The node will begin in the ``enroll`` state, and must be moved first to ``manageable``, then ``available`` before it can be used.
-
-  If Ironic automated cleaning is enabled, the node must complete a cleaning process before it can reach the available state.
-
-  * Use Kayobe to attempt to move the node to the ``available`` state.
-
-    .. code-block:: console
-
-       source etc/kolla/public-openrc.sh
-       kayobe baremetal compute provide --limit <node>
-
-- Once the node is in the ``available`` state, Nova will make the node available for scheduling. This happens periodically, and typically takes around a minute.
-
-.. _tor-switch-configuration:
-
-Top of Rack (ToR) switch configuration
-~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
-
-Networking Generic Switch must be aware of the Top-of-Rack switch connected to the new node.
-Switches managed by NGS are configured in ``ml2_conf.ini``.
-
-.. TODO: Fill in details about how switches are added to NGS config in kayobe-config
-
-After adding switches to the NGS configuration, Neutron must be redeployed.
-
-Considerations when booting baremetal compared to VMs
-------------------------------------------------------
-
-- You can only use networks of type: vlan
-- Without using trunk ports, it is only possible to directly attach one network to each port or port group of an instance.
-
-  * To access other networks you can use routers
-  * You can still attach floating IPs
-
-- Instances take much longer to provision (expect at least 15 mins)
-- When booting an instance use one of the flavors that maps to a baremetal node via the RESOURCE_CLASS configured on the flavor.
diff --git a/doc/source/operations/index.rst b/doc/source/operations/index.rst
index ae8a71901e..f130cee9f9 100644
--- a/doc/source/operations/index.rst
+++ b/doc/source/operations/index.rst
@@ -7,7 +7,6 @@ This guide is for operators of the StackHPC Kayobe configuration project.
 .. toctree::
    :maxdepth: 1
 
-   baremetal-node-management
    ceph-management
    control-plane-operation
    customsing-horizon

From a2833f5eba0ea04bba0920192747e4e97504dccb Mon Sep 17 00:00:00 2001
From: Seunghun Lee <seunghun@stackhpc.com>
Date: Mon, 17 Jun 2024 15:26:49 +0100
Subject: [PATCH 21/42] Fix formatting

---
 doc/source/operations/ceph-management.rst         | 1 +
 doc/source/operations/control-plane-operation.rst | 8 ++++----
 doc/source/operations/customising-horizon.rst     | 1 +
 doc/source/operations/gpu-in-openstack.rst        | 6 +++---
 doc/source/operations/index.rst                   | 2 +-
 5 files changed, 10 insertions(+), 8 deletions(-)

diff --git a/doc/source/operations/ceph-management.rst b/doc/source/operations/ceph-management.rst
index fc17571278..aac620e33d 100644
--- a/doc/source/operations/ceph-management.rst
+++ b/doc/source/operations/ceph-management.rst
@@ -100,6 +100,7 @@ In this case Ceph will go into `HEALTH_WARN` state.
 Ceph can report details about failed OSDs by running:
 
 .. code-block:: console
+
    # From storage host
    sudo cephadm shell
    ceph health detail
diff --git a/doc/source/operations/control-plane-operation.rst b/doc/source/operations/control-plane-operation.rst
index ebacc0568a..f81111dc01 100644
--- a/doc/source/operations/control-plane-operation.rst
+++ b/doc/source/operations/control-plane-operation.rst
@@ -71,10 +71,10 @@ hypervisor will produce several alerts:
 * ``PrometheusTargetMissing`` from several Prometheus exporters
 
 Rather than silencing each alert one by one for a specific host, a silence can
-apply to multiple alerts using a reduced list of labels. :ref:`Log into
-Alertmanager <prometheus-alertmanager>`, click on the ``Silence`` button next
-to an alert and adjust the matcher list to keep only ``instance=<hostname>``
-label.  Then, create another silence to match ``hostname=<hostname>`` (this is
+apply to multiple alerts using a reduced list of labels. Log into Alertmanager,
+click on the ``Silence`` button next to an alert and adjust the matcher list
+to keep only ``instance=<hostname>`` label.
+Then, create another silence to match ``hostname=<hostname>`` (this is
 required because, for the OpenStack exporter, the instance is the host running
 the monitoring service rather than the host being monitored).
 
diff --git a/doc/source/operations/customising-horizon.rst b/doc/source/operations/customising-horizon.rst
index d1fcd5e65b..096ce2e561 100644
--- a/doc/source/operations/customising-horizon.rst
+++ b/doc/source/operations/customising-horizon.rst
@@ -54,6 +54,7 @@ Create a directory and transfer custom theme files to it ``$KAYOBE_CONFIG_PATH/k
 Define the custom theme in ``etc/kayobe/kolla/globals.yml``
 
 .. code-block:: yaml
+
    horizon_custom_themes:
       - name: <custom theme name>
         label: <custom theme label> # This will be the visible name to users
diff --git a/doc/source/operations/gpu-in-openstack.rst b/doc/source/operations/gpu-in-openstack.rst
index 259e39e8c1..4979d3f25b 100644
--- a/doc/source/operations/gpu-in-openstack.rst
+++ b/doc/source/operations/gpu-in-openstack.rst
@@ -971,9 +971,9 @@ Once this code has taken effect (after a reboot), the VFIO kernel drivers should
 
    # lspci -nnk -s 3d:00.0
    3d:00.0 VGA compatible controller [0300]: NVIDIA Corporation GM107GL [Tesla M10] [10de:13bd] (rev a2)
-	 Subsystem: NVIDIA Corporation Tesla M10 [10de:1160]
-	 Kernel driver in use: vfio-pci
-	 Kernel modules: nouveau
+   Subsystem: NVIDIA Corporation Tesla M10 [10de:1160]
+   Kernel driver in use: vfio-pci
+   Kernel modules: nouveau
 
 IOMMU should be enabled at kernel level as well - we can verify that on the compute host:
 
diff --git a/doc/source/operations/index.rst b/doc/source/operations/index.rst
index f130cee9f9..2408b1a36e 100644
--- a/doc/source/operations/index.rst
+++ b/doc/source/operations/index.rst
@@ -9,7 +9,7 @@ This guide is for operators of the StackHPC Kayobe configuration project.
 
    ceph-management
    control-plane-operation
-   customsing-horizon
+   customising-horizon
    gpu-in-openstack
    hardware-inventory-management
    hotfix-playbook

From 124a2a38d6dece71abbbf4e10a0087b222b74844 Mon Sep 17 00:00:00 2001
From: Seunghun Lee <seunghun@stackhpc.com>
Date: Tue, 9 Jul 2024 12:05:36 +0100
Subject: [PATCH 22/42] Adding missing /

---
 doc/source/configuration/wazuh.rst | 2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/doc/source/configuration/wazuh.rst b/doc/source/configuration/wazuh.rst
index ca6e519b17..49462f86ea 100644
--- a/doc/source/configuration/wazuh.rst
+++ b/doc/source/configuration/wazuh.rst
@@ -15,7 +15,7 @@ The short version
 #. Create an infrastructure VM for the Wazuh manager, and add it to the wazuh-manager group
 #. Configure the infrastructure VM with kayobe: ``kayobe infra vm host configure``
 #. Edit your config under
-   ``$KAYOBE_CONFIG_PATHinventory/group_vars/wazuh-manager/wazuh-manager``, in
+   ``$KAYOBE_CONFIG_PATH/inventory/group_vars/wazuh-manager/wazuh-manager``, in
    particular the defaults assume that the ``provision_oc_net`` network will be
    used.
 #. Generate secrets: ``kayobe playbook run $KAYOBE_CONFIG_PATH/ansible/wazuh-secrets.yml``

From e12e8fa16f22a97e8aa79e8340e11e2b284e701b Mon Sep 17 00:00:00 2001
From: Seunghun Lee <seunghun@stackhpc.com>
Date: Thu, 12 Sep 2024 13:49:59 +0100
Subject: [PATCH 23/42] Update content to Antelope and misc changes

---
 doc/source/configuration/wazuh.rst            | 36 +++++++-------
 doc/source/operations/ceph-management.rst     | 12 ++---
 .../operations/control-plane-operation.rst    | 47 +++----------------
 doc/source/operations/customising-horizon.rst |  6 +--
 4 files changed, 33 insertions(+), 68 deletions(-)

diff --git a/doc/source/configuration/wazuh.rst b/doc/source/configuration/wazuh.rst
index 49462f86ea..40b8a973ca 100644
--- a/doc/source/configuration/wazuh.rst
+++ b/doc/source/configuration/wazuh.rst
@@ -34,14 +34,14 @@ Provisioning an infra VM for Wazuh Manager.
 Kayobe supports :kayobe-doc:`provisioning infra VMs <deployment.html#infrastructure-vms>`.
 The following configuration may be used as a guide. Config for infra VMs is documented :kayobe-doc:`here <configuration/reference/infra-vms>`.
 
-Add a Wazuh Manager host to the ``wazuh-manager`` group in ``etc/kayobe/inventory/hosts``.
+Add a Wazuh Manager host to the ``wazuh-manager`` group in ``$KAYOBE_CONFIG_PATH/inventory/hosts``.
 
 .. code-block:: ini
 
    [wazuh-manager]
    os-wazuh
 
-Add the ``wazuh-manager`` group to the ``infra-vms`` group in ``etc/kayobe/inventory/groups``.
+Add the ``wazuh-manager`` group to the ``infra-vms`` group in ``$KAYOBE_CONFIG_PATH/inventory/groups``.
 
 .. code-block:: ini
 
@@ -50,7 +50,7 @@ Add the ``wazuh-manager`` group to the ``infra-vms`` group in ``etc/kayobe/inven
    [infra-vms:children]
    wazuh-manager
 
-Define VM sizing in ``etc/kayobe/inventory/group_vars/wazuh-manager/infra-vms``:
+Define VM sizing in ``$KAYOBE_CONFIG_PATH/inventory/group_vars/wazuh-manager/infra-vms``:
 
 .. code-block:: yaml
 
@@ -64,7 +64,7 @@ Define VM sizing in ``etc/kayobe/inventory/group_vars/wazuh-manager/infra-vms``:
   # Capacity of the infra VM data volume.
   infra_vm_data_capacity: "200G"
 
-Optional: define LVM volumes in ``etc/kayobe/inventory/group_vars/wazuh-manager/lvm``.
+Optional: define LVM volumes in ``$KAYOBE_CONFIG_PATH/inventory/group_vars/wazuh-manager/lvm``.
 ``/var/ossec`` often requires greater storage space, and ``/var/lib/wazuh-indexer``
 may be beneficial too.
 
@@ -86,7 +86,7 @@ may be beneficial too.
           create: true
 
 
-Define network interfaces ``etc/kayobe/inventory/group_vars/wazuh-manager/network-interfaces``:
+Define network interfaces ``$KAYOBE_CONFIG_PATH/inventory/group_vars/wazuh-manager/network-interfaces``:
 
 (The following is an example - the names will depend on your particular network configuration.)
 
@@ -98,7 +98,7 @@ Define network interfaces ``etc/kayobe/inventory/group_vars/wazuh-manager/networ
 
 
 The Wazuh manager may need to be exposed externally, in which case it may require another interface.
-This can be done as follows in ``etc/kayobe/inventory/group_vars/wazuh-manager/network-interfaces``,
+This can be done as follows in ``$KAYOBE_CONFIG_PATH/inventory/group_vars/wazuh-manager/network-interfaces``,
 with the network defined in ``networks.yml`` as usual.
 
 .. code-block:: yaml
@@ -190,7 +190,7 @@ Deploying Wazuh Manager services
 Setup
 -----
 
-To install a specific version modify the wazuh-ansible entry in ``etc/kayobe/ansible/requirements.yml``:
+To install a specific version modify the wazuh-ansible entry in ``$KAYOBE_CONFIG_PATH/ansible/requirements.yml``:
 
 .. code-block:: yaml
 
@@ -211,7 +211,7 @@ Edit the playbook and variables to your needs:
 Wazuh manager configuration
 ---------------------------
 
-Wazuh manager playbook is located in ``etc/kayobe/ansible/wazuh-manager.yml``.
+Wazuh manager playbook is located in ``$KAYOBE_CONFIG_PATH/ansible/wazuh-manager.yml``.
 Running this playbook will:
 
 * generate certificates for wazuh-manager
@@ -221,7 +221,7 @@ Running this playbook will:
 * setup and deploy wazuh-dashboard on wazuh-manager vm
 * copy certificates over to wazuh-manager vm
 
-Wazuh manager variables file is located in ``etc/kayobe/inventory/group_vars/wazuh-manager/wazuh-manager``.
+Wazuh manager variables file is located in ``$KAYOBE_CONFIG_PATH/inventory/group_vars/wazuh-manager/wazuh-manager``.
 
 You may need to modify some of the variables, including:
 
@@ -232,13 +232,13 @@ You may need to modify some of the variables, including:
 
     If you are using multiple environments, and you need to customise Wazuh in
     each environment, create override files in an appropriate directory,
-    for example ``etc/kayobe/environments/production/inventory/group_vars/``.
+    for example ``$KAYOBE_CONFIG_PATH/environments/production/inventory/group_vars/``.
 
     Files which values can be overridden (in the context of Wazuh):
 
-    - etc/kayobe/inventory/group_vars/wazuh/wazuh-manager/wazuh-manager
-    - etc/kayobe/wazuh-manager.yml
-    - etc/kayobe/inventory/group_vars/wazuh/wazuh-agent/wazuh-agent
+    - $KAYOBE_CONFIG_PATH/inventory/group_vars/wazuh/wazuh-manager/wazuh-manager
+    - $KAYOBE_CONFIG_PATH/wazuh-manager.yml
+    - $KAYOBE_CONFIG_PATH/inventory/group_vars/wazuh/wazuh-agent/wazuh-agent
 
 You'll need to run ``wazuh-manager.yml`` playbook again to apply customisation.
 
@@ -246,13 +246,13 @@ Secrets
 -------
 
 Wazuh requires that secrets or passwords are set for itself and the services with which it communiticates.
-Wazuh secrets playbook is located in ``etc/kayobe/ansible/wazuh-secrets.yml``.
+Wazuh secrets playbook is located in ``$KAYOBE_CONFIG_PATH/ansible/wazuh-secrets.yml``.
 Running this playbook will generate and put pertinent security items into secrets
 vault file which will be placed in ``$KAYOBE_CONFIG_PATH/wazuh-secrets.yml``.
 If using environments it ends up in ``$KAYOBE_CONFIG_PATH/environments/<env_name>/wazuh-secrets.yml``
 Remember to encrypt!
 
-Wazuh secrets template is located in ``etc/kayobe/ansible/templates/wazuh-secrets.yml.j2``.
+Wazuh secrets template is located in ``$KAYOBE_CONFIG_PATH/ansible/templates/wazuh-secrets.yml.j2``.
 It will be used by wazuh secrets playbook to generate wazuh secrets vault file.
 
 
@@ -380,7 +380,7 @@ Verification
 ------------
 
 The Wazuh portal should be accessible on port 443 of the Wazuh
-manager’s IPs (using HTTPS, with the root CA cert in ``etc/kayobe/ansible/wazuh/certificates/wazuh-certificates/root-ca.pem``).
+manager’s IPs (using HTTPS, with the root CA cert in ``$KAYOBE_CONFIG_PATH/ansible/wazuh/certificates/wazuh-certificates/root-ca.pem``).
 The first login should be as the admin user,
 with the opendistro_admin_password password in ``$KAYOBE_CONFIG_PATH/wazuh-secrets.yml``.
 This will create the necessary indices.
@@ -392,9 +392,9 @@ Logs are in ``/var/log/wazuh-indexer/wazuh.log``. There are also logs in the jou
 Wazuh agents
 ============
 
-Wazuh agent playbook is located in ``etc/kayobe/ansible/wazuh-agent.yml``.
+Wazuh agent playbook is located in ``$KAYOBE_CONFIG_PATH/ansible/wazuh-agent.yml``.
 
-Wazuh agent variables file is located in ``etc/kayobe/inventory/group_vars/wazuh-agent/wazuh-agent``.
+Wazuh agent variables file is located in ``$KAYOBE_CONFIG_PATH/inventory/group_vars/wazuh-agent/wazuh-agent``.
 
 You may need to modify some variables, including:
 
diff --git a/doc/source/operations/ceph-management.rst b/doc/source/operations/ceph-management.rst
index aac620e33d..67e3a5899f 100644
--- a/doc/source/operations/ceph-management.rst
+++ b/doc/source/operations/ceph-management.rst
@@ -8,14 +8,14 @@ Working with Cephadm
 This documentation provides guide for Ceph operations. For deploying Ceph,
 please refer to :ref:`cephadm-kayobe` documentation.
 
-cephadm configuration location
+Cephadm configuration location
 ------------------------------
 
 In kayobe-config repository, under ``etc/kayobe/cephadm.yml`` (or in a specific
 Kayobe environment when using multiple environment, e.g.
 ``etc/kayobe/environments/<Environment Name>/cephadm.yml``)
 
-StackHPC's cephadm Ansible collection relies on multiple inventory groups:
+StackHPC's Cephadm Ansible collection relies on multiple inventory groups:
 
 - ``mons``
 - ``mgrs``
@@ -24,11 +24,11 @@ StackHPC's cephadm Ansible collection relies on multiple inventory groups:
 
 Those groups are usually defined in ``etc/kayobe/inventory/groups``.
 
-Running cephadm playbooks
+Running Cephadm playbooks
 -------------------------
 
 In kayobe-config repository, under ``etc/kayobe/ansible`` there is a set of
-cephadm based playbooks utilising stackhpc.cephadm Ansible Galaxy collection.
+Cephadm based playbooks utilising stackhpc.cephadm Ansible Galaxy collection.
 
 - ``cephadm.yml`` - runs the end to end process starting with deployment and
   defining EC profiles/crush rules/pools and users
@@ -176,11 +176,11 @@ Remove the OSD using Ceph orchestrator command:
    ceph orch osd rm <ID> --replace
 
 After removing OSDs, if the drives the OSDs were deployed on once again become
-available, cephadm may automatically try to deploy more OSDs on these drives if
+available, Cephadm may automatically try to deploy more OSDs on these drives if
 they match an existing drivegroup spec.
 If this is not your desired action plan - it's best to modify the drivegroup
 spec before (``cephadm_osd_spec`` variable in ``etc/kayobe/cephadm.yml``).
-Either set ``unmanaged: true`` to stop cephadm from picking up new disks or
+Either set ``unmanaged: true`` to stop Cephadm from picking up new disks or
 modify it in some way that it no longer matches the drives you want to remove.
 
 Host maintenance
diff --git a/doc/source/operations/control-plane-operation.rst b/doc/source/operations/control-plane-operation.rst
index f81111dc01..d3440c97db 100644
--- a/doc/source/operations/control-plane-operation.rst
+++ b/doc/source/operations/control-plane-operation.rst
@@ -26,7 +26,7 @@ Monitoring
 ----------
 
 * `Back up InfluxDB <https://docs.influxdata.com/influxdb/v1.8/administration/backup_and_restore/>`__
-* `Back up ElasticSearch <https://www.elastic.co/guide/en/elasticsearch/reference/current/backup-cluster-data.html>`__
+* `Back up OpenSearch <https://opensearch.org/docs/latest/tuning-your-cluster/availability-and-recovery/snapshots/snapshot-restore/>`__
 * `Back up Prometheus <https://prometheus.io/docs/prometheus/latest/querying/api/#snapshot>`__
 
 Seed
@@ -42,8 +42,8 @@ Ansible control host
 Control Plane Monitoring
 ========================
 
-The control plane has been configured to collect logs centrally using the EFK
-stack (Elasticsearch, Fluentd and Kibana).
+The control plane has been configured to collect logs centrally using Fluentd,
+OpenSearch and OpenSearch Dashboards.
 
 Telemetry monitoring of the control plane is performed by Prometheus. Metrics
 are collected by Prometheus exporters, which are either running on all hosts
@@ -227,7 +227,7 @@ Overview
 * Remove the node from maintenance mode in bifrost
 * Bifrost should automatically power on the node via IPMI
 * Check that all docker containers are running
-* Check Kibana for any messages with log level ERROR or equivalent
+* Check OpenSearch Dashboards for any messages with log level ERROR or equivalent
 
 Controllers
 -----------
@@ -277,7 +277,7 @@ Stop all Docker containers:
 
 .. code-block:: console
 
-   monitoring0# for i in `docker ps -q`; do docker stop $i; done
+   monitoring0# for i in `docker ps -a`; do systemctl stop kolla-$i-container; done
 
 Shut down the node:
 
@@ -342,21 +342,6 @@ Host packages can be updated with:
 
 See https://docs.openstack.org/kayobe/latest/administration/overcloud.html#updating-packages
 
-Upgrading OpenStack Services
-----------------------------
-
-* Update tags for the images in ``etc/kayobe/kolla-image-tags.yml``
-* Pull container images to overcloud hosts with ``kayobe overcloud container image pull``
-* Run ``kayobe overcloud service upgrade``
-
-You can update the subset of containers or hosts by
-
-.. code-block:: console
-
-   kayobe# kayobe overcloud service upgrade --kolla-tags <service> --limit <hostname> --kolla-limit <hostname>
-
-For more information, see: https://docs.openstack.org/kayobe/latest/upgrading.html
-
 Troubleshooting
 ===============
 
@@ -378,27 +363,7 @@ To boot an instance on a specific hypervisor
 
 .. code-block:: console
 
-   openstack server create --flavor <flavour name>--network <network name> --key-name <key> --image <Image name> --os-compute-api-version 2.74 --host <hypervisor hostname> <vm name>
-
-Cleanup Procedures
-==================
-
-OpenStack services can sometimes fail to remove all resources correctly. This
-is the case with Magnum, which fails to clean up users in its domain after
-clusters are deleted. `A patch has been submitted to stable branches
-<https://review.opendev.org/#/q/Ibadd5b57fe175bb0b100266e2dbcc2e1ea4efcf9>`__.
-Until this fix becomes available, if Magnum is in use, administrators can
-perform the following cleanup procedure regularly:
-
-.. code-block:: console
-
-   for user in $(openstack user list --domain magnum -f value -c Name | grep -v magnum_trustee_domain_admin); do
-      if openstack coe cluster list -c uuid -f value | grep -q $(echo $user | sed 's/_[0-9a-f]*$//'); then
-         echo "$user still in use, not deleting"
-      else
-         openstack user delete --domain magnum $user
-      fi
-      done
+   openstack server create --flavor <flavour name> --network <network name> --key-name <key name> --image <image name> --os-compute-api-version 2.74 --host <hypervisor hostname> <vm name>
 
 OpenSearch indexes retention
 =============================
diff --git a/doc/source/operations/customising-horizon.rst b/doc/source/operations/customising-horizon.rst
index 096ce2e561..a39973e0f9 100644
--- a/doc/source/operations/customising-horizon.rst
+++ b/doc/source/operations/customising-horizon.rst
@@ -113,6 +113,6 @@ If the ``horizon`` container is restarting with the following error:
    /var/lib/kolla/venv/bin/python /var/lib/kolla/venv/bin/manage.py compress --force
    CommandError: An error occurred during rendering /var/lib/kolla/venv/lib/python3.6/site-packages/openstack_dashboard/templates/horizon/_scripts.html: Couldn't find any precompiler in COMPRESS_PRECOMPILERS setting for mimetype '\'text/javascript\''.
 
-It can be resolved by dropping cached content with ``docker restart
-memcached``. Note this will log out users from Horizon, as Django sessions are
-stored in Memcached.
+It can be resolved by dropping cached content with ``systemctl restart
+kolla-memcached-container``. Note this will log out users from Horizon, as Django
+sessions are stored in Memcached.

From 7d41f5b206649a0da2a94e2de30a2fa54a3c0581 Mon Sep 17 00:00:00 2001
From: Seunghun Lee <seunghun@stackhpc.com>
Date: Mon, 16 Sep 2024 14:03:13 +0100
Subject: [PATCH 24/42] Update Cephadm playbook info

---
 doc/source/operations/ceph-management.rst | 33 ++++++++++++++++-------
 1 file changed, 24 insertions(+), 9 deletions(-)

diff --git a/doc/source/operations/ceph-management.rst b/doc/source/operations/ceph-management.rst
index 67e3a5899f..3132c42bae 100644
--- a/doc/source/operations/ceph-management.rst
+++ b/doc/source/operations/ceph-management.rst
@@ -30,16 +30,31 @@ Running Cephadm playbooks
 In kayobe-config repository, under ``etc/kayobe/ansible`` there is a set of
 Cephadm based playbooks utilising stackhpc.cephadm Ansible Galaxy collection.
 
-- ``cephadm.yml`` - runs the end to end process starting with deployment and
-  defining EC profiles/crush rules/pools and users
-- ``cephadm-crush-rules.yml`` - defines Ceph crush rules according
-- ``cephadm-deploy.yml`` - runs the bootstrap/deploy playbook without the
+``cephadm.yml`` runs the end to end process of Cephadm deployment and
+configuration. It is composed with following list of other Cephadm playbooks
+and they can be run separately.
+
+- ``cephadm-deploy.yml`` - Runs the bootstrap/deploy playbook without the
   additional playbooks
-- ``cephadm-ec-profiles.yml`` - defines Ceph EC profiles
-- ``cephadm-gather-keys.yml`` - gather Ceph configuration and keys and populate
-  kayobe-config
-- ``cephadm-keys.yml`` - defines Ceph users/keys
-- ``cephadm-pools.yml`` - defines Ceph pools\
+- ``cephadm-commands-pre.yml`` - Runs Ceph commands before post-deployment
+  configuration (You can set a list of commands at ``cephadm_commands_pre_extra``
+  in ``cephadm.yml``)
+- ``cephadm-ec-profiles.yml`` - Defines Ceph EC profiles
+- ``cephadm-crush-rules.yml`` - Defines Ceph crush rules according
+- ``cephadm-pools.yml`` - Defines Ceph pools
+- ``cephadm-keys.yml`` - Defines Ceph users/keys
+- ``cephadm-commands-post.yml`` - Runs Ceph commands after post-deployment
+  configuration (You can set a list of commands at ``cephadm_commands_post_extra``
+  in ``cephadm.yml``)
+
+There are also other Ceph playbooks that are not part of ``cephadm.yml``
+
+- ``cephadm-gather-keys.yml`` - Populate ``ceph.conf`` in kayobe-config by
+  gathering Ceph configuration and keys
+- ``ceph-enter-maintenance.yml`` - Set Ceph to maintenance mode for storage
+  hosts (Can limit the hosts with ``-l <hostname>``)
+- ``ceph-exit-maintenance.yml`` - Unset Ceph to maintenance mode for storage
+  hosts (Can limit the hosts with ``-l <hostname>``)
 
 Running Ceph commands
 ---------------------

From 109ca13c62d978c90682d495a592e470b73927c4 Mon Sep 17 00:00:00 2001
From: Seunghun Lee <seunghun@stackhpc.com>
Date: Mon, 16 Sep 2024 16:26:20 +0100
Subject: [PATCH 25/42] Replace etc/kayobe to $KAYOBE_CONFIG_PATH

---
 doc/source/operations/ceph-management.rst        | 16 ++++++++--------
 doc/source/operations/customising-horizon.rst    |  2 +-
 doc/source/operations/gpu-in-openstack.rst       |  6 +++---
 doc/source/operations/nova-compute-ironic.rst    |  8 ++++----
 .../operations/openstack-reconfiguration.rst     |  2 +-
 doc/source/operations/secret-rotation.rst        |  2 +-
 6 files changed, 18 insertions(+), 18 deletions(-)

diff --git a/doc/source/operations/ceph-management.rst b/doc/source/operations/ceph-management.rst
index 3132c42bae..6db44ad23f 100644
--- a/doc/source/operations/ceph-management.rst
+++ b/doc/source/operations/ceph-management.rst
@@ -11,9 +11,9 @@ please refer to :ref:`cephadm-kayobe` documentation.
 Cephadm configuration location
 ------------------------------
 
-In kayobe-config repository, under ``etc/kayobe/cephadm.yml`` (or in a specific
+In kayobe-config repository, under ``$KAYOBE_CONFIG_PATH/cephadm.yml`` (or in a specific
 Kayobe environment when using multiple environment, e.g.
-``etc/kayobe/environments/<Environment Name>/cephadm.yml``)
+``$KAYOBE_CONFIG_PATH/environments/<environment name>/cephadm.yml``)
 
 StackHPC's Cephadm Ansible collection relies on multiple inventory groups:
 
@@ -22,12 +22,12 @@ StackHPC's Cephadm Ansible collection relies on multiple inventory groups:
 - ``osds``
 - ``rgws`` (optional)
 
-Those groups are usually defined in ``etc/kayobe/inventory/groups``.
+Those groups are usually defined in ``$KAYOBE_CONFIG_PATH/inventory/groups``.
 
 Running Cephadm playbooks
 -------------------------
 
-In kayobe-config repository, under ``etc/kayobe/ansible`` there is a set of
+In kayobe-config repository, under ``$KAYOBE_CONFIG_PATH/ansible`` there is a set of
 Cephadm based playbooks utilising stackhpc.cephadm Ansible Galaxy collection.
 
 ``cephadm.yml`` runs the end to end process of Cephadm deployment and
@@ -38,14 +38,14 @@ and they can be run separately.
   additional playbooks
 - ``cephadm-commands-pre.yml`` - Runs Ceph commands before post-deployment
   configuration (You can set a list of commands at ``cephadm_commands_pre_extra``
-  in ``cephadm.yml``)
+  variable in ``$KAYOBE_CONFIG_PATH/cephadm.yml``)
 - ``cephadm-ec-profiles.yml`` - Defines Ceph EC profiles
 - ``cephadm-crush-rules.yml`` - Defines Ceph crush rules according
 - ``cephadm-pools.yml`` - Defines Ceph pools
 - ``cephadm-keys.yml`` - Defines Ceph users/keys
 - ``cephadm-commands-post.yml`` - Runs Ceph commands after post-deployment
   configuration (You can set a list of commands at ``cephadm_commands_post_extra``
-  in ``cephadm.yml``)
+  variable in ``$KAYOBE_CONFIG_PATH/cephadm.yml``)
 
 There are also other Ceph playbooks that are not part of ``cephadm.yml``
 
@@ -102,7 +102,7 @@ Once all daemons are removed - you can remove the host:
    ceph orch host rm <host>
 
 And then remove the host from inventory (usually in
-``etc/kayobe/inventory/overcloud``)
+``$KAYOBE_CONFIG_PATH/inventory/overcloud``)
 
 Additional options/commands may be found in
 `Host management <https://docs.ceph.com/en/latest/cephadm/host-management/>`_
@@ -194,7 +194,7 @@ After removing OSDs, if the drives the OSDs were deployed on once again become
 available, Cephadm may automatically try to deploy more OSDs on these drives if
 they match an existing drivegroup spec.
 If this is not your desired action plan - it's best to modify the drivegroup
-spec before (``cephadm_osd_spec`` variable in ``etc/kayobe/cephadm.yml``).
+spec before (``cephadm_osd_spec`` variable in ``$KAYOBE_CONFIG_PATH/cephadm.yml``).
 Either set ``unmanaged: true`` to stop Cephadm from picking up new disks or
 modify it in some way that it no longer matches the drives you want to remove.
 
diff --git a/doc/source/operations/customising-horizon.rst b/doc/source/operations/customising-horizon.rst
index a39973e0f9..586bb242cd 100644
--- a/doc/source/operations/customising-horizon.rst
+++ b/doc/source/operations/customising-horizon.rst
@@ -51,7 +51,7 @@ Adding the custom theme
 
 Create a directory and transfer custom theme files to it ``$KAYOBE_CONFIG_PATH/kolla/config/horizon/themes/<custom theme name>``.
 
-Define the custom theme in ``etc/kayobe/kolla/globals.yml``
+Define the custom theme in ``$KAYOBE_CONFIG_PATH/kolla/globals.yml``
 
 .. code-block:: yaml
 
diff --git a/doc/source/operations/gpu-in-openstack.rst b/doc/source/operations/gpu-in-openstack.rst
index 4979d3f25b..9270817187 100644
--- a/doc/source/operations/gpu-in-openstack.rst
+++ b/doc/source/operations/gpu-in-openstack.rst
@@ -991,7 +991,7 @@ Configure nova-scheduler
 
 The nova-scheduler service must be configured to enable the ``PciPassthroughFilter``
 To enable it add it to the list of filters to Kolla-Ansible configuration file:
-``etc/kayobe/kolla/config/nova.conf``, for instance:
+``$KAYOBE_CONFIG_PATH/kolla/config/nova.conf``, for instance:
 
 .. code-block:: yaml
 
@@ -1006,7 +1006,7 @@ Configuration can be applied in flexible ways using Kolla-Ansible's
 methods for `inventory-driven customisation of configuration
 <https://docs.openstack.org/kayobe/latest/configuration/reference/kolla-ansible.html#service-configuration>`_.
 The following configuration could be added to
-``etc/kayobe/kolla/config/nova/nova-compute.conf`` to enable PCI
+``$KAYOBE_CONFIG_PATH/kolla/config/nova/nova-compute.conf`` to enable PCI
 passthrough of GPU devices for hosts in a group named ``compute_gpu``.
 Again, the 4-digit PCI Vendor ID and Device ID extracted from ``lspci
 -nn`` can be used here to specify the GPU device(s).
@@ -1037,7 +1037,7 @@ Configure nova-api
 pci.alias also needs to be configured on the controller.
 This configuration should match the configuration found on the compute nodes.
 Add it to Kolla-Ansible configuration file:
-``etc/kayobe/kolla/config/nova/nova-api.conf``, for instance:
+``$KAYOBE_CONFIG_PATH/kolla/config/nova/nova-api.conf``, for instance:
 
 .. code-block:: yaml
 
diff --git a/doc/source/operations/nova-compute-ironic.rst b/doc/source/operations/nova-compute-ironic.rst
index 6cbe00550f..9b9af76ffb 100644
--- a/doc/source/operations/nova-compute-ironic.rst
+++ b/doc/source/operations/nova-compute-ironic.rst
@@ -68,7 +68,7 @@ Moving from multiple Nova Compute Instances to a single instance
 1. Decide where the single instance should run. This should normally be
    one of the three OpenStack control plane hosts. For convention, pick
    the first one, unless you can think of a good reason not to. Once you
-   have chosen, set the following variable in ``etc/kayobe/nova.yml``.
+   have chosen, set the following variable in ``$KAYOBE_CONFIG_PATH/nova.yml``.
    Here we have picked ``controller1``.
 
   .. code-block:: yaml
@@ -196,7 +196,7 @@ constant, such that the new Nova Compute Ironic instance comes up with the
 same name as the one it replaces.
 
 For example, if the original instance resides on ``controller1``, then set the
-following in ``etc/kayobe/nova.yml``:
+following in ``$KAYOBE_CONFIG_PATH/nova.yml``:
 
 .. code-block:: yaml
 
@@ -270,8 +270,8 @@ Current host is not accessible
 ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
 
 In this case you will need to remove the inaccessible host from the inventory.
-For example, in ``etc/kayobe/inventory/hosts``, remove ``controller1`` from
-the ``controllers`` group.
+For example, in ``$KAYOBE_CONFIG_PATH/inventory/hosts``, remove ``controller1``
+from the ``controllers`` group.
 
 Adjust the ``kolla_nova_compute_ironic_host`` variable to point to the
 new host, eg.
diff --git a/doc/source/operations/openstack-reconfiguration.rst b/doc/source/operations/openstack-reconfiguration.rst
index 712a0f779e..ab2a01fc8e 100644
--- a/doc/source/operations/openstack-reconfiguration.rst
+++ b/doc/source/operations/openstack-reconfiguration.rst
@@ -73,7 +73,7 @@ Alternative Configuration
 
 As an alternative to writing the certificates as a variable to
 ``secrets.yml``, it is also possible to write the same data to a file,
-``etc/kayobe/kolla/certificates/haproxy.pem``.  The file should be
+``$KAYOBE_CONFIG_PATH/kolla/certificates/haproxy.pem``.  The file should be
 vault-encrypted in the same manner as secrets.yml.  In this instance,
 variable ``kolla_external_tls_cert`` does not need to be defined.
 
diff --git a/doc/source/operations/secret-rotation.rst b/doc/source/operations/secret-rotation.rst
index a01f66fa9c..34fd33a72c 100644
--- a/doc/source/operations/secret-rotation.rst
+++ b/doc/source/operations/secret-rotation.rst
@@ -105,7 +105,7 @@ Full method
 
 3. Navigate to the directory containing your ``passwords.yml`` file
    (``kayobe-config/etc/kolla/passwords.yml`` OR
-   ``kayobe-config/etc/kayobe/environments/envname/kolla/passwords.yml``)
+   ``kayobe-config/etc/kayobe/environments/<envname>/kolla/passwords.yml``)
 
 4. Create a file called ``deletelist.txt`` and populate it with this content
    (including all whitespace):

From 8edf08f55031c073f55ff4f9bcb270d7d8998067 Mon Sep 17 00:00:00 2001
From: Seunghun Lee <seunghun@stackhpc.com>
Date: Wed, 9 Oct 2024 12:11:16 +0100
Subject: [PATCH 26/42] specify keyring is populated

---
 doc/source/operations/ceph-management.rst | 2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/doc/source/operations/ceph-management.rst b/doc/source/operations/ceph-management.rst
index 6db44ad23f..e48a8d3e02 100644
--- a/doc/source/operations/ceph-management.rst
+++ b/doc/source/operations/ceph-management.rst
@@ -49,7 +49,7 @@ and they can be run separately.
 
 There are also other Ceph playbooks that are not part of ``cephadm.yml``
 
-- ``cephadm-gather-keys.yml`` - Populate ``ceph.conf`` in kayobe-config by
+- ``cephadm-gather-keys.yml`` - Populate ``ceph.conf`` and keyrings in kayobe-config by
   gathering Ceph configuration and keys
 - ``ceph-enter-maintenance.yml`` - Set Ceph to maintenance mode for storage
   hosts (Can limit the hosts with ``-l <hostname>``)

From 20e46a3e62c3744fadd122397dd1cfcd521ef8d0 Mon Sep 17 00:00:00 2001
From: Seunghun Lee <seunghun@stackhpc.com>
Date: Wed, 9 Oct 2024 12:11:25 +0100
Subject: [PATCH 27/42] Add rebooting case

---
 doc/source/operations/control-plane-operation.rst | 4 ++--
 1 file changed, 2 insertions(+), 2 deletions(-)

diff --git a/doc/source/operations/control-plane-operation.rst b/doc/source/operations/control-plane-operation.rst
index d3440c97db..cdb328a469 100644
--- a/doc/source/operations/control-plane-operation.rst
+++ b/doc/source/operations/control-plane-operation.rst
@@ -20,7 +20,7 @@ Compute
 
 The compute nodes can largely be thought of as ephemeral, but you do need to
 make sure you have migrated any instances and disabled the hypervisor before
-decommissioning or making any disruptive configuration change.
+rebooting, decommissioning or making any disruptive configuration change.
 
 Monitoring
 ----------
@@ -197,7 +197,7 @@ following order:
 
 * Perform a graceful shutdown of all virtual machine instances
 * Shut down compute nodes
-* Shut down monitoring node
+* Shut down monitoring node (if separate from controllers)
 * Shut down network nodes (if separate from controllers)
 * Shut down controllers
 * Shut down Ceph nodes (if applicable)

From 275ce2c494e855bac846fd918a588c21eb918df9 Mon Sep 17 00:00:00 2001
From: Seunghun Lee <seunghun@stackhpc.com>
Date: Mon, 4 Nov 2024 13:26:47 +0000
Subject: [PATCH 28/42] Remove missing document

---
 doc/source/operations/index.rst | 1 -
 1 file changed, 1 deletion(-)

diff --git a/doc/source/operations/index.rst b/doc/source/operations/index.rst
index 2408b1a36e..3bfc7e976a 100644
--- a/doc/source/operations/index.rst
+++ b/doc/source/operations/index.rst
@@ -23,4 +23,3 @@ This guide is for operators of the StackHPC Kayobe configuration project.
    tempest
    upgrading-openstack
    upgrading-ceph
-   wazuh-operation

From c24dc1ddd4a73723cadf4cbe30c021c7c74ff8ad Mon Sep 17 00:00:00 2001
From: Seunghun Lee <seunghun@stackhpc.com>
Date: Mon, 4 Nov 2024 13:28:31 +0000
Subject: [PATCH 29/42] Make hardware inventory doc bifrost specific

---
 ...t => bifrost-hardware-inventory-management.rst} | 14 ++++++++------
 doc/source/operations/index.rst                    |  2 +-
 2 files changed, 9 insertions(+), 7 deletions(-)
 rename doc/source/operations/{hardware-inventory-management.rst => bifrost-hardware-inventory-management.rst} (96%)

diff --git a/doc/source/operations/hardware-inventory-management.rst b/doc/source/operations/bifrost-hardware-inventory-management.rst
similarity index 96%
rename from doc/source/operations/hardware-inventory-management.rst
rename to doc/source/operations/bifrost-hardware-inventory-management.rst
index f88626a827..970e828aee 100644
--- a/doc/source/operations/hardware-inventory-management.rst
+++ b/doc/source/operations/bifrost-hardware-inventory-management.rst
@@ -1,8 +1,8 @@
-=============================
-Hardware Inventory Management
-=============================
+=====================================
+Bifrost Hardware Inventory Management
+=====================================
 
-At its lowest level, hardware inventory is managed in the Bifrost service.
+In most deployments, hardware inventory is managed by the Bifrost service.
 
 Reconfiguring Control Plane Hardware
 ====================================
@@ -56,7 +56,9 @@ in Bifrost:
    | da0c61af-b411-41b9-8909-df2509f2059b | example-hypervisor-01 | None          | power off   | enroll             | False       |
    +--------------------------------------+-----------------------+---------------+-------------+--------------------+-------------+
 
-After editing ``${KAYOBE_CONFIG_PATH}/overcloud.yml`` to add these new hosts to
+After editing ``${KAYOBE_CONFIG_PATH}/overcloud.yml`` (or
+``${KAYOBE_CONFIG_PATH}/environments/${KAYOBE_ENVIRONMENT}/overcloud.yml``
+if Kayobe environment is used) to add these new hosts to
 the correct groups, import them in Kayobe's inventory with:
 
 .. code-block:: console
@@ -138,7 +140,7 @@ migrate as the process needs manual confirmation. You can do this with:
 
 .. code-block:: console
 
-   openstack # openstack server resize confirm <instance-uuid>
+   openstack# openstack server resize confirm <instance-uuid>
 
 The symptom to look out for is that the server is showing a status of ``VERIFY
 RESIZE`` as shown in this snippet of ``openstack server show <instance-uuid>``:
diff --git a/doc/source/operations/index.rst b/doc/source/operations/index.rst
index 3bfc7e976a..fd21c8c690 100644
--- a/doc/source/operations/index.rst
+++ b/doc/source/operations/index.rst
@@ -11,7 +11,7 @@ This guide is for operators of the StackHPC Kayobe configuration project.
    control-plane-operation
    customising-horizon
    gpu-in-openstack
-   hardware-inventory-management
+   bifrost-hardware-inventory-management
    hotfix-playbook
    migrating-vm
    nova-compute-ironic

From 5b12bd0a4eed855a9ab469f97cea0bf6f8639cdb Mon Sep 17 00:00:00 2001
From: Seunghun Lee <seunghun@stackhpc.com>
Date: Mon, 4 Nov 2024 13:29:36 +0000
Subject: [PATCH 30/42] Add reference to monitoring doc

---
 doc/source/configuration/monitoring.rst           | 2 ++
 doc/source/operations/control-plane-operation.rst | 3 +++
 2 files changed, 5 insertions(+)

diff --git a/doc/source/configuration/monitoring.rst b/doc/source/configuration/monitoring.rst
index 8215dc48bf..77c5e47f77 100644
--- a/doc/source/configuration/monitoring.rst
+++ b/doc/source/configuration/monitoring.rst
@@ -2,6 +2,8 @@
 Monitoring
 ==========
 
+.. _monitoring-service-configuration:
+
 Monitoring Configuration
 ========================
 
diff --git a/doc/source/operations/control-plane-operation.rst b/doc/source/operations/control-plane-operation.rst
index cdb328a469..d056a95105 100644
--- a/doc/source/operations/control-plane-operation.rst
+++ b/doc/source/operations/control-plane-operation.rst
@@ -42,6 +42,9 @@ Ansible control host
 Control Plane Monitoring
 ========================
 
+This section shows user guide of monitoring control plane. To see how to
+configure monitoring services, read :ref:`monitoring-service-configuration`.
+
 The control plane has been configured to collect logs centrally using Fluentd,
 OpenSearch and OpenSearch Dashboards.
 

From b7b776f3ea655d1dfe71ecdc9633f2b218d12356 Mon Sep 17 00:00:00 2001
From: Seunghun Lee <seunghun@stackhpc.com>
Date: Mon, 4 Nov 2024 13:30:31 +0000
Subject: [PATCH 31/42] Use reboot playbook rather than shutdown command

---
 doc/source/operations/control-plane-operation.rst | 3 ++-
 1 file changed, 2 insertions(+), 1 deletion(-)

diff --git a/doc/source/operations/control-plane-operation.rst b/doc/source/operations/control-plane-operation.rst
index d056a95105..7e7711004d 100644
--- a/doc/source/operations/control-plane-operation.rst
+++ b/doc/source/operations/control-plane-operation.rst
@@ -210,11 +210,12 @@ following order:
 Rebooting a node
 ----------------
 
+Use ``reboot.yml`` playbook to reboot nodes
 Example: Reboot all compute hosts apart from compute0:
 
 .. code-block:: console
 
-   kayobe# kayobe overcloud host command run --limit 'compute:!compute0' -b --command "shutdown -r"
+   kayobe# kayobe playbook run $KAYOBE_CONFIG_PATH/ansible/reboot.yml --limit 'compute:!compute0'
 
 References
 ----------

From f7018d5c99a09955568d2fa365a1f010dc8fc34a Mon Sep 17 00:00:00 2001
From: Seunghun Lee <seunghun@stackhpc.com>
Date: Mon, 4 Nov 2024 13:31:03 +0000
Subject: [PATCH 32/42] Use env variable

---
 doc/source/operations/secret-rotation.rst | 4 ++--
 1 file changed, 2 insertions(+), 2 deletions(-)

diff --git a/doc/source/operations/secret-rotation.rst b/doc/source/operations/secret-rotation.rst
index 34fd33a72c..127635ab4d 100644
--- a/doc/source/operations/secret-rotation.rst
+++ b/doc/source/operations/secret-rotation.rst
@@ -104,8 +104,8 @@ Full method
 
 
 3. Navigate to the directory containing your ``passwords.yml`` file
-   (``kayobe-config/etc/kolla/passwords.yml`` OR
-   ``kayobe-config/etc/kayobe/environments/<envname>/kolla/passwords.yml``)
+   (``$KOLLA_CONFIG_PATH/passwords.yml`` OR
+   ``$KAYOBE_CONFIG_PATH/environments/$KAYOBE_ENVIRONMENT/kolla/passwords.yml``)
 
 4. Create a file called ``deletelist.txt`` and populate it with this content
    (including all whitespace):

From 9e324bad5da08cd960353e57b6c736c22966ab87 Mon Sep 17 00:00:00 2001
From: Seunghun Lee <seunghun@stackhpc.com>
Date: Mon, 4 Nov 2024 13:32:16 +0000
Subject: [PATCH 33/42] Make Vault and Openstack reconfig doc refer each other

---
 doc/source/configuration/vault.rst                 |  5 +++++
 .../operations/openstack-reconfiguration.rst       | 14 ++++++++++----
 2 files changed, 15 insertions(+), 4 deletions(-)

diff --git a/doc/source/configuration/vault.rst b/doc/source/configuration/vault.rst
index 893af246c3..d05513632e 100644
--- a/doc/source/configuration/vault.rst
+++ b/doc/source/configuration/vault.rst
@@ -1,3 +1,5 @@
+.. _hashicorp-vault:
+
 ================================
 Hashicorp Vault for internal PKI
 ================================
@@ -111,6 +113,9 @@ Certificates generation
 Create the external TLS certificates (testing only)
 ---------------------------------------------------
 
+This method should only be used for testing. For external certificates on production system,
+See `Installing External TLS Certificates <installing-external-tls-certificates>`__.
+
 Typically external API TLS certificates should be generated by a organisation's trusted internal or third-party CA.
 For test and development purposes it is possible to use Vault as a CA for the external API.
 
diff --git a/doc/source/operations/openstack-reconfiguration.rst b/doc/source/operations/openstack-reconfiguration.rst
index ab2a01fc8e..35729c8c5f 100644
--- a/doc/source/operations/openstack-reconfiguration.rst
+++ b/doc/source/operations/openstack-reconfiguration.rst
@@ -35,8 +35,14 @@ On each controller:
 Some services may store data in a dedicated Docker volume, which can be removed
 with ``docker volume rm``.
 
-Installing TLS Certificates
-===========================
+.. _installing-external-tls-certificates:
+
+Installing External TLS Certificates
+====================================
+
+This section explains the process of deploying external TLS.
+For internal and backend TLS, see `Hashicorp Vault for internal PKI
+<hashicorp-vault>`__.
 
 To configure TLS for the first time, we write the contents of a PEM
 file to the ``secrets.yml`` file as ``secrets_kolla_external_tls_cert``.
@@ -81,8 +87,8 @@ See `Kolla-Ansible TLS guide
 <https://docs.openstack.org/kolla-ansible/latest/admin/tls.html>`__ for
 further details.
 
-Updating TLS Certificates
--------------------------
+Updating External TLS Certificates
+----------------------------------
 
 Check the expiry date on an installed TLS certificate from a host that can
 reach the OpenStack APIs:

From b9cefdf35df4d89399657ae436e87225347c9b67 Mon Sep 17 00:00:00 2001
From: Seunghun Lee <seunghun@stackhpc.com>
Date: Mon, 4 Nov 2024 13:32:54 +0000
Subject: [PATCH 34/42] Fix: Use RST syntax of Note

---
 doc/source/operations/openstack-reconfiguration.rst | 6 ++++--
 1 file changed, 4 insertions(+), 2 deletions(-)

diff --git a/doc/source/operations/openstack-reconfiguration.rst b/doc/source/operations/openstack-reconfiguration.rst
index 35729c8c5f..31c1c5acc8 100644
--- a/doc/source/operations/openstack-reconfiguration.rst
+++ b/doc/source/operations/openstack-reconfiguration.rst
@@ -97,8 +97,10 @@ reach the OpenStack APIs:
 
    openstack# openssl s_client -connect <Public endpoint FQDN>:443 2> /dev/null | openssl x509 -noout -dates
 
-*NOTE*: Prometheus Blackbox monitoring can check certificates automatically
-and alert when expiry is approaching.
+.. note::
+
+   Prometheus Blackbox monitoring can check certificates automatically
+   and alert when expiry is approaching.
 
 To update an existing certificate, for example when it has reached expiration,
 change the value of ``secrets_kolla_external_tls_cert``, in the same order as

From 106b14f870cb026e5b0b5870c84f7fa3f39eb7e3 Mon Sep 17 00:00:00 2001
From: Seunghun Lee <seunghun@stackhpc.com>
Date: Mon, 4 Nov 2024 13:33:23 +0000
Subject: [PATCH 35/42] Update to use some of upstream doc

---
 doc/source/operations/gpu-in-openstack.rst | 324 ++-------------------
 1 file changed, 19 insertions(+), 305 deletions(-)

diff --git a/doc/source/operations/gpu-in-openstack.rst b/doc/source/operations/gpu-in-openstack.rst
index 9270817187..330edfbb1d 100644
--- a/doc/source/operations/gpu-in-openstack.rst
+++ b/doc/source/operations/gpu-in-openstack.rst
@@ -2,35 +2,18 @@
 Support for GPUs in OpenStack
 =============================
 
-NVIDIA Virtual GPU
-##################
+Virtual GPUs
+############
 
 BIOS configuration
 ------------------
 
-Intel
-^^^^^
-
-* Enable `VT-x` in the BIOS for virtualisation support.
-* Enable `VT-d` in the BIOS for IOMMU support.
-
-Dell
-^^^^
-
-Enabling SR-IOV with `racadm`:
-
-.. code:: shell
-
-    /opt/dell/srvadmin/bin/idracadm7 set BIOS.IntegratedDevices.SriovGlobalEnable Enabled
-    /opt/dell/srvadmin/bin/idracadm7 jobqueue create BIOS.Setup.1-1
-    <reboot>
-
+See upstream documentation: `BIOS configuration <https://docs.openstack.org/kayobe/latest/configuration/reference/vgpu.html#bios-configuration>`__
 
 Obtain driver from NVIDIA licensing portal
--------------------------------------------
+------------------------------------------
 
-Download Nvidia GRID driver from `here <https://docs.nvidia.com/grid/latest/grid-software-quick-start-guide/index.html#redeeming-pak-and-downloading-grid-software>`__
-(This requires a login). The file can either be placed on the :ref:`ansible control host<NVIDIA control host>` or :ref:`uploaded to pulp<NVIDIA Pulp>`.
+See upstream documentation: `Obtain driver from NVIDIA licencing portal <https://docs.openstack.org/kayobe/latest/configuration/reference/vgpu.html#obtain-driver-from-nvidia-licensing-portal>`__
 
 .. _NVIDIA Pulp:
 
@@ -52,7 +35,8 @@ running in a CI environment.
 The file will then be available at ``<pulp_url>/pulp/content/nvidia/NVIDIA-GRID-Linux-KVM-525.105.14-525.105.17-528.89.zip``. You
 will need to set the ``vgpu_driver_url`` configuration option to this value:
 
-.. code:: yaml
+.. code-block:: yaml
+   :caption: $KAYOBE_CONFIG_PATH/vgpu.yml
 
    # URL of GRID driver in pulp
    vgpu_driver_url: "{{ pulp_url }}/pulp/content/nvidia/NVIDIA-GRID-Linux-KVM-525.105.14-525.105.17-528.89.zip"
@@ -67,7 +51,8 @@ Placing the GRID driver on the ansible control host
 Copy the driver bundle to a known location on the ansible control host. Set the ``vgpu_driver_url`` configuration variable to reference this
 path using ``file`` as the url scheme e.g:
 
-.. code:: yaml
+.. code-block:: yaml
+   :caption: $KAYOBE_CONFIG_PATH/vgpu.yml
 
     # Location of NVIDIA GRID driver on localhost
     vgpu_driver_url: "file://{{ lookup('env', 'HOME') }}/NVIDIA-GRID-Linux-KVM-525.105.14-525.105.17-528.89.zip"
@@ -81,24 +66,12 @@ OS Configuration
 
 Host OS configuration is done by using roles in the `stackhpc.linux <https://github.com/stackhpc/ansible-collection-linux>`_ ansible collection.
 
-Add the following to your ansible ``requirements.yml``:
-
-.. code-block:: yaml
-   :caption: $KAYOBE_CONFIG_PATH/ansible/requirements.yml
-
-    #FIXME: Update to known release When VGPU and IOMMU roles have landed
-    collections:
-      - name: stackhpc.linux
-        source: git+https://github.com/stackhpc/ansible-collection-linux.git,preemptive/vgpu-iommu
-        type: git
-
 Create a new playbook or update an existing on to apply the roles:
 
 .. code-block:: yaml
    :caption: $KAYOBE_CONFIG_PATH/ansible/host-configure.yml
 
     ---
-
       - hosts: iommu
         tags:
           - iommu
@@ -176,15 +149,6 @@ hosts can automatically be mapped to these groups by configuring
 Role Configuration
 ^^^^^^^^^^^^^^^^^^
 
-Configure the location of the NVIDIA driver:
-
-.. code-block:: yaml
-   :caption: $KAYOBE_CONFIG_PATH/vgpu.yml
-
-    ---
-
-    vgpu_driver_url: "http://{{ pulp_url }}/pulp/content/nvidia/NVIDIA-GRID-Linux-KVM-525.105.14-525.105.17-528.89.zip"
-
 Configure the VGPU devices:
 
 .. code-block:: yaml
@@ -260,56 +224,8 @@ ensure you do not forget to run it when hosts are enrolled in the future.
 Kolla-Ansible configuration
 ^^^^^^^^^^^^^^^^^^^^^^^^^^^
 
-To use the mdev devices that were created, modify nova.conf to add a list of mdev devices that
-can be passed through to guests:
-
-.. code-block::
-   :caption: $KAYOBE_CONFIG_PATH/kolla/config/nova/nova-compute.conf
-
-    {% if inventory_hostname in groups['compute_multi_instance_gpu'] %}
-    [devices]
-    enabled_mdev_types = nvidia-700, nvidia-699
-
-    [mdev_nvidia-700]
-    device_addresses = 0000:21:00.4,0000:21:00.5,0000:21:00.6,0000:81:00.4,0000:81:00.5,0000:81:00.6
-    mdev_class = CUSTOM_NVIDIA_700
-
-    [mdev_nvidia-699]
-    device_addresses = 0000:21:00.7,0000:81:00.7
-    mdev_class = CUSTOM_NVIDIA_699
-
-    {% elif inventory_hostname in groups['compute_vgpu'] %}
-    [devices]
-    enabled_mdev_types = nvidia-697
-
-    [mdev_nvidia-697]
-    device_addresses = 0000:21:00.4,0000:21:00.5,0000:81:00.4,0000:81:00.5
-    # Custom resource classes don't work when you only have single resource type.
-    mdev_class = VGPU
-
-    {% endif %}
-
-You will need to adjust the PCI addresses to match the virtual function
-addresses. These can be obtained by checking the mdevctl configuration after
-running the role:
-
-.. code-block:: shell
-
-   # mdevctl list
-
-   73269d0f-b2c9-438d-8f28-f9e4bc6c6995 0000:17:00.4 nvidia-700 manual (defined)
-   dc352ef3-efeb-4a5d-a48e-912eb230bc76 0000:17:00.5 nvidia-700 manual (defined)
-   a464fbae-1f89-419a-a7bd-3a79c7b2eef4 0000:17:00.6 nvidia-700 manual (defined)
-   f3b823d3-97c8-4e0a-ae1b-1f102dcb3bce 0000:17:00.7 nvidia-699 manual (defined)
-   330be289-ba3f-4416-8c8a-b46ba7e51284 0000:65:00.4 nvidia-700 manual (defined)
-   1ba5392c-c61f-4f48-8fb1-4c6b2bbb0673 0000:65:00.5 nvidia-700 manual (defined)
-   f6868020-eb3a-49c6-9701-6c93e4e3fa9c 0000:65:00.6 nvidia-700 manual (defined)
-   00501f37-c468-5ba4-8be2-8d653c4604ed 0000:65:00.7 nvidia-699 manual (defined)
-
-The mdev_class maps to a resource class that you can set in your flavor definition.
-Note that if you only define a single mdev type on a given hypervisor, then the
-mdev_class configuration option is silently ignored and it will use the ``VGPU``
-resource class (bug?).
+See upstream documentation: `Kolla Ansible configuration <https://docs.openstack.org/kayobe/latest/configuration/reference/vgpu.html#kolla-ansible-configuration>`__
+then follow the rest.
 
 Map through the kayobe inventory groups into kolla:
 
@@ -356,28 +272,7 @@ You will need to reconfigure nova for this change to be applied:
 Openstack flavors
 ^^^^^^^^^^^^^^^^^
 
-Define some flavors that request the resource class that was configured in nova.conf.
-An example definition, that can be used with ``openstack.cloud.compute_flavor`` Ansible module,
-is shown below:
-
-.. code-block:: yaml
-
-  vgpu_a100_2g_20gb:
-    name: "vgpu.a100.2g.20gb"
-    ram: 65536
-    disk: 30
-    vcpus: 8
-    is_public: false
-    extra_specs:
-      hw:cpu_policy: "dedicated"
-      hw:cpu_thread_policy: "prefer"
-      hw:mem_page_size: "1GB"
-      hw:cpu_sockets: 2
-      hw:numa_nodes: 8
-      hw_rng:allowed: "True"
-      resources:CUSTOM_NVIDIA_700: "1"
-
-You now should be able to launch a VM with this flavor.
+See upstream documentation: `OpenStack flavors <https://docs.openstack.org/kayobe/latest/configuration/reference/vgpu.html#openstack-flavors>`__
 
 NVIDIA License Server
 ^^^^^^^^^^^^^^^^^^^^^
@@ -667,123 +562,7 @@ Example output:
 Changing VGPU device types
 ^^^^^^^^^^^^^^^^^^^^^^^^^^
 
-Converting the second card to an NVIDIA-698 (whole card). The hypervisor
-is empty so we can freely delete mdevs. First clean up the mdev
-definition:
-
-.. code:: shell
-
-   [stack@computegpu007 ~]$ sudo mdevctl list
-   5c630867-a673-5d75-aa31-a499e6c7cb19 0000:21:00.4 nvidia-697 manual (defined)
-   eaa6e018-308e-58e2-b351-aadbcf01f5a8 0000:21:00.5 nvidia-697 manual (defined)
-   72291b01-689b-5b7a-9171-6b3480deabf4 0000:81:00.4 nvidia-697 manual (defined)
-   0a47ffd1-392e-5373-8428-707a4e0ce31a 0000:81:00.5 nvidia-697 manual (defined)
-
-   [stack@computegpu007 ~]$ sudo mdevctl stop --uuid 72291b01-689b-5b7a-9171-6b3480deabf4
-   [stack@computegpu007 ~]$ sudo mdevctl stop --uuid 0a47ffd1-392e-5373-8428-707a4e0ce31a
-
-   [stack@computegpu007 ~]$ sudo mdevctl undefine --uuid 0a47ffd1-392e-5373-8428-707a4e0ce31a
-
-   [stack@computegpu007 ~]$ sudo mdevctl list --defined
-   5c630867-a673-5d75-aa31-a499e6c7cb19 0000:21:00.4 nvidia-697 manual (active)
-   eaa6e018-308e-58e2-b351-aadbcf01f5a8 0000:21:00.5 nvidia-697 manual (active)
-   72291b01-689b-5b7a-9171-6b3480deabf4 0000:81:00.4 nvidia-697 manual
-
-   # We can re-use the first virtual function
-
-Secondly remove the systemd unit that starts the mdev device:
-
-.. code:: shell
-
-   [stack@computegpu007 ~]$ sudo rm /etc/systemd/system/multi-user.target.wants/nvidia-mdev@0a47ffd1-392e-5373-8428-707a4e0ce31a.service
-
-Example config change:
-
-.. code:: shell
-
-   diff --git a/etc/kayobe/environments/cci1/inventory/host_vars/computegpu007/vgpu b/etc/kayobe/environments/cci1/inventory/host_vars/computegpu007/vgpu
-   new file mode 100644
-   index 0000000..6cea9bf
-   --- /dev/null
-   +++ b/etc/kayobe/environments/cci1/inventory/host_vars/computegpu007/vgpu
-   @@ -0,0 +1,12 @@
-   +---
-   +vgpu_definitions:
-   +    - pci_address: "0000:21:00.0"
-   +      virtual_functions:
-   +        - mdev_type: nvidia-697
-   +          index: 0
-   +        - mdev_type: nvidia-697
-   +          index: 1
-   +    - pci_address: "0000:81:00.0"
-   +      virtual_functions:
-   +        - mdev_type: nvidia-698
-   +          index: 0
-   diff --git a/etc/kayobe/kolla/config/nova/nova-compute.conf b/etc/kayobe/kolla/config/nova/nova-compute.conf
-   index 6f680cb..e663ec4 100644
-   --- a/etc/kayobe/kolla/config/nova/nova-compute.conf
-   +++ b/etc/kayobe/kolla/config/nova/nova-compute.conf
-   @@ -39,7 +39,19 @@ cpu_mode = host-model
-    {% endraw %}
-
-    {% raw %}
-   -{% if inventory_hostname in groups['compute_multi_instance_gpu'] %}
-   +{% if inventory_hostname == "computegpu007" %}
-   +[devices]
-   +enabled_mdev_types = nvidia-697, nvidia-698
-   +
-   +[mdev_nvidia-697]
-   +device_addresses = 0000:21:00.4,0000:21:00.5
-   +mdev_class = VGPU
-   +
-   +[mdev_nvidia-698]
-   +device_addresses = 0000:81:00.4
-   +mdev_class = CUSTOM_NVIDIA_698
-   +
-   +{% elif inventory_hostname in groups['compute_multi_instance_gpu'] %}
-    [devices]
-    enabled_mdev_types = nvidia-700, nvidia-699
-
-   @@ -50,15 +62,14 @@ mdev_class = CUSTOM_NVIDIA_700
-    [mdev_nvidia-699]
-    device_addresses = 0000:21:00.7,0000:81:00.7
-    mdev_class = CUSTOM_NVIDIA_699
-   -{% endif %}
-
-   -{% if inventory_hostname in groups['compute_vgpu'] %}
-   +{% elif inventory_hostname in groups['compute_vgpu'] %}
-    [devices]
-    enabled_mdev_types = nvidia-697
-
-    [mdev_nvidia-697]
-    device_addresses = 0000:21:00.4,0000:21:00.5,0000:81:00.4,0000:81:00.5
-   -# Custom resource classes don't seem to work for this card.
-   +# Custom resource classes don't work when you only have single resource type.
-    mdev_class = VGPU
-
-    {% endif %}
-
-Re-run the configure playbook:
-
-.. code:: shell
-
-   (kayobe) [stack@ansiblenode1 kayobe]$ kayobe playbook run $KAYOBE_CONFIG_PATH/ansible/host-configure.yml --tags vgpu --limit computegpu007
-
-Check the result:
-
-.. code:: shell
-
-   [stack@computegpu007 ~]$ mdevctl list
-   5c630867-a673-5d75-aa31-a499e6c7cb19 0000:21:00.4 nvidia-697 manual
-   eaa6e018-308e-58e2-b351-aadbcf01f5a8 0000:21:00.5 nvidia-697 manual
-   72291b01-689b-5b7a-9171-6b3480deabf4 0000:81:00.4 nvidia-698 manual
-
-Reconfigure nova to match the change:
-
-.. code:: shell
-
-   kayobe overcloud service reconfigure -kt nova --kolla-limit computegpu007 --skip-prechecks
-
+See upstream documentation: `Changing VGPU device types <https://docs.openstack.org/kayobe/latest/configuration/reference/vgpu.html#changing-vgpu-device-types>`__
 
 PCI Passthrough
 ###############
@@ -986,81 +765,16 @@ IOMMU should be enabled at kernel level as well - we can verify that on the comp
 OpenStack Nova configuration
 ----------------------------
 
-Configure nova-scheduler
-^^^^^^^^^^^^^^^^^^^^^^^^
-
-The nova-scheduler service must be configured to enable the ``PciPassthroughFilter``
-To enable it add it to the list of filters to Kolla-Ansible configuration file:
-``$KAYOBE_CONFIG_PATH/kolla/config/nova.conf``, for instance:
-
-.. code-block:: yaml
-
-   [filter_scheduler]
-   available_filters = nova.scheduler.filters.all_filters
-   enabled_filters = AvailabilityZoneFilter, ComputeFilter, ComputeCapabilitiesFilter, ImagePropertiesFilter, ServerGroupAntiAffinityFilter, ServerGroupAffinityFilter, PciPassthroughFilter
-
-Configure nova-compute
-^^^^^^^^^^^^^^^^^^^^^^
-
-Configuration can be applied in flexible ways using Kolla-Ansible's
-methods for `inventory-driven customisation of configuration
-<https://docs.openstack.org/kayobe/latest/configuration/reference/kolla-ansible.html#service-configuration>`_.
-The following configuration could be added to
-``$KAYOBE_CONFIG_PATH/kolla/config/nova/nova-compute.conf`` to enable PCI
-passthrough of GPU devices for hosts in a group named ``compute_gpu``.
-Again, the 4-digit PCI Vendor ID and Device ID extracted from ``lspci
--nn`` can be used here to specify the GPU device(s).
-
-.. code-block:: jinja
-
-   [pci]
-   {% raw %}
-   {% if inventory_hostname in groups['compute_gpu'] %}
-   # We could support multiple models of GPU.
-   # This can be done more selectively using different inventory groups.
-   # GPU models defined here:
-   # NVidia Tesla V100 16GB
-   # NVidia Tesla V100 32GB
-   # NVidia Tesla P100 16GB
-   passthrough_whitelist = [{ "vendor_id":"10de", "product_id":"1db4" },
-                            { "vendor_id":"10de", "product_id":"1db5" },
-                            { "vendor_id":"10de", "product_id":"15f8" }]
-   alias = { "vendor_id":"10de", "product_id":"1db4", "device_type":"type-PCI", "name":"gpu-v100-16" }
-   alias = { "vendor_id":"10de", "product_id":"1db5", "device_type":"type-PCI", "name":"gpu-v100-32" }
-   alias = { "vendor_id":"10de", "product_id":"15f8", "device_type":"type-PCI", "name":"gpu-p100" }
-   {% endif %}
-   {% endraw %}
-
-Configure nova-api
-^^^^^^^^^^^^^^^^^^
-
-pci.alias also needs to be configured on the controller.
-This configuration should match the configuration found on the compute nodes.
-Add it to Kolla-Ansible configuration file:
-``$KAYOBE_CONFIG_PATH/kolla/config/nova/nova-api.conf``, for instance:
-
-.. code-block:: yaml
-
-   [pci]
-   alias = { "vendor_id":"10de", "product_id":"1db4", "device_type":"type-PCI", "name":"gpu-v100-16" }
-   alias = { "vendor_id":"10de", "product_id":"1db5", "device_type":"type-PCI", "name":"gpu-v100-32" }
-   alias = { "vendor_id":"10de", "product_id":"15f8", "device_type":"type-PCI", "name":"gpu-p100" }
-
-Reconfigure nova service
-^^^^^^^^^^^^^^^^^^^^^^^^
-
-.. code-block:: text
-
-   kayobe overcloud service reconfigure --kolla-tags nova --kolla-skip-tags common --skip-prechecks
+See upsteram Nova documentation: `Attaching physical PCI devices to guests <https://docs.openstack.org/nova/latest/admin/pci-passthrough.html>`__
 
 Configure a flavor
 ^^^^^^^^^^^^^^^^^^
 
-For example, to request two of the GPUs with alias gpu-p100
+For example, to request two of the GPUs with alias **a1**
 
 .. code-block:: text
 
-   openstack flavor set m1.medium --property "pci_passthrough:alias"="gpu-p100:2"
+   openstack flavor set m1.medium --property "pci_passthrough:alias"="a1:2"
 
 
 This can be also defined in the openstack-config repository
@@ -1072,12 +786,12 @@ add extra_specs to flavor in etc/openstack-config/openstack-config.yml:
    admin# cd src/openstack-config
    admin# vim etc/openstack-config/openstack-config.yml
 
-    name: "m1.medium"
+    name: "m1.medium-gpu"
     ram: 4096
     disk: 40
     vcpus: 2
     extra_specs:
-      "pci_passthrough:alias": "gpu-p100:2"
+      "pci_passthrough:alias": "a1:2"
 
 Invoke configuration playbooks afterwards:
 
@@ -1092,7 +806,7 @@ Create instance with GPU passthrough
 
 .. code-block:: text
 
-   openstack server create --flavor m1.medium --image ubuntu2004 --wait test-pci
+   openstack server create --flavor m1.medium-gpu --image ubuntu22.04 --wait test-pci
 
 Testing GPU in a Guest VM
 -------------------------

From 15b575fa646c34b45218a7837c428261038d3b5d Mon Sep 17 00:00:00 2001
From: Seunghun Lee <45145778+seunghun1ee@users.noreply.github.com>
Date: Thu, 7 Nov 2024 10:53:30 +0000
Subject: [PATCH 36/42] Better wordings on section intro

Co-authored-by: Alex-Welsh <112560678+Alex-Welsh@users.noreply.github.com>
---
 doc/source/configuration/vault.rst | 2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/doc/source/configuration/vault.rst b/doc/source/configuration/vault.rst
index d05513632e..61a5ab1c9e 100644
--- a/doc/source/configuration/vault.rst
+++ b/doc/source/configuration/vault.rst
@@ -113,7 +113,7 @@ Certificates generation
 Create the external TLS certificates (testing only)
 ---------------------------------------------------
 
-This method should only be used for testing. For external certificates on production system,
+This method should only be used for testing. For external TLS on production systems,
 See `Installing External TLS Certificates <installing-external-tls-certificates>`__.
 
 Typically external API TLS certificates should be generated by a organisation's trusted internal or third-party CA.

From 7703f9314dd7bbb87b42a12fff01830e20657a3b Mon Sep 17 00:00:00 2001
From: Seunghun Lee <seunghun@stackhpc.com>
Date: Thu, 7 Nov 2024 11:08:02 +0000
Subject: [PATCH 37/42] Remove unnecessary curly brackets

---
 doc/source/configuration/ci-cd.rst                 | 14 +++++++-------
 doc/source/configuration/lvm.rst                   |  4 ++--
 doc/source/configuration/swap.rst                  |  4 ++--
 doc/source/contributor/pre-commit.rst              |  6 +++---
 .../bifrost-hardware-inventory-management.rst      | 14 +++++++-------
 doc/source/operations/hotfix-playbook.rst          |  4 ++--
 doc/source/operations/octavia.rst                  |  4 ++--
 .../operations/openstack-reconfiguration.rst       |  6 +++---
 8 files changed, 28 insertions(+), 28 deletions(-)

diff --git a/doc/source/configuration/ci-cd.rst b/doc/source/configuration/ci-cd.rst
index 6e495c2e81..dcf86350e6 100644
--- a/doc/source/configuration/ci-cd.rst
+++ b/doc/source/configuration/ci-cd.rst
@@ -57,26 +57,26 @@ Runner Deployment
     Ideally an Infra VM could be used here or failing that the control host.
     Wherever it is deployed the host will need access to the :code:`admin_network`, :code:`public_network` and the :code:`pulp registry` on the seed.
 
-2. Edit the environment's :code:`${KAYOBE_CONFIG_PATH}/environments/${KAYOBE_ENVIRONMENT}/inventory/groups` to add the predefined :code:`github-runners` group to :code:`infra-vms`
+2. Edit the environment's :code:`$KAYOBE_CONFIG_PATH/environments/$KAYOBE_ENVIRONMENT/inventory/groups` to add the predefined :code:`github-runners` group to :code:`infra-vms`
 
 .. code-block:: ini
 
     [infra-vms:children]
     github-runners
 
-3. Edit the environment's :code:`${KAYOBE_CONFIG_PATH}/environments/${KAYOBE_ENVIRONMENT}/inventory/hosts` to define the host(s) that will host the runners.
+3. Edit the environment's :code:`$KAYOBE_CONFIG_PATH/environments/$KAYOBE_ENVIRONMENT/inventory/hosts` to define the host(s) that will host the runners.
 
 .. code-block:: ini
 
     [github-runners]
     prod-runner-01
 
-4. Provide all the relevant Kayobe :code:`group_vars` for :code:`github-runners` under :code:`${KAYOBE_CONFIG_PATH}/environments/${KAYOBE_ENVIRONMENT}/inventory/group_vars/github-runners`
+4. Provide all the relevant Kayobe :code:`group_vars` for :code:`github-runners` under :code:`$KAYOBE_CONFIG_PATH/environments/$KAYOBE_ENVIRONMENT/inventory/group_vars/github-runners`
     * `infra-vms` ensuring all required `infra_vm_extra_network_interfaces` are defined
     * `network-interfaces`
     * `python-interpreter.yml` ensuring that `ansible_python_interpreter: /usr/bin/python3` has been set
 
-5. Edit the ``${KAYOBE_CONFIG_PATH}/inventory/group_vars/github-runners/runners.yml`` file which will contain the variables required to deploy a series of runners.
+5. Edit the ``$KAYOBE_CONFIG_PATH/inventory/group_vars/github-runners/runners.yml`` file which will contain the variables required to deploy a series of runners.
    Below is a core set of variables that will require consideration and modification for successful deployment of the runners.
    The number of runners deployed can be configured by removing and extending the dict :code:`github-runners`.
    As for how many runners present three is suitable number as this would prevent situations where long running jobs could halt progress other tasks whilst waiting for a free runner.
@@ -120,7 +120,7 @@ Runner Deployment
 
 7. If the host is an actual Infra VM then please refer to upstream `Infrastructure VMs <https://docs.openstack.org/kayobe/latest/configuration/reference/infra-vms.html>`__ documentation for additional configuration and steps.
 
-8. Run :code:`kayobe playbook run ${KAYOBE_CONFIG_PATH}/ansible/deploy-github-runner.yml`
+8. Run :code:`kayobe playbook run $KAYOBE_CONFIG_PATH/ansible/deploy-github-runner.yml`
 
 9. Check runners have registered properly by visiting the repository's :code:`Action` tab -> :code:`Runners` -> :code:`Self-hosted runners`
 
@@ -130,9 +130,9 @@ Runner Deployment
 Workflow Deployment
 -------------------
 
-1. Edit :code:`${KAYOBE_CONFIG_PATH}/inventory/group_vars/github-writer/writer.yml` in the base configuration making the appropriate changes to your deployments specific needs. See documentation for `stackhpc.kayobe_workflows.github <https://github.com/stackhpc/ansible-collection-kayobe-workflows/tree/main/roles/github>`__.
+1. Edit :code:`$KAYOBE_CONFIG_PATH/inventory/group_vars/github-writer/writer.yml` in the base configuration making the appropriate changes to your deployments specific needs. See documentation for `stackhpc.kayobe_workflows.github <https://github.com/stackhpc/ansible-collection-kayobe-workflows/tree/main/roles/github>`__.
 
-2. Run :code:`kayobe playbook run ${KAYOBE_CONFIG_PATH}/ansible/write-github-workflows.yml`
+2. Run :code:`kayobe playbook run $KAYOBE_CONFIG_PATH/ansible/write-github-workflows.yml`
 
 3. Add all required secrets and variables to repository either via the GitHub UI or GitHub CLI (may require repository owner)
 
diff --git a/doc/source/configuration/lvm.rst b/doc/source/configuration/lvm.rst
index a96ca8db99..bb2b7862c4 100644
--- a/doc/source/configuration/lvm.rst
+++ b/doc/source/configuration/lvm.rst
@@ -93,6 +93,6 @@ hosts:
 
 .. code-block:: console
 
-   mkdir -p ${KAYOBE_CONFIG_PATH}/hooks/overcloud-host-configure/pre.d
-   cd ${KAYOBE_CONFIG_PATH}/hooks/overcloud-host-configure/pre.d
+   mkdir -p $KAYOBE_CONFIG_PATH/hooks/overcloud-host-configure/pre.d
+   cd $KAYOBE_CONFIG_PATH/hooks/overcloud-host-configure/pre.d
    ln -s ../../../ansible/growroot.yml 30-growroot.yml
diff --git a/doc/source/configuration/swap.rst b/doc/source/configuration/swap.rst
index 58545e9066..2419195744 100644
--- a/doc/source/configuration/swap.rst
+++ b/doc/source/configuration/swap.rst
@@ -23,6 +23,6 @@ hosts:
 
 .. code-block:: console
 
-   mkdir -p ${KAYOBE_CONFIG_PATH}/hooks/overcloud-host-configure/post.d
-   cd ${KAYOBE_CONFIG_PATH}/hooks/overcloud-host-configure/post.d
+   mkdir -p $KAYOBE_CONFIG_PATH/hooks/overcloud-host-configure/post.d
+   cd $KAYOBE_CONFIG_PATH/hooks/overcloud-host-configure/post.d
    ln -s ../../../ansible/swap.yml 10-swap.yml
diff --git a/doc/source/contributor/pre-commit.rst b/doc/source/contributor/pre-commit.rst
index 3afffc11b4..dc9f691bf6 100644
--- a/doc/source/contributor/pre-commit.rst
+++ b/doc/source/contributor/pre-commit.rst
@@ -29,12 +29,12 @@ Once done you should find `pre-commit` is available within the `kayobe` virtuale
 
 To run the playbook using the following command
 
-- ``kayobe playbook run ${KAYOBE_CONFIG_PATH}/ansible/install-pre-commit-hooks.yml``
+- ``kayobe playbook run $KAYOBE_CONFIG_PATH/ansible/install-pre-commit-hooks.yml``
 
 Whereas to run the playbook when control host bootstrap runs ensure it registered as symlink using the following command
 
-- ``mkdir -p ${KAYOBE_CONFIG_PATH}/hooks/control-host-bootstrap/post.d``
-- ``ln -s ${KAYOBE_CONFIG_PATH}/ansible/install-pre-commit-hooks.yml ${KAYOBE_CONFIG_PATH}/hooks/control-host-bootstrap/post.d/install-pre-commit-hooks.yml``
+- ``mkdir -p $KAYOBE_CONFIG_PATH/hooks/control-host-bootstrap/post.d``
+- ``ln -s $KAYOBE_CONFIG_PATH/ansible/install-pre-commit-hooks.yml $KAYOBE_CONFIG_PATH/hooks/control-host-bootstrap/post.d/install-pre-commit-hooks.yml``
 
 All that remains is the installation of the hooks themselves which can be accomplished either by
 running `pre-commit run` or using `git commit` when you have changes that need to be committed.
diff --git a/doc/source/operations/bifrost-hardware-inventory-management.rst b/doc/source/operations/bifrost-hardware-inventory-management.rst
index 970e828aee..ba6e6bb253 100644
--- a/doc/source/operations/bifrost-hardware-inventory-management.rst
+++ b/doc/source/operations/bifrost-hardware-inventory-management.rst
@@ -26,7 +26,7 @@ configured to network boot on the provisioning network, the following commands
 will instruct them to PXE boot. The nodes will boot on the Ironic Python Agent
 kernel and ramdisk, which is configured to extract hardware information and
 send it to Bifrost. Note that IPMI credentials can be found in the encrypted
-file located at ``${KAYOBE_CONFIG_PATH}/secrets.yml``.
+file located at ``$KAYOBE_CONFIG_PATH/secrets.yml``.
 
 .. code-block:: console
 
@@ -56,8 +56,8 @@ in Bifrost:
    | da0c61af-b411-41b9-8909-df2509f2059b | example-hypervisor-01 | None          | power off   | enroll             | False       |
    +--------------------------------------+-----------------------+---------------+-------------+--------------------+-------------+
 
-After editing ``${KAYOBE_CONFIG_PATH}/overcloud.yml`` (or
-``${KAYOBE_CONFIG_PATH}/environments/${KAYOBE_ENVIRONMENT}/overcloud.yml``
+After editing ``$KAYOBE_CONFIG_PATH/overcloud.yml`` (or
+``$KAYOBE_CONFIG_PATH/environments/$KAYOBE_ENVIRONMENT/overcloud.yml``
 if Kayobe environment is used) to add these new hosts to
 the correct groups, import them in Kayobe's inventory with:
 
@@ -201,7 +201,7 @@ To build ipa image with extra-hardware  you need to edit ``ipa.yml`` and add thi
    - "extra-hardware"
 
 Extract introspection data from Bifrost with Kayobe. JSON files will be created
-into ``${KAYOBE_CONFIG_PATH}/overcloud-introspection-data``:
+into ``$KAYOBE_CONFIG_PATH/overcloud-introspection-data``:
 
 .. code-block:: console
 
@@ -210,7 +210,7 @@ into ``${KAYOBE_CONFIG_PATH}/overcloud-introspection-data``:
 Using ADVise
 ------------
 
-The Ansible playbook ``advise-run.yml`` can be found at ``${KAYOBE_CONFIG_PATH}/ansible/advise-run.yml``.
+The Ansible playbook ``advise-run.yml`` can be found at ``$KAYOBE_CONFIG_PATH/ansible/advise-run.yml``.
 
 The playbook will:
 
@@ -220,8 +220,8 @@ The playbook will:
 
 .. code-block:: console
 
-   cd ${KAYOBE_CONFIG_PATH}
-   ansible-playbook ${KAYOBE_CONFIG_PATH}/ansible/advise-run.yml
+   cd $KAYOBE_CONFIG_PATH
+   ansible-playbook $KAYOBE_CONFIG_PATH/ansible/advise-run.yml
 
 The playbook has the following optional parameters:
 
diff --git a/doc/source/operations/hotfix-playbook.rst b/doc/source/operations/hotfix-playbook.rst
index ee4d9df012..8f7c6145e3 100644
--- a/doc/source/operations/hotfix-playbook.rst
+++ b/doc/source/operations/hotfix-playbook.rst
@@ -20,7 +20,7 @@ The playbook can be invoked with:
 
 .. code-block:: console
 
-  kayobe playbook run ${KAYOBE_CONFIG_PATH}/ansible/hotfix-containers.yml
+  kayobe playbook run $KAYOBE_CONFIG_PATH/ansible/hotfix-containers.yml
 
 Playbook variables:
 -------------------
@@ -49,7 +49,7 @@ to a file, then add them as an extra var. e.g:
 
 .. code-block:: console
 
-  kayobe playbook run ${KAYOBE_CONFIG_PATH}/ansible/hotfix-containers.yml -e "@~/vars.yml"
+  kayobe playbook run $KAYOBE_CONFIG_PATH/ansible/hotfix-containers.yml -e "@~/vars.yml"
 
 
 Example Variables file
diff --git a/doc/source/operations/octavia.rst b/doc/source/operations/octavia.rst
index f884d130f1..e13b0a1b3a 100644
--- a/doc/source/operations/octavia.rst
+++ b/doc/source/operations/octavia.rst
@@ -12,7 +12,7 @@ With your kayobe environment activated, you can build a new amphora image with:
 
 .. code-block:: console
 
-  kayobe playbook run ${KAYOBE_CONFIG_PATH}/ansible/octavia-amphora-image-build.yml
+  kayobe playbook run $KAYOBE_CONFIG_PATH/ansible/octavia-amphora-image-build.yml
 
 The resultant image is based on Ubuntu. By default the image will be built on the
 seed, but it is possible to change the group in the ansible inventory using the
@@ -29,7 +29,7 @@ You can then run the playbook to upload the image:
 
 .. code-block:: console
 
-  kayobe playbook run ${KAYOBE_CONFIG_PATH}/ansible/octavia-amphora-image-register.yml
+  kayobe playbook run $KAYOBE_CONFIG_PATH/ansible/octavia-amphora-image-register.yml
 
 This will rename the old image by adding a timestamp suffix, before uploading a
 new image with the name, ``amphora-x64-haproxy``. Octavia should be configured
diff --git a/doc/source/operations/openstack-reconfiguration.rst b/doc/source/operations/openstack-reconfiguration.rst
index 31c1c5acc8..7a86a7e707 100644
--- a/doc/source/operations/openstack-reconfiguration.rst
+++ b/doc/source/operations/openstack-reconfiguration.rst
@@ -10,7 +10,7 @@ service is handled less well, because of Ansible's imperative style.
 
 To remove a service, it is disabled in Kayobe's Kolla config, which prevents
 other services from communicating with it. For example, to disable
-``cinder-backup``, edit ``${KAYOBE_CONFIG_PATH}/kolla.yml``:
+``cinder-backup``, edit ``$KAYOBE_CONFIG_PATH/kolla.yml``:
 
 .. code-block:: diff
 
@@ -50,7 +50,7 @@ Use a command of this form:
 
 .. code-block:: console
 
-   kayobe# ansible-vault edit ${KAYOBE_CONFIG_PATH}/secrets.yml --vault-password-file=<Vault password file path>
+   kayobe# ansible-vault edit $KAYOBE_CONFIG_PATH/secrets.yml --vault-password-file=<Vault password file path>
 
 Concatenate the contents of the certificate and key files to create
 ``secrets_kolla_external_tls_cert``.  The certificates should be installed in
@@ -60,7 +60,7 @@ this order:
 * Any intermediate certificates
 * The TLS certificate private key
 
-In ``${KAYOBE_CONFIG_PATH}/kolla.yml``, set the following:
+In ``$KAYOBE_CONFIG_PATH/kolla.yml``, set the following:
 
 .. code-block:: yaml
 

From 3eb65370b34e30ea3a0613faaa561492cf10e6fd Mon Sep 17 00:00:00 2001
From: Seunghun Lee <seunghun@stackhpc.com>
Date: Thu, 7 Nov 2024 11:15:02 +0000
Subject: [PATCH 38/42] Add note of reconfiguring monitoring service

---
 .../operations/bifrost-hardware-inventory-management.rst     | 5 +++++
 1 file changed, 5 insertions(+)

diff --git a/doc/source/operations/bifrost-hardware-inventory-management.rst b/doc/source/operations/bifrost-hardware-inventory-management.rst
index ba6e6bb253..3ec0dd2653 100644
--- a/doc/source/operations/bifrost-hardware-inventory-management.rst
+++ b/doc/source/operations/bifrost-hardware-inventory-management.rst
@@ -73,6 +73,11 @@ We can then provision and configure them:
    kayobe# kayobe overcloud host configure --limit <Hostname>
    kayobe# kayobe overcloud service deploy --limit <Hostname> --kolla-limit <Hostname>
 
+.. note::
+
+   Reconfiguring monitoring services on controllers is required after provisioning them.
+   Otherwise, they will not show up.
+
 Replacing a Failing Hypervisor
 ------------------------------
 

From 8f001c4c4d976371ff768b151b2c96ad4153fdf5 Mon Sep 17 00:00:00 2001
From: Seunghun Lee <seunghun@stackhpc.com>
Date: Thu, 7 Nov 2024 11:16:29 +0000
Subject: [PATCH 39/42] Fix spacing

---
 doc/source/operations/migrating-vm.rst              | 2 +-
 doc/source/operations/openstack-reconfiguration.rst | 2 +-
 2 files changed, 2 insertions(+), 2 deletions(-)

diff --git a/doc/source/operations/migrating-vm.rst b/doc/source/operations/migrating-vm.rst
index 784abe74a4..fd5260286f 100644
--- a/doc/source/operations/migrating-vm.rst
+++ b/doc/source/operations/migrating-vm.rst
@@ -19,4 +19,4 @@ To move a virtual machine with local disks:
 
 .. code-block:: console
 
-   admin# openstack  server migrate --live-migration --block-migration --host hypervisor-01 <VM name or uuid>
+   admin# openstack server migrate --live-migration --block-migration --host hypervisor-01 <VM name or uuid>
diff --git a/doc/source/operations/openstack-reconfiguration.rst b/doc/source/operations/openstack-reconfiguration.rst
index 7a86a7e707..771103cac3 100644
--- a/doc/source/operations/openstack-reconfiguration.rst
+++ b/doc/source/operations/openstack-reconfiguration.rst
@@ -104,7 +104,7 @@ reach the OpenStack APIs:
 
 To update an existing certificate, for example when it has reached expiration,
 change the value of ``secrets_kolla_external_tls_cert``, in the same order as
-above.  Run the following command:
+above. Run the following command:
 
 .. code-block:: console
 

From 869fb57696efbaf2a93f9151c3ce14348bbecf04 Mon Sep 17 00:00:00 2001
From: Seunghun Lee <seunghun@stackhpc.com>
Date: Thu, 7 Nov 2024 11:41:17 +0000
Subject: [PATCH 40/42] Remove command prefixes

---
 .../bifrost-hardware-inventory-management.rst | 32 +++++++++----------
 .../operations/control-plane-operation.rst    | 28 ++++++++--------
 doc/source/operations/gpu-in-openstack.rst    | 10 +++---
 doc/source/operations/migrating-vm.rst        |  6 ++--
 .../operations/openstack-reconfiguration.rst  | 24 +++++++-------
 5 files changed, 49 insertions(+), 51 deletions(-)

diff --git a/doc/source/operations/bifrost-hardware-inventory-management.rst b/doc/source/operations/bifrost-hardware-inventory-management.rst
index 3ec0dd2653..6730418192 100644
--- a/doc/source/operations/bifrost-hardware-inventory-management.rst
+++ b/doc/source/operations/bifrost-hardware-inventory-management.rst
@@ -13,7 +13,7 @@ can be reinspected like this:
 
 .. code-block:: console
 
-   kayobe# kayobe overcloud hardware inspect --limit <Host name>
+   kayobe overcloud hardware inspect --limit <Host name>
 
 .. _enrolling-new-hypervisors:
 
@@ -30,26 +30,26 @@ file located at ``$KAYOBE_CONFIG_PATH/secrets.yml``.
 
 .. code-block:: console
 
-   bifrost# ipmitool -I lanplus -U <ipmi username> -H <Hostname>-ipmi chassis bootdev pxe
+   ipmitool -I lanplus -U <ipmi username> -H <Hostname>-ipmi chassis bootdev pxe
 
 If node is are off, power them on:
 
 .. code-block:: console
 
-   bifrost# ipmitool -I lanplus -U <ipmi username> -H <Hostname>-ipmi power on
+   ipmitool -I lanplus -U <ipmi username> -H <Hostname>-ipmi power on
 
 If nodes is on, reset them:
 
 .. code-block:: console
 
-   bifrost# ipmitool -I lanplus -U <ipmi username> -H <Hostname>-ipmi power reset
+   ipmitool -I lanplus -U <ipmi username> -H <Hostname>-ipmi power reset
 
 Once node have booted and have completed introspection, they should be visible
 in Bifrost:
 
 .. code-block:: console
 
-   bifrost# baremetal node list --provision-state enroll
+   baremetal node list --provision-state enroll
    +--------------------------------------+-----------------------+---------------+-------------+--------------------+-------------+
    | UUID                                 | Name                  | Instance UUID | Power State | Provisioning State | Maintenance |
    +--------------------------------------+-----------------------+---------------+-------------+--------------------+-------------+
@@ -63,15 +63,15 @@ the correct groups, import them in Kayobe's inventory with:
 
 .. code-block:: console
 
-   kayobe# kayobe overcloud inventory discover
+   kayobe overcloud inventory discover
 
 We can then provision and configure them:
 
 .. code-block:: console
 
-   kayobe# kayobe overcloud provision --limit <Hostname>
-   kayobe# kayobe overcloud host configure --limit <Hostname>
-   kayobe# kayobe overcloud service deploy --limit <Hostname> --kolla-limit <Hostname>
+   kayobe overcloud provision --limit <Hostname>
+   kayobe overcloud host configure --limit <Hostname>
+   kayobe overcloud service deploy --limit <Hostname> --kolla-limit <Hostname>
 
 .. note::
 
@@ -94,7 +94,7 @@ To deprovision an existing hypervisor, run:
 
 .. code-block:: console
 
-   kayobe# kayobe overcloud deprovision --limit <Hypervisor hostname>
+   kayobe overcloud deprovision --limit <Hypervisor hostname>
 
 .. warning::
 
@@ -109,14 +109,14 @@ Evacuating all instances
 
 .. code-block:: console
 
-   admin# openstack server evacuate $(openstack server list --host <Hypervisor hostname> --format value --column ID)
+   openstack server evacuate $(openstack server list --host <Hypervisor hostname> --format value --column ID)
 
 You should now check the status of all the instances that were running on that
 hypervisor. They should all show the status ACTIVE. This can be verified with:
 
 .. code-block:: console
 
-   admin# openstack server show <instance uuid>
+   openstack server show <instance uuid>
 
 Troubleshooting
 ===============
@@ -145,7 +145,7 @@ migrate as the process needs manual confirmation. You can do this with:
 
 .. code-block:: console
 
-   openstack# openstack server resize confirm <instance-uuid>
+   openstack server resize confirm <instance-uuid>
 
 The symptom to look out for is that the server is showing a status of ``VERIFY
 RESIZE`` as shown in this snippet of ``openstack server show <instance-uuid>``:
@@ -161,7 +161,7 @@ Set maintenance mode on a node in Bifrost
 
 .. code-block:: console
 
-   seed# docker exec -it bifrost_deploy /bin/bash
+   docker exec -it bifrost_deploy /bin/bash
    (bifrost-deploy)[root@seed bifrost-base]# export OS_CLOUD=bifrost
    (bifrost-deploy)[root@seed bifrost-base]# baremetal node maintenance set <Hostname>
 
@@ -172,7 +172,7 @@ Unset maintenance mode on a node in Bifrost
 
 .. code-block:: console
 
-   seed# docker exec -it bifrost_deploy /bin/bash
+   docker exec -it bifrost_deploy /bin/bash
    (bifrost-deploy)[root@seed bifrost-base]# export OS_CLOUD=bifrost
    (bifrost-deploy)[root@seed bifrost-base]# baremetal node maintenance unset <Hostname>
 
@@ -210,7 +210,7 @@ into ``$KAYOBE_CONFIG_PATH/overcloud-introspection-data``:
 
 .. code-block:: console
 
-   kayobe# kayobe overcloud introspection data save
+   kayobe overcloud introspection data save
 
 Using ADVise
 ------------
diff --git a/doc/source/operations/control-plane-operation.rst b/doc/source/operations/control-plane-operation.rst
index 7e7711004d..e2de527095 100644
--- a/doc/source/operations/control-plane-operation.rst
+++ b/doc/source/operations/control-plane-operation.rst
@@ -120,7 +120,7 @@ The password can be found using:
 
 .. code-block:: console
 
-   kayobe# ansible-vault view $KAYOBE_CONFIG_PATH/kolla/passwords.yml \
+   ansible-vault view $KAYOBE_CONFIG_PATH/kolla/passwords.yml \
            --vault-password-file <Vault password file path> | grep ^database
 
 Checking RabbitMQ
@@ -188,7 +188,7 @@ Shutting down the seed VM
 
 .. code-block:: console
 
-   kayobe# virsh shutdown <Seed hostname>
+   virsh shutdown <Seed hostname>
 
 .. _full-shutdown:
 
@@ -215,7 +215,7 @@ Example: Reboot all compute hosts apart from compute0:
 
 .. code-block:: console
 
-   kayobe# kayobe playbook run $KAYOBE_CONFIG_PATH/ansible/reboot.yml --limit 'compute:!compute0'
+   kayobe playbook run $KAYOBE_CONFIG_PATH/ansible/reboot.yml --limit 'compute:!compute0'
 
 References
 ----------
@@ -242,7 +242,7 @@ with the following command:
 
 .. code-block:: console
 
-   kayobe# kayobe overcloud database recover
+   kayobe overcloud database recover
 
 Ansible Control Host
 --------------------
@@ -258,7 +258,7 @@ hypervisor is powered on. If it does not, it can be started with:
 
 .. code-block:: console
 
-   kayobe# virsh start <Seed hostname>
+   virsh start <Seed hostname>
 
 Full power on
 -------------
@@ -275,7 +275,7 @@ Log into the monitoring host(s):
 
 .. code-block:: console
 
-   kayobe# ssh stack@monitoring0
+   ssh stack@monitoring0
 
 Stop all Docker containers:
 
@@ -312,22 +312,22 @@ To sync host packages:
 
 .. code-block:: console
 
-   kayobe# kayobe playbook run $KAYOBE_CONFIG_PATH/ansible/pulp-repo-sync.yml
-   kayobe# kayobe playbook run $KAYOBE_CONFIG_PATH/ansible/pulp-repo-publish.yml
+   kayobe playbook run $KAYOBE_CONFIG_PATH/ansible/pulp-repo-sync.yml
+   kayobe playbook run $KAYOBE_CONFIG_PATH/ansible/pulp-repo-publish.yml
 
 If the system is production environment and want to use packages tested in test/staging
 environment, you can promote them by:
 
 .. code-block:: console
 
-   kayobe# kayobe playbook run $KAYOBE_CONFIG_PATH/ansible/pulp-repo-promote-production.yml
+   kayobe playbook run $KAYOBE_CONFIG_PATH/ansible/pulp-repo-promote-production.yml
 
 To sync container images:
 
 .. code-block:: console
 
-   kayobe# kayobe playbook run $KAYOBE_CONFIG_PATH/ansible/pulp-container-sync.yml
-   kayobe# kayobe playbook run $KAYOBE_CONFIG_PATH/ansible/pulp-container-publish.yml
+   kayobe playbook run $KAYOBE_CONFIG_PATH/ansible/pulp-container-sync.yml
+   kayobe playbook run $KAYOBE_CONFIG_PATH/ansible/pulp-container-publish.yml
 
 For more information about StackHPC Release Train, see :ref:`stackhpc-release-train` documentation.
 
@@ -341,8 +341,8 @@ Host packages can be updated with:
 
 .. code-block:: console
 
-   kayobe# kayobe overcloud host package update --limit <node> --packages '*'
-   kayobe# kayobe seed host package update --packages '*'
+   kayobe overcloud host package update --limit <node> --packages '*'
+   kayobe seed host package update --packages '*'
 
 See https://docs.openstack.org/kayobe/latest/administration/overcloud.html#updating-packages
 
@@ -387,7 +387,7 @@ Reconfigure Opensearch with new values:
 
 .. code-block:: console
 
-   kayobe# kayobe overcloud service reconfigure --kolla-tags opensearch
+   kayobe overcloud service reconfigure --kolla-tags opensearch
 
 For more information see the `upstream documentation
 <https://docs.openstack.org/kolla-ansible/latest/reference/logging-and-monitoring/central-logging-guide.html#applying-log-retention-policies>`__.
diff --git a/doc/source/operations/gpu-in-openstack.rst b/doc/source/operations/gpu-in-openstack.rst
index 330edfbb1d..2d7e30dee1 100644
--- a/doc/source/operations/gpu-in-openstack.rst
+++ b/doc/source/operations/gpu-in-openstack.rst
@@ -783,8 +783,8 @@ add extra_specs to flavor in etc/openstack-config/openstack-config.yml:
 
 .. code-block:: console
 
-   admin# cd src/openstack-config
-   admin# vim etc/openstack-config/openstack-config.yml
+   cd src/openstack-config
+   vim etc/openstack-config/openstack-config.yml
 
     name: "m1.medium-gpu"
     ram: 4096
@@ -797,9 +797,9 @@ Invoke configuration playbooks afterwards:
 
 .. code-block:: console
 
-   admin# source src/kayobe-config/etc/kolla/public-openrc.sh
-   admin# source venvs/openstack/bin/activate
-   admin# tools/openstack-config --vault-password-file <Vault password file path>
+   source src/kayobe-config/etc/kolla/public-openrc.sh
+   source venvs/openstack/bin/activate
+   tools/openstack-config --vault-password-file <Vault password file path>
 
 Create instance with GPU passthrough
 ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
diff --git a/doc/source/operations/migrating-vm.rst b/doc/source/operations/migrating-vm.rst
index fd5260286f..031df1b609 100644
--- a/doc/source/operations/migrating-vm.rst
+++ b/doc/source/operations/migrating-vm.rst
@@ -6,17 +6,17 @@ To see where all virtual machines are running on the hypervisors:
 
 .. code-block:: console
 
-   admin# openstack server list --all-projects --long
+   openstack server list --all-projects --long
 
 To move a virtual machine with shared storage or booted from volume from one hypervisor to another, for example to
 hypervisor-01:
 
 .. code-block:: console
 
-   admin# openstack server migrate --live-migration --host hypervisor-01 <VM name or uuid>
+   openstack server migrate --live-migration --host hypervisor-01 <VM name or uuid>
 
 To move a virtual machine with local disks:
 
 .. code-block:: console
 
-   admin# openstack server migrate --live-migration --block-migration --host hypervisor-01 <VM name or uuid>
+   openstack server migrate --live-migration --block-migration --host hypervisor-01 <VM name or uuid>
diff --git a/doc/source/operations/openstack-reconfiguration.rst b/doc/source/operations/openstack-reconfiguration.rst
index 771103cac3..36bcece666 100644
--- a/doc/source/operations/openstack-reconfiguration.rst
+++ b/doc/source/operations/openstack-reconfiguration.rst
@@ -21,7 +21,7 @@ Then, reconfigure Cinder services with Kayobe:
 
 .. code-block:: console
 
-   kayobe# kayobe overcloud service reconfigure --kolla-tags cinder
+   kayobe overcloud service reconfigure --kolla-tags cinder
 
 However, the service itself, no longer in Ansible's manifest of managed state,
 must be manually stopped and prevented from restarting.
@@ -30,7 +30,7 @@ On each controller:
 
 .. code-block:: console
 
-   kayobe# docker rm -f cinder_backup
+   docker rm -f cinder_backup
 
 Some services may store data in a dedicated Docker volume, which can be removed
 with ``docker volume rm``.
@@ -50,7 +50,7 @@ Use a command of this form:
 
 .. code-block:: console
 
-   kayobe# ansible-vault edit $KAYOBE_CONFIG_PATH/secrets.yml --vault-password-file=<Vault password file path>
+   ansible-vault edit $KAYOBE_CONFIG_PATH/secrets.yml --vault-password-file=<Vault password file path>
 
 Concatenate the contents of the certificate and key files to create
 ``secrets_kolla_external_tls_cert``.  The certificates should be installed in
@@ -72,7 +72,7 @@ be updated in Keystone:
 
 .. code-block:: console
 
-   kayobe# kayobe overcloud service reconfigure
+   kayobe overcloud service reconfigure
 
 Alternative Configuration
 -------------------------
@@ -95,7 +95,7 @@ reach the OpenStack APIs:
 
 .. code-block:: console
 
-   openstack# openssl s_client -connect <Public endpoint FQDN>:443 2> /dev/null | openssl x509 -noout -dates
+   openssl s_client -connect <Public endpoint FQDN>:443 2> /dev/null | openssl x509 -noout -dates
 
 .. note::
 
@@ -108,7 +108,7 @@ above. Run the following command:
 
 .. code-block:: console
 
-   kayobe# kayobe overcloud service reconfigure --kolla-tags haproxy
+   kayobe overcloud service reconfigure --kolla-tags haproxy
 
 .. _taking-a-hypervisor-out-of-service:
 
@@ -119,8 +119,7 @@ To take a hypervisor out of Nova scheduling:
 
 .. code-block:: console
 
-   admin# openstack compute service set --disable \
-          <Hypervisor name> nova-compute
+   openstack compute service set --disable <Hypervisor name> nova-compute
 
 Running instances on the hypervisor will not be affected, but new instances
 will not be deployed on it.
@@ -130,19 +129,18 @@ A reason for disabling a hypervisor can be documented with the
 
 .. code-block:: console
 
-   admin# openstack compute service set --disable \
-          --disable-reason "Broken drive" <Hypervisor name> nova-compute
+   openstack compute service set --disable \
+   --disable-reason "Broken drive" <Hypervisor name> nova-compute
 
 Details about all hypervisors and the reasons they are disabled can be
 displayed with:
 
 .. code-block:: console
 
-   admin# openstack compute service list --long
+   openstack compute service list --long
 
 And then to enable a hypervisor again:
 
 .. code-block:: console
 
-   admin# openstack compute service set --enable \
-          <Hypervisor name> nova-compute
+   openstack compute service set --enable <Hypervisor name> nova-compute

From 1788f7b60f063434b9117e6d218a3ddc2cdd1e51 Mon Sep 17 00:00:00 2001
From: Seunghun Lee <seunghun@stackhpc.com>
Date: Thu, 7 Nov 2024 12:02:26 +0000
Subject: [PATCH 41/42] Add warning of brief downtime

---
 doc/source/operations/openstack-reconfiguration.rst | 4 ++++
 1 file changed, 4 insertions(+)

diff --git a/doc/source/operations/openstack-reconfiguration.rst b/doc/source/operations/openstack-reconfiguration.rst
index 36bcece666..392b92421d 100644
--- a/doc/source/operations/openstack-reconfiguration.rst
+++ b/doc/source/operations/openstack-reconfiguration.rst
@@ -106,6 +106,10 @@ To update an existing certificate, for example when it has reached expiration,
 change the value of ``secrets_kolla_external_tls_cert``, in the same order as
 above. Run the following command:
 
+.. warning::
+
+   Services can be briefly unavailable during reconfiguring HAProxy.
+
 .. code-block:: console
 
    kayobe overcloud service reconfigure --kolla-tags haproxy

From a6872b57c75cae27d4e2cb51e0622202bb6eee6a Mon Sep 17 00:00:00 2001
From: Seunghun Lee <seunghun@stackhpc.com>
Date: Thu, 5 Dec 2024 11:34:45 +0000
Subject: [PATCH 42/42] Remove outdated information

---
 doc/source/operations/ceph-management.rst     |  2 +-
 .../operations/control-plane-operation.rst    |  9 +-----
 doc/source/operations/gpu-in-openstack.rst    | 29 -------------------
 3 files changed, 2 insertions(+), 38 deletions(-)

diff --git a/doc/source/operations/ceph-management.rst b/doc/source/operations/ceph-management.rst
index e48a8d3e02..98988959b7 100644
--- a/doc/source/operations/ceph-management.rst
+++ b/doc/source/operations/ceph-management.rst
@@ -120,7 +120,7 @@ Ceph can report details about failed OSDs by running:
    sudo cephadm shell
    ceph health detail
 
-.. note ::
+.. note::
 
    Remember to run ceph/rbd commands from within ``cephadm shell``
    (preferred method) or after installing Ceph client. Details in the
diff --git a/doc/source/operations/control-plane-operation.rst b/doc/source/operations/control-plane-operation.rst
index e2de527095..3dfd1ec44b 100644
--- a/doc/source/operations/control-plane-operation.rst
+++ b/doc/source/operations/control-plane-operation.rst
@@ -81,12 +81,6 @@ Then, create another silence to match ``hostname=<hostname>`` (this is
 required because, for the OpenStack exporter, the instance is the host running
 the monitoring service rather than the host being monitored).
 
-.. note::
-
-   After creating the silence, you may get redirected to a 404 page. This is a
-   `known issue <https://github.com/prometheus/alertmanager/issues/1377>`__
-   when running several Alertmanager instances behind HAProxy.
-
 Control Plane Shutdown Procedure
 ================================
 
@@ -353,8 +347,7 @@ Deploying to a Specific Hypervisor
 ----------------------------------
 
 To test creating an instance on a specific hypervisor, *as an admin-level user*
-you can specify the hypervisor name as part of an extended availability zone
-description.
+you can specify the hypervisor name.
 
 To see the list of hypervisor names:
 
diff --git a/doc/source/operations/gpu-in-openstack.rst b/doc/source/operations/gpu-in-openstack.rst
index 2d7e30dee1..1fd99d30d1 100644
--- a/doc/source/operations/gpu-in-openstack.rst
+++ b/doc/source/operations/gpu-in-openstack.rst
@@ -190,35 +190,6 @@ Configure the VGPU devices:
             - mdev_type: nvidia-697
               index: 1
 
-Running the playbook
-^^^^^^^^^^^^^^^^^^^^
-
-The playbook defined in the :ref:`previous step<NVIDIA OS Configuration>`
-should be run after `kayobe overcloud host configure` has completed. This will
-ensure the host has been fully bootstrapped. With default settings, internet
-connectivity is required to download `MIG Partition Editor for NVIDIA GPUs`. If
-this is not desirable, you can override the one of the following variables
-(depending on host OS):
-
-.. code-block:: yaml
-   :caption: $KAYOBE_CONFIG_PATH/inventory/group_vars/compute_vgpu/vgpu
-
-   vgpu_nvidia_mig_manager_rpm_url: "https://github.com/NVIDIA/mig-parted/releases/download/v0.5.1/nvidia-mig-manager-0.5.1-1.x86_64.rpm"
-   vgpu_nvidia_mig_manager_deb_url: "https://github.com/NVIDIA/mig-parted/releases/download/v0.5.1/nvidia-mig-manager_0.5.1-1_amd64.deb"
-
-For example, you may wish to upload these artifacts to the local pulp.
-
-Run the playbook that you defined earlier:
-
-.. code-block:: shell
-
-  kayobe playbook run $KAYOBE_CONFIG_PATH/ansible/host-configure.yml
-
-Note: This will reboot the hosts on first run.
-
-The playbook may be added as a hook in ``$KAYOBE_CONFIG_PATH/hooks/overcloud-host-configure/post.d``; this will
-ensure you do not forget to run it when hosts are enrolled in the future.
-
 .. _NVIDIA Kolla Ansible Configuration:
 
 Kolla-Ansible configuration