From 5ca6b834bbb03fb99b3ca3f203e7da47da50e3e8 Mon Sep 17 00:00:00 2001 From: Seunghun Lee Date: Wed, 22 May 2024 16:36:41 +0100 Subject: [PATCH 01/42] Add openstack projects & users management doc --- ...penstack-projects-and-users-management.rst | 38 +++++++++++++++++++ 1 file changed, 38 insertions(+) create mode 100644 doc/source/operations/openstack-projects-and-users-management.rst diff --git a/doc/source/operations/openstack-projects-and-users-management.rst b/doc/source/operations/openstack-projects-and-users-management.rst new file mode 100644 index 0000000000..676db20cab --- /dev/null +++ b/doc/source/operations/openstack-projects-and-users-management.rst @@ -0,0 +1,38 @@ +======================================= +Openstack Projects and Users Management +======================================= + +Projects (in OpenStack) can be defined in the ``openstack-config`` repository + +To initialise the working environment for ``openstack-config``: + +.. code-block:: console + + git clone ~/src/openstack-config + python3 -m venv ~/venvs/openstack-config-venv + source ~/venvs/openstack-config-venv/bin/activate + cd ~/src/openstack-config + pip install -U pip + pip install -r requirements.txt + ansible-galaxy collection install \ + -p ansible/collections \ + -r requirements.yml + +To define a new project, add a new project to +``etc/openstack-config/openstack-config.yml``: + +Example invocation: + +.. code-block:: console + + source ~/src/kayobe-config/etc/kolla/public-openrc.sh + source ~/venvs/openstack-config-venv/bin/activate + cd ~/src/openstack-config + tools/openstack-config -- --vault-password-file + +Deleting Users and Projects +--------------------------- + +Ansible is designed for adding configuration that is not present; removing +state is less easy. To remove a project or user, the configuration should be +manually removed. From 6b835765cae53facf7eb366f121ed6d0834b73bc Mon Sep 17 00:00:00 2001 From: Seunghun Lee Date: Thu, 23 May 2024 15:47:31 +0100 Subject: [PATCH 02/42] Add horizon customisation doc --- doc/source/operations/customising_horizon.rst | 167 ++++++++++++++++++ 1 file changed, 167 insertions(+) create mode 100644 doc/source/operations/customising_horizon.rst diff --git a/doc/source/operations/customising_horizon.rst b/doc/source/operations/customising_horizon.rst new file mode 100644 index 0000000000..1f8977a31e --- /dev/null +++ b/doc/source/operations/customising_horizon.rst @@ -0,0 +1,167 @@ +.. include:: vars.rst + +==================================== +Customising Horizon +==================================== + +Horizon is the most frequent site-specific container customisation required: +other customisations tend to be common across deployments, but personalisation +of Horizon is unique to each institution. + +This describes a simple process for customising the Horizon theme. + +Creating a custom Horizon theme +------------------------------- + +A simple custom theme for Horizon can be implemented as small modifications of +an existing theme, such as the `Default +`__ +one. + +A theme contains at least two files: ``static/_styles.scss``, which can be empty, and +``static/_variables.scss``, which can reference another theme like this: + +.. code-block:: scss + + @import "/themes/default/variables"; + @import "/themes/default/styles"; + +Some resources such as logos can be overridden by dropping SVG image files into +``static/img`` (since the Ocata release, files must be SVG instead of PNG). See +`the Horizon documentation +`__ +for more details. + +Content on some pages such as the splash (login) screen can be updated using +templates. + +See `our example horizon-theme `__ +which inherits from the default theme and includes: + +* a custom splash screen logo +* a custom top-left logo +* a custom message on the splash screen + +Further reading: + +* https://docs.openstack.org/horizon/latest/configuration/customizing.html +* https://docs.openstack.org/horizon/latest/configuration/themes.html +* https://docs.openstack.org/horizon/latest/configuration/branding.html + +Building a Horizon container image with custom theme +---------------------------------------------------- + +Building a custom container image for Horizon can be done by modifying +``kolla.yml`` to fetch the custom theme and include it in the image: + +.. code-block:: yaml + :substitutions: + + kolla_sources: + horizon-additions-theme-: + type: "git" + location: + reference: master + + kolla_build_blocks: + horizon_footer: | + # Binary images cannot use the additions mechanism. + {% raw %} + {% if install_type == 'source' %} + ADD additions-archive / + RUN mkdir -p /etc/openstack-dashboard/themes/ \ + && cp -R /additions/horizon-additions-theme--archive-master/* /etc/openstack-dashboard/themes// \ + && chown -R horizon: /etc/openstack-dashboard/themes + {% endif %} + {% endraw %} + +If using a specific container image tag, don't forget to set: + +.. code-block:: yaml + + kolla_tag: mytag + +Build the image with: + +.. code-block:: console + + kayobe overcloud container image build horizon -e kolla_install_type=source --push + +Pull the new Horizon container to the controller: + +.. code-block:: console + + kayobe overcloud container image pull --kolla-tags horizon + +Deploy and use the custom theme +------------------------------- + +Switch to source image type in ``${KAYOBE_CONFIG_PATH}/kolla/globals.yml``: + +.. code-block:: yaml + + horizon_install_type: source + +You may also need to update the container image tag: + +.. code-block:: yaml + + horizon_tag: mytag + +Configure Horizon to include the custom theme and use it by default: + +.. code-block:: console + + mkdir -p ${KAYOBE_CONFIG_PATH}/kolla/config/horizon/ + +Add to ``${KAYOBE_CONFIG_PATH}/kolla/config/horizon/custom_local_settings``: + +.. code-block:: console + + AVAILABLE_THEMES = [ + ('default', 'Default', 'themes/default'), + ('material', 'Material', 'themes/material'), + ('', '', '/etc/openstack-dashboard/themes/'), + ] + DEFAULT_THEME = '' + +You can also set other customisations in this file, such as the HTML title of the page: + +.. code-block:: console + + SITE_BRANDING = "" + +Deploy with: + +.. code-block:: console + + kayobe overcloud service reconfigure --kolla-tags horizon + +Troubleshooting +--------------- + +Make sure you build source images, as binary images cannot use the addition +mechanism used here. + +If the theme is selected but the logo doesn’t load, try running these commands +inside the ``horizon`` container: + +.. code-block:: console + + /var/lib/kolla/venv/bin/python /var/lib/kolla/venv/bin/manage.py collectstatic --noinput --clear + /var/lib/kolla/venv/bin/python /var/lib/kolla/venv/bin/manage.py compress --force + settings_bundle | md5sum > /var/lib/kolla/.settings.md5sum.txt + +Alternatively, try changing anything in ``custom_local_settings`` and restarting +the ``horizon`` container. + +If the ``horizon`` container is restarting with the following error: + +.. code-block:: console + + /var/lib/kolla/venv/bin/python /var/lib/kolla/venv/bin/manage.py compress --force + CommandError: An error occurred during rendering /var/lib/kolla/venv/lib/python3.6/site-packages/openstack_dashboard/templates/horizon/_scripts.html: Couldn't find any precompiler in COMPRESS_PRECOMPILERS setting for mimetype '\'text/javascript\''. + +It can be resolved by dropping cached content with ``docker restart +memcached``. Note this will log out users from Horizon, as Django sessions are +stored in Memcached. From ac8bcbadbbe29524c29c95d35e55f4c84b8d54d8 Mon Sep 17 00:00:00 2001 From: Seunghun Lee Date: Thu, 23 May 2024 16:05:38 +0100 Subject: [PATCH 03/42] Add ceph management doc --- doc/source/operations/ceph-management.rst | 123 ++++++++++++++++++++++ 1 file changed, 123 insertions(+) create mode 100644 doc/source/operations/ceph-management.rst diff --git a/doc/source/operations/ceph-management.rst b/doc/source/operations/ceph-management.rst new file mode 100644 index 0000000000..754a6deb9c --- /dev/null +++ b/doc/source/operations/ceph-management.rst @@ -0,0 +1,123 @@ +========================== +Managing Ceph with Cephadm +========================== + +cephadm configuration location +============================== + +In kayobe-config repository, under ``etc/kayobe/cephadm.yml`` (or in a specific +Kayobe environment when using multiple environment, e.g. +``etc/kayobe/environments/production/cephadm.yml``) + +StackHPC's cephadm Ansible collection relies on multiple inventory groups: + +- ``mons`` +- ``mgrs`` +- ``osds`` +- ``rgws`` (optional) + +Those groups are usually defined in ``etc/kayobe/inventory/groups``. + +Running cephadm playbooks +========================= + +In kayobe-config repository, under ``etc/kayobe/ansible`` there is a set of +cephadm based playbooks utilising stackhpc.cephadm Ansible Galaxy collection. + +- ``cephadm.yml`` - runs the end to end process starting with deployment and + defining EC profiles/crush rules/pools and users +- ``cephadm-crush-rules.yml`` - defines Ceph crush rules according +- ``cephadm-deploy.yml`` - runs the bootstrap/deploy playbook without the + additional playbooks +- ``cephadm-ec-profiles.yml`` - defines Ceph EC profiles +- ``cephadm-gather-keys.yml`` - gather Ceph configuration and keys and populate + kayobe-config +- ``cephadm-keys.yml`` - defines Ceph users/keys +- ``cephadm-pools.yml`` - defines Ceph pools\ + +Running Ceph commands +===================== + +Ceph commands are usually run inside a ``cephadm shell`` utility container: + +.. code-block:: console + + # From the node that runs Ceph + ceph# sudo cephadm shell + +Operating a cluster requires a keyring with an admin access to be available for Ceph +commands. Cephadm will copy such keyring to the nodes carrying +`_admin `__ +label - present on MON servers by default when using +`StackHPC Cephadm collection `__. + +Adding a new storage node +========================= + +Add a node to a respective group (e.g. osds) and run ``cephadm-deploy.yml`` +playbook. + +.. note:: + To add other node types than osds (mons, mgrs, etc) you need to specify + ``-e cephadm_bootstrap=True`` on playbook run. + +Removing a storage node +======================= + +First drain the node + +.. code-block:: console + + ceph# cephadm shell + ceph# ceph orch host drain + +Once all daemons are removed - you can remove the host: + +.. code-block:: console + + ceph# cephadm shell + ceph# ceph orch host rm + +And then remove the host from inventory (usually in +``etc/kayobe/inventory/overcloud``) + +Additional options/commands may be found in +`Host management `_ + +Replacing a Failed Ceph Drive +============================= + +Once an OSD has been identified as having a hardware failure, +the affected drive will need to be replaced. + +If rebooting a Ceph node, first set ``noout`` to prevent excess data +movement: + +.. code-block:: console + + ceph# cephadm shell + ceph# ceph osd set noout + +Reboot the node and replace the drive + +Unset noout after the node is back online + +.. code-block:: console + + ceph# cephadm shell + ceph# ceph osd unset noout + +Remove the OSD using Ceph orchestrator command: + +.. code-block:: console + + ceph# cephadm shell + ceph# ceph orch osd rm --replace + +After removing OSDs, if the drives the OSDs were deployed on once again become +available, cephadm may automatically try to deploy more OSDs on these drives if +they match an existing drivegroup spec. +If this is not your desired action plan - it's best to modify the drivegroup +spec before (``cephadm_osd_spec`` variable in ``etc/kayobe/cephadm.yml``). +Either set ``unmanaged: true`` to stop cephadm from picking up new disks or +modify it in some way that it no longer matches the drives you want to remove. From e5b7a77275496173eb0ba06cf67c246fdd6177f7 Mon Sep 17 00:00:00 2001 From: Seunghun Lee Date: Tue, 28 May 2024 13:19:20 +0100 Subject: [PATCH 04/42] Add ceph operation doc --- doc/source/operations/ceph-management.rst | 178 ++++++++++++++++++++-- 1 file changed, 167 insertions(+), 11 deletions(-) diff --git a/doc/source/operations/ceph-management.rst b/doc/source/operations/ceph-management.rst index 754a6deb9c..8e3d1f4e94 100644 --- a/doc/source/operations/ceph-management.rst +++ b/doc/source/operations/ceph-management.rst @@ -1,13 +1,16 @@ -========================== -Managing Ceph with Cephadm -========================== +=========================== +Managing and Operating Ceph +=========================== + +Working with Cephadm +==================== cephadm configuration location -============================== +------------------------------ In kayobe-config repository, under ``etc/kayobe/cephadm.yml`` (or in a specific Kayobe environment when using multiple environment, e.g. -``etc/kayobe/environments/production/cephadm.yml``) +``etc/kayobe/environments//cephadm.yml``) StackHPC's cephadm Ansible collection relies on multiple inventory groups: @@ -19,7 +22,7 @@ StackHPC's cephadm Ansible collection relies on multiple inventory groups: Those groups are usually defined in ``etc/kayobe/inventory/groups``. Running cephadm playbooks -========================= +------------------------- In kayobe-config repository, under ``etc/kayobe/ansible`` there is a set of cephadm based playbooks utilising stackhpc.cephadm Ansible Galaxy collection. @@ -36,7 +39,7 @@ cephadm based playbooks utilising stackhpc.cephadm Ansible Galaxy collection. - ``cephadm-pools.yml`` - defines Ceph pools\ Running Ceph commands -===================== +--------------------- Ceph commands are usually run inside a ``cephadm shell`` utility container: @@ -47,12 +50,12 @@ Ceph commands are usually run inside a ``cephadm shell`` utility container: Operating a cluster requires a keyring with an admin access to be available for Ceph commands. Cephadm will copy such keyring to the nodes carrying -`_admin `__ +`_admin `__ label - present on MON servers by default when using `StackHPC Cephadm collection `__. Adding a new storage node -========================= +------------------------- Add a node to a respective group (e.g. osds) and run ``cephadm-deploy.yml`` playbook. @@ -62,7 +65,7 @@ playbook. ``-e cephadm_bootstrap=True`` on playbook run. Removing a storage node -======================= +----------------------- First drain the node @@ -85,7 +88,7 @@ Additional options/commands may be found in `Host management `_ Replacing a Failed Ceph Drive -============================= +----------------------------- Once an OSD has been identified as having a hardware failure, the affected drive will need to be replaced. @@ -121,3 +124,156 @@ If this is not your desired action plan - it's best to modify the drivegroup spec before (``cephadm_osd_spec`` variable in ``etc/kayobe/cephadm.yml``). Either set ``unmanaged: true`` to stop cephadm from picking up new disks or modify it in some way that it no longer matches the drives you want to remove. + + +Operations +========== + +Replacing drive +--------------- + +See upstream documentation: +https://docs.ceph.com/en/latest/cephadm/services/osd/#replacing-an-osd + +In case where disk holding DB and/or WAL fails, it is necessary to recreate +(using replacement procedure above) all OSDs that are associated with this +disk - usually NVMe drive. The following single command is sufficient to +identify which OSDs are tied to which physical disks: + +.. code-block:: console + + ceph# ceph device ls + +Host maintenance +---------------- + +https://docs.ceph.com/en/latest/cephadm/host-management/#maintenance-mode + +Upgrading +--------- + +https://docs.ceph.com/en/latest/cephadm/upgrade/ + + +Troubleshooting +=============== + +Investigating a Failed Ceph Drive +--------------------------------- + +A failing drive in a Ceph cluster will cause OSD daemon to crash. +In this case Ceph will go into `HEALTH_WARN` state. +Ceph can report details about failed OSDs by running: + +.. code-block:: console + + ceph# ceph health detail + +.. note :: + + Remember to run ceph/rbd commands from within ``cephadm shell`` + (preferred method) or after installing Ceph client. Details in the + official `documentation `__. + It is also required that the host where commands are executed has admin + Ceph keyring present - easiest to achieve by applying + `_admin `__ + label (Ceph MON servers have it by default when using + `StackHPC Cephadm collection `__). + +A failed OSD will also be reported as down by running: + +.. code-block:: console + + ceph# ceph osd tree + +Note the ID of the failed OSD. + +The failed disk is usually logged by the Linux kernel too: + +.. code-block:: console + + storage-0# dmesg -T + +Cross-reference the hardware device and OSD ID to ensure they match. +(Using `pvs` and `lvs` may help make this connection). + +Inspecting a Ceph Block Device for a VM +--------------------------------------- + +To find out what block devices are attached to a VM, go to the hypervisor that +it is running on (an admin-level user can see this from ``openstack server +show``). + +On this hypervisor, enter the libvirt container: + +.. code-block:: console + :substitutions: + + |hypervisor_hostname|# docker exec -it nova_libvirt /bin/bash + +Find the VM name using libvirt: + +.. code-block:: console + :substitutions: + + (nova-libvirt)[root@|hypervisor_hostname| /]# virsh list + Id Name State + ------------------------------------ + 1 instance-00000001 running + +Now inspect the properties of the VM using ``virsh dumpxml``: + +.. code-block:: console + :substitutions: + + (nova-libvirt)[root@|hypervisor_hostname| /]# virsh dumpxml instance-00000001 | grep rbd + + +On a Ceph node, the RBD pool can be inspected and the volume extracted as a RAW +block image: + +.. code-block:: console + :substitutions: + + ceph# rbd ls |nova_rbd_pool| + ceph# rbd export |nova_rbd_pool|/51206278-e797-4153-b720-8255381228da_disk blob.raw + +The raw block device (blob.raw above) can be mounted using the loopback device. + +Inspecting a QCOW Image using LibGuestFS +---------------------------------------- + +The virtual machine's root image can be inspected by installing +libguestfs-tools and using the guestfish command: + +.. code-block:: console + + ceph# export LIBGUESTFS_BACKEND=direct + ceph# guestfish -a blob.qcow + > run + 100% [XXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXX] 00:00 + > list-filesystems + /dev/sda1: ext4 + > mount /dev/sda1 / + > ls / + bin + boot + dev + etc + home + lib + lib64 + lost+found + media + mnt + opt + proc + root + run + sbin + srv + sys + tmp + usr + var + > quit From f4b2630557c284c24bf6513344d8b3120cacff80 Mon Sep 17 00:00:00 2001 From: Seunghun Lee Date: Tue, 28 May 2024 13:19:59 +0100 Subject: [PATCH 05/42] Add openstack operation docs --- .../operations/control-plane-operation.rst | 391 ++++++++++++++++++ doc/source/operations/migrating-vm.rst | 22 + .../operations/openstack-reconfiguration.rst | 186 +++++++++ 3 files changed, 599 insertions(+) create mode 100644 doc/source/operations/control-plane-operation.rst create mode 100644 doc/source/operations/migrating-vm.rst create mode 100644 doc/source/operations/openstack-reconfiguration.rst diff --git a/doc/source/operations/control-plane-operation.rst b/doc/source/operations/control-plane-operation.rst new file mode 100644 index 0000000000..c5c629d52f --- /dev/null +++ b/doc/source/operations/control-plane-operation.rst @@ -0,0 +1,391 @@ +======================= +Operating Control Plane +======================= + +Backup of the OpenStack Control Plane +===================================== + +As the backup procedure is constantly changing, it is normally best to check +the upstream documentation for an up to date procedure. Here is a high level +overview of the key things you need to backup: + +Controllers +----------- + +* `Back up SQL databases `__ +* `Back up configuration in /etc/kolla `__ + +Compute +------- + +The compute nodes can largely be thought of as ephemeral, but you do need to +make sure you have migrated any instances and disabled the hypervisor before +decommissioning or making any disruptive configuration change. + +Monitoring +---------- + +* `Back up InfluxDB `__ +* `Back up ElasticSearch `__ +* `Back up Prometheus `__ + +Seed +---- + +* `Back up bifrost `__ + +Ansible control host +-------------------- + +* Back up service VMs such as the seed VM + +Control Plane Monitoring +======================== + +The control plane has been configured to collect logs centrally using the EFK +stack (Elasticsearch, Fluentd and Kibana). + +Telemetry monitoring of the control plane is performed by Prometheus. Metrics +are collected by Prometheus exporters, which are either running on all hosts +(e.g. node exporter), on specific hosts (e.g. controllers for the memcached +exporter or monitoring hosts for the OpenStack exporter). These exporters are +scraped by the Prometheus server. + +Configuring Prometheus Alerts +----------------------------- + +Alerts are defined in code and stored in Kayobe configuration. See ``*.rules`` +files in ``${KAYOBE_CONFIG_PATH}/kolla/config/prometheus`` as a model to add +custom rules. + +Silencing Prometheus Alerts +--------------------------- + +Sometimes alerts must be silenced because the root cause cannot be resolved +right away, such as when hardware is faulty. For example, an unreachable +hypervisor will produce several alerts: + +* ``InstanceDown`` from Node Exporter +* ``OpenStackServiceDown`` from the OpenStack exporter, which reports status of + the ``nova-compute`` agent on the host +* ``PrometheusTargetMissing`` from several Prometheus exporters + +Rather than silencing each alert one by one for a specific host, a silence can +apply to multiple alerts using a reduced list of labels. :ref:`Log into +Alertmanager `, click on the ``Silence`` button next +to an alert and adjust the matcher list to keep only ``instance=`` +label. Then, create another silence to match ``hostname=`` (this is +required because, for the OpenStack exporter, the instance is the host running +the monitoring service rather than the host being monitored). + +.. note:: + + After creating the silence, you may get redirected to a 404 page. This is a + `known issue `__ + when running several Alertmanager instances behind HAProxy. + +Generating Alerts from Metrics +++++++++++++++++++++++++++++++ + +Alerts are defined in code and stored in Kayobe configuration. See ``*.rules`` +files in ``${KAYOBE_CONFIG_PATH}/kolla/config/prometheus`` as a model to add +custom rules. + +Control Plane Shutdown Procedure +================================ + +Overview +-------- + +* Verify integrity of clustered components (RabbitMQ, Galera, Keepalived). They + should all report a healthy status. +* Put node into maintenance mode in bifrost to prevent it from automatically + powering back on +* Shutdown down nodes one at a time gracefully using systemctl poweroff + +Controllers +----------- + +If you are restarting the controllers, it is best to do this one controller at +a time to avoid the clustered components losing quorum. + +Checking Galera state ++++++++++++++++++++++ + +On each controller perform the following: + +.. code-block:: console + + [stack@controller0 ~]$ docker exec -i mariadb mysql -u root -p -e "SHOW STATUS LIKE 'wsrep_local_state_comment'" + Variable_name Value + wsrep_local_state_comment Synced + +The password can be found using: + +.. code-block:: console + + kayobe# ansible-vault view ${KAYOBE_CONFIG_PATH}/kolla/passwords.yml \ + --vault-password-file | grep ^database + +Checking RabbitMQ ++++++++++++++++++ + +RabbitMQ health is determined using the command ``rabbitmqctl cluster_status``: + +.. code-block:: console + + [stack@controller0 ~]$ docker exec rabbitmq rabbitmqctl cluster_status + Cluster status of node rabbit@controller0 ... + [{nodes,[{disc,['rabbit@controller0','rabbit@controller1', + 'rabbit@controller2']}]}, + {running_nodes,['rabbit@controller1','rabbit@controller2', + 'rabbit@controller0']}, + {cluster_name,<<"rabbit@controller2">>}, + {partitions,[]}, + {alarms,[{'rabbit@controller1',[]}, + {'rabbit@controller2',[]}, + {'rabbit@controller0',[]}]}] + +Checking Keepalived ++++++++++++++++++++ + +On (for example) three controllers: + +.. code-block:: console + + [stack@controller0 ~]$ docker logs keepalived + +Two instances should show: + +.. code-block:: console + + VRRP_Instance(kolla_internal_vip_51) Entering BACKUP STATE + +and the other: + +.. code-block:: console + + VRRP_Instance(kolla_internal_vip_51) Entering MASTER STATE + +Ansible Control Host +-------------------- + +The Ansible control host is not enrolled in bifrost. This node may run services +such as the seed virtual machine which will need to be gracefully powered down. + +Compute +------- + +If you are shutting down a single hypervisor, to avoid down time to tenants it +is advisable to migrate all of the instances to another machine. See +:ref:`evacuating-all-instances`. + +.. ifconfig:: deployment['ceph_managed'] + + Ceph + ---- + + The following guide provides a good overview: + https://access.redhat.com/documentation/en-us/red_hat_openstack_platform/8/html/director_installation_and_usage/sect-rebooting-ceph + +Shutting down the seed VM +------------------------- + +.. code-block:: console + + kayobe# virsh shutdown + +.. _full-shutdown: + +Full shutdown +------------- + +In case a full shutdown of the system is required, we advise to use the +following order: + +* Perform a graceful shutdown of all virtual machine instances +* Shut down compute nodes +* Shut down monitoring node +* Shut down network nodes (if separate from controllers) +* Shut down controllers +* Shut down Ceph nodes (if applicable) +* Shut down seed VM +* Shut down Ansible control host + +Rebooting a node +---------------- + +Example: Reboot all compute hosts apart from compute0: + +.. code-block:: console + + kayobe# kayobe overcloud host command run --limit 'compute:!compute0' -b --command "shutdown -r" + +References +---------- + +* https://galeracluster.com/library/training/tutorials/restarting-cluster.html + +Control Plane Power on Procedure +================================ + +Overview +-------- + +* Remove the node from maintenance mode in bifrost +* Bifrost should automatically power on the node via IPMI +* Check that all docker containers are running +* Check Kibana for any messages with log level ERROR or equivalent + +Controllers +----------- + +If all of the servers were shut down at the same time, it is necessary to run a +script to recover the database once they have all started up. This can be done +with the following command: + +.. code-block:: console + + kayobe# kayobe overcloud database recover + +Ansible Control Host +-------------------- + +The Ansible control host is not enrolled in Bifrost and will have to be powered +on manually. + +Seed VM +------- + +The seed VM (and any other service VM) should start automatically when the seed +hypervisor is powered on. If it does not, it can be started with: + +.. code-block:: console + + kayobe# virsh start seed-0 + +Full power on +------------- + +Follow the order in :ref:`full-shutdown`, but in reverse order. + +Shutting Down / Restarting Monitoring Services +---------------------------------------------- + +Shutting down ++++++++++++++ + +Log into the monitoring host(s): + +.. code-block:: console + + kayobe# ssh stack@monitoring0 + +Stop all Docker containers: + +.. code-block:: console + + monitoring0# for i in `docker ps -q`; do docker stop $i; done + +Shut down the node: + +.. code-block:: console + + monitoring0# sudo shutdown -h + +Starting up ++++++++++++ + +The monitoring services containers will automatically start when the monitoring +node is powered back on. + +Software Updates +================ + +Update Packages on Control Plane +-------------------------------- + +OS packages can be updated with: + +.. code-block:: console + + kayobe# kayobe overcloud host package update --limit --packages '*' + kayobe# kayobe overcloud seed package update --packages '*' + +See https://docs.openstack.org/kayobe/latest/administration/overcloud.html#updating-packages + +Minor Upgrades to OpenStack Services +------------------------------------ + +* Pull latest changes from upstream stable branch to your own ``kolla`` fork (if applicable) +* Update ``kolla_openstack_release`` in ``etc/kayobe/kolla.yml`` (unless using default) +* Update tags for the images in ``etc/kayobe/kolla/globals.yml`` to use the new value of ``kolla_openstack_release`` +* Rebuild container images +* Pull container images to overcloud hosts +* Run kayobe overcloud service upgrade + +For more information, see: https://docs.openstack.org/kayobe/latest/upgrading.html + +Troubleshooting +=============== + +Deploying to a Specific Hypervisor +---------------------------------- + +To test creating an instance on a specific hypervisor, *as an admin-level user* +you can specify the hypervisor name as part of an extended availability zone +description. + +To see the list of hypervisor names: + +.. code-block:: console + + admin# openstack hypervisor list + +To boot an instance on a specific hypervisor + +.. code-block:: console + + admin# openstack server create --flavor --network --key-name --image --availability-zone nova:: + +Cleanup Procedures +================== + +OpenStack services can sometimes fail to remove all resources correctly. This +is the case with Magnum, which fails to clean up users in its domain after +clusters are deleted. `A patch has been submitted to stable branches +`__. +Until this fix becomes available, if Magnum is in use, administrators can +perform the following cleanup procedure regularly: + +.. code-block:: console + + admin# for user in $(openstack user list --domain magnum -f value -c Name | grep -v magnum_trustee_domain_admin); do + if openstack coe cluster list -c uuid -f value | grep -q $(echo $user | sed 's/_[0-9a-f]*$//'); then + echo "$user still in use, not deleting" + else + openstack user delete --domain magnum $user + fi + done + +OpenSearch indexes retention +============================= + +To alter default rotation values for OpenSearch, edit + +``${KAYOBE_CONFIG_PATH}/kolla/globals.yml``: + +.. code-block:: console + # Duration after which index is closed (default 30) + opensearch_soft_retention_period_days: 90 + # Duration after which index is deleted (default 60) + opensearch_hard_retention_period_days: 180 + +Reconfigure Opensearch with new values: + +.. code-block:: console + kayobe overcloud service reconfigure --kolla-tags opensearch + +For more information see the `upstream documentation + +`__. diff --git a/doc/source/operations/migrating-vm.rst b/doc/source/operations/migrating-vm.rst new file mode 100644 index 0000000000..784abe74a4 --- /dev/null +++ b/doc/source/operations/migrating-vm.rst @@ -0,0 +1,22 @@ +========================== +Migrating virtual machines +========================== + +To see where all virtual machines are running on the hypervisors: + +.. code-block:: console + + admin# openstack server list --all-projects --long + +To move a virtual machine with shared storage or booted from volume from one hypervisor to another, for example to +hypervisor-01: + +.. code-block:: console + + admin# openstack server migrate --live-migration --host hypervisor-01 + +To move a virtual machine with local disks: + +.. code-block:: console + + admin# openstack server migrate --live-migration --block-migration --host hypervisor-01 diff --git a/doc/source/operations/openstack-reconfiguration.rst b/doc/source/operations/openstack-reconfiguration.rst new file mode 100644 index 0000000000..dfba372f26 --- /dev/null +++ b/doc/source/operations/openstack-reconfiguration.rst @@ -0,0 +1,186 @@ +========================= +OpenStack Reconfiguration +========================= + +Disabling a Service +------------------- + +Ansible is oriented towards adding or reconfiguring services, but removing a +service is handled less well, because of Ansible's imperative style. + +To remove a service, it is disabled in Kayobe's Kolla config, which prevents +other services from communicating with it. For example, to disable +``cinder-backup``, edit ``${KAYOBE_CONFIG_PATH}/kolla.yml``: + +.. code-block:: diff + + -enable_cinder_backup: true + +enable_cinder_backup: false + +Then, reconfigure Cinder services with Kayobe: + +.. code-block:: console + + kayobe# kayobe overcloud service reconfigure --kolla-tags cinder + +However, the service itself, no longer in Ansible's manifest of managed state, +must be manually stopped and prevented from restarting. + +On each controller: + +.. code-block:: console + + kayobe# docker rm -f cinder_backup + +Some services may store data in a dedicated Docker volume, which can be removed +with ``docker volume rm``. + +Installing TLS Certificates +--------------------------- + +To configure TLS for the first time, we write the contents of a PEM +file to the ``secrets.yml`` file as ``secrets_kolla_external_tls_cert``. +Use a command of this form: + +.. code-block:: console + + kayobe# ansible-vault edit ${KAYOBE_CONFIG_PATH}/secrets.yml --vault-password-file= + +Concatenate the contents of the certificate and key files to create +``secrets_kolla_external_tls_cert``. The certificates should be installed in +this order: + +* TLS certificate for the public endpoint FQDN +* Any intermediate certificates +* The TLS certificate private key + +In ``${KAYOBE_CONFIG_PATH}/kolla.yml``, set the following: + +.. code-block:: yaml + + kolla_enable_tls_external: True + kolla_external_tls_cert: "{{ secrets_kolla_external_tls_cert }}" + +To apply TLS configuration, we need to reconfigure all services, as endpoint URLs need to +be updated in Keystone: + +.. code-block:: console + + kayobe# kayobe overcloud service reconfigure + +Alternative Configuration ++++++++++++++++++++++++++ + +As an alternative to writing the certificates as a variable to +``secrets.yml``, it is also possible to write the same data to a file, +``etc/kayobe/kolla/certificates/haproxy.pem``. The file should be +vault-encrypted in the same manner as secrets.yml. In this instance, +variable ``kolla_external_tls_cert`` does not need to be defined. + +See `Kolla-Ansible TLS guide +`__ for +further details. + +Updating TLS Certificates +------------------------- + +Check the expiry date on an installed TLS certificate from a host that can +reach the OpenStack APIs: + +.. code-block:: console + :substitutions: + + openstack# openssl s_client -connect :443 2> /dev/null | openssl x509 -noout -dates + +*NOTE*: Prometheus Blackbox monitoring can check certificates automatically +and alert when expiry is approaching. + +To update an existing certificate, for example when it has reached expiration, +change the value of ``secrets_kolla_external_tls_cert``, in the same order as +above. Run the following command: + +.. code-block:: console + + kayobe# kayobe overcloud service reconfigure --kolla-tags haproxy + +.. _taking-a-hypervisor-out-of-service: + +Taking a Hypervisor out of Service +---------------------------------- + +To take a hypervisor out of Nova scheduling: + +.. code-block:: console + + admin# openstack compute service set --disable \ + nova-compute + +Running instances on the hypervisor will not be affected, but new instances +will not be deployed on it. + +A reason for disabling a hypervisor can be documented with the +``--disable-reason`` flag: + +.. code-block:: console + + admin# openstack compute service set --disable \ + --disable-reason "Broken drive" nova-compute + +Details about all hypervisors and the reasons they are disabled can be +displayed with: + +.. code-block:: console + + admin# openstack compute service list --long + +And then to enable a hypervisor again: + +.. code-block:: console + + admin# openstack compute service set --enable \ + nova-compute + +Managing Space in the Docker Registry +------------------------------------- + +If the Docker registry becomes full, this can prevent container updates and +(depending on the storage configuration of the seed host) could lead to other +problems with services provided by the seed host. + +To remove container images from the Docker Registry, follow this process: + +* Reconfigure the registry container to allow deleting containers. This can be + done in ``docker-registry.yml`` with Kayobe: + +.. code-block:: yaml + + docker_registry_env: + REGISTRY_STORAGE_DELETE_ENABLED: "true" + +* For the change to take effect, run: + +.. code-block:: console + + kayobe seed host configure + +* A helper script is useful, such as https://github.com/byrnedo/docker-reg-tool + (this requires ``jq``). To delete all images with a specific tag, use: + +.. code-block:: console + + for repo in `./docker_reg_tool http://registry-ip:4000 list`; do + ./docker_reg_tool http://registry-ip:4000 delete $repo $tag + done + +* Deleting the tag does not actually release the space. To actually free up + space, run garbage collection: + +.. code-block:: console + + seed# docker exec docker_registry bin/registry garbage-collect /etc/docker/registry/config.yml + +The seed host can also accrue a lot of data from building container images. +The images stored locally in the seed host can be seen using ``docker image ls``. + +Old and redundant images can be identified from their names and tags, and +removed using ``docker image rm``. From 8f304839c478550e7eaf2eb0d534e33ce123aa18 Mon Sep 17 00:00:00 2001 From: Seunghun Lee Date: Tue, 28 May 2024 13:20:17 +0100 Subject: [PATCH 06/42] Add wazuh operation docs --- doc/source/operations/wazuh-operation.rst | 89 +++++++++++++++++++++++ 1 file changed, 89 insertions(+) create mode 100644 doc/source/operations/wazuh-operation.rst diff --git a/doc/source/operations/wazuh-operation.rst b/doc/source/operations/wazuh-operation.rst new file mode 100644 index 0000000000..23800ff849 --- /dev/null +++ b/doc/source/operations/wazuh-operation.rst @@ -0,0 +1,89 @@ +======================= +Wazuh Security Platform +======================= + +`Wazuh `_ is a security monitoring platform. +It monitors for: + +* Security-related system events. +* Known vulnerabilities (CVEs) in versions of installed software. +* Misconfigurations in system security. + +One method for deploying and maintaining Wazuh is the `official +Ansible playbooks `_. These +can be integrated into ``kayobe-config`` as a custom playbook. + +Configuring Wazuh Manager +------------------------- + +Wazuh Manager is configured by editing the ``wazuh-manager.yml`` +groups vars file found at +``etc/kayobe/inventory/group_vars/wazuh-manager/``. This file +controls various aspects of Wazuh Manager configuration. +Most notably: + +*domain_name*: + The domain used by Search Guard CE when generating certificates. + +*wazuh_manager_ip*: + The IP address that the Wazuh Manager shall reside on for communicating with the agents. + +*wazuh_manager_connection*: + Used to define port and protocol for the manager to be listening on. + +*wazuh_manager_authd*: + Connection settings for the daemon responsible for registering new agents. + +Running ``kayobe playbook run +$KAYOBE_CONFIG_PATH/ansible/wazuh-manager.yml`` will deploy these +changes. + +Secrets +------- + +Wazuh requires that secrets or passwords are set for itself and the services with which it communiticates. +The playbook ``etc/kayobe/ansible/wazuh-secrets.yml`` automates the creation of these secrets, which should then be encrypted with Ansible Vault. + +To update the secrets you can execute the following two commands + +.. code-block:: shell + + kayobe# kayobe playbook run $KAYOBE_CONFIG_PATH/ansible/wazuh-secrets.yml \ + -e wazuh_user_pass=$(uuidgen) \ + -e wazuh_admin_pass=$(uuidgen) + kayobe# ansible-vault encrypt --vault-password-file \ + $KAYOBE_CONFIG_PATH/inventory/group_vars/wazuh-manager/wazuh-secrets.yml + +Once generated, run ``kayobe playbook run $KAYOBE_CONFIG_PATH/ansible/wazuh-manager.yml`` which copies the secrets into place. + +.. note:: Use ``ansible-vault`` to view the secrets: + + ``ansible-vault view --vault-password-file ~/vault.password $KAYOBE_CONFIG_PATH/inventory/group_vars/wazuh-manager/wazuh-secrets.yml`` + +Adding a New Agent +------------------ +The Wazuh Agent is deployed to all hosts in the ``wazuh-agent`` +inventory group, comprising the ``seed`` group +plus the ``overcloud`` group (containing all hosts in the +OpenStack control plane). + +.. code-block:: ini + + [wazuh-agent:children] + seed + overcloud + +The following playbook deploys the Wazuh Agent to all hosts in the +``wazuh-agent`` group: + +.. code-block:: shell + + kayobe playbook run $KAYOBE_CONFIG_PATH/ansible/wazuh-agent.yml + +The hosts running Wazuh Agent should automatically be registered +and visible within the Wazuh Manager dashboard. + +.. note:: It is good practice to use a `Kayobe deploy hook + `_ + to automate deployment and configuration of the Wazuh Agent + following a run of ``kayobe overcloud host configure``. From 7b29ccd1f58619c8ffcb8ce0f41ba09de6632640 Mon Sep 17 00:00:00 2001 From: Seunghun Lee Date: Tue, 28 May 2024 16:35:11 +0100 Subject: [PATCH 07/42] Add hardware inventory management doc --- .../hardware-inventory-management.rst | 253 ++++++++++++++++++ 1 file changed, 253 insertions(+) create mode 100644 doc/source/operations/hardware-inventory-management.rst diff --git a/doc/source/operations/hardware-inventory-management.rst b/doc/source/operations/hardware-inventory-management.rst new file mode 100644 index 0000000000..43fcb73aff --- /dev/null +++ b/doc/source/operations/hardware-inventory-management.rst @@ -0,0 +1,253 @@ +============================= +Hardware Inventory Management +============================= + +At its lowest level, hardware inventory is managed in the Bifrost service. + +Reconfiguring Control Plane Hardware +------------------------------------ + +If a server's hardware or firmware configuration is changed, it should be +re-inspected in Bifrost before it is redeployed into service. A single server +can be reinspected like this: + +.. code-block:: console + + kayobe# kayobe overcloud hardware inspect --limit + +.. _enrolling-new-hypervisors: + +Enrolling New Hypervisors +------------------------- + +New hypervisors can be added to the Bifrost inventory by using its discovery +capabilities. Assuming that new hypervisors have IPMI enabled and are +configured to network boot on the provisioning network, the following commands +will instruct them to PXE boot. The nodes will boot on the Ironic Python Agent +kernel and ramdisk, which is configured to extract hardware information and +send it to Bifrost. Note that IPMI credentials can be found in the encrypted +file located at ``${KAYOBE_CONFIG_PATH}/secrets.yml``. + +.. code-block:: console + + bifrost# ipmitool -I lanplus -U -H -ipmi chassis bootdev pxe + +If node is are off, power them on: + +.. code-block:: console + + bifrost# ipmitool -I lanplus -U -H -ipmi power on + +If nodes is on, reset them: + +.. code-block:: console + + bifrost# ipmitool -I lanplus -U -H -ipmi power reset + +Once node have booted and have completed introspection, they should be visible +in Bifrost: + +.. code-block:: console + + bifrost# baremetal node list --provision-state enroll + +--------------------------------------+-----------------------+---------------+-------------+--------------------+-------------+ + | UUID | Name | Instance UUID | Power State | Provisioning State | Maintenance | + +--------------------------------------+-----------------------+---------------+-------------+--------------------+-------------+ + | da0c61af-b411-41b9-8909-df2509f2059b | example-hypervisor-01 | None | power off | enroll | False | + +--------------------------------------+-----------------------+---------------+-------------+--------------------+-------------+ + +After editing ``${KAYOBE_CONFIG_PATH}/overcloud.yml`` to add these new hosts to +the correct groups, import them in Kayobe's inventory with: + +.. code-block:: console + + kayobe# kayobe overcloud inventory discover + +We can then provision and configure them: + +.. code-block:: console + + kayobe# kayobe overcloud provision --limit + kayobe# kayobe overcloud host configure --limit + kayobe# kayobe overcloud service deploy --limit --kolla-limit + +Replacing a Failing Hypervisor +------------------------------ + +To replace a failing hypervisor, proceed as follows: + +* :ref:`Disable the hypervisor to avoid scheduling any new instance on it ` +* :ref:`Evacuate all instances ` +* :ref:`Set the node to maintenance mode in Bifrost ` +* Physically fix or replace the node +* It may be necessary to reinspect the node if hardware was changed (this will require deprovisioning and reprovisioning) +* If the node was replaced or reprovisioned, follow :ref:`enrolling-new-hypervisors` + +To deprovision an existing hypervisor, run: + +.. code-block:: console + + kayobe# kayobe overcloud deprovision --limit + +.. warning:: + + Always use ``--limit`` with ``kayobe overcloud deprovision`` on a production + system. Running this command without a limit will deprovision all overcloud + hosts. + +.. _evacuating-all-instances: + +Evacuating all instances +------------------------ + +.. code-block:: console + + admin# openstack server evacuate $(openstack server list --host --format value --column ID) + +You should now check the status of all the instances that were running on that +hypervisor. They should all show the status ACTIVE. This can be verified with: + +.. code-block:: console + + admin# openstack server show + +Troubleshooting ++++++++++++++++ + +Servers that have been shut down +******************************** + +If there are any instances that are SHUTOFF they won’t be migrated, but you can +use ``openstack server migrate`` for them once the live migration is finished. + +Also if a VM does heavy memory access, it may take ages to migrate (Nova tries +to incrementally increase the expected downtime, but is quite conservative). +You can use ``openstack server migration force complete --os-compute-api-version 2.22 +`` to trigger the final move. + +You get the migration ID via ``openstack server migration list --server ``. + +For more details see: +http://www.danplanet.com/blog/2016/03/03/evacuate-in-nova-one-command-to-confuse-us-all/ + +Flavors have changed +******************** + +If the size of the flavors has changed, some instances will also fail to +migrate as the process needs manual confirmation. You can do this with: + +.. code-block:: console + + openstack # openstack server resize confirm + +The symptom to look out for is that the server is showing a status of ``VERIFY +RESIZE`` as shown in this snippet of ``openstack server show ``: + +.. code-block:: console + + | status | VERIFY_RESIZE | + +.. _set-bifrost-maintenance-mode: + +Set maintenance mode on a node in Bifrost ++++++++++++++++++++++++++++++++++++++++++ + +.. code-block:: console + + seed# docker exec -it bifrost_deploy /bin/bash + (bifrost-deploy)[root@seed bifrost-base]# export OS_CLOUD=bifrost + (bifrost-deploy)[root@seed bifrost-base]# baremetal node maintenance set + +.. _unset-bifrost-maintenance-mode: + +Unset maintenance mode on a node in Bifrost ++++++++++++++++++++++++++++++++++++++++++++ + +.. code-block:: console + + seed# docker exec -it bifrost_deploy /bin/bash + (bifrost-deploy)[root@seed bifrost-base]# export OS_CLOUD=bifrost + (bifrost-deploy)[root@seed bifrost-base]# baremetal node maintenance unset + +Detect hardware differences with ADVise +======================================= + +Extract Bifrost introspection data +---------------------------------- + +The ADVise tool assumes that hardware introspection data has already been gathered in JSON format. +The ``extra-hardware`` disk builder element enabled when building the IPA image for the required data to be available. + +To build ipa image with extra-hardware you need to edit ``ipa.yml`` and add this: +.. code-block:: console + + # Whether to build IPA images from source. + ipa_build_images: true + + # List of additional Diskimage Builder (DIB) elements to use when building IPA + images. Default is none. + ipa_build_dib_elements_extra: + - "extra-hardware" + + # List of additional inspection collectors to run. + ipa_collectors_extra: + - "extra-hardware" + +Extract introspection data from Bifrost with Kayobe. JSON files will be created +into ``${KAYOBE_CONFIG_PATH}/overcloud-introspection-data``: + +.. code-block:: console + + kayobe# kayobe overcloud introspection data save + +Using ADVise +------------ + +Hardware information captured during the Ironic introspection process can be +analysed to detect hardware differences, such as mismatches in firmware +versions or missing storage devices. The `ADVise `__ +tool can be used for this purpose. + +The Ansible playbook ``advise-run.yml`` can be found at ``${KAYOBE_CONFIG_PATH}/ansible/advise-run.yml``. + +The playbook will: + +1. Install ADVise and dependencies +2. Run the mungetout utility for extracting the required information from the introspection data ready for use with ADVise. +3. Run ADVise on the data. + +.. code-block:: console + + cd ${KAYOBE_CONFIG_PATH} + ansible-playbook ${KAYOBE_CONFIG_PATH}/ansible/advise-run.yml + +The playbook has the following optional parameters: + +- venv : path to the virtual environment to use. Default: ``"~/venvs/advise-review"`` +- input_dir: path to the hardware introspection data. Default: ``"{{ lookup('env', 'PWD') }}/overcloud-introspection-data"`` +- output_dir: path to where results should be saved. Default: ``"{{ lookup('env', 'PWD') }}/review"`` +- advise-pattern: regular expression to specify what introspection data should be analysed. Default: ``".*.eval"`` + +Example command to run the tool on data about the compute nodes in a system, where compute nodes are named cpt01, cpt02, cpt03…: + +.. code-block:: console + + ansible-playbook advise-run.yml -e advise_pattern=’(cpt)(.*)(.eval)’ + + +.. note:: + The mungetout utility will always use the file extension .eval + +Using the results +----------------- + +The ADVise tool will output a selection of results found under output_dir/results these include: + +- ``.html`` files to display network visualisations of any hardware differences. +- The folder ``Paired_Comparisons`` which contains information on the shared and differing fields found between the systems. This is a reflection of the network visualisation webpage, with more detail as to what the differences are. +- ``_summary``, a listing of how the systems can be grouped into sets of identical hardware. +- ``_performance``, the results of analysing the benchmarking data gathered. +- ``_perf_summary``, a subset of the performance metrics, just showing any potentially anomalous data such as where variance is too high, or individual nodes have been found to over/underperform. + +To get visuallised result, It is recommanded to copy instrospection data and review directories to your +local machine then run ADVise playbook locally with the data. From 3dc7a2c5adb4d2372cc37a83a411a02a80fdbddb Mon Sep 17 00:00:00 2001 From: Seunghun Lee Date: Wed, 29 May 2024 10:44:10 +0100 Subject: [PATCH 08/42] Move advise tool intro --- .../operations/hardware-inventory-management.rst | 10 +++++----- 1 file changed, 5 insertions(+), 5 deletions(-) diff --git a/doc/source/operations/hardware-inventory-management.rst b/doc/source/operations/hardware-inventory-management.rst index 43fcb73aff..0d6fd8adf1 100644 --- a/doc/source/operations/hardware-inventory-management.rst +++ b/doc/source/operations/hardware-inventory-management.rst @@ -172,6 +172,11 @@ Unset maintenance mode on a node in Bifrost Detect hardware differences with ADVise ======================================= +Hardware information captured during the Ironic introspection process can be +analysed to detect hardware differences, such as mismatches in firmware +versions or missing storage devices. The `ADVise `__ +tool can be used for this purpose. + Extract Bifrost introspection data ---------------------------------- @@ -203,11 +208,6 @@ into ``${KAYOBE_CONFIG_PATH}/overcloud-introspection-data``: Using ADVise ------------ -Hardware information captured during the Ironic introspection process can be -analysed to detect hardware differences, such as mismatches in firmware -versions or missing storage devices. The `ADVise `__ -tool can be used for this purpose. - The Ansible playbook ``advise-run.yml`` can be found at ``${KAYOBE_CONFIG_PATH}/ansible/advise-run.yml``. The playbook will: From 21aaa5a24492efa33b124a9e402550ef6c236cbd Mon Sep 17 00:00:00 2001 From: Seunghun Lee Date: Wed, 29 May 2024 13:55:43 +0100 Subject: [PATCH 09/42] Add baremetal node management doc --- .../operations/baremetal-node-management.rst | 277 ++++++++++++++++++ 1 file changed, 277 insertions(+) create mode 100644 doc/source/operations/baremetal-node-management.rst diff --git a/doc/source/operations/baremetal-node-management.rst b/doc/source/operations/baremetal-node-management.rst new file mode 100644 index 0000000000..f45903dad9 --- /dev/null +++ b/doc/source/operations/baremetal-node-management.rst @@ -0,0 +1,277 @@ +====================================== +Bare Metal Compute Hardware Management +====================================== + +Bare metal compute nodes are managed by the Ironic services. +This section describes elements of the configuration of this service. + +.. _ironic-node-lifecycle: + +Ironic node life cycle +---------------------- + +The deployment process is documented in the `Ironic User Guide `__. +OpenStack deployment uses the +`direct deploy method `__. + +The Ironic state machine can be found `here `__. The rest of +this documentation refers to these states and assumes that you have familiarity. + +High level overview of state transitions +~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ + +The following section attempts to describe the state transitions for various Ironic operations at a high level. +It focuses on trying to describe the steps where dynamic switch reconfiguration is triggered. +For a more detailed overview, refer to the :ref:`ironic-node-lifecycle` section. + +Provisioning +~~~~~~~~~~~~ + +Provisioning starts when an instance is created in Nova using a bare metal flavor. + +- Node starts in the available state (available) +- User provisions an instance (deploying) +- Ironic will switch the node onto the provisioning network (deploying) +- Ironic will power on the node and will await a callback (wait-callback) +- Ironic will image the node with an operating system using the image provided at creation (deploying) +- Ironic switches the node onto the tenant network(s) via neutron (deploying) +- Transition node to active state (active) + +.. _baremetal-management-deprovisioning: + +Deprovisioning +~~~~~~~~~~~~~~ + +Deprovisioning starts when an instance created in Nova using a bare metal flavor is destroyed. + +If automated cleaning is enabled, it occurs when nodes are deprovisioned. + +- Node starts in active state (active) +- User deletes instance (deleting) +- Ironic will remove the node from any tenant network(s) (deleting) +- Ironic will switch the node onto the cleaning network (deleting) +- Ironic will power on the node and will await a callback (clean-wait) +- Node boots into Ironic Python Agent and issues callback, Ironic starts cleaning (cleaning) +- Ironic removes node from cleaning network (cleaning) +- Node transitions to available (available) + +If automated cleaning is disabled. + +- Node starts in active state (active) +- User deletes instance (deleting) +- Ironic will remove the node from any tenant network(s) (deleting) +- Node transitions to available (available) + +Cleaning +~~~~~~~~ + +Manual cleaning is not part of the regular state transitions when using Nova, however nodes can be manually cleaned by administrators. + +- Node starts in the manageable state (manageable) +- User triggers cleaning with API (cleaning) +- Ironic will switch the node onto the cleaning network (cleaning) +- Ironic will power on the node and will await a callback (clean-wait) +- Node boots into Ironic Python Agent and issues callback, Ironic starts cleaning (cleaning) +- Ironic removes node from cleaning network (cleaning) +- Node transitions back to the manageable state (manageable) + +Rescuing +~~~~~~~~ + +Feature not used. The required rescue network is not currently configured. + +Baremetal networking +-------------------- + +Baremetal networking with the Neutron Networking Generic Switch ML2 driver requires a combination of static and dynamic switch configuration. + +.. _static-switch-config: + +Static switch configuration +~~~~~~~~~~~~~~~~~~~~~~~~~~~ + +Static physical network configuration is managed via Kayobe. + +.. TODO: Fill in the switch configuration + +- Some initial switch configuration is required before networking generic switch can take over the management of an interface. + First, LACP must be configured on the switch ports attached to the baremetal node, e.g: + + .. code-block:: shell + + The interface is then partially configured: + + .. code-block:: shell + + For :ref:`ironic-node-discovery` to work, you need to manually switch the port to the provisioning network: + + .. code-block:: shell + + **NOTE**: You only need to do this if Ironic isn't aware of the node. + +Configuration with kayobe +^^^^^^^^^^^^^^^^^^^^^^^^^ + +Kayobe can be used to apply the :ref:`static-switch-config`. + +- Upstream documentation can be found `here `__. +- Kayobe does all the switch configuration that isn't :ref:`dynamically updated using Ironic `. +- Optionally switches the node onto the provisioning network (when using ``--enable-discovery``) + + + NOTE: This is a dangerous operation as it can wipe out the dynamic VLAN configuration applied by neutron/ironic. + You should only run this when initially enrolling a node, and should always use the ``interface-description-limit`` option. For example: + + .. code-block:: + + kayobe physical network configure --interface-description-limit --group switches --display --enable-discovery + + In this example, ``--display`` is used to preview the switch configuration without applying it. + +.. TODO: Fill in information about how switches are configured in kayobe-config, with links + +- Configuration is done using a combination of ``group_vars`` and ``host_vars`` + +.. _dynamic-switch-configuration: + +Dynamic switch configuration +~~~~~~~~~~~~~~~~~~~~~~~~~~~~ + +Ironic dynamically configures the switches using the Neutron `Networking Generic Switch `_ ML2 driver. + +- Used to toggle the baremetal nodes onto different networks + + + Can use any VLAN network defined in OpenStack, providing that the VLAN has been trunked to the controllers + as this is required for DHCP to function. + + See :ref:`ironic-node-lifecycle`. This attempts to illustrate when any switch reconfigurations happen. + +- Only configures VLAN membership of the switch interfaces or port groups. To prevent conflicts with the static switch configuration, + the convention used is: after the node is in service in Ironic, VLAN membership should not be manually adjusted and + should be left to be controlled by ironic i.e *don't* use ``--enable-discovery`` without an interface limit when configuring the + switches with kayobe. +- Ironic is configured to use the neutron networking driver. + +.. _ngs-commands: + +Commands that NGS will execute +^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ + +Networking Generic Switch is mainly concerned with toggling the ports onto different VLANs. It +cannot fully configure the switch. + +.. TODO: Fill in the switch configuration + +- Switching the port onto the provisioning network + + .. code-block:: shell + +- Switching the port onto the tenant network. + + .. code-block:: shell + +- When deleting the instance, the VLANs are removed from the port. Using: + + .. code-block:: shell + +NGS will save the configuration after each reconfiguration (by default). + +Ports managed by NGS +^^^^^^^^^^^^^^^^^^^^ + +The command below extracts a list of port UUID, node UUID and switch port information. + +.. code-block:: bash + + admin# openstack baremetal port list --field uuid --field node_uuid --field local_link_connection --format value + +NGS will manage VLAN membership for ports when the ``local_link_connection`` fields match one of the switches in ``ml2_conf.ini``. +The rest of the switch configuration is static. +The switch configuration that NGS will apply to these ports is detailed in :ref:`dynamic-switch-configuration`. + +.. _ironic-node-discovery: + +Ironic node discovery +--------------------- + +Discovery is a process used to automatically enrol new nodes in Ironic. +It works by PXE booting the nodes into the Ironic Python Agent (IPA) ramdisk. +This ramdisk will collect hardware and networking configuration from the node in a process known as introspection. +This data is used to populate the baremetal node object in Ironic. +The series of steps you need to take to enrol a new node is as follows: + +- Configure credentials on the BMC. These are needed for Ironic to be able to perform power control actions. + +- Controllers should have network connectivity with the target BMC. + +- (If kayobe manages physical network) Add any additional switch configuration to kayobe config. + The minimal switch configuration that kayobe needs to know about is described in :ref:`tor-switch-configuration`. + +- Apply any :ref:`static switch configration `. This performs the initial + setup of the switchports that is needed before Ironic can take over. The static configuration + will not be modified by Ironic, so it should be safe to reapply at any point. See :ref:`ngs-commands` + for details about the switch configuation that Networking Generic Switch will apply. + +- (If kayobe manages physical network) Put the node onto the provisioning network by using the + ``--enable-discovery`` flag and either ``--interface-description-limit`` or ``--interface-limit`` + (do not run this command without one of these limits). See :ref:`static-switch-config`. + + * This is only necessary to initially discover the node. Once the node is in registered in Ironic, + it will take over control of the the VLAN membership. See :ref:`dynamic-switch-configuration`. + + * This provides ethernet connectivity with the controllers over the `workload provisioning` network + +- (If kayobe doesn't manage physical network) Put the node onto the provisioning network. + +.. TODO: link to the relevant file in kayobe config + +- Add node to the kayobe inventory. + +.. TODO: Fill in details about necessary BIOS & RAID config + +- Apply any necesary BIOS & RAID configuration. + +.. TODO: Fill in details about how to trigger a PXE boot + +- PXE boot the node. + +- If the discovery process is successful, the node will appear in Ironic and will get populated with the necessary information from the hardware inspection process. + +.. TODO: Link to the Kayobe inventory in the repo + +- Add node to the Kayobe inventory in the ``baremetal-compute`` group. + +- The node will begin in the ``enroll`` state, and must be moved first to ``manageable``, then ``available`` before it can be used. + + If Ironic automated cleaning is enabled, the node must complete a cleaning process before it can reach the available state. + + * Use Kayobe to attempt to move the node to the ``available`` state. + + .. code-block:: console + + source etc/kolla/public-openrc.sh + kayobe baremetal compute provide --limit + +- Once the node is in the ``available`` state, Nova will make the node available for scheduling. This happens periodically, and typically takes around a minute. + +.. _tor-switch-configuration: + +Top of Rack (ToR) switch configuration +~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ + +Networking Generic Switch must be aware of the Top-of-Rack switch connected to the new node. +Switches managed by NGS are configured in ``ml2_conf.ini``. + +.. TODO: Fill in details about how switches are added to NGS config in kayobe-config + +After adding switches to the NGS configuration, Neutron must be redeployed. + +Considerations when booting baremetal compared to VMs +------------------------------------------------------ + +- You can only use networks of type: vlan +- Without using trunk ports, it is only possible to directly attach one network to each port or port group of an instance. + + * To access other networks you can use routers + * You can still attach floating IPs + +- Instances take much longer to provision (expect at least 15 mins) +- When booting an instance use one of the flavors that maps to a baremetal node via the RESOURCE_CLASS configured on the flavor. From 5482670eb448fb2bd6567a2f28015f15f5e63416 Mon Sep 17 00:00:00 2001 From: Seunghun Lee Date: Wed, 29 May 2024 14:44:19 +0100 Subject: [PATCH 10/42] Add gpu doc --- doc/source/operations/gpu-in-openstack.rst | 1124 ++++++++++++++++++++ 1 file changed, 1124 insertions(+) create mode 100644 doc/source/operations/gpu-in-openstack.rst diff --git a/doc/source/operations/gpu-in-openstack.rst b/doc/source/operations/gpu-in-openstack.rst new file mode 100644 index 0000000000..66170c6800 --- /dev/null +++ b/doc/source/operations/gpu-in-openstack.rst @@ -0,0 +1,1124 @@ +.. include:: vars.rst + +============================= +Support for GPUs in OpenStack +============================= + +NVIDIA Virtual GPU +################## + +BIOS configuration +------------------ + +Intel +^^^^^ + +* Enable `VT-x` in the BIOS for virtualisation support. +* Enable `VT-d` in the BIOS for IOMMU support. + +Dell +^^^^ + +Enabling SR-IOV with `racadm`: + +.. code:: shell + + /opt/dell/srvadmin/bin/idracadm7 set BIOS.IntegratedDevices.SriovGlobalEnable Enabled + /opt/dell/srvadmin/bin/idracadm7 jobqueue create BIOS.Setup.1-1 + + + +Obtain driver from NVIDIA licensing portal +------------------------------------------- + +Download Nvidia GRID driver from `here `__ +(This requires a login). The file can either be placed on the :ref:`ansible control host` or :ref:`uploaded to pulp`. + +.. _NVIDIA Pulp: + +Uploading the GRID driver to pulp +--------------------------------- + +Uploading the driver to pulp will make it possible to run kayobe from any host. This can be useful when +running in a CI environment. + +.. code:: shell + + pulp artifact upload --file ~/NVIDIA-GRID-Linux-KVM-525.105.14-525.105.17-528.89.zip + pulp file content create --relative-path "NVIDIA-GRID-Linux-KVM-525.105.14-525.105.17-528.89.zip" --sha256 c8e12c15b881df35e618bdee1f141cbfcc7e112358f0139ceaa95b48e20761e0 + pulp file repository create --name nvidia + pulp file repository content add --repository nvidia --sha256 c8e12c15b881df35e618bdee1f141cbfcc7e112358f0139ceaa95b48e20761e0 --relative-path "NVIDIA-GRID-Linux-KVM-525.105.14-525.105.17-528.89.zip" + pulp file publication create --repository nvidia + pulp file distribution create --name nvidia --base-path nvidia --repository nvidia + +The file will then be available at ``/pulp/content/nvidia/NVIDIA-GRID-Linux-KVM-525.105.14-525.105.17-528.89.zip``. You +will need to set the ``vgpu_driver_url`` configuration option to this value: + +.. code:: yaml + + # URL of GRID driver in pulp + vgpu_driver_url: "{{ pulp_url }}/pulp/content/nvidia/NVIDIA-GRID-Linux-KVM-525.105.14-525.105.17-528.89.zip" + +See :ref:`NVIDIA Role Configuration`. + +.. _NVIDIA control host: + +Placing the GRID driver on the ansible control host +--------------------------------------------------- + +Copy the driver bundle to a known location on the ansible control host. Set the ``vgpu_driver_url`` configuration variable to reference this +path using ``file`` as the url scheme e.g: + +.. code:: yaml + + # Location of NVIDIA GRID driver on localhost + vgpu_driver_url: "file://{{ lookup('env', 'HOME') }}/NVIDIA-GRID-Linux-KVM-525.105.14-525.105.17-528.89.zip" + +See :ref:`NVIDIA Role Configuration`. + +.. _NVIDIA OS Configuration: + +OS Configuration +---------------- + +Host OS configuration is done by using roles in the `stackhpc.linux `_ ansible collection. + +Add the following to your ansible ``requirements.yml``: + +.. code-block:: yaml + :caption: $KAYOBE_CONFIG_PATH/ansible/requirements.yml + + #FIXME: Update to known release When VGPU and IOMMU roles have landed + collections: + - name: stackhpc.linux + source: git+https://github.com/stackhpc/ansible-collection-linux.git,preemptive/vgpu-iommu + type: git + +Create a new playbook or update an existing on to apply the roles: + +.. code-block:: yaml + :caption: $KAYOBE_CONFIG_PATH/ansible/host-configure.yml + + --- + + - hosts: iommu + tags: + - iommu + tasks: + - import_role: + name: stackhpc.linux.iommu + handlers: + - name: reboot + set_fact: + kayobe_needs_reboot: true + + - hosts: vgpu + tags: + - vgpu + tasks: + - import_role: + name: stackhpc.linux.vgpu + handlers: + - name: reboot + set_fact: + kayobe_needs_reboot: true + + - name: Reboot when required + hosts: iommu:vgpu + tags: + - reboot + tasks: + - name: Reboot + reboot: + reboot_timeout: 3600 + become: true + when: kayobe_needs_reboot | default(false) | bool + +Ansible Inventory Configuration +------------------------------- + +Add some hosts into the ``vgpu`` group. The example below maps two custom +compute groups, ``compute_multi_instance_gpu`` and ``compute_vgpu``, +into the ``vgpu`` group: + +.. code-block:: yaml + :caption: $KAYOBE_CONFIG_PATH/inventory/custom + + [compute] + [compute_multi_instance_gpu] + [compute_vgpu] + + [vgpu:children] + compute_multi_instance_gpu + compute_vgpu + + [iommu:children] + vgpu + +Having multiple groups is useful if you want to be able to do conditional +templating in ``nova.conf`` (see :ref:`NVIDIA Kolla Ansible +Configuration`). Since the vgpu role requires iommu to be enabled, all of the +hosts in the ``vgpu`` group are also added to the ``iommu`` group. + +If using bifrost and the ``kayobe overcloud inventory discover`` mechanism, +hosts can automatically be mapped to these groups by configuring +``overcloud_group_hosts_map``: + +.. code-block:: yaml + :caption: ``$KAYOBE_CONFIG_PATH/overcloud.yml`` + + overcloud_group_hosts_map: + compute_vgpu: + - "computegpu000" + compute_mutli_instance_gpu: + - "computegpu001" + +.. _NVIDIA Role Configuration: + +Role Configuration +^^^^^^^^^^^^^^^^^^ + +Configure the location of the NVIDIA driver: + +.. code-block:: yaml + :caption: $KAYOBE_CONFIG_PATH/vgpu.yml + + --- + + vgpu_driver_url: "http://{{ pulp_url }}/pulp/content/nvidia/NVIDIA-GRID-Linux-KVM-525.105.14-525.105.17-528.89.zip" + +Configure the VGPU devices: + +.. code-block:: yaml + :caption: $KAYOBE_CONFIG_PATH/inventory/group_vars/compute_vgpu/vgpu + + #nvidia-692 GRID A100D-4C + #nvidia-693 GRID A100D-8C + #nvidia-694 GRID A100D-10C + #nvidia-695 GRID A100D-16C + #nvidia-696 GRID A100D-20C + #nvidia-697 GRID A100D-40C + #nvidia-698 GRID A100D-80C + #nvidia-699 GRID A100D-1-10C + #nvidia-700 GRID A100D-2-20C + #nvidia-701 GRID A100D-3-40C + #nvidia-702 GRID A100D-4-40C + #nvidia-703 GRID A100D-7-80C + #nvidia-707 GRID A100D-1-10CME + vgpu_definitions: + # Configuring a MIG backed VGPU + - pci_address: "0000:17:00.0" + virtual_functions: + - mdev_type: nvidia-700 + index: 0 + - mdev_type: nvidia-700 + index: 1 + - mdev_type: nvidia-700 + index: 2 + - mdev_type: nvidia-699 + index: 3 + mig_devices: + "1g.10gb": 1 + "2g.20gb": 3 + # Configuring a card in a time-sliced configuration (non-MIG backed) + - pci_address: "0000:65:00.0" + virtual_functions: + - mdev_type: nvidia-697 + index: 0 + - mdev_type: nvidia-697 + index: 1 + +Running the playbook +^^^^^^^^^^^^^^^^^^^^ + +The playbook defined in the :ref:`previous step` +should be run after `kayobe overcloud host configure` has completed. This will +ensure the host has been fully bootstrapped. With default settings, internet +connectivity is required to download `MIG Partition Editor for NVIDIA GPUs`. If +this is not desirable, you can override the one of the following variables +(depending on host OS): + +.. code-block:: yaml + :caption: $KAYOBE_CONFIG_PATH/inventory/group_vars/compute_vgpu/vgpu + + vgpu_nvidia_mig_manager_rpm_url: "https://github.com/NVIDIA/mig-parted/releases/download/v0.5.1/nvidia-mig-manager-0.5.1-1.x86_64.rpm" + vgpu_nvidia_mig_manager_deb_url: "https://github.com/NVIDIA/mig-parted/releases/download/v0.5.1/nvidia-mig-manager_0.5.1-1_amd64.deb" + +For example, you may wish to upload these artifacts to the local pulp. + +Run the playbook that you defined earlier: + +.. code-block:: shell + + kayobe playbook run $KAYOBE_CONFIG_PATH/ansible/host-configure.yml + +Note: This will reboot the hosts on first run. + +The playbook may be added as a hook in ``$KAYOBE_CONFIG_PATH/hooks/overcloud-host-configure/post.d``; this will +ensure you do not forget to run it when hosts are enrolled in the future. + +.. _NVIDIA Kolla Ansible Configuration: + +Kolla-Ansible configuration +^^^^^^^^^^^^^^^^^^^^^^^^^^^ + +To use the mdev devices that were created, modify nova.conf to add a list of mdev devices that +can be passed through to guests: + +.. code-block:: + :caption: $KAYOBE_CONFIG_PATH/kolla/config/nova/nova-compute.conf + + {% if inventory_hostname in groups['compute_multi_instance_gpu'] %} + [devices] + enabled_mdev_types = nvidia-700, nvidia-699 + + [mdev_nvidia-700] + device_addresses = 0000:21:00.4,0000:21:00.5,0000:21:00.6,0000:81:00.4,0000:81:00.5,0000:81:00.6 + mdev_class = CUSTOM_NVIDIA_700 + + [mdev_nvidia-699] + device_addresses = 0000:21:00.7,0000:81:00.7 + mdev_class = CUSTOM_NVIDIA_699 + + {% elif inventory_hostname in groups['compute_vgpu'] %} + [devices] + enabled_mdev_types = nvidia-697 + + [mdev_nvidia-697] + device_addresses = 0000:21:00.4,0000:21:00.5,0000:81:00.4,0000:81:00.5 + # Custom resource classes don't work when you only have single resource type. + mdev_class = VGPU + + {% endif %} + +You will need to adjust the PCI addresses to match the virtual function +addresses. These can be obtained by checking the mdevctl configuration after +running the role: + +.. code-block:: shell + + # mdevctl list + + 73269d0f-b2c9-438d-8f28-f9e4bc6c6995 0000:17:00.4 nvidia-700 manual (defined) + dc352ef3-efeb-4a5d-a48e-912eb230bc76 0000:17:00.5 nvidia-700 manual (defined) + a464fbae-1f89-419a-a7bd-3a79c7b2eef4 0000:17:00.6 nvidia-700 manual (defined) + f3b823d3-97c8-4e0a-ae1b-1f102dcb3bce 0000:17:00.7 nvidia-699 manual (defined) + 330be289-ba3f-4416-8c8a-b46ba7e51284 0000:65:00.4 nvidia-700 manual (defined) + 1ba5392c-c61f-4f48-8fb1-4c6b2bbb0673 0000:65:00.5 nvidia-700 manual (defined) + f6868020-eb3a-49c6-9701-6c93e4e3fa9c 0000:65:00.6 nvidia-700 manual (defined) + 00501f37-c468-5ba4-8be2-8d653c4604ed 0000:65:00.7 nvidia-699 manual (defined) + +The mdev_class maps to a resource class that you can set in your flavor definition. +Note that if you only define a single mdev type on a given hypervisor, then the +mdev_class configuration option is silently ignored and it will use the ``VGPU`` +resource class (bug?). + +Map through the kayobe inventory groups into kolla: + +.. code-block:: yaml + :caption: $KAYOBE_CONFIG_PATH/kolla.yml + + kolla_overcloud_inventory_top_level_group_map: + control: + groups: + - controllers + network: + groups: + - network + compute_cpu: + groups: + - compute_cpu + compute_gpu: + groups: + - compute_gpu + compute_multi_instance_gpu: + groups: + - compute_multi_instance_gpu + compute_vgpu: + groups: + - compute_vgpu + compute: + groups: + - compute + monitoring: + groups: + - monitoring + storage: + groups: + "{{ kolla_overcloud_inventory_storage_groups }}" + +Where the ``compute_`` groups have been added to the kayobe defaults. + +You will need to reconfigure nova for this change to be applied: + +.. code-block:: shell + + kayobe overcloud service deploy -kt nova --kolla-limit compute_vgpu + +Openstack flavors +^^^^^^^^^^^^^^^^^ + +Define some flavors that request the resource class that was configured in nova.conf. +An example definition, that can be used with ``openstack.cloud.compute_flavor`` Ansible module, +is shown below: + +.. code-block:: yaml + + vgpu_a100_2g_20gb: + name: "vgpu.a100.2g.20gb" + ram: 65536 + disk: 30 + vcpus: 8 + is_public: false + extra_specs: + hw:cpu_policy: "dedicated" + hw:cpu_thread_policy: "prefer" + hw:mem_page_size: "1GB" + hw:cpu_sockets: 2 + hw:numa_nodes: 8 + hw_rng:allowed: "True" + resources:CUSTOM_NVIDIA_700: "1" + +You now should be able to launch a VM with this flavor. + +NVIDIA License Server +^^^^^^^^^^^^^^^^^^^^^ + +The Nvidia delegated license server is a virtual machine based appliance. You simply need to boot an instance +using the image supplied on the NVIDIA Licensing portal. This can be done on the OpenStack cloud itself. The +requirements are: + +* All tenants wishing to use GPU based instances must have network connectivity to this machine. (network licensing) + - It is possible to configure node locked licensing where tenants do not need access to the license server +* Satisfy minimum requirements detailed `here `__. + +The official documentation for configuring the instance +can be found `here `__. + +Below is a snippet of openstack-config for defining a project, and a security group that can be used for a non-HA deployment: + +.. code-block:: yaml + + secgroup_rules_nvidia_dls: + # Allow ICMP (for ping, etc.). + - ethertype: IPv4 + protocol: icmp + # Allow SSH. + - ethertype: IPv4 + protocol: tcp + port_range_min: 22 + port_range_max: 22 + # https://docs.nvidia.com/license-system/latest/nvidia-license-system-user-guide/index.html + - ethertype: IPv4 + protocol: tcp + port_range_min: 443 + port_range_max: 443 + - ethertype: IPv4 + protocol: tcp + port_range_min: 80 + port_range_max: 80 + - ethertype: IPv4 + protocol: tcp + port_range_min: 7070 + port_range_max: 7070 + + secgroup_nvidia_dls: + name: nvidia-dls + project: "{{ project_cloud_services.name }}" + rules: "{{ secgroup_rules_nvidia_dls }}" + + openstack_security_groups: + - "{{ secgroup_nvidia_dls }}" + + project_cloud_services: + name: "cloud-services" + description: "Internal Cloud services" + project_domain: default + user_domain: default + users: [] + quotas: "{{ quotas_project }}" + +Booting the VM: + +.. code-block:: shell + + # Uploading the image and making it available in the cloud services project + $ openstack image create --file nls-3.0.0-bios.qcow2 nls-3.0.0-bios --disk-format qcow2 + $ openstack image add project nls-3.0.0-bios cloud-services + $ openstack image set --accept nls-3.0.0-bios --project cloud-services + $ openstack image member list nls-3.0.0-bios + + # Booting a server as the admin user in the cloud-services project. We pre-create the port so that + # we can recreate it without changing the MAC address. + $ openstack port create --mac-address fa:16:3e:a3:fd:19 --network external nvidia-dls-1 --project cloud-services + $ openstack role add member --project cloud-services --user admin + $ export OS_PROJECT_NAME=cloud-services + $ openstack server group create nvidia-dls --policy anti-affinity + $ openstack server create --flavor 8cpu-8gbmem-30gbdisk --image nls-3.0.0-bios --port nvidia-dls-1 --hint group=179dfa59-0947-4925-a0ff-b803bc0e58b2 nvidia-dls-cci1-1 --security-group nvidia-dls + $ openstack server add security group nvidia-dls-1 nvidia-dls + + +Manual VM driver and licence configuration +^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ + +vGPU client VMs need to be configured with Nvidia drivers to run GPU workloads. +The host drivers should already be applied to the hypervisor. + +GCP hosts compatible client drivers `here +`__. + +Find the correct version (when in doubt, use the same version as the host) and +download it to the VM. The exact dependencies will depend on the base image you +are using but at a minimum, you will need GCC installed. + +Ubuntu Jammy example: + +.. code-block:: bash + + sudo apt update + sudo apt install -y make gcc wget + wget https://storage.googleapis.com/nvidia-drivers-us-public/GRID/vGPU17.1/NVIDIA-Linux-x86_64-550.54.15-grid.run + sudo sh NVIDIA-Linux-x86_64-550.54.15-grid.run + +Check the ``nvidia-smi`` client is available: + +.. code-block:: bash + + nvidia-smi + +Generate a token from the licence server, and copy the token file to the client +VM. + +On the client, create an Nvidia grid config file from the template: + +.. code-block:: bash + + sudo cp /etc/nvidia/gridd.conf.template /etc/nvidia/gridd.conf + +Edit it to set ``FeatureType=1`` and leave the rest of the settings as default. + +Copy the client configuration token into the ``/etc/nvidia/ClientConfigToken`` +directory. + +Ensure the correct permissions are set: + +.. code-block:: bash + + sudo chmod 744 /etc/nvidia/ClientConfigToken/client_configuration_token_.tok + +Restart the ``nvidia-gridd`` service: + +.. code-block:: bash + + sudo systemctl restart nvidia-gridd + +Check that the token has been recognised: + +.. code-block:: bash + + nvidia-smi -q | grep 'License Status' + +If not, an error should appear in the journal: + +.. code-block:: bash + + sudo journalctl -xeu nvidia-gridd + +A successfully licenced VM can be snapshotted to create an image in Glance that +includes the drivers and licencing token. Alternatively, an image can be +created using Diskimage Builder. + +Disk image builder recipe to automatically license VGPU on boot +^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ + +`stackhpc-image-elements `__ provides a ``nvidia-vgpu`` +element to configure the nvidia-gridd service in VGPU mode. This allows you to boot VMs that automatically license themselves. +Snippets of ``openstack-config`` that allow you to do this are shown below: + +.. code-block:: shell + + image_rocky9_nvidia: + name: "Rocky9-NVIDIA" + type: raw + elements: + - "rocky-container" + - "rpm" + - "nvidia-vgpu" + - "cloud-init" + - "epel" + - "cloud-init-growpart" + - "selinux-permissive" + - "dhcp-all-interfaces" + - "vm" + - "extra-repos" + - "grub2" + - "stable-interface-names" + - "openssh-server" + is_public: True + packages: + - "dkms" + - "git" + - "tmux" + - "cuda-minimal-build-12-1" + - "cuda-demo-suite-12-1" + - "cuda-libraries-12-1" + - "cuda-toolkit" + - "vim-enhanced" + env: + DIB_CONTAINERFILE_NETWORK_DRIVER: host + DIB_CONTAINERFILE_RUNTIME: docker + DIB_RPMS: "http://192.168.1.2:80/pulp/content/nvidia/nvidia-linux-grid-525-525.105.17-1.x86_64.rpm" + YUM: dnf + DIB_EXTRA_REPOS: "https://developer.download.nvidia.com/compute/cuda/repos/rhel9/x86_64/cuda-rhel9.repo" + DIB_NVIDIA_VGPU_CLIENT_TOKEN: "{{ lookup('file' , 'secrets/client_configuration_token_05-30-2023-12-41-40.tok') }}" + DIB_CLOUD_INIT_GROWPART_DEVICES: + - "/" + DIB_RELEASE: "9" + properties: + os_type: "linux" + os_distro: "rocky" + os_version: "9" + + openstack_images: + - "{{ image_rocky9_nvidia }}" + + openstack_image_git_elements: + - repo: "https://github.com/stackhpc/stackhpc-image-elements" + local: "{{ playbook_dir }}/stackhpc-image-elements" + version: master + elements_path: elements + +The gridd driver was uploaded pulp using the following procedure: + +.. code-block:: shell + + $ unzip NVIDIA-GRID-Linux-KVM-525.105.14-525.105.17-528.89.zip + $ pulp artifact upload --file ~/nvidia-linux-grid-525-525.105.17-1.x86_64.rpm + $ pulp file content create --relative-path "nvidia-linux-grid-525-525.105.17-1.x86_64.rpm" --sha256 58fda68d01f00ea76586c9fd5f161c9fbb907f627b7e4f4059a309d8112ec5f5 + $ pulp file repository add --name nvidia --sha256 58fda68d01f00ea76586c9fd5f161c9fbb907f627b7e4f4059a309d8112ec5f5 --relative-path "nvidia-linux-grid-525-525.105.17-1.x86_64.rpm" + $ pulp file publication create --repository nvidia + $ pulp file distribution update --name nvidia --base-path nvidia --repository nvidia + +This is the file we reference in ``DIB_RPMS``. It is important to keep the driver versions aligned between hypervisor and guest VM. + +The client token can be downloaded from the web interface of the licensing portal. Care should be taken +when copying the contents as it can contain invisible characters. It is best to copy the file directly +into your openstack-config repository and vault encrypt it. The ``file`` lookup plugin can be used to decrypt +the file (as shown in the example above). + +Testing vGPU VMs +^^^^^^^^^^^^^^^^ + +vGPU VMs can be validated using the following test workload. The test should +succeed if the VM is correctly licenced and drivers are correctly installed for +both the host and client VM. + +Install ``cuda-toolkit`` using the instructions `here +`__. + +Ubuntu Jammy example: + +.. code-block:: bash + + wget https://developer.download.nvidia.com/compute/cuda/repos/ubuntu2204/x86_64/cuda-keyring_1.1-1_all.deb + sudo dpkg -i cuda-keyring_1.1-1_all.deb + sudo apt update -y + sudo apt install -y cuda-toolkit make + +The VM may require a reboot at this point. + +Clone the ``cuda-samples`` repo: + +.. code-block:: bash + + git clone https://github.com/NVIDIA/cuda-samples.git + +Build and run a test workload: + +.. code-block:: bash + + cd cuda-samples/Samples/6_Performance/transpose + make + ./transpose + +Example output: + +.. code-block:: + + Transpose Starting... + + GPU Device 0: "Ampere" with compute capability 8.0 + + > Device 0: "GRID A100D-1-10C MIG 1g.10gb" + > SM Capability 8.0 detected: + > [GRID A100D-1-10C MIG 1g.10gb] has 14 MP(s) x 64 (Cores/MP) = 896 (Cores) + > Compute performance scaling factor = 1.00 + + Matrix size: 1024x1024 (64x64 tiles), tile size: 16x16, block size: 16x16 + + transpose simple copy , Throughput = 159.1779 GB/s, Time = 0.04908 ms, Size = 1048576 fp32 elements, NumDevsUsed = 1, Workgroup = 256 + transpose shared memory copy, Throughput = 152.1922 GB/s, Time = 0.05133 ms, Size = 1048576 fp32 elements, NumDevsUsed = 1, Workgroup = 256 + transpose naive , Throughput = 117.2670 GB/s, Time = 0.06662 ms, Size = 1048576 fp32 elements, NumDevsUsed = 1, Workgroup = 256 + transpose coalesced , Throughput = 135.0813 GB/s, Time = 0.05784 ms, Size = 1048576 fp32 elements, NumDevsUsed = 1, Workgroup = 256 + transpose optimized , Throughput = 145.4326 GB/s, Time = 0.05372 ms, Size = 1048576 fp32 elements, NumDevsUsed = 1, Workgroup = 256 + transpose coarse-grained , Throughput = 145.2941 GB/s, Time = 0.05377 ms, Size = 1048576 fp32 elements, NumDevsUsed = 1, Workgroup = 256 + transpose fine-grained , Throughput = 150.5703 GB/s, Time = 0.05189 ms, Size = 1048576 fp32 elements, NumDevsUsed = 1, Workgroup = 256 + transpose diagonal , Throughput = 117.6831 GB/s, Time = 0.06639 ms, Size = 1048576 fp32 elements, NumDevsUsed = 1, Workgroup = 256 + Test passed + +Changing VGPU device types +^^^^^^^^^^^^^^^^^^^^^^^^^^ + +Converting the second card to an NVIDIA-698 (whole card). The hypervisor +is empty so we can freely delete mdevs. First clean up the mdev +definition: + +.. code:: shell + + [stack@computegpu007 ~]$ sudo mdevctl list + 5c630867-a673-5d75-aa31-a499e6c7cb19 0000:21:00.4 nvidia-697 manual (defined) + eaa6e018-308e-58e2-b351-aadbcf01f5a8 0000:21:00.5 nvidia-697 manual (defined) + 72291b01-689b-5b7a-9171-6b3480deabf4 0000:81:00.4 nvidia-697 manual (defined) + 0a47ffd1-392e-5373-8428-707a4e0ce31a 0000:81:00.5 nvidia-697 manual (defined) + + [stack@computegpu007 ~]$ sudo mdevctl stop --uuid 72291b01-689b-5b7a-9171-6b3480deabf4 + [stack@computegpu007 ~]$ sudo mdevctl stop --uuid 0a47ffd1-392e-5373-8428-707a4e0ce31a + + [stack@computegpu007 ~]$ sudo mdevctl undefine --uuid 0a47ffd1-392e-5373-8428-707a4e0ce31a + + [stack@computegpu007 ~]$ sudo mdevctl list --defined + 5c630867-a673-5d75-aa31-a499e6c7cb19 0000:21:00.4 nvidia-697 manual (active) + eaa6e018-308e-58e2-b351-aadbcf01f5a8 0000:21:00.5 nvidia-697 manual (active) + 72291b01-689b-5b7a-9171-6b3480deabf4 0000:81:00.4 nvidia-697 manual + + # We can re-use the first virtual function + +Secondly remove the systemd unit that starts the mdev device: + +.. code:: shell + + [stack@computegpu007 ~]$ sudo rm /etc/systemd/system/multi-user.target.wants/nvidia-mdev@0a47ffd1-392e-5373-8428-707a4e0ce31a.service + +Example config change: + +.. code:: shell + + diff --git a/etc/kayobe/environments/cci1/inventory/host_vars/computegpu007/vgpu b/etc/kayobe/environments/cci1/inventory/host_vars/computegpu007/vgpu + new file mode 100644 + index 0000000..6cea9bf + --- /dev/null + +++ b/etc/kayobe/environments/cci1/inventory/host_vars/computegpu007/vgpu + @@ -0,0 +1,12 @@ + +--- + +vgpu_definitions: + + - pci_address: "0000:21:00.0" + + virtual_functions: + + - mdev_type: nvidia-697 + + index: 0 + + - mdev_type: nvidia-697 + + index: 1 + + - pci_address: "0000:81:00.0" + + virtual_functions: + + - mdev_type: nvidia-698 + + index: 0 + diff --git a/etc/kayobe/kolla/config/nova/nova-compute.conf b/etc/kayobe/kolla/config/nova/nova-compute.conf + index 6f680cb..e663ec4 100644 + --- a/etc/kayobe/kolla/config/nova/nova-compute.conf + +++ b/etc/kayobe/kolla/config/nova/nova-compute.conf + @@ -39,7 +39,19 @@ cpu_mode = host-model + {% endraw %} + + {% raw %} + -{% if inventory_hostname in groups['compute_multi_instance_gpu'] %} + +{% if inventory_hostname == "computegpu007" %} + +[devices] + +enabled_mdev_types = nvidia-697, nvidia-698 + + + +[mdev_nvidia-697] + +device_addresses = 0000:21:00.4,0000:21:00.5 + +mdev_class = VGPU + + + +[mdev_nvidia-698] + +device_addresses = 0000:81:00.4 + +mdev_class = CUSTOM_NVIDIA_698 + + + +{% elif inventory_hostname in groups['compute_multi_instance_gpu'] %} + [devices] + enabled_mdev_types = nvidia-700, nvidia-699 + + @@ -50,15 +62,14 @@ mdev_class = CUSTOM_NVIDIA_700 + [mdev_nvidia-699] + device_addresses = 0000:21:00.7,0000:81:00.7 + mdev_class = CUSTOM_NVIDIA_699 + -{% endif %} + + -{% if inventory_hostname in groups['compute_vgpu'] %} + +{% elif inventory_hostname in groups['compute_vgpu'] %} + [devices] + enabled_mdev_types = nvidia-697 + + [mdev_nvidia-697] + device_addresses = 0000:21:00.4,0000:21:00.5,0000:81:00.4,0000:81:00.5 + -# Custom resource classes don't seem to work for this card. + +# Custom resource classes don't work when you only have single resource type. + mdev_class = VGPU + + {% endif %} + +Re-run the configure playbook: + +.. code:: shell + + (kayobe) [stack@ansiblenode1 kayobe]$ kayobe playbook run $KAYOBE_CONFIG_PATH/ansible/host-configure.yml --tags vgpu --limit computegpu007 + +Check the result: + +.. code:: shell + + [stack@computegpu007 ~]$ mdevctl list + 5c630867-a673-5d75-aa31-a499e6c7cb19 0000:21:00.4 nvidia-697 manual + eaa6e018-308e-58e2-b351-aadbcf01f5a8 0000:21:00.5 nvidia-697 manual + 72291b01-689b-5b7a-9171-6b3480deabf4 0000:81:00.4 nvidia-698 manual + +Reconfigure nova to match the change: + +.. code:: shell + + kayobe overcloud service reconfigure -kt nova --kolla-limit computegpu007 --skip-prechecks + + +PCI Passthrough +############### + +This guide has been developed for Nvidia GPUs and CentOS 8. + +See `Kayobe Ops `_ for +a playbook implementation of host setup for GPU. + +BIOS Configuration Requirements +------------------------------- + +On an Intel system: + +* Enable `VT-x` in the BIOS for virtualisation support. +* Enable `VT-d` in the BIOS for IOMMU support. + +Hypervisor Configuration Requirements +------------------------------------- + +Find the GPU device IDs +^^^^^^^^^^^^^^^^^^^^^^^ + +From the host OS, use ``lspci -nn`` to find the PCI vendor ID and +device ID for the GPU device and supporting components. These are +4-digit hex numbers. + +For example: + +.. code-block:: text + + 01:00.0 VGA compatible controller [0300]: NVIDIA Corporation GM204M [GeForce GTX 980M] [10de:13d7] (rev a1) (prog-if 00 [VGA controller]) + 01:00.1 Audio device [0403]: NVIDIA Corporation GM204 High Definition Audio Controller [10de:0fbb] (rev a1) + +In this case the vendor ID is ``10de``, display ID is ``13d7`` and audio ID is ``0fbb``. + +Alternatively, for an Nvidia Quadro RTX 6000: + +.. code-block:: yaml + + # NVIDIA Quadro RTX 6000/8000 PCI device IDs + vendor_id: "10de" + display_id: "1e30" + audio_id: "10f7" + usba_id: "1ad6" + usba_class: "0c0330" + usbc_id: "1ad7" + usbc_class: "0c8000" + +These parameters will be used for device-specific configuration. + +Kernel Ramdisk Reconfiguration +^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ + +The ramdisk loaded during kernel boot can be extended to include the +vfio PCI drivers and ensure they are loaded early in system boot. + +.. code-block:: yaml + + - name: Template dracut config + blockinfile: + path: /etc/dracut.conf.d/gpu-vfio.conf + block: | + add_drivers+="vfio vfio_iommu_type1 vfio_pci vfio_virqfd" + owner: root + group: root + mode: 0660 + create: true + become: true + notify: + - Regenerate initramfs + - reboot + +The handler for regenerating the Dracut initramfs is: + +.. code-block:: yaml + + - name: Regenerate initramfs + shell: |- + #!/bin/bash + set -eux + dracut -v -f /boot/initramfs-$(uname -r).img $(uname -r) + become: true + +Kernel Boot Parameters +^^^^^^^^^^^^^^^^^^^^^^ + +Set the following kernel parameters by adding to +``GRUB_CMDLINE_LINUX_DEFAULT`` or ``GRUB_CMDLINE_LINUX`` in +``/etc/default/grub.conf``. We can use the +`stackhpc.grubcmdline `_ +role from Ansible Galaxy: + +.. code-block:: yaml + + - name: Add vfio-pci.ids kernel args + include_role: + name: stackhpc.grubcmdline + vars: + kernel_cmdline: + - intel_iommu=on + - iommu=pt + - "vfio-pci.ids={{ vendor_id }}:{{ display_id }},{{ vendor_id }}:{{ audio_id }}" + kernel_cmdline_remove: + - iommu + - intel_iommu + - vfio-pci.ids + +Kernel Device Management +^^^^^^^^^^^^^^^^^^^^^^^^ + +In the hypervisor, we must prevent kernel device initialisation of +the GPU and prevent drivers from loading for binding the GPU in the +host OS. We do this using ``udev`` rules: + +.. code-block:: yaml + + - name: Template udev rules to blacklist GPU usb controllers + blockinfile: + # We want this to execute as soon as possible + path: /etc/udev/rules.d/99-gpu.rules + block: | + #Remove NVIDIA USB xHCI Host Controller Devices, if present + ACTION=="add", SUBSYSTEM=="pci", ATTR{vendor}=="0x{{ vendor_id }}", ATTR{class}=="0x{{ usba_class }}", ATTR{remove}="1" + #Remove NVIDIA USB Type-C UCSI devices, if present + ACTION=="add", SUBSYSTEM=="pci", ATTR{vendor}=="0x{{ vendor_id }}", ATTR{class}=="0x{{ usbc_class }}", ATTR{remove}="1" + owner: root + group: root + mode: 0644 + create: true + become: true + +Kernel Drivers +^^^^^^^^^^^^^^ + +Prevent the ``nouveau`` kernel driver from loading by +blacklisting the module: + +.. code-block:: yaml + + - name: Blacklist nouveau + blockinfile: + path: /etc/modprobe.d/blacklist-nouveau.conf + block: | + blacklist nouveau + options nouveau modeset=0 + mode: 0664 + owner: root + group: root + create: true + become: true + notify: + - reboot + - Regenerate initramfs + +Ensure that the ``vfio`` drivers are loaded into the kernel on boot: + +.. code-block:: yaml + + - name: Add vfio to modules-load.d + blockinfile: + path: /etc/modules-load.d/vfio.conf + block: | + vfio + vfio_iommu_type1 + vfio_pci + vfio_virqfd + owner: root + group: root + mode: 0664 + create: true + become: true + notify: reboot + +Once this code has taken effect (after a reboot), the VFIO kernel drivers should be loaded on boot: + +.. code-block:: text + + # lsmod | grep vfio + vfio_pci 49152 0 + vfio_virqfd 16384 1 vfio_pci + vfio_iommu_type1 28672 0 + vfio 32768 2 vfio_iommu_type1,vfio_pci + irqbypass 16384 5 vfio_pci,kvm + + # lspci -nnk -s 3d:00.0 + 3d:00.0 VGA compatible controller [0300]: NVIDIA Corporation GM107GL [Tesla M10] [10de:13bd] (rev a2) + Subsystem: NVIDIA Corporation Tesla M10 [10de:1160] + Kernel driver in use: vfio-pci + Kernel modules: nouveau + +IOMMU should be enabled at kernel level as well - we can verify that on the compute host: + +.. code-block:: text + + # docker exec -it nova_libvirt virt-host-validate | grep IOMMU + QEMU: Checking for device assignment IOMMU support : PASS + QEMU: Checking if IOMMU is enabled by kernel : PASS + +OpenStack Nova configuration +---------------------------- + +Configure nova-scheduler +^^^^^^^^^^^^^^^^^^^^^^^^ + +The nova-scheduler service must be configured to enable the ``PciPassthroughFilter`` +To enable it add it to the list of filters to Kolla-Ansible configuration file: +``etc/kayobe/kolla/config/nova.conf``, for instance: + +.. code-block:: yaml + + [filter_scheduler] + available_filters = nova.scheduler.filters.all_filters + enabled_filters = AvailabilityZoneFilter, ComputeFilter, ComputeCapabilitiesFilter, ImagePropertiesFilter, ServerGroupAntiAffinityFilter, ServerGroupAffinityFilter, PciPassthroughFilter + +Configure nova-compute +^^^^^^^^^^^^^^^^^^^^^^ + +Configuration can be applied in flexible ways using Kolla-Ansible's +methods for `inventory-driven customisation of configuration +`_. +The following configuration could be added to +``etc/kayobe/kolla/config/nova/nova-compute.conf`` to enable PCI +passthrough of GPU devices for hosts in a group named ``compute_gpu``. +Again, the 4-digit PCI Vendor ID and Device ID extracted from ``lspci +-nn`` can be used here to specify the GPU device(s). + +.. code-block:: jinja + + [pci] + {% raw %} + {% if inventory_hostname in groups['compute_gpu'] %} + # We could support multiple models of GPU. + # This can be done more selectively using different inventory groups. + # GPU models defined here: + # NVidia Tesla V100 16GB + # NVidia Tesla V100 32GB + # NVidia Tesla P100 16GB + passthrough_whitelist = [{ "vendor_id":"10de", "product_id":"1db4" }, + { "vendor_id":"10de", "product_id":"1db5" }, + { "vendor_id":"10de", "product_id":"15f8" }] + alias = { "vendor_id":"10de", "product_id":"1db4", "device_type":"type-PCI", "name":"gpu-v100-16" } + alias = { "vendor_id":"10de", "product_id":"1db5", "device_type":"type-PCI", "name":"gpu-v100-32" } + alias = { "vendor_id":"10de", "product_id":"15f8", "device_type":"type-PCI", "name":"gpu-p100" } + {% endif %} + {% endraw %} + +Configure nova-api +^^^^^^^^^^^^^^^^^^ + +pci.alias also needs to be configured on the controller. +This configuration should match the configuration found on the compute nodes. +Add it to Kolla-Ansible configuration file: +``etc/kayobe/kolla/config/nova/nova-api.conf``, for instance: + +.. code-block:: yaml + + [pci] + alias = { "vendor_id":"10de", "product_id":"1db4", "device_type":"type-PCI", "name":"gpu-v100-16" } + alias = { "vendor_id":"10de", "product_id":"1db5", "device_type":"type-PCI", "name":"gpu-v100-32" } + alias = { "vendor_id":"10de", "product_id":"15f8", "device_type":"type-PCI", "name":"gpu-p100" } + +Reconfigure nova service +^^^^^^^^^^^^^^^^^^^^^^^^ + +.. code-block:: text + + kayobe overcloud service reconfigure --kolla-tags nova --kolla-skip-tags common --skip-prechecks + +Configure a flavor +^^^^^^^^^^^^^^^^^^ + +For example, to request two of the GPUs with alias gpu-p100 + +.. code-block:: text + + openstack flavor set m1.medium --property "pci_passthrough:alias"="gpu-p100:2" + + +This can be also defined in the openstack-config repository + +add extra_specs to flavor in etc/openstack-config/openstack-config.yml: + +.. code-block:: console + + admin# cd src/openstack-config + admin# vim etc/openstack-config/openstack-config.yml + + name: "m1.medium" + ram: 4096 + disk: 40 + vcpus: 2 + extra_specs: + "pci_passthrough:alias": "gpu-p100:2" + +Invoke configuration playbooks afterwards: + +.. code-block:: console + + admin# source src/kayobe-config/etc/kolla/public-openrc.sh + admin# source venvs/openstack/bin/activate + admin# tools/openstack-config --vault-password-file + +Create instance with GPU passthrough +^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ + +.. code-block:: text + + openstack server create --flavor m1.medium --image ubuntu2004 --wait test-pci + +Testing GPU in a Guest VM +------------------------- + +The Nvidia drivers must be installed first. For example, on an Ubuntu guest: + +.. code-block:: text + + sudo apt install nvidia-headless-440 nvidia-utils-440 nvidia-compute-utils-440 + +The ``nvidia-smi`` command will generate detailed output if the driver has loaded +successfully. + +Further Reference +----------------- + +For PCI Passthrough and GPUs in OpenStack: + +* Consumer-grade GPUs: https://gist.github.com/claudiok/890ab6dfe76fa45b30081e58038a9215 +* https://www.jimmdenton.com/gpu-offloading-openstack/ +* https://docs.openstack.org/nova/latest/admin/pci-passthrough.html +* https://docs.openstack.org/nova/latest/admin/virtual-gpu.html (vGPU only) +* Tesla models in OpenStack: https://egallen.com/openstack-nvidia-tesla-gpu-passthrough/ +* https://wiki.archlinux.org/index.php/PCI_passthrough_via_OVMF +* https://www.kernel.org/doc/Documentation/Intel-IOMMU.txt +* https://access.redhat.com/documentation/en-us/red_hat_virtualization/4.1/html/installation_guide/appe-configuring_a_hypervisor_host_for_pci_passthrough +* https://www.gresearch.co.uk/article/utilising-the-openstack-placement-service-to-schedule-gpu-and-nvme-workloads-alongside-general-purpose-instances/ From 05cff81681bd1f1b0b72908aafc13589ad490a49 Mon Sep 17 00:00:00 2001 From: Seunghun Lee Date: Wed, 29 May 2024 16:32:20 +0100 Subject: [PATCH 11/42] Fix errors and add to index --- .../operations/baremetal-node-management.rst | 2 +- doc/source/operations/ceph-management.rst | 62 ++++++++++--------- .../operations/control-plane-operation.rst | 47 +++++++------- ...ng_horizon.rst => customising-horizon.rst} | 7 +-- .../hardware-inventory-management.rst | 12 ++-- doc/source/operations/index.rst | 10 +++ .../operations/openstack-reconfiguration.rst | 11 ++-- doc/source/operations/wazuh-operation.rst | 2 +- 8 files changed, 83 insertions(+), 70 deletions(-) rename doc/source/operations/{customising_horizon.rst => customising-horizon.rst} (97%) diff --git a/doc/source/operations/baremetal-node-management.rst b/doc/source/operations/baremetal-node-management.rst index f45903dad9..f5c82bd5ce 100644 --- a/doc/source/operations/baremetal-node-management.rst +++ b/doc/source/operations/baremetal-node-management.rst @@ -181,7 +181,7 @@ The command below extracts a list of port UUID, node UUID and switch port inform .. code-block:: bash - admin# openstack baremetal port list --field uuid --field node_uuid --field local_link_connection --format value + openstack baremetal port list --field uuid --field node_uuid --field local_link_connection --format value NGS will manage VLAN membership for ports when the ``local_link_connection`` fields match one of the switches in ``ml2_conf.ini``. The rest of the switch configuration is static. diff --git a/doc/source/operations/ceph-management.rst b/doc/source/operations/ceph-management.rst index 8e3d1f4e94..3c82a3ffe0 100644 --- a/doc/source/operations/ceph-management.rst +++ b/doc/source/operations/ceph-management.rst @@ -45,8 +45,8 @@ Ceph commands are usually run inside a ``cephadm shell`` utility container: .. code-block:: console - # From the node that runs Ceph - ceph# sudo cephadm shell + # From storage host + sudo cephadm shell Operating a cluster requires a keyring with an admin access to be available for Ceph commands. Cephadm will copy such keyring to the nodes carrying @@ -71,15 +71,17 @@ First drain the node .. code-block:: console - ceph# cephadm shell - ceph# ceph orch host drain + # From storage host + sudo cephadm shell + ceph orch host drain Once all daemons are removed - you can remove the host: .. code-block:: console - ceph# cephadm shell - ceph# ceph orch host rm + # From storage host + sudo cephadm shell + ceph orch host rm And then remove the host from inventory (usually in ``etc/kayobe/inventory/overcloud``) @@ -98,8 +100,9 @@ movement: .. code-block:: console - ceph# cephadm shell - ceph# ceph osd set noout + # From storage host + sudo cephadm shell + ceph osd set noout Reboot the node and replace the drive @@ -107,15 +110,17 @@ Unset noout after the node is back online .. code-block:: console - ceph# cephadm shell - ceph# ceph osd unset noout + # From storage host + sudo cephadm shell + ceph osd unset noout Remove the OSD using Ceph orchestrator command: .. code-block:: console - ceph# cephadm shell - ceph# ceph orch osd rm --replace + # From storage host + sudo cephadm shell + ceph orch osd rm --replace After removing OSDs, if the drives the OSDs were deployed on once again become available, cephadm may automatically try to deploy more OSDs on these drives if @@ -142,7 +147,7 @@ identify which OSDs are tied to which physical disks: .. code-block:: console - ceph# ceph device ls + ceph device ls Host maintenance ---------------- @@ -167,7 +172,7 @@ Ceph can report details about failed OSDs by running: .. code-block:: console - ceph# ceph health detail + ceph health detail .. note :: @@ -184,7 +189,7 @@ A failed OSD will also be reported as down by running: .. code-block:: console - ceph# ceph osd tree + ceph osd tree Note the ID of the failed OSD. @@ -192,7 +197,8 @@ The failed disk is usually logged by the Linux kernel too: .. code-block:: console - storage-0# dmesg -T + # From storage host + dmesg -T Cross-reference the hardware device and OSD ID to ensure they match. (Using `pvs` and `lvs` may help make this connection). @@ -207,16 +213,15 @@ show``). On this hypervisor, enter the libvirt container: .. code-block:: console - :substitutions: - |hypervisor_hostname|# docker exec -it nova_libvirt /bin/bash + # From hypervisor host + docker exec -it nova_libvirt /bin/bash Find the VM name using libvirt: .. code-block:: console - :substitutions: - (nova-libvirt)[root@|hypervisor_hostname| /]# virsh list + (nova-libvirt)[root@compute-01 /]# virsh list Id Name State ------------------------------------ 1 instance-00000001 running @@ -224,19 +229,19 @@ Find the VM name using libvirt: Now inspect the properties of the VM using ``virsh dumpxml``: .. code-block:: console - :substitutions: - (nova-libvirt)[root@|hypervisor_hostname| /]# virsh dumpxml instance-00000001 | grep rbd - + (nova-libvirt)[root@compute-01 /]# virsh dumpxml instance-00000001 | grep rbd + On a Ceph node, the RBD pool can be inspected and the volume extracted as a RAW block image: .. code-block:: console - :substitutions: - ceph# rbd ls |nova_rbd_pool| - ceph# rbd export |nova_rbd_pool|/51206278-e797-4153-b720-8255381228da_disk blob.raw + # From storage host + sudo cephadm shell + rbd ls + rbd export /51206278-e797-4153-b720-8255381228da_disk blob.raw The raw block device (blob.raw above) can be mounted using the loopback device. @@ -248,8 +253,9 @@ libguestfs-tools and using the guestfish command: .. code-block:: console - ceph# export LIBGUESTFS_BACKEND=direct - ceph# guestfish -a blob.qcow + # From storage host + export LIBGUESTFS_BACKEND=direct + guestfish -a blob.qcow > run 100% [XXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXX] 00:00 > list-filesystems diff --git a/doc/source/operations/control-plane-operation.rst b/doc/source/operations/control-plane-operation.rst index c5c629d52f..ffd5299ce3 100644 --- a/doc/source/operations/control-plane-operation.rst +++ b/doc/source/operations/control-plane-operation.rst @@ -55,7 +55,7 @@ Configuring Prometheus Alerts ----------------------------- Alerts are defined in code and stored in Kayobe configuration. See ``*.rules`` -files in ``${KAYOBE_CONFIG_PATH}/kolla/config/prometheus`` as a model to add +files in ``$KAYOBE_CONFIG_PATH/kolla/config/prometheus`` as a model to add custom rules. Silencing Prometheus Alerts @@ -88,7 +88,7 @@ Generating Alerts from Metrics ++++++++++++++++++++++++++++++ Alerts are defined in code and stored in Kayobe configuration. See ``*.rules`` -files in ``${KAYOBE_CONFIG_PATH}/kolla/config/prometheus`` as a model to add +files in ``$KAYOBE_CONFIG_PATH/kolla/config/prometheus`` as a model to add custom rules. Control Plane Shutdown Procedure @@ -124,7 +124,7 @@ The password can be found using: .. code-block:: console - kayobe# ansible-vault view ${KAYOBE_CONFIG_PATH}/kolla/passwords.yml \ + kayobe# ansible-vault view $KAYOBE_CONFIG_PATH/kolla/passwords.yml \ --vault-password-file | grep ^database Checking RabbitMQ @@ -135,6 +135,7 @@ RabbitMQ health is determined using the command ``rabbitmqctl cluster_status``: .. code-block:: console [stack@controller0 ~]$ docker exec rabbitmq rabbitmqctl cluster_status + Cluster status of node rabbit@controller0 ... [{nodes,[{disc,['rabbit@controller0','rabbit@controller1', 'rabbit@controller2']}]}, @@ -180,20 +181,18 @@ If you are shutting down a single hypervisor, to avoid down time to tenants it is advisable to migrate all of the instances to another machine. See :ref:`evacuating-all-instances`. -.. ifconfig:: deployment['ceph_managed'] - - Ceph - ---- +Ceph +---- - The following guide provides a good overview: - https://access.redhat.com/documentation/en-us/red_hat_openstack_platform/8/html/director_installation_and_usage/sect-rebooting-ceph +The following guide provides a good overview: +https://access.redhat.com/documentation/en-us/red_hat_openstack_platform/8/html/director_installation_and_usage/sect-rebooting-ceph Shutting down the seed VM ------------------------- .. code-block:: console - kayobe# virsh shutdown + kayobe# virsh shutdown .. _full-shutdown: @@ -262,7 +261,7 @@ hypervisor is powered on. If it does not, it can be started with: .. code-block:: console - kayobe# virsh start seed-0 + kayobe# virsh start Full power on ------------- @@ -340,13 +339,14 @@ To see the list of hypervisor names: .. code-block:: console - admin# openstack hypervisor list + # From host that can reach Openstack + openstack hypervisor list To boot an instance on a specific hypervisor .. code-block:: console - admin# openstack server create --flavor --network --key-name --image --availability-zone nova:: + openstack server create --flavor --network --key-name --image --availability-zone nova:: Cleanup Procedures ================== @@ -360,22 +360,23 @@ perform the following cleanup procedure regularly: .. code-block:: console - admin# for user in $(openstack user list --domain magnum -f value -c Name | grep -v magnum_trustee_domain_admin); do - if openstack coe cluster list -c uuid -f value | grep -q $(echo $user | sed 's/_[0-9a-f]*$//'); then - echo "$user still in use, not deleting" - else - openstack user delete --domain magnum $user - fi - done + for user in $(openstack user list --domain magnum -f value -c Name | grep -v magnum_trustee_domain_admin); do + if openstack coe cluster list -c uuid -f value | grep -q $(echo $user | sed 's/_[0-9a-f]*$//'); then + echo "$user still in use, not deleting" + else + openstack user delete --domain magnum $user + fi + done OpenSearch indexes retention ============================= To alter default rotation values for OpenSearch, edit -``${KAYOBE_CONFIG_PATH}/kolla/globals.yml``: +``$KAYOBE_CONFIG_PATH/kolla/globals.yml``: .. code-block:: console + # Duration after which index is closed (default 30) opensearch_soft_retention_period_days: 90 # Duration after which index is deleted (default 60) @@ -384,8 +385,8 @@ To alter default rotation values for OpenSearch, edit Reconfigure Opensearch with new values: .. code-block:: console - kayobe overcloud service reconfigure --kolla-tags opensearch -For more information see the `upstream documentation + kayobe# kayobe overcloud service reconfigure --kolla-tags opensearch +For more information see the `upstream documentation `__. diff --git a/doc/source/operations/customising_horizon.rst b/doc/source/operations/customising-horizon.rst similarity index 97% rename from doc/source/operations/customising_horizon.rst rename to doc/source/operations/customising-horizon.rst index 1f8977a31e..4b2e157d86 100644 --- a/doc/source/operations/customising_horizon.rst +++ b/doc/source/operations/customising-horizon.rst @@ -1,8 +1,6 @@ -.. include:: vars.rst - -==================================== +=================== Customising Horizon -==================================== +=================== Horizon is the most frequent site-specific container customisation required: other customisations tend to be common across deployments, but personalisation @@ -55,7 +53,6 @@ Building a custom container image for Horizon can be done by modifying ``kolla.yml`` to fetch the custom theme and include it in the image: .. code-block:: yaml - :substitutions: kolla_sources: horizon-additions-theme-: diff --git a/doc/source/operations/hardware-inventory-management.rst b/doc/source/operations/hardware-inventory-management.rst index 0d6fd8adf1..8636d5b562 100644 --- a/doc/source/operations/hardware-inventory-management.rst +++ b/doc/source/operations/hardware-inventory-management.rst @@ -5,7 +5,7 @@ Hardware Inventory Management At its lowest level, hardware inventory is managed in the Bifrost service. Reconfiguring Control Plane Hardware ------------------------------------- +==================================== If a server's hardware or firmware configuration is changed, it should be re-inspected in Bifrost before it is redeployed into service. A single server @@ -112,10 +112,10 @@ hypervisor. They should all show the status ACTIVE. This can be verified with: admin# openstack server show Troubleshooting -+++++++++++++++ +=============== Servers that have been shut down -******************************** +-------------------------------- If there are any instances that are SHUTOFF they won’t be migrated, but you can use ``openstack server migrate`` for them once the live migration is finished. @@ -131,7 +131,7 @@ For more details see: http://www.danplanet.com/blog/2016/03/03/evacuate-in-nova-one-command-to-confuse-us-all/ Flavors have changed -******************** +-------------------- If the size of the flavors has changed, some instances will also fail to migrate as the process needs manual confirmation. You can do this with: @@ -150,7 +150,7 @@ RESIZE`` as shown in this snippet of ``openstack server show ``: .. _set-bifrost-maintenance-mode: Set maintenance mode on a node in Bifrost -+++++++++++++++++++++++++++++++++++++++++ +----------------------------------------- .. code-block:: console @@ -161,7 +161,7 @@ Set maintenance mode on a node in Bifrost .. _unset-bifrost-maintenance-mode: Unset maintenance mode on a node in Bifrost -+++++++++++++++++++++++++++++++++++++++++++ +------------------------------------------- .. code-block:: console diff --git a/doc/source/operations/index.rst b/doc/source/operations/index.rst index 825384c4bf..aea4139980 100644 --- a/doc/source/operations/index.rst +++ b/doc/source/operations/index.rst @@ -7,11 +7,21 @@ This guide is for operators of the StackHPC Kayobe configuration project. .. toctree:: :maxdepth: 1 + baremetal-node-management + ceph-management + control-plane-operation + customsing-horizon + gpu-in-openstack + hardware-inventory-management hotfix-playbook + migrating-vm nova-compute-ironic octavia + openstack-projects-and-users-management + openstack-reconfiguration rabbitmq secret-rotation tempest upgrading-openstack upgrading-ceph + wazuh-operation diff --git a/doc/source/operations/openstack-reconfiguration.rst b/doc/source/operations/openstack-reconfiguration.rst index dfba372f26..db157f6a91 100644 --- a/doc/source/operations/openstack-reconfiguration.rst +++ b/doc/source/operations/openstack-reconfiguration.rst @@ -3,7 +3,7 @@ OpenStack Reconfiguration ========================= Disabling a Service -------------------- +=================== Ansible is oriented towards adding or reconfiguring services, but removing a service is handled less well, because of Ansible's imperative style. @@ -36,7 +36,7 @@ Some services may store data in a dedicated Docker volume, which can be removed with ``docker volume rm``. Installing TLS Certificates ---------------------------- +=========================== To configure TLS for the first time, we write the contents of a PEM file to the ``secrets.yml`` file as ``secrets_kolla_external_tls_cert``. @@ -69,7 +69,7 @@ be updated in Keystone: kayobe# kayobe overcloud service reconfigure Alternative Configuration -+++++++++++++++++++++++++ +------------------------- As an alternative to writing the certificates as a variable to ``secrets.yml``, it is also possible to write the same data to a file, @@ -88,7 +88,6 @@ Check the expiry date on an installed TLS certificate from a host that can reach the OpenStack APIs: .. code-block:: console - :substitutions: openstack# openssl s_client -connect :443 2> /dev/null | openssl x509 -noout -dates @@ -106,7 +105,7 @@ above. Run the following command: .. _taking-a-hypervisor-out-of-service: Taking a Hypervisor out of Service ----------------------------------- +================================== To take a hypervisor out of Nova scheduling: @@ -141,7 +140,7 @@ And then to enable a hypervisor again: nova-compute Managing Space in the Docker Registry -------------------------------------- +===================================== If the Docker registry becomes full, this can prevent container updates and (depending on the storage configuration of the seed host) could lead to other diff --git a/doc/source/operations/wazuh-operation.rst b/doc/source/operations/wazuh-operation.rst index 23800ff849..3f56c24603 100644 --- a/doc/source/operations/wazuh-operation.rst +++ b/doc/source/operations/wazuh-operation.rst @@ -14,7 +14,7 @@ Ansible playbooks `_. These can be integrated into ``kayobe-config`` as a custom playbook. Configuring Wazuh Manager -------------------------- +========================= Wazuh Manager is configured by editing the ``wazuh-manager.yml`` groups vars file found at From 4fced04b38dc8c6e0109bc2e205e1151fa392b6d Mon Sep 17 00:00:00 2001 From: Seunghun Lee Date: Thu, 30 May 2024 09:22:10 +0100 Subject: [PATCH 12/42] Remove repeating section --- doc/source/operations/control-plane-operation.rst | 7 ------- 1 file changed, 7 deletions(-) diff --git a/doc/source/operations/control-plane-operation.rst b/doc/source/operations/control-plane-operation.rst index ffd5299ce3..8212b95fba 100644 --- a/doc/source/operations/control-plane-operation.rst +++ b/doc/source/operations/control-plane-operation.rst @@ -84,13 +84,6 @@ the monitoring service rather than the host being monitored). `known issue `__ when running several Alertmanager instances behind HAProxy. -Generating Alerts from Metrics -++++++++++++++++++++++++++++++ - -Alerts are defined in code and stored in Kayobe configuration. See ``*.rules`` -files in ``$KAYOBE_CONFIG_PATH/kolla/config/prometheus`` as a model to add -custom rules. - Control Plane Shutdown Procedure ================================ From 879f8dc57a66a70a806a13a096e70614368c9534 Mon Sep 17 00:00:00 2001 From: Seunghun Lee Date: Thu, 30 May 2024 09:37:43 +0100 Subject: [PATCH 13/42] Add more instruction for ADVise tool --- .../hardware-inventory-management.rst | 30 ++++++++++++++++--- 1 file changed, 26 insertions(+), 4 deletions(-) diff --git a/doc/source/operations/hardware-inventory-management.rst b/doc/source/operations/hardware-inventory-management.rst index 8636d5b562..f88626a827 100644 --- a/doc/source/operations/hardware-inventory-management.rst +++ b/doc/source/operations/hardware-inventory-management.rst @@ -228,6 +228,8 @@ The playbook has the following optional parameters: - output_dir: path to where results should be saved. Default: ``"{{ lookup('env', 'PWD') }}/review"`` - advise-pattern: regular expression to specify what introspection data should be analysed. Default: ``".*.eval"`` +You can override them by provide new values with ``-e =`` + Example command to run the tool on data about the compute nodes in a system, where compute nodes are named cpt01, cpt02, cpt03…: .. code-block:: console @@ -244,10 +246,30 @@ Using the results The ADVise tool will output a selection of results found under output_dir/results these include: - ``.html`` files to display network visualisations of any hardware differences. -- The folder ``Paired_Comparisons`` which contains information on the shared and differing fields found between the systems. This is a reflection of the network visualisation webpage, with more detail as to what the differences are. +- The folder ``Paired_Comparisons`` which contains information on the shared and differing fields found between the systems. + This is a reflection of the network visualisation webpage, with more detail as to what the differences are. - ``_summary``, a listing of how the systems can be grouped into sets of identical hardware. - ``_performance``, the results of analysing the benchmarking data gathered. -- ``_perf_summary``, a subset of the performance metrics, just showing any potentially anomalous data such as where variance is too high, or individual nodes have been found to over/underperform. +- ``_perf_summary``, a subset of the performance metrics, just showing any potentially anomalous data such as where variance + is too high, or individual nodes have been found to over/underperform. + +The ADVise tool will also launch an interactive Dash webpage, which displays the network visualisations, +tables with information on the differing hardware attributes, the performance metrics as a range of box-plots, +and specifies which individual nodes may be anomalous via box-plot outliers. This can be accessed at ``localhost:8050``. +To close this service, simply ``Ctrl+C`` in the terminal where you ran the playbook. + +To get visuallised result, It is recommanded to copy instrospection data to your local machine then run ADVise playbook locally. + +Recommanded Workflow +-------------------- -To get visuallised result, It is recommanded to copy instrospection data and review directories to your -local machine then run ADVise playbook locally with the data. +1. Run the playbook as outlined above. +2. Open the Dash webpage at ``localhost:8050``. +3. Review the hardware differences. Note that hovering over a group will display the nodes it contains. +4. Identify any unexpected differences in the systems. If multiple differing fields exist they will be graphed separately. + As an example, here we expected all compute nodes to be identical. +5. Use the dropdown menu beneath each graph to show a table of the differences found between two sets of groups. + If required, information on shared fields can be found under ``output_dir/results/Paired_Comparisons``. +6. Scroll down the webpage to the performance review. Identify if any of the discovered performance results could be + indicative of a larger issue. +7. Examine the ``_performance`` and ``_perf_summary`` files if you require any more information. From 2ddc2a2e5a9ab8c77b5ad22ae200c27a31f52ec2 Mon Sep 17 00:00:00 2001 From: Seunghun Lee Date: Thu, 30 May 2024 10:02:22 +0100 Subject: [PATCH 14/42] Fix formatting --- doc/source/operations/baremetal-node-management.rst | 6 ++---- doc/source/operations/gpu-in-openstack.rst | 2 -- 2 files changed, 2 insertions(+), 6 deletions(-) diff --git a/doc/source/operations/baremetal-node-management.rst b/doc/source/operations/baremetal-node-management.rst index f5c82bd5ce..25a12760d6 100644 --- a/doc/source/operations/baremetal-node-management.rst +++ b/doc/source/operations/baremetal-node-management.rst @@ -99,13 +99,11 @@ Static physical network configuration is managed via Kayobe. .. code-block:: shell - The interface is then partially configured: + The interface is then partially configured: .. code-block:: shell - For :ref:`ironic-node-discovery` to work, you need to manually switch the port to the provisioning network: - - .. code-block:: shell + For :ref:`ironic-node-discovery` to work, you need to manually switch the port to the provisioning network: **NOTE**: You only need to do this if Ironic isn't aware of the node. diff --git a/doc/source/operations/gpu-in-openstack.rst b/doc/source/operations/gpu-in-openstack.rst index 66170c6800..259e39e8c1 100644 --- a/doc/source/operations/gpu-in-openstack.rst +++ b/doc/source/operations/gpu-in-openstack.rst @@ -1,5 +1,3 @@ -.. include:: vars.rst - ============================= Support for GPUs in OpenStack ============================= From 828f42c5b223f482adddad91073b5608a8430038 Mon Sep 17 00:00:00 2001 From: Seunghun Lee Date: Fri, 14 Jun 2024 16:24:55 +0200 Subject: [PATCH 15/42] Merge drive replacement related sections into one --- doc/source/operations/ceph-management.rst | 116 ++++++++++------------ 1 file changed, 53 insertions(+), 63 deletions(-) diff --git a/doc/source/operations/ceph-management.rst b/doc/source/operations/ceph-management.rst index 3c82a3ffe0..46de62fc83 100644 --- a/doc/source/operations/ceph-management.rst +++ b/doc/source/operations/ceph-management.rst @@ -89,11 +89,60 @@ And then remove the host from inventory (usually in Additional options/commands may be found in `Host management `_ -Replacing a Failed Ceph Drive ------------------------------ +Replacing failing drive +----------------------- -Once an OSD has been identified as having a hardware failure, -the affected drive will need to be replaced. +A failing drive in a Ceph cluster will cause OSD daemon to crash. +In this case Ceph will go into `HEALTH_WARN` state. +Ceph can report details about failed OSDs by running: + +.. code-block:: console + # From storage host + sudo cephadm shell + ceph health detail + +.. note :: + + Remember to run ceph/rbd commands from within ``cephadm shell`` + (preferred method) or after installing Ceph client. Details in the + official `documentation `__. + It is also required that the host where commands are executed has admin + Ceph keyring present - easiest to achieve by applying + `_admin `__ + label (Ceph MON servers have it by default when using + `StackHPC Cephadm collection `__). + +A failed OSD will also be reported as down by running: + +.. code-block:: console + + ceph osd tree + +Note the ID of the failed OSD. + +The failed disk is usually logged by the Linux kernel too: + +.. code-block:: console + + # From storage host + dmesg -T + +Cross-reference the hardware device and OSD ID to ensure they match. +(Using `pvs` and `lvs` may help make this connection). + +See upstream documentation: +https://docs.ceph.com/en/latest/cephadm/services/osd/#replacing-an-osd + +In case where disk holding DB and/or WAL fails, it is necessary to recreate +all OSDs that are associated with this disk - usually NVMe drive. The +following single command is sufficient to identify which OSDs are tied to +which physical disks: + +.. code-block:: console + + ceph device ls + +Once OSDs on failed disks are identified, follow procedure below. If rebooting a Ceph node, first set ``noout`` to prevent excess data movement: @@ -130,25 +179,6 @@ spec before (``cephadm_osd_spec`` variable in ``etc/kayobe/cephadm.yml``). Either set ``unmanaged: true`` to stop cephadm from picking up new disks or modify it in some way that it no longer matches the drives you want to remove. - -Operations -========== - -Replacing drive ---------------- - -See upstream documentation: -https://docs.ceph.com/en/latest/cephadm/services/osd/#replacing-an-osd - -In case where disk holding DB and/or WAL fails, it is necessary to recreate -(using replacement procedure above) all OSDs that are associated with this -disk - usually NVMe drive. The following single command is sufficient to -identify which OSDs are tied to which physical disks: - -.. code-block:: console - - ceph device ls - Host maintenance ---------------- @@ -163,46 +193,6 @@ https://docs.ceph.com/en/latest/cephadm/upgrade/ Troubleshooting =============== -Investigating a Failed Ceph Drive ---------------------------------- - -A failing drive in a Ceph cluster will cause OSD daemon to crash. -In this case Ceph will go into `HEALTH_WARN` state. -Ceph can report details about failed OSDs by running: - -.. code-block:: console - - ceph health detail - -.. note :: - - Remember to run ceph/rbd commands from within ``cephadm shell`` - (preferred method) or after installing Ceph client. Details in the - official `documentation `__. - It is also required that the host where commands are executed has admin - Ceph keyring present - easiest to achieve by applying - `_admin `__ - label (Ceph MON servers have it by default when using - `StackHPC Cephadm collection `__). - -A failed OSD will also be reported as down by running: - -.. code-block:: console - - ceph osd tree - -Note the ID of the failed OSD. - -The failed disk is usually logged by the Linux kernel too: - -.. code-block:: console - - # From storage host - dmesg -T - -Cross-reference the hardware device and OSD ID to ensure they match. -(Using `pvs` and `lvs` may help make this connection). - Inspecting a Ceph Block Device for a VM --------------------------------------- From a092cba3296c5c4ee25ddcc11a33b5b7774f37cc Mon Sep 17 00:00:00 2001 From: Seunghun Lee Date: Fri, 14 Jun 2024 16:26:30 +0200 Subject: [PATCH 16/42] Reference Cephadm & Kayobe doc as deployment guide --- doc/source/configuration/cephadm.rst | 8 +++++--- doc/source/operations/ceph-management.rst | 3 +++ 2 files changed, 8 insertions(+), 3 deletions(-) diff --git a/doc/source/configuration/cephadm.rst b/doc/source/configuration/cephadm.rst index bcb13cd6ce..19112a0839 100644 --- a/doc/source/configuration/cephadm.rst +++ b/doc/source/configuration/cephadm.rst @@ -1,6 +1,8 @@ -==== -Ceph -==== +.. _cephadm-kayobe: + +================ +Cephadm & Kayobe +================ This section describes how to use the Cephadm integration included in StackHPC Kayobe configuration to deploy Ceph. diff --git a/doc/source/operations/ceph-management.rst b/doc/source/operations/ceph-management.rst index 46de62fc83..fc17571278 100644 --- a/doc/source/operations/ceph-management.rst +++ b/doc/source/operations/ceph-management.rst @@ -5,6 +5,9 @@ Managing and Operating Ceph Working with Cephadm ==================== +This documentation provides guide for Ceph operations. For deploying Ceph, +please refer to :ref:`cephadm-kayobe` documentation. + cephadm configuration location ------------------------------ From 5cf8890a604af87fc75f776ca0126f91e968e150 Mon Sep 17 00:00:00 2001 From: Seunghun Lee Date: Mon, 17 Jun 2024 12:18:59 +0100 Subject: [PATCH 17/42] Merge Wazuh documents --- doc/source/configuration/wazuh.rst | 35 ++++++++- doc/source/operations/index.rst | 4 +- doc/source/operations/wazuh-operation.rst | 89 ----------------------- 3 files changed, 36 insertions(+), 92 deletions(-) delete mode 100644 doc/source/operations/wazuh-operation.rst diff --git a/doc/source/configuration/wazuh.rst b/doc/source/configuration/wazuh.rst index cd57716d34..ca6e519b17 100644 --- a/doc/source/configuration/wazuh.rst +++ b/doc/source/configuration/wazuh.rst @@ -2,13 +2,20 @@ Wazuh ===== +`Wazuh `_ is a security monitoring platform. +It monitors for: + +* Security-related system events. +* Known vulnerabilities (CVEs) in versions of installed software. +* Misconfigurations in system security. + The short version ================= #. Create an infrastructure VM for the Wazuh manager, and add it to the wazuh-manager group #. Configure the infrastructure VM with kayobe: ``kayobe infra vm host configure`` #. Edit your config under - ``etc/kayobe/inventory/group_vars/wazuh-manager/wazuh-manager``, in + ``$KAYOBE_CONFIG_PATHinventory/group_vars/wazuh-manager/wazuh-manager``, in particular the defaults assume that the ``provision_oc_net`` network will be used. #. Generate secrets: ``kayobe playbook run $KAYOBE_CONFIG_PATH/ansible/wazuh-secrets.yml`` @@ -233,9 +240,12 @@ You may need to modify some of the variables, including: - etc/kayobe/wazuh-manager.yml - etc/kayobe/inventory/group_vars/wazuh/wazuh-agent/wazuh-agent +You'll need to run ``wazuh-manager.yml`` playbook again to apply customisation. + Secrets ------- +Wazuh requires that secrets or passwords are set for itself and the services with which it communiticates. Wazuh secrets playbook is located in ``etc/kayobe/ansible/wazuh-secrets.yml``. Running this playbook will generate and put pertinent security items into secrets vault file which will be placed in ``$KAYOBE_CONFIG_PATH/wazuh-secrets.yml``. @@ -250,6 +260,10 @@ It will be used by wazuh secrets playbook to generate wazuh secrets vault file. kayobe playbook run $KAYOBE_CONFIG_PATH/ansible/wazuh-secrets.yml +.. note:: Use ``ansible-vault`` to view the secrets: + + ``ansible-vault view --vault-password-file ~/vault.password $KAYOBE_CONFIG_PATH/inventory/group_vars/wazuh-manager/wazuh-secrets.yml`` + Configure Wazuh Dashboard's Server Host --------------------------------------- @@ -390,6 +404,25 @@ Deploy the Wazuh agents: ``kayobe playbook run $KAYOBE_CONFIG_PATH/ansible/wazuh-agent.yml`` +The Wazuh Agent is deployed to all hosts in the ``wazuh-agent`` +inventory group, comprising the ``seed`` group +plus the ``overcloud`` group (containing all hosts in the +OpenStack control plane). + +.. code-block:: ini + + [wazuh-agent:children] + seed + overcloud + +The hosts running Wazuh Agent should automatically be registered +and visible within the Wazuh Manager dashboard. + +.. note:: It is good practice to use a `Kayobe deploy hook + `_ + to automate deployment and configuration of the Wazuh Agent + following a run of ``kayobe overcloud host configure``. + Verification ------------ diff --git a/doc/source/operations/index.rst b/doc/source/operations/index.rst index aea4139980..ae8a71901e 100644 --- a/doc/source/operations/index.rst +++ b/doc/source/operations/index.rst @@ -1,6 +1,6 @@ -================= +============== Operator Guide -================= +============== This guide is for operators of the StackHPC Kayobe configuration project. diff --git a/doc/source/operations/wazuh-operation.rst b/doc/source/operations/wazuh-operation.rst deleted file mode 100644 index 3f56c24603..0000000000 --- a/doc/source/operations/wazuh-operation.rst +++ /dev/null @@ -1,89 +0,0 @@ -======================= -Wazuh Security Platform -======================= - -`Wazuh `_ is a security monitoring platform. -It monitors for: - -* Security-related system events. -* Known vulnerabilities (CVEs) in versions of installed software. -* Misconfigurations in system security. - -One method for deploying and maintaining Wazuh is the `official -Ansible playbooks `_. These -can be integrated into ``kayobe-config`` as a custom playbook. - -Configuring Wazuh Manager -========================= - -Wazuh Manager is configured by editing the ``wazuh-manager.yml`` -groups vars file found at -``etc/kayobe/inventory/group_vars/wazuh-manager/``. This file -controls various aspects of Wazuh Manager configuration. -Most notably: - -*domain_name*: - The domain used by Search Guard CE when generating certificates. - -*wazuh_manager_ip*: - The IP address that the Wazuh Manager shall reside on for communicating with the agents. - -*wazuh_manager_connection*: - Used to define port and protocol for the manager to be listening on. - -*wazuh_manager_authd*: - Connection settings for the daemon responsible for registering new agents. - -Running ``kayobe playbook run -$KAYOBE_CONFIG_PATH/ansible/wazuh-manager.yml`` will deploy these -changes. - -Secrets -------- - -Wazuh requires that secrets or passwords are set for itself and the services with which it communiticates. -The playbook ``etc/kayobe/ansible/wazuh-secrets.yml`` automates the creation of these secrets, which should then be encrypted with Ansible Vault. - -To update the secrets you can execute the following two commands - -.. code-block:: shell - - kayobe# kayobe playbook run $KAYOBE_CONFIG_PATH/ansible/wazuh-secrets.yml \ - -e wazuh_user_pass=$(uuidgen) \ - -e wazuh_admin_pass=$(uuidgen) - kayobe# ansible-vault encrypt --vault-password-file \ - $KAYOBE_CONFIG_PATH/inventory/group_vars/wazuh-manager/wazuh-secrets.yml - -Once generated, run ``kayobe playbook run $KAYOBE_CONFIG_PATH/ansible/wazuh-manager.yml`` which copies the secrets into place. - -.. note:: Use ``ansible-vault`` to view the secrets: - - ``ansible-vault view --vault-password-file ~/vault.password $KAYOBE_CONFIG_PATH/inventory/group_vars/wazuh-manager/wazuh-secrets.yml`` - -Adding a New Agent ------------------- -The Wazuh Agent is deployed to all hosts in the ``wazuh-agent`` -inventory group, comprising the ``seed`` group -plus the ``overcloud`` group (containing all hosts in the -OpenStack control plane). - -.. code-block:: ini - - [wazuh-agent:children] - seed - overcloud - -The following playbook deploys the Wazuh Agent to all hosts in the -``wazuh-agent`` group: - -.. code-block:: shell - - kayobe playbook run $KAYOBE_CONFIG_PATH/ansible/wazuh-agent.yml - -The hosts running Wazuh Agent should automatically be registered -and visible within the Wazuh Manager dashboard. - -.. note:: It is good practice to use a `Kayobe deploy hook - `_ - to automate deployment and configuration of the Wazuh Agent - following a run of ``kayobe overcloud host configure``. From 342a4a2d9740fe89d665805484f1daf24ca5e103 Mon Sep 17 00:00:00 2001 From: Seunghun Lee Date: Mon, 17 Jun 2024 12:43:53 +0100 Subject: [PATCH 18/42] Update old contents --- .../operations/control-plane-operation.rst | 62 +++++++++++++---- doc/source/operations/customising-horizon.rst | 67 +++---------------- .../operations/openstack-reconfiguration.rst | 45 ------------- 3 files changed, 58 insertions(+), 116 deletions(-) diff --git a/doc/source/operations/control-plane-operation.rst b/doc/source/operations/control-plane-operation.rst index 8212b95fba..c7b10e75f4 100644 --- a/doc/source/operations/control-plane-operation.rst +++ b/doc/source/operations/control-plane-operation.rst @@ -294,27 +294,61 @@ node is powered back on. Software Updates ================ -Update Packages on Control Plane --------------------------------- +Update Host Packages on Control Plane +------------------------------------- -OS packages can be updated with: +The host packages and Kolla container images are distributed from `StackHPC Release Train +`__ to ensure tested and reliable +software releases are provided. + +Syncing new StackHPC Release Train contents to local Pulp server is needed before updating +host packages and/or Kolla services. + +To sync host packages: + +.. code-block:: console + + kayobe# kayobe playbook run $KAYOBE_CONFIG_PATH/ansible/pulp-repo-sync.yml + kayobe# kayobe playbook run $KAYOBE_CONFIG_PATH/ansible/pulp-repo-publish.yml + +If the system is production environment and want to use packages tested in test/staging +environment, you can promote them by: + +.. code-block:: console + + kayobe# kayobe playbook run $KAYOBE_CONFIG_PATH/ansible/pulp-repo-promote-production.yml + +To sync container images: .. code-block:: console - kayobe# kayobe overcloud host package update --limit --packages '*' - kayobe# kayobe overcloud seed package update --packages '*' + kayobe# kayobe playbook run $KAYOBE_CONFIG_PATH/ansible/pulp-container-sync.yml + kayobe# kayobe playbook run $KAYOBE_CONFIG_PATH/ansible/pulp-container-publish.yml + +Once sync with StackHPC Release Train is done, new contents will be accessible from local +Pulp server. + +Host packages can be updated with: + +.. code-block:: console + + kayobe# kayobe overcloud host package update --limit --packages '*' + kayobe# kayobe seed host package update --packages '*' See https://docs.openstack.org/kayobe/latest/administration/overcloud.html#updating-packages -Minor Upgrades to OpenStack Services ------------------------------------- +Upgrading OpenStack Services +---------------------------- + +* Update tags for the images in ``etc/kayobe/kolla-image-tags.yml`` to use the new value of ``kolla_openstack_release`` +* Pull container images to overcloud hosts with ``kayobe overcloud container image pull`` +* Run ``kayobe overcloud service upgrade`` + +You can update the subset of containers or hosts by + +.. code-block:: console -* Pull latest changes from upstream stable branch to your own ``kolla`` fork (if applicable) -* Update ``kolla_openstack_release`` in ``etc/kayobe/kolla.yml`` (unless using default) -* Update tags for the images in ``etc/kayobe/kolla/globals.yml`` to use the new value of ``kolla_openstack_release`` -* Rebuild container images -* Pull container images to overcloud hosts -* Run kayobe overcloud service upgrade + kayobe# kayobe overcloud service upgrade --kolla-tags --limit --kolla-limit For more information, see: https://docs.openstack.org/kayobe/latest/upgrading.html @@ -339,7 +373,7 @@ To boot an instance on a specific hypervisor .. code-block:: console - openstack server create --flavor --network --key-name --image --availability-zone nova:: + openstack server create --flavor --network --key-name --image --os-compute-api-version 2.74 --host Cleanup Procedures ================== diff --git a/doc/source/operations/customising-horizon.rst b/doc/source/operations/customising-horizon.rst index 4b2e157d86..d1fcd5e65b 100644 --- a/doc/source/operations/customising-horizon.rst +++ b/doc/source/operations/customising-horizon.rst @@ -46,79 +46,35 @@ Further reading: * https://docs.openstack.org/horizon/latest/configuration/themes.html * https://docs.openstack.org/horizon/latest/configuration/branding.html -Building a Horizon container image with custom theme ----------------------------------------------------- +Adding the custom theme +----------------------- -Building a custom container image for Horizon can be done by modifying -``kolla.yml`` to fetch the custom theme and include it in the image: +Create a directory and transfer custom theme files to it ``$KAYOBE_CONFIG_PATH/kolla/config/horizon/themes/``. -.. code-block:: yaml - - kolla_sources: - horizon-additions-theme-: - type: "git" - location: - reference: master - - kolla_build_blocks: - horizon_footer: | - # Binary images cannot use the additions mechanism. - {% raw %} - {% if install_type == 'source' %} - ADD additions-archive / - RUN mkdir -p /etc/openstack-dashboard/themes/ \ - && cp -R /additions/horizon-additions-theme--archive-master/* /etc/openstack-dashboard/themes// \ - && chown -R horizon: /etc/openstack-dashboard/themes - {% endif %} - {% endraw %} - -If using a specific container image tag, don't forget to set: +Define the custom theme in ``etc/kayobe/kolla/globals.yml`` .. code-block:: yaml - - kolla_tag: mytag - -Build the image with: - -.. code-block:: console - - kayobe overcloud container image build horizon -e kolla_install_type=source --push - -Pull the new Horizon container to the controller: - -.. code-block:: console - - kayobe overcloud container image pull --kolla-tags horizon + horizon_custom_themes: + - name: + label: # This will be the visible name to users Deploy and use the custom theme ------------------------------- -Switch to source image type in ``${KAYOBE_CONFIG_PATH}/kolla/globals.yml``: - -.. code-block:: yaml - - horizon_install_type: source - -You may also need to update the container image tag: - -.. code-block:: yaml - - horizon_tag: mytag - Configure Horizon to include the custom theme and use it by default: .. code-block:: console - mkdir -p ${KAYOBE_CONFIG_PATH}/kolla/config/horizon/ + mkdir -p $KAYOBE_CONFIG_PATH/kolla/config/horizon/ -Add to ``${KAYOBE_CONFIG_PATH}/kolla/config/horizon/custom_local_settings``: +Create file ``$KAYOBE_CONFIG_PATH/kolla/config/horizon/custom_local_settings`` and add followings .. code-block:: console AVAILABLE_THEMES = [ ('default', 'Default', 'themes/default'), ('material', 'Material', 'themes/material'), - ('', '', '/etc/openstack-dashboard/themes/'), + ('', '', 'themes/'), ] DEFAULT_THEME = '' @@ -137,9 +93,6 @@ Deploy with: Troubleshooting --------------- -Make sure you build source images, as binary images cannot use the addition -mechanism used here. - If the theme is selected but the logo doesn’t load, try running these commands inside the ``horizon`` container: diff --git a/doc/source/operations/openstack-reconfiguration.rst b/doc/source/operations/openstack-reconfiguration.rst index db157f6a91..712a0f779e 100644 --- a/doc/source/operations/openstack-reconfiguration.rst +++ b/doc/source/operations/openstack-reconfiguration.rst @@ -138,48 +138,3 @@ And then to enable a hypervisor again: admin# openstack compute service set --enable \ nova-compute - -Managing Space in the Docker Registry -===================================== - -If the Docker registry becomes full, this can prevent container updates and -(depending on the storage configuration of the seed host) could lead to other -problems with services provided by the seed host. - -To remove container images from the Docker Registry, follow this process: - -* Reconfigure the registry container to allow deleting containers. This can be - done in ``docker-registry.yml`` with Kayobe: - -.. code-block:: yaml - - docker_registry_env: - REGISTRY_STORAGE_DELETE_ENABLED: "true" - -* For the change to take effect, run: - -.. code-block:: console - - kayobe seed host configure - -* A helper script is useful, such as https://github.com/byrnedo/docker-reg-tool - (this requires ``jq``). To delete all images with a specific tag, use: - -.. code-block:: console - - for repo in `./docker_reg_tool http://registry-ip:4000 list`; do - ./docker_reg_tool http://registry-ip:4000 delete $repo $tag - done - -* Deleting the tag does not actually release the space. To actually free up - space, run garbage collection: - -.. code-block:: console - - seed# docker exec docker_registry bin/registry garbage-collect /etc/docker/registry/config.yml - -The seed host can also accrue a lot of data from building container images. -The images stored locally in the seed host can be seen using ``docker image ls``. - -Old and redundant images can be identified from their names and tags, and -removed using ``docker image rm``. From e5a0f509788252bd9d2a14277b8ec1ada9cb707e Mon Sep 17 00:00:00 2001 From: Seunghun Lee Date: Mon, 17 Jun 2024 14:31:33 +0100 Subject: [PATCH 19/42] Attach Release Train document for more info --- doc/source/configuration/release-train.rst | 2 +- doc/source/operations/control-plane-operation.rst | 11 ++++++++--- doc/source/operations/upgrading-openstack.rst | 2 +- 3 files changed, 10 insertions(+), 5 deletions(-) diff --git a/doc/source/configuration/release-train.rst b/doc/source/configuration/release-train.rst index f77109aff9..5ed9b50c74 100644 --- a/doc/source/configuration/release-train.rst +++ b/doc/source/configuration/release-train.rst @@ -1,4 +1,4 @@ -.. _stackhpc_release_train: +.. _stackhpc-release-train: ====================== StackHPC Release Train diff --git a/doc/source/operations/control-plane-operation.rst b/doc/source/operations/control-plane-operation.rst index c7b10e75f4..ebacc0568a 100644 --- a/doc/source/operations/control-plane-operation.rst +++ b/doc/source/operations/control-plane-operation.rst @@ -294,8 +294,8 @@ node is powered back on. Software Updates ================ -Update Host Packages on Control Plane -------------------------------------- +Sync local Pulp server with StackHPC Release Train +-------------------------------------------------- The host packages and Kolla container images are distributed from `StackHPC Release Train `__ to ensure tested and reliable @@ -325,9 +325,14 @@ To sync container images: kayobe# kayobe playbook run $KAYOBE_CONFIG_PATH/ansible/pulp-container-sync.yml kayobe# kayobe playbook run $KAYOBE_CONFIG_PATH/ansible/pulp-container-publish.yml +For more information about StackHPC Release Train, see :ref:`stackhpc-release-train` documentation. + Once sync with StackHPC Release Train is done, new contents will be accessible from local Pulp server. +Update Host Packages on Control Plane +------------------------------------- + Host packages can be updated with: .. code-block:: console @@ -340,7 +345,7 @@ See https://docs.openstack.org/kayobe/latest/administration/overcloud.html#updat Upgrading OpenStack Services ---------------------------- -* Update tags for the images in ``etc/kayobe/kolla-image-tags.yml`` to use the new value of ``kolla_openstack_release`` +* Update tags for the images in ``etc/kayobe/kolla-image-tags.yml`` * Pull container images to overcloud hosts with ``kayobe overcloud container image pull`` * Run ``kayobe overcloud service upgrade`` diff --git a/doc/source/operations/upgrading-openstack.rst b/doc/source/operations/upgrading-openstack.rst index 0b0df50563..647ac2702c 100644 --- a/doc/source/operations/upgrading-openstack.rst +++ b/doc/source/operations/upgrading-openstack.rst @@ -459,7 +459,7 @@ To upgrade the Ansible control host: Syncing Release Train artifacts ------------------------------- -New :ref:`stackhpc_release_train` content should be synced to the local Pulp +New :ref:`stackhpc-release-train` content should be synced to the local Pulp server. This includes host packages (Deb/RPM) and container images. .. _sync-rt-package-repos: From d170a9e7481891253fbd6d3a6e3cec5fa62cbccd Mon Sep 17 00:00:00 2001 From: Seunghun Lee Date: Mon, 17 Jun 2024 15:13:19 +0100 Subject: [PATCH 20/42] Remove baremetal management doc This doc will need client specific infromation to be helpful --- .../operations/baremetal-node-management.rst | 275 ------------------ doc/source/operations/index.rst | 1 - 2 files changed, 276 deletions(-) delete mode 100644 doc/source/operations/baremetal-node-management.rst diff --git a/doc/source/operations/baremetal-node-management.rst b/doc/source/operations/baremetal-node-management.rst deleted file mode 100644 index 25a12760d6..0000000000 --- a/doc/source/operations/baremetal-node-management.rst +++ /dev/null @@ -1,275 +0,0 @@ -====================================== -Bare Metal Compute Hardware Management -====================================== - -Bare metal compute nodes are managed by the Ironic services. -This section describes elements of the configuration of this service. - -.. _ironic-node-lifecycle: - -Ironic node life cycle ----------------------- - -The deployment process is documented in the `Ironic User Guide `__. -OpenStack deployment uses the -`direct deploy method `__. - -The Ironic state machine can be found `here `__. The rest of -this documentation refers to these states and assumes that you have familiarity. - -High level overview of state transitions -~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ - -The following section attempts to describe the state transitions for various Ironic operations at a high level. -It focuses on trying to describe the steps where dynamic switch reconfiguration is triggered. -For a more detailed overview, refer to the :ref:`ironic-node-lifecycle` section. - -Provisioning -~~~~~~~~~~~~ - -Provisioning starts when an instance is created in Nova using a bare metal flavor. - -- Node starts in the available state (available) -- User provisions an instance (deploying) -- Ironic will switch the node onto the provisioning network (deploying) -- Ironic will power on the node and will await a callback (wait-callback) -- Ironic will image the node with an operating system using the image provided at creation (deploying) -- Ironic switches the node onto the tenant network(s) via neutron (deploying) -- Transition node to active state (active) - -.. _baremetal-management-deprovisioning: - -Deprovisioning -~~~~~~~~~~~~~~ - -Deprovisioning starts when an instance created in Nova using a bare metal flavor is destroyed. - -If automated cleaning is enabled, it occurs when nodes are deprovisioned. - -- Node starts in active state (active) -- User deletes instance (deleting) -- Ironic will remove the node from any tenant network(s) (deleting) -- Ironic will switch the node onto the cleaning network (deleting) -- Ironic will power on the node and will await a callback (clean-wait) -- Node boots into Ironic Python Agent and issues callback, Ironic starts cleaning (cleaning) -- Ironic removes node from cleaning network (cleaning) -- Node transitions to available (available) - -If automated cleaning is disabled. - -- Node starts in active state (active) -- User deletes instance (deleting) -- Ironic will remove the node from any tenant network(s) (deleting) -- Node transitions to available (available) - -Cleaning -~~~~~~~~ - -Manual cleaning is not part of the regular state transitions when using Nova, however nodes can be manually cleaned by administrators. - -- Node starts in the manageable state (manageable) -- User triggers cleaning with API (cleaning) -- Ironic will switch the node onto the cleaning network (cleaning) -- Ironic will power on the node and will await a callback (clean-wait) -- Node boots into Ironic Python Agent and issues callback, Ironic starts cleaning (cleaning) -- Ironic removes node from cleaning network (cleaning) -- Node transitions back to the manageable state (manageable) - -Rescuing -~~~~~~~~ - -Feature not used. The required rescue network is not currently configured. - -Baremetal networking --------------------- - -Baremetal networking with the Neutron Networking Generic Switch ML2 driver requires a combination of static and dynamic switch configuration. - -.. _static-switch-config: - -Static switch configuration -~~~~~~~~~~~~~~~~~~~~~~~~~~~ - -Static physical network configuration is managed via Kayobe. - -.. TODO: Fill in the switch configuration - -- Some initial switch configuration is required before networking generic switch can take over the management of an interface. - First, LACP must be configured on the switch ports attached to the baremetal node, e.g: - - .. code-block:: shell - - The interface is then partially configured: - - .. code-block:: shell - - For :ref:`ironic-node-discovery` to work, you need to manually switch the port to the provisioning network: - - **NOTE**: You only need to do this if Ironic isn't aware of the node. - -Configuration with kayobe -^^^^^^^^^^^^^^^^^^^^^^^^^ - -Kayobe can be used to apply the :ref:`static-switch-config`. - -- Upstream documentation can be found `here `__. -- Kayobe does all the switch configuration that isn't :ref:`dynamically updated using Ironic `. -- Optionally switches the node onto the provisioning network (when using ``--enable-discovery``) - - + NOTE: This is a dangerous operation as it can wipe out the dynamic VLAN configuration applied by neutron/ironic. - You should only run this when initially enrolling a node, and should always use the ``interface-description-limit`` option. For example: - - .. code-block:: - - kayobe physical network configure --interface-description-limit --group switches --display --enable-discovery - - In this example, ``--display`` is used to preview the switch configuration without applying it. - -.. TODO: Fill in information about how switches are configured in kayobe-config, with links - -- Configuration is done using a combination of ``group_vars`` and ``host_vars`` - -.. _dynamic-switch-configuration: - -Dynamic switch configuration -~~~~~~~~~~~~~~~~~~~~~~~~~~~~ - -Ironic dynamically configures the switches using the Neutron `Networking Generic Switch `_ ML2 driver. - -- Used to toggle the baremetal nodes onto different networks - - + Can use any VLAN network defined in OpenStack, providing that the VLAN has been trunked to the controllers - as this is required for DHCP to function. - + See :ref:`ironic-node-lifecycle`. This attempts to illustrate when any switch reconfigurations happen. - -- Only configures VLAN membership of the switch interfaces or port groups. To prevent conflicts with the static switch configuration, - the convention used is: after the node is in service in Ironic, VLAN membership should not be manually adjusted and - should be left to be controlled by ironic i.e *don't* use ``--enable-discovery`` without an interface limit when configuring the - switches with kayobe. -- Ironic is configured to use the neutron networking driver. - -.. _ngs-commands: - -Commands that NGS will execute -^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ - -Networking Generic Switch is mainly concerned with toggling the ports onto different VLANs. It -cannot fully configure the switch. - -.. TODO: Fill in the switch configuration - -- Switching the port onto the provisioning network - - .. code-block:: shell - -- Switching the port onto the tenant network. - - .. code-block:: shell - -- When deleting the instance, the VLANs are removed from the port. Using: - - .. code-block:: shell - -NGS will save the configuration after each reconfiguration (by default). - -Ports managed by NGS -^^^^^^^^^^^^^^^^^^^^ - -The command below extracts a list of port UUID, node UUID and switch port information. - -.. code-block:: bash - - openstack baremetal port list --field uuid --field node_uuid --field local_link_connection --format value - -NGS will manage VLAN membership for ports when the ``local_link_connection`` fields match one of the switches in ``ml2_conf.ini``. -The rest of the switch configuration is static. -The switch configuration that NGS will apply to these ports is detailed in :ref:`dynamic-switch-configuration`. - -.. _ironic-node-discovery: - -Ironic node discovery ---------------------- - -Discovery is a process used to automatically enrol new nodes in Ironic. -It works by PXE booting the nodes into the Ironic Python Agent (IPA) ramdisk. -This ramdisk will collect hardware and networking configuration from the node in a process known as introspection. -This data is used to populate the baremetal node object in Ironic. -The series of steps you need to take to enrol a new node is as follows: - -- Configure credentials on the BMC. These are needed for Ironic to be able to perform power control actions. - -- Controllers should have network connectivity with the target BMC. - -- (If kayobe manages physical network) Add any additional switch configuration to kayobe config. - The minimal switch configuration that kayobe needs to know about is described in :ref:`tor-switch-configuration`. - -- Apply any :ref:`static switch configration `. This performs the initial - setup of the switchports that is needed before Ironic can take over. The static configuration - will not be modified by Ironic, so it should be safe to reapply at any point. See :ref:`ngs-commands` - for details about the switch configuation that Networking Generic Switch will apply. - -- (If kayobe manages physical network) Put the node onto the provisioning network by using the - ``--enable-discovery`` flag and either ``--interface-description-limit`` or ``--interface-limit`` - (do not run this command without one of these limits). See :ref:`static-switch-config`. - - * This is only necessary to initially discover the node. Once the node is in registered in Ironic, - it will take over control of the the VLAN membership. See :ref:`dynamic-switch-configuration`. - - * This provides ethernet connectivity with the controllers over the `workload provisioning` network - -- (If kayobe doesn't manage physical network) Put the node onto the provisioning network. - -.. TODO: link to the relevant file in kayobe config - -- Add node to the kayobe inventory. - -.. TODO: Fill in details about necessary BIOS & RAID config - -- Apply any necesary BIOS & RAID configuration. - -.. TODO: Fill in details about how to trigger a PXE boot - -- PXE boot the node. - -- If the discovery process is successful, the node will appear in Ironic and will get populated with the necessary information from the hardware inspection process. - -.. TODO: Link to the Kayobe inventory in the repo - -- Add node to the Kayobe inventory in the ``baremetal-compute`` group. - -- The node will begin in the ``enroll`` state, and must be moved first to ``manageable``, then ``available`` before it can be used. - - If Ironic automated cleaning is enabled, the node must complete a cleaning process before it can reach the available state. - - * Use Kayobe to attempt to move the node to the ``available`` state. - - .. code-block:: console - - source etc/kolla/public-openrc.sh - kayobe baremetal compute provide --limit - -- Once the node is in the ``available`` state, Nova will make the node available for scheduling. This happens periodically, and typically takes around a minute. - -.. _tor-switch-configuration: - -Top of Rack (ToR) switch configuration -~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ - -Networking Generic Switch must be aware of the Top-of-Rack switch connected to the new node. -Switches managed by NGS are configured in ``ml2_conf.ini``. - -.. TODO: Fill in details about how switches are added to NGS config in kayobe-config - -After adding switches to the NGS configuration, Neutron must be redeployed. - -Considerations when booting baremetal compared to VMs ------------------------------------------------------- - -- You can only use networks of type: vlan -- Without using trunk ports, it is only possible to directly attach one network to each port or port group of an instance. - - * To access other networks you can use routers - * You can still attach floating IPs - -- Instances take much longer to provision (expect at least 15 mins) -- When booting an instance use one of the flavors that maps to a baremetal node via the RESOURCE_CLASS configured on the flavor. diff --git a/doc/source/operations/index.rst b/doc/source/operations/index.rst index ae8a71901e..f130cee9f9 100644 --- a/doc/source/operations/index.rst +++ b/doc/source/operations/index.rst @@ -7,7 +7,6 @@ This guide is for operators of the StackHPC Kayobe configuration project. .. toctree:: :maxdepth: 1 - baremetal-node-management ceph-management control-plane-operation customsing-horizon From a2833f5eba0ea04bba0920192747e4e97504dccb Mon Sep 17 00:00:00 2001 From: Seunghun Lee Date: Mon, 17 Jun 2024 15:26:49 +0100 Subject: [PATCH 21/42] Fix formatting --- doc/source/operations/ceph-management.rst | 1 + doc/source/operations/control-plane-operation.rst | 8 ++++---- doc/source/operations/customising-horizon.rst | 1 + doc/source/operations/gpu-in-openstack.rst | 6 +++--- doc/source/operations/index.rst | 2 +- 5 files changed, 10 insertions(+), 8 deletions(-) diff --git a/doc/source/operations/ceph-management.rst b/doc/source/operations/ceph-management.rst index fc17571278..aac620e33d 100644 --- a/doc/source/operations/ceph-management.rst +++ b/doc/source/operations/ceph-management.rst @@ -100,6 +100,7 @@ In this case Ceph will go into `HEALTH_WARN` state. Ceph can report details about failed OSDs by running: .. code-block:: console + # From storage host sudo cephadm shell ceph health detail diff --git a/doc/source/operations/control-plane-operation.rst b/doc/source/operations/control-plane-operation.rst index ebacc0568a..f81111dc01 100644 --- a/doc/source/operations/control-plane-operation.rst +++ b/doc/source/operations/control-plane-operation.rst @@ -71,10 +71,10 @@ hypervisor will produce several alerts: * ``PrometheusTargetMissing`` from several Prometheus exporters Rather than silencing each alert one by one for a specific host, a silence can -apply to multiple alerts using a reduced list of labels. :ref:`Log into -Alertmanager `, click on the ``Silence`` button next -to an alert and adjust the matcher list to keep only ``instance=`` -label. Then, create another silence to match ``hostname=`` (this is +apply to multiple alerts using a reduced list of labels. Log into Alertmanager, +click on the ``Silence`` button next to an alert and adjust the matcher list +to keep only ``instance=`` label. +Then, create another silence to match ``hostname=`` (this is required because, for the OpenStack exporter, the instance is the host running the monitoring service rather than the host being monitored). diff --git a/doc/source/operations/customising-horizon.rst b/doc/source/operations/customising-horizon.rst index d1fcd5e65b..096ce2e561 100644 --- a/doc/source/operations/customising-horizon.rst +++ b/doc/source/operations/customising-horizon.rst @@ -54,6 +54,7 @@ Create a directory and transfer custom theme files to it ``$KAYOBE_CONFIG_PATH/k Define the custom theme in ``etc/kayobe/kolla/globals.yml`` .. code-block:: yaml + horizon_custom_themes: - name: label: # This will be the visible name to users diff --git a/doc/source/operations/gpu-in-openstack.rst b/doc/source/operations/gpu-in-openstack.rst index 259e39e8c1..4979d3f25b 100644 --- a/doc/source/operations/gpu-in-openstack.rst +++ b/doc/source/operations/gpu-in-openstack.rst @@ -971,9 +971,9 @@ Once this code has taken effect (after a reboot), the VFIO kernel drivers should # lspci -nnk -s 3d:00.0 3d:00.0 VGA compatible controller [0300]: NVIDIA Corporation GM107GL [Tesla M10] [10de:13bd] (rev a2) - Subsystem: NVIDIA Corporation Tesla M10 [10de:1160] - Kernel driver in use: vfio-pci - Kernel modules: nouveau + Subsystem: NVIDIA Corporation Tesla M10 [10de:1160] + Kernel driver in use: vfio-pci + Kernel modules: nouveau IOMMU should be enabled at kernel level as well - we can verify that on the compute host: diff --git a/doc/source/operations/index.rst b/doc/source/operations/index.rst index f130cee9f9..2408b1a36e 100644 --- a/doc/source/operations/index.rst +++ b/doc/source/operations/index.rst @@ -9,7 +9,7 @@ This guide is for operators of the StackHPC Kayobe configuration project. ceph-management control-plane-operation - customsing-horizon + customising-horizon gpu-in-openstack hardware-inventory-management hotfix-playbook From 124a2a38d6dece71abbbf4e10a0087b222b74844 Mon Sep 17 00:00:00 2001 From: Seunghun Lee Date: Tue, 9 Jul 2024 12:05:36 +0100 Subject: [PATCH 22/42] Adding missing / --- doc/source/configuration/wazuh.rst | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/doc/source/configuration/wazuh.rst b/doc/source/configuration/wazuh.rst index ca6e519b17..49462f86ea 100644 --- a/doc/source/configuration/wazuh.rst +++ b/doc/source/configuration/wazuh.rst @@ -15,7 +15,7 @@ The short version #. Create an infrastructure VM for the Wazuh manager, and add it to the wazuh-manager group #. Configure the infrastructure VM with kayobe: ``kayobe infra vm host configure`` #. Edit your config under - ``$KAYOBE_CONFIG_PATHinventory/group_vars/wazuh-manager/wazuh-manager``, in + ``$KAYOBE_CONFIG_PATH/inventory/group_vars/wazuh-manager/wazuh-manager``, in particular the defaults assume that the ``provision_oc_net`` network will be used. #. Generate secrets: ``kayobe playbook run $KAYOBE_CONFIG_PATH/ansible/wazuh-secrets.yml`` From e12e8fa16f22a97e8aa79e8340e11e2b284e701b Mon Sep 17 00:00:00 2001 From: Seunghun Lee Date: Thu, 12 Sep 2024 13:49:59 +0100 Subject: [PATCH 23/42] Update content to Antelope and misc changes --- doc/source/configuration/wazuh.rst | 36 +++++++------- doc/source/operations/ceph-management.rst | 12 ++--- .../operations/control-plane-operation.rst | 47 +++---------------- doc/source/operations/customising-horizon.rst | 6 +-- 4 files changed, 33 insertions(+), 68 deletions(-) diff --git a/doc/source/configuration/wazuh.rst b/doc/source/configuration/wazuh.rst index 49462f86ea..40b8a973ca 100644 --- a/doc/source/configuration/wazuh.rst +++ b/doc/source/configuration/wazuh.rst @@ -34,14 +34,14 @@ Provisioning an infra VM for Wazuh Manager. Kayobe supports :kayobe-doc:`provisioning infra VMs `. The following configuration may be used as a guide. Config for infra VMs is documented :kayobe-doc:`here `. -Add a Wazuh Manager host to the ``wazuh-manager`` group in ``etc/kayobe/inventory/hosts``. +Add a Wazuh Manager host to the ``wazuh-manager`` group in ``$KAYOBE_CONFIG_PATH/inventory/hosts``. .. code-block:: ini [wazuh-manager] os-wazuh -Add the ``wazuh-manager`` group to the ``infra-vms`` group in ``etc/kayobe/inventory/groups``. +Add the ``wazuh-manager`` group to the ``infra-vms`` group in ``$KAYOBE_CONFIG_PATH/inventory/groups``. .. code-block:: ini @@ -50,7 +50,7 @@ Add the ``wazuh-manager`` group to the ``infra-vms`` group in ``etc/kayobe/inven [infra-vms:children] wazuh-manager -Define VM sizing in ``etc/kayobe/inventory/group_vars/wazuh-manager/infra-vms``: +Define VM sizing in ``$KAYOBE_CONFIG_PATH/inventory/group_vars/wazuh-manager/infra-vms``: .. code-block:: yaml @@ -64,7 +64,7 @@ Define VM sizing in ``etc/kayobe/inventory/group_vars/wazuh-manager/infra-vms``: # Capacity of the infra VM data volume. infra_vm_data_capacity: "200G" -Optional: define LVM volumes in ``etc/kayobe/inventory/group_vars/wazuh-manager/lvm``. +Optional: define LVM volumes in ``$KAYOBE_CONFIG_PATH/inventory/group_vars/wazuh-manager/lvm``. ``/var/ossec`` often requires greater storage space, and ``/var/lib/wazuh-indexer`` may be beneficial too. @@ -86,7 +86,7 @@ may be beneficial too. create: true -Define network interfaces ``etc/kayobe/inventory/group_vars/wazuh-manager/network-interfaces``: +Define network interfaces ``$KAYOBE_CONFIG_PATH/inventory/group_vars/wazuh-manager/network-interfaces``: (The following is an example - the names will depend on your particular network configuration.) @@ -98,7 +98,7 @@ Define network interfaces ``etc/kayobe/inventory/group_vars/wazuh-manager/networ The Wazuh manager may need to be exposed externally, in which case it may require another interface. -This can be done as follows in ``etc/kayobe/inventory/group_vars/wazuh-manager/network-interfaces``, +This can be done as follows in ``$KAYOBE_CONFIG_PATH/inventory/group_vars/wazuh-manager/network-interfaces``, with the network defined in ``networks.yml`` as usual. .. code-block:: yaml @@ -190,7 +190,7 @@ Deploying Wazuh Manager services Setup ----- -To install a specific version modify the wazuh-ansible entry in ``etc/kayobe/ansible/requirements.yml``: +To install a specific version modify the wazuh-ansible entry in ``$KAYOBE_CONFIG_PATH/ansible/requirements.yml``: .. code-block:: yaml @@ -211,7 +211,7 @@ Edit the playbook and variables to your needs: Wazuh manager configuration --------------------------- -Wazuh manager playbook is located in ``etc/kayobe/ansible/wazuh-manager.yml``. +Wazuh manager playbook is located in ``$KAYOBE_CONFIG_PATH/ansible/wazuh-manager.yml``. Running this playbook will: * generate certificates for wazuh-manager @@ -221,7 +221,7 @@ Running this playbook will: * setup and deploy wazuh-dashboard on wazuh-manager vm * copy certificates over to wazuh-manager vm -Wazuh manager variables file is located in ``etc/kayobe/inventory/group_vars/wazuh-manager/wazuh-manager``. +Wazuh manager variables file is located in ``$KAYOBE_CONFIG_PATH/inventory/group_vars/wazuh-manager/wazuh-manager``. You may need to modify some of the variables, including: @@ -232,13 +232,13 @@ You may need to modify some of the variables, including: If you are using multiple environments, and you need to customise Wazuh in each environment, create override files in an appropriate directory, - for example ``etc/kayobe/environments/production/inventory/group_vars/``. + for example ``$KAYOBE_CONFIG_PATH/environments/production/inventory/group_vars/``. Files which values can be overridden (in the context of Wazuh): - - etc/kayobe/inventory/group_vars/wazuh/wazuh-manager/wazuh-manager - - etc/kayobe/wazuh-manager.yml - - etc/kayobe/inventory/group_vars/wazuh/wazuh-agent/wazuh-agent + - $KAYOBE_CONFIG_PATH/inventory/group_vars/wazuh/wazuh-manager/wazuh-manager + - $KAYOBE_CONFIG_PATH/wazuh-manager.yml + - $KAYOBE_CONFIG_PATH/inventory/group_vars/wazuh/wazuh-agent/wazuh-agent You'll need to run ``wazuh-manager.yml`` playbook again to apply customisation. @@ -246,13 +246,13 @@ Secrets ------- Wazuh requires that secrets or passwords are set for itself and the services with which it communiticates. -Wazuh secrets playbook is located in ``etc/kayobe/ansible/wazuh-secrets.yml``. +Wazuh secrets playbook is located in ``$KAYOBE_CONFIG_PATH/ansible/wazuh-secrets.yml``. Running this playbook will generate and put pertinent security items into secrets vault file which will be placed in ``$KAYOBE_CONFIG_PATH/wazuh-secrets.yml``. If using environments it ends up in ``$KAYOBE_CONFIG_PATH/environments//wazuh-secrets.yml`` Remember to encrypt! -Wazuh secrets template is located in ``etc/kayobe/ansible/templates/wazuh-secrets.yml.j2``. +Wazuh secrets template is located in ``$KAYOBE_CONFIG_PATH/ansible/templates/wazuh-secrets.yml.j2``. It will be used by wazuh secrets playbook to generate wazuh secrets vault file. @@ -380,7 +380,7 @@ Verification ------------ The Wazuh portal should be accessible on port 443 of the Wazuh -manager’s IPs (using HTTPS, with the root CA cert in ``etc/kayobe/ansible/wazuh/certificates/wazuh-certificates/root-ca.pem``). +manager’s IPs (using HTTPS, with the root CA cert in ``$KAYOBE_CONFIG_PATH/ansible/wazuh/certificates/wazuh-certificates/root-ca.pem``). The first login should be as the admin user, with the opendistro_admin_password password in ``$KAYOBE_CONFIG_PATH/wazuh-secrets.yml``. This will create the necessary indices. @@ -392,9 +392,9 @@ Logs are in ``/var/log/wazuh-indexer/wazuh.log``. There are also logs in the jou Wazuh agents ============ -Wazuh agent playbook is located in ``etc/kayobe/ansible/wazuh-agent.yml``. +Wazuh agent playbook is located in ``$KAYOBE_CONFIG_PATH/ansible/wazuh-agent.yml``. -Wazuh agent variables file is located in ``etc/kayobe/inventory/group_vars/wazuh-agent/wazuh-agent``. +Wazuh agent variables file is located in ``$KAYOBE_CONFIG_PATH/inventory/group_vars/wazuh-agent/wazuh-agent``. You may need to modify some variables, including: diff --git a/doc/source/operations/ceph-management.rst b/doc/source/operations/ceph-management.rst index aac620e33d..67e3a5899f 100644 --- a/doc/source/operations/ceph-management.rst +++ b/doc/source/operations/ceph-management.rst @@ -8,14 +8,14 @@ Working with Cephadm This documentation provides guide for Ceph operations. For deploying Ceph, please refer to :ref:`cephadm-kayobe` documentation. -cephadm configuration location +Cephadm configuration location ------------------------------ In kayobe-config repository, under ``etc/kayobe/cephadm.yml`` (or in a specific Kayobe environment when using multiple environment, e.g. ``etc/kayobe/environments//cephadm.yml``) -StackHPC's cephadm Ansible collection relies on multiple inventory groups: +StackHPC's Cephadm Ansible collection relies on multiple inventory groups: - ``mons`` - ``mgrs`` @@ -24,11 +24,11 @@ StackHPC's cephadm Ansible collection relies on multiple inventory groups: Those groups are usually defined in ``etc/kayobe/inventory/groups``. -Running cephadm playbooks +Running Cephadm playbooks ------------------------- In kayobe-config repository, under ``etc/kayobe/ansible`` there is a set of -cephadm based playbooks utilising stackhpc.cephadm Ansible Galaxy collection. +Cephadm based playbooks utilising stackhpc.cephadm Ansible Galaxy collection. - ``cephadm.yml`` - runs the end to end process starting with deployment and defining EC profiles/crush rules/pools and users @@ -176,11 +176,11 @@ Remove the OSD using Ceph orchestrator command: ceph orch osd rm --replace After removing OSDs, if the drives the OSDs were deployed on once again become -available, cephadm may automatically try to deploy more OSDs on these drives if +available, Cephadm may automatically try to deploy more OSDs on these drives if they match an existing drivegroup spec. If this is not your desired action plan - it's best to modify the drivegroup spec before (``cephadm_osd_spec`` variable in ``etc/kayobe/cephadm.yml``). -Either set ``unmanaged: true`` to stop cephadm from picking up new disks or +Either set ``unmanaged: true`` to stop Cephadm from picking up new disks or modify it in some way that it no longer matches the drives you want to remove. Host maintenance diff --git a/doc/source/operations/control-plane-operation.rst b/doc/source/operations/control-plane-operation.rst index f81111dc01..d3440c97db 100644 --- a/doc/source/operations/control-plane-operation.rst +++ b/doc/source/operations/control-plane-operation.rst @@ -26,7 +26,7 @@ Monitoring ---------- * `Back up InfluxDB `__ -* `Back up ElasticSearch `__ +* `Back up OpenSearch `__ * `Back up Prometheus `__ Seed @@ -42,8 +42,8 @@ Ansible control host Control Plane Monitoring ======================== -The control plane has been configured to collect logs centrally using the EFK -stack (Elasticsearch, Fluentd and Kibana). +The control plane has been configured to collect logs centrally using Fluentd, +OpenSearch and OpenSearch Dashboards. Telemetry monitoring of the control plane is performed by Prometheus. Metrics are collected by Prometheus exporters, which are either running on all hosts @@ -227,7 +227,7 @@ Overview * Remove the node from maintenance mode in bifrost * Bifrost should automatically power on the node via IPMI * Check that all docker containers are running -* Check Kibana for any messages with log level ERROR or equivalent +* Check OpenSearch Dashboards for any messages with log level ERROR or equivalent Controllers ----------- @@ -277,7 +277,7 @@ Stop all Docker containers: .. code-block:: console - monitoring0# for i in `docker ps -q`; do docker stop $i; done + monitoring0# for i in `docker ps -a`; do systemctl stop kolla-$i-container; done Shut down the node: @@ -342,21 +342,6 @@ Host packages can be updated with: See https://docs.openstack.org/kayobe/latest/administration/overcloud.html#updating-packages -Upgrading OpenStack Services ----------------------------- - -* Update tags for the images in ``etc/kayobe/kolla-image-tags.yml`` -* Pull container images to overcloud hosts with ``kayobe overcloud container image pull`` -* Run ``kayobe overcloud service upgrade`` - -You can update the subset of containers or hosts by - -.. code-block:: console - - kayobe# kayobe overcloud service upgrade --kolla-tags --limit --kolla-limit - -For more information, see: https://docs.openstack.org/kayobe/latest/upgrading.html - Troubleshooting =============== @@ -378,27 +363,7 @@ To boot an instance on a specific hypervisor .. code-block:: console - openstack server create --flavor --network --key-name --image --os-compute-api-version 2.74 --host - -Cleanup Procedures -================== - -OpenStack services can sometimes fail to remove all resources correctly. This -is the case with Magnum, which fails to clean up users in its domain after -clusters are deleted. `A patch has been submitted to stable branches -`__. -Until this fix becomes available, if Magnum is in use, administrators can -perform the following cleanup procedure regularly: - -.. code-block:: console - - for user in $(openstack user list --domain magnum -f value -c Name | grep -v magnum_trustee_domain_admin); do - if openstack coe cluster list -c uuid -f value | grep -q $(echo $user | sed 's/_[0-9a-f]*$//'); then - echo "$user still in use, not deleting" - else - openstack user delete --domain magnum $user - fi - done + openstack server create --flavor --network --key-name --image --os-compute-api-version 2.74 --host OpenSearch indexes retention ============================= diff --git a/doc/source/operations/customising-horizon.rst b/doc/source/operations/customising-horizon.rst index 096ce2e561..a39973e0f9 100644 --- a/doc/source/operations/customising-horizon.rst +++ b/doc/source/operations/customising-horizon.rst @@ -113,6 +113,6 @@ If the ``horizon`` container is restarting with the following error: /var/lib/kolla/venv/bin/python /var/lib/kolla/venv/bin/manage.py compress --force CommandError: An error occurred during rendering /var/lib/kolla/venv/lib/python3.6/site-packages/openstack_dashboard/templates/horizon/_scripts.html: Couldn't find any precompiler in COMPRESS_PRECOMPILERS setting for mimetype '\'text/javascript\''. -It can be resolved by dropping cached content with ``docker restart -memcached``. Note this will log out users from Horizon, as Django sessions are -stored in Memcached. +It can be resolved by dropping cached content with ``systemctl restart +kolla-memcached-container``. Note this will log out users from Horizon, as Django +sessions are stored in Memcached. From 7d41f5b206649a0da2a94e2de30a2fa54a3c0581 Mon Sep 17 00:00:00 2001 From: Seunghun Lee Date: Mon, 16 Sep 2024 14:03:13 +0100 Subject: [PATCH 24/42] Update Cephadm playbook info --- doc/source/operations/ceph-management.rst | 33 ++++++++++++++++------- 1 file changed, 24 insertions(+), 9 deletions(-) diff --git a/doc/source/operations/ceph-management.rst b/doc/source/operations/ceph-management.rst index 67e3a5899f..3132c42bae 100644 --- a/doc/source/operations/ceph-management.rst +++ b/doc/source/operations/ceph-management.rst @@ -30,16 +30,31 @@ Running Cephadm playbooks In kayobe-config repository, under ``etc/kayobe/ansible`` there is a set of Cephadm based playbooks utilising stackhpc.cephadm Ansible Galaxy collection. -- ``cephadm.yml`` - runs the end to end process starting with deployment and - defining EC profiles/crush rules/pools and users -- ``cephadm-crush-rules.yml`` - defines Ceph crush rules according -- ``cephadm-deploy.yml`` - runs the bootstrap/deploy playbook without the +``cephadm.yml`` runs the end to end process of Cephadm deployment and +configuration. It is composed with following list of other Cephadm playbooks +and they can be run separately. + +- ``cephadm-deploy.yml`` - Runs the bootstrap/deploy playbook without the additional playbooks -- ``cephadm-ec-profiles.yml`` - defines Ceph EC profiles -- ``cephadm-gather-keys.yml`` - gather Ceph configuration and keys and populate - kayobe-config -- ``cephadm-keys.yml`` - defines Ceph users/keys -- ``cephadm-pools.yml`` - defines Ceph pools\ +- ``cephadm-commands-pre.yml`` - Runs Ceph commands before post-deployment + configuration (You can set a list of commands at ``cephadm_commands_pre_extra`` + in ``cephadm.yml``) +- ``cephadm-ec-profiles.yml`` - Defines Ceph EC profiles +- ``cephadm-crush-rules.yml`` - Defines Ceph crush rules according +- ``cephadm-pools.yml`` - Defines Ceph pools +- ``cephadm-keys.yml`` - Defines Ceph users/keys +- ``cephadm-commands-post.yml`` - Runs Ceph commands after post-deployment + configuration (You can set a list of commands at ``cephadm_commands_post_extra`` + in ``cephadm.yml``) + +There are also other Ceph playbooks that are not part of ``cephadm.yml`` + +- ``cephadm-gather-keys.yml`` - Populate ``ceph.conf`` in kayobe-config by + gathering Ceph configuration and keys +- ``ceph-enter-maintenance.yml`` - Set Ceph to maintenance mode for storage + hosts (Can limit the hosts with ``-l ``) +- ``ceph-exit-maintenance.yml`` - Unset Ceph to maintenance mode for storage + hosts (Can limit the hosts with ``-l ``) Running Ceph commands --------------------- From 109ca13c62d978c90682d495a592e470b73927c4 Mon Sep 17 00:00:00 2001 From: Seunghun Lee Date: Mon, 16 Sep 2024 16:26:20 +0100 Subject: [PATCH 25/42] Replace etc/kayobe to $KAYOBE_CONFIG_PATH --- doc/source/operations/ceph-management.rst | 16 ++++++++-------- doc/source/operations/customising-horizon.rst | 2 +- doc/source/operations/gpu-in-openstack.rst | 6 +++--- doc/source/operations/nova-compute-ironic.rst | 8 ++++---- .../operations/openstack-reconfiguration.rst | 2 +- doc/source/operations/secret-rotation.rst | 2 +- 6 files changed, 18 insertions(+), 18 deletions(-) diff --git a/doc/source/operations/ceph-management.rst b/doc/source/operations/ceph-management.rst index 3132c42bae..6db44ad23f 100644 --- a/doc/source/operations/ceph-management.rst +++ b/doc/source/operations/ceph-management.rst @@ -11,9 +11,9 @@ please refer to :ref:`cephadm-kayobe` documentation. Cephadm configuration location ------------------------------ -In kayobe-config repository, under ``etc/kayobe/cephadm.yml`` (or in a specific +In kayobe-config repository, under ``$KAYOBE_CONFIG_PATH/cephadm.yml`` (or in a specific Kayobe environment when using multiple environment, e.g. -``etc/kayobe/environments//cephadm.yml``) +``$KAYOBE_CONFIG_PATH/environments//cephadm.yml``) StackHPC's Cephadm Ansible collection relies on multiple inventory groups: @@ -22,12 +22,12 @@ StackHPC's Cephadm Ansible collection relies on multiple inventory groups: - ``osds`` - ``rgws`` (optional) -Those groups are usually defined in ``etc/kayobe/inventory/groups``. +Those groups are usually defined in ``$KAYOBE_CONFIG_PATH/inventory/groups``. Running Cephadm playbooks ------------------------- -In kayobe-config repository, under ``etc/kayobe/ansible`` there is a set of +In kayobe-config repository, under ``$KAYOBE_CONFIG_PATH/ansible`` there is a set of Cephadm based playbooks utilising stackhpc.cephadm Ansible Galaxy collection. ``cephadm.yml`` runs the end to end process of Cephadm deployment and @@ -38,14 +38,14 @@ and they can be run separately. additional playbooks - ``cephadm-commands-pre.yml`` - Runs Ceph commands before post-deployment configuration (You can set a list of commands at ``cephadm_commands_pre_extra`` - in ``cephadm.yml``) + variable in ``$KAYOBE_CONFIG_PATH/cephadm.yml``) - ``cephadm-ec-profiles.yml`` - Defines Ceph EC profiles - ``cephadm-crush-rules.yml`` - Defines Ceph crush rules according - ``cephadm-pools.yml`` - Defines Ceph pools - ``cephadm-keys.yml`` - Defines Ceph users/keys - ``cephadm-commands-post.yml`` - Runs Ceph commands after post-deployment configuration (You can set a list of commands at ``cephadm_commands_post_extra`` - in ``cephadm.yml``) + variable in ``$KAYOBE_CONFIG_PATH/cephadm.yml``) There are also other Ceph playbooks that are not part of ``cephadm.yml`` @@ -102,7 +102,7 @@ Once all daemons are removed - you can remove the host: ceph orch host rm And then remove the host from inventory (usually in -``etc/kayobe/inventory/overcloud``) +``$KAYOBE_CONFIG_PATH/inventory/overcloud``) Additional options/commands may be found in `Host management `_ @@ -194,7 +194,7 @@ After removing OSDs, if the drives the OSDs were deployed on once again become available, Cephadm may automatically try to deploy more OSDs on these drives if they match an existing drivegroup spec. If this is not your desired action plan - it's best to modify the drivegroup -spec before (``cephadm_osd_spec`` variable in ``etc/kayobe/cephadm.yml``). +spec before (``cephadm_osd_spec`` variable in ``$KAYOBE_CONFIG_PATH/cephadm.yml``). Either set ``unmanaged: true`` to stop Cephadm from picking up new disks or modify it in some way that it no longer matches the drives you want to remove. diff --git a/doc/source/operations/customising-horizon.rst b/doc/source/operations/customising-horizon.rst index a39973e0f9..586bb242cd 100644 --- a/doc/source/operations/customising-horizon.rst +++ b/doc/source/operations/customising-horizon.rst @@ -51,7 +51,7 @@ Adding the custom theme Create a directory and transfer custom theme files to it ``$KAYOBE_CONFIG_PATH/kolla/config/horizon/themes/``. -Define the custom theme in ``etc/kayobe/kolla/globals.yml`` +Define the custom theme in ``$KAYOBE_CONFIG_PATH/kolla/globals.yml`` .. code-block:: yaml diff --git a/doc/source/operations/gpu-in-openstack.rst b/doc/source/operations/gpu-in-openstack.rst index 4979d3f25b..9270817187 100644 --- a/doc/source/operations/gpu-in-openstack.rst +++ b/doc/source/operations/gpu-in-openstack.rst @@ -991,7 +991,7 @@ Configure nova-scheduler The nova-scheduler service must be configured to enable the ``PciPassthroughFilter`` To enable it add it to the list of filters to Kolla-Ansible configuration file: -``etc/kayobe/kolla/config/nova.conf``, for instance: +``$KAYOBE_CONFIG_PATH/kolla/config/nova.conf``, for instance: .. code-block:: yaml @@ -1006,7 +1006,7 @@ Configuration can be applied in flexible ways using Kolla-Ansible's methods for `inventory-driven customisation of configuration `_. The following configuration could be added to -``etc/kayobe/kolla/config/nova/nova-compute.conf`` to enable PCI +``$KAYOBE_CONFIG_PATH/kolla/config/nova/nova-compute.conf`` to enable PCI passthrough of GPU devices for hosts in a group named ``compute_gpu``. Again, the 4-digit PCI Vendor ID and Device ID extracted from ``lspci -nn`` can be used here to specify the GPU device(s). @@ -1037,7 +1037,7 @@ Configure nova-api pci.alias also needs to be configured on the controller. This configuration should match the configuration found on the compute nodes. Add it to Kolla-Ansible configuration file: -``etc/kayobe/kolla/config/nova/nova-api.conf``, for instance: +``$KAYOBE_CONFIG_PATH/kolla/config/nova/nova-api.conf``, for instance: .. code-block:: yaml diff --git a/doc/source/operations/nova-compute-ironic.rst b/doc/source/operations/nova-compute-ironic.rst index 6cbe00550f..9b9af76ffb 100644 --- a/doc/source/operations/nova-compute-ironic.rst +++ b/doc/source/operations/nova-compute-ironic.rst @@ -68,7 +68,7 @@ Moving from multiple Nova Compute Instances to a single instance 1. Decide where the single instance should run. This should normally be one of the three OpenStack control plane hosts. For convention, pick the first one, unless you can think of a good reason not to. Once you - have chosen, set the following variable in ``etc/kayobe/nova.yml``. + have chosen, set the following variable in ``$KAYOBE_CONFIG_PATH/nova.yml``. Here we have picked ``controller1``. .. code-block:: yaml @@ -196,7 +196,7 @@ constant, such that the new Nova Compute Ironic instance comes up with the same name as the one it replaces. For example, if the original instance resides on ``controller1``, then set the -following in ``etc/kayobe/nova.yml``: +following in ``$KAYOBE_CONFIG_PATH/nova.yml``: .. code-block:: yaml @@ -270,8 +270,8 @@ Current host is not accessible ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ In this case you will need to remove the inaccessible host from the inventory. -For example, in ``etc/kayobe/inventory/hosts``, remove ``controller1`` from -the ``controllers`` group. +For example, in ``$KAYOBE_CONFIG_PATH/inventory/hosts``, remove ``controller1`` +from the ``controllers`` group. Adjust the ``kolla_nova_compute_ironic_host`` variable to point to the new host, eg. diff --git a/doc/source/operations/openstack-reconfiguration.rst b/doc/source/operations/openstack-reconfiguration.rst index 712a0f779e..ab2a01fc8e 100644 --- a/doc/source/operations/openstack-reconfiguration.rst +++ b/doc/source/operations/openstack-reconfiguration.rst @@ -73,7 +73,7 @@ Alternative Configuration As an alternative to writing the certificates as a variable to ``secrets.yml``, it is also possible to write the same data to a file, -``etc/kayobe/kolla/certificates/haproxy.pem``. The file should be +``$KAYOBE_CONFIG_PATH/kolla/certificates/haproxy.pem``. The file should be vault-encrypted in the same manner as secrets.yml. In this instance, variable ``kolla_external_tls_cert`` does not need to be defined. diff --git a/doc/source/operations/secret-rotation.rst b/doc/source/operations/secret-rotation.rst index a01f66fa9c..34fd33a72c 100644 --- a/doc/source/operations/secret-rotation.rst +++ b/doc/source/operations/secret-rotation.rst @@ -105,7 +105,7 @@ Full method 3. Navigate to the directory containing your ``passwords.yml`` file (``kayobe-config/etc/kolla/passwords.yml`` OR - ``kayobe-config/etc/kayobe/environments/envname/kolla/passwords.yml``) + ``kayobe-config/etc/kayobe/environments//kolla/passwords.yml``) 4. Create a file called ``deletelist.txt`` and populate it with this content (including all whitespace): From 8edf08f55031c073f55ff4f9bcb270d7d8998067 Mon Sep 17 00:00:00 2001 From: Seunghun Lee Date: Wed, 9 Oct 2024 12:11:16 +0100 Subject: [PATCH 26/42] specify keyring is populated --- doc/source/operations/ceph-management.rst | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/doc/source/operations/ceph-management.rst b/doc/source/operations/ceph-management.rst index 6db44ad23f..e48a8d3e02 100644 --- a/doc/source/operations/ceph-management.rst +++ b/doc/source/operations/ceph-management.rst @@ -49,7 +49,7 @@ and they can be run separately. There are also other Ceph playbooks that are not part of ``cephadm.yml`` -- ``cephadm-gather-keys.yml`` - Populate ``ceph.conf`` in kayobe-config by +- ``cephadm-gather-keys.yml`` - Populate ``ceph.conf`` and keyrings in kayobe-config by gathering Ceph configuration and keys - ``ceph-enter-maintenance.yml`` - Set Ceph to maintenance mode for storage hosts (Can limit the hosts with ``-l ``) From 20e46a3e62c3744fadd122397dd1cfcd521ef8d0 Mon Sep 17 00:00:00 2001 From: Seunghun Lee Date: Wed, 9 Oct 2024 12:11:25 +0100 Subject: [PATCH 27/42] Add rebooting case --- doc/source/operations/control-plane-operation.rst | 4 ++-- 1 file changed, 2 insertions(+), 2 deletions(-) diff --git a/doc/source/operations/control-plane-operation.rst b/doc/source/operations/control-plane-operation.rst index d3440c97db..cdb328a469 100644 --- a/doc/source/operations/control-plane-operation.rst +++ b/doc/source/operations/control-plane-operation.rst @@ -20,7 +20,7 @@ Compute The compute nodes can largely be thought of as ephemeral, but you do need to make sure you have migrated any instances and disabled the hypervisor before -decommissioning or making any disruptive configuration change. +rebooting, decommissioning or making any disruptive configuration change. Monitoring ---------- @@ -197,7 +197,7 @@ following order: * Perform a graceful shutdown of all virtual machine instances * Shut down compute nodes -* Shut down monitoring node +* Shut down monitoring node (if separate from controllers) * Shut down network nodes (if separate from controllers) * Shut down controllers * Shut down Ceph nodes (if applicable) From 275ce2c494e855bac846fd918a588c21eb918df9 Mon Sep 17 00:00:00 2001 From: Seunghun Lee Date: Mon, 4 Nov 2024 13:26:47 +0000 Subject: [PATCH 28/42] Remove missing document --- doc/source/operations/index.rst | 1 - 1 file changed, 1 deletion(-) diff --git a/doc/source/operations/index.rst b/doc/source/operations/index.rst index 2408b1a36e..3bfc7e976a 100644 --- a/doc/source/operations/index.rst +++ b/doc/source/operations/index.rst @@ -23,4 +23,3 @@ This guide is for operators of the StackHPC Kayobe configuration project. tempest upgrading-openstack upgrading-ceph - wazuh-operation From c24dc1ddd4a73723cadf4cbe30c021c7c74ff8ad Mon Sep 17 00:00:00 2001 From: Seunghun Lee Date: Mon, 4 Nov 2024 13:28:31 +0000 Subject: [PATCH 29/42] Make hardware inventory doc bifrost specific --- ...t => bifrost-hardware-inventory-management.rst} | 14 ++++++++------ doc/source/operations/index.rst | 2 +- 2 files changed, 9 insertions(+), 7 deletions(-) rename doc/source/operations/{hardware-inventory-management.rst => bifrost-hardware-inventory-management.rst} (96%) diff --git a/doc/source/operations/hardware-inventory-management.rst b/doc/source/operations/bifrost-hardware-inventory-management.rst similarity index 96% rename from doc/source/operations/hardware-inventory-management.rst rename to doc/source/operations/bifrost-hardware-inventory-management.rst index f88626a827..970e828aee 100644 --- a/doc/source/operations/hardware-inventory-management.rst +++ b/doc/source/operations/bifrost-hardware-inventory-management.rst @@ -1,8 +1,8 @@ -============================= -Hardware Inventory Management -============================= +===================================== +Bifrost Hardware Inventory Management +===================================== -At its lowest level, hardware inventory is managed in the Bifrost service. +In most deployments, hardware inventory is managed by the Bifrost service. Reconfiguring Control Plane Hardware ==================================== @@ -56,7 +56,9 @@ in Bifrost: | da0c61af-b411-41b9-8909-df2509f2059b | example-hypervisor-01 | None | power off | enroll | False | +--------------------------------------+-----------------------+---------------+-------------+--------------------+-------------+ -After editing ``${KAYOBE_CONFIG_PATH}/overcloud.yml`` to add these new hosts to +After editing ``${KAYOBE_CONFIG_PATH}/overcloud.yml`` (or +``${KAYOBE_CONFIG_PATH}/environments/${KAYOBE_ENVIRONMENT}/overcloud.yml`` +if Kayobe environment is used) to add these new hosts to the correct groups, import them in Kayobe's inventory with: .. code-block:: console @@ -138,7 +140,7 @@ migrate as the process needs manual confirmation. You can do this with: .. code-block:: console - openstack # openstack server resize confirm + openstack# openstack server resize confirm The symptom to look out for is that the server is showing a status of ``VERIFY RESIZE`` as shown in this snippet of ``openstack server show ``: diff --git a/doc/source/operations/index.rst b/doc/source/operations/index.rst index 3bfc7e976a..fd21c8c690 100644 --- a/doc/source/operations/index.rst +++ b/doc/source/operations/index.rst @@ -11,7 +11,7 @@ This guide is for operators of the StackHPC Kayobe configuration project. control-plane-operation customising-horizon gpu-in-openstack - hardware-inventory-management + bifrost-hardware-inventory-management hotfix-playbook migrating-vm nova-compute-ironic From 5b12bd0a4eed855a9ab469f97cea0bf6f8639cdb Mon Sep 17 00:00:00 2001 From: Seunghun Lee Date: Mon, 4 Nov 2024 13:29:36 +0000 Subject: [PATCH 30/42] Add reference to monitoring doc --- doc/source/configuration/monitoring.rst | 2 ++ doc/source/operations/control-plane-operation.rst | 3 +++ 2 files changed, 5 insertions(+) diff --git a/doc/source/configuration/monitoring.rst b/doc/source/configuration/monitoring.rst index 8215dc48bf..77c5e47f77 100644 --- a/doc/source/configuration/monitoring.rst +++ b/doc/source/configuration/monitoring.rst @@ -2,6 +2,8 @@ Monitoring ========== +.. _monitoring-service-configuration: + Monitoring Configuration ======================== diff --git a/doc/source/operations/control-plane-operation.rst b/doc/source/operations/control-plane-operation.rst index cdb328a469..d056a95105 100644 --- a/doc/source/operations/control-plane-operation.rst +++ b/doc/source/operations/control-plane-operation.rst @@ -42,6 +42,9 @@ Ansible control host Control Plane Monitoring ======================== +This section shows user guide of monitoring control plane. To see how to +configure monitoring services, read :ref:`monitoring-service-configuration`. + The control plane has been configured to collect logs centrally using Fluentd, OpenSearch and OpenSearch Dashboards. From b7b776f3ea655d1dfe71ecdc9633f2b218d12356 Mon Sep 17 00:00:00 2001 From: Seunghun Lee Date: Mon, 4 Nov 2024 13:30:31 +0000 Subject: [PATCH 31/42] Use reboot playbook rather than shutdown command --- doc/source/operations/control-plane-operation.rst | 3 ++- 1 file changed, 2 insertions(+), 1 deletion(-) diff --git a/doc/source/operations/control-plane-operation.rst b/doc/source/operations/control-plane-operation.rst index d056a95105..7e7711004d 100644 --- a/doc/source/operations/control-plane-operation.rst +++ b/doc/source/operations/control-plane-operation.rst @@ -210,11 +210,12 @@ following order: Rebooting a node ---------------- +Use ``reboot.yml`` playbook to reboot nodes Example: Reboot all compute hosts apart from compute0: .. code-block:: console - kayobe# kayobe overcloud host command run --limit 'compute:!compute0' -b --command "shutdown -r" + kayobe# kayobe playbook run $KAYOBE_CONFIG_PATH/ansible/reboot.yml --limit 'compute:!compute0' References ---------- From f7018d5c99a09955568d2fa365a1f010dc8fc34a Mon Sep 17 00:00:00 2001 From: Seunghun Lee Date: Mon, 4 Nov 2024 13:31:03 +0000 Subject: [PATCH 32/42] Use env variable --- doc/source/operations/secret-rotation.rst | 4 ++-- 1 file changed, 2 insertions(+), 2 deletions(-) diff --git a/doc/source/operations/secret-rotation.rst b/doc/source/operations/secret-rotation.rst index 34fd33a72c..127635ab4d 100644 --- a/doc/source/operations/secret-rotation.rst +++ b/doc/source/operations/secret-rotation.rst @@ -104,8 +104,8 @@ Full method 3. Navigate to the directory containing your ``passwords.yml`` file - (``kayobe-config/etc/kolla/passwords.yml`` OR - ``kayobe-config/etc/kayobe/environments//kolla/passwords.yml``) + (``$KOLLA_CONFIG_PATH/passwords.yml`` OR + ``$KAYOBE_CONFIG_PATH/environments/$KAYOBE_ENVIRONMENT/kolla/passwords.yml``) 4. Create a file called ``deletelist.txt`` and populate it with this content (including all whitespace): From 9e324bad5da08cd960353e57b6c736c22966ab87 Mon Sep 17 00:00:00 2001 From: Seunghun Lee Date: Mon, 4 Nov 2024 13:32:16 +0000 Subject: [PATCH 33/42] Make Vault and Openstack reconfig doc refer each other --- doc/source/configuration/vault.rst | 5 +++++ .../operations/openstack-reconfiguration.rst | 14 ++++++++++---- 2 files changed, 15 insertions(+), 4 deletions(-) diff --git a/doc/source/configuration/vault.rst b/doc/source/configuration/vault.rst index 893af246c3..d05513632e 100644 --- a/doc/source/configuration/vault.rst +++ b/doc/source/configuration/vault.rst @@ -1,3 +1,5 @@ +.. _hashicorp-vault: + ================================ Hashicorp Vault for internal PKI ================================ @@ -111,6 +113,9 @@ Certificates generation Create the external TLS certificates (testing only) --------------------------------------------------- +This method should only be used for testing. For external certificates on production system, +See `Installing External TLS Certificates `__. + Typically external API TLS certificates should be generated by a organisation's trusted internal or third-party CA. For test and development purposes it is possible to use Vault as a CA for the external API. diff --git a/doc/source/operations/openstack-reconfiguration.rst b/doc/source/operations/openstack-reconfiguration.rst index ab2a01fc8e..35729c8c5f 100644 --- a/doc/source/operations/openstack-reconfiguration.rst +++ b/doc/source/operations/openstack-reconfiguration.rst @@ -35,8 +35,14 @@ On each controller: Some services may store data in a dedicated Docker volume, which can be removed with ``docker volume rm``. -Installing TLS Certificates -=========================== +.. _installing-external-tls-certificates: + +Installing External TLS Certificates +==================================== + +This section explains the process of deploying external TLS. +For internal and backend TLS, see `Hashicorp Vault for internal PKI +`__. To configure TLS for the first time, we write the contents of a PEM file to the ``secrets.yml`` file as ``secrets_kolla_external_tls_cert``. @@ -81,8 +87,8 @@ See `Kolla-Ansible TLS guide `__ for further details. -Updating TLS Certificates -------------------------- +Updating External TLS Certificates +---------------------------------- Check the expiry date on an installed TLS certificate from a host that can reach the OpenStack APIs: From b9cefdf35df4d89399657ae436e87225347c9b67 Mon Sep 17 00:00:00 2001 From: Seunghun Lee Date: Mon, 4 Nov 2024 13:32:54 +0000 Subject: [PATCH 34/42] Fix: Use RST syntax of Note --- doc/source/operations/openstack-reconfiguration.rst | 6 ++++-- 1 file changed, 4 insertions(+), 2 deletions(-) diff --git a/doc/source/operations/openstack-reconfiguration.rst b/doc/source/operations/openstack-reconfiguration.rst index 35729c8c5f..31c1c5acc8 100644 --- a/doc/source/operations/openstack-reconfiguration.rst +++ b/doc/source/operations/openstack-reconfiguration.rst @@ -97,8 +97,10 @@ reach the OpenStack APIs: openstack# openssl s_client -connect :443 2> /dev/null | openssl x509 -noout -dates -*NOTE*: Prometheus Blackbox monitoring can check certificates automatically -and alert when expiry is approaching. +.. note:: + + Prometheus Blackbox monitoring can check certificates automatically + and alert when expiry is approaching. To update an existing certificate, for example when it has reached expiration, change the value of ``secrets_kolla_external_tls_cert``, in the same order as From 106b14f870cb026e5b0b5870c84f7fa3f39eb7e3 Mon Sep 17 00:00:00 2001 From: Seunghun Lee Date: Mon, 4 Nov 2024 13:33:23 +0000 Subject: [PATCH 35/42] Update to use some of upstream doc --- doc/source/operations/gpu-in-openstack.rst | 324 ++------------------- 1 file changed, 19 insertions(+), 305 deletions(-) diff --git a/doc/source/operations/gpu-in-openstack.rst b/doc/source/operations/gpu-in-openstack.rst index 9270817187..330edfbb1d 100644 --- a/doc/source/operations/gpu-in-openstack.rst +++ b/doc/source/operations/gpu-in-openstack.rst @@ -2,35 +2,18 @@ Support for GPUs in OpenStack ============================= -NVIDIA Virtual GPU -################## +Virtual GPUs +############ BIOS configuration ------------------ -Intel -^^^^^ - -* Enable `VT-x` in the BIOS for virtualisation support. -* Enable `VT-d` in the BIOS for IOMMU support. - -Dell -^^^^ - -Enabling SR-IOV with `racadm`: - -.. code:: shell - - /opt/dell/srvadmin/bin/idracadm7 set BIOS.IntegratedDevices.SriovGlobalEnable Enabled - /opt/dell/srvadmin/bin/idracadm7 jobqueue create BIOS.Setup.1-1 - - +See upstream documentation: `BIOS configuration `__ Obtain driver from NVIDIA licensing portal -------------------------------------------- +------------------------------------------ -Download Nvidia GRID driver from `here `__ -(This requires a login). The file can either be placed on the :ref:`ansible control host` or :ref:`uploaded to pulp`. +See upstream documentation: `Obtain driver from NVIDIA licencing portal `__ .. _NVIDIA Pulp: @@ -52,7 +35,8 @@ running in a CI environment. The file will then be available at ``/pulp/content/nvidia/NVIDIA-GRID-Linux-KVM-525.105.14-525.105.17-528.89.zip``. You will need to set the ``vgpu_driver_url`` configuration option to this value: -.. code:: yaml +.. code-block:: yaml + :caption: $KAYOBE_CONFIG_PATH/vgpu.yml # URL of GRID driver in pulp vgpu_driver_url: "{{ pulp_url }}/pulp/content/nvidia/NVIDIA-GRID-Linux-KVM-525.105.14-525.105.17-528.89.zip" @@ -67,7 +51,8 @@ Placing the GRID driver on the ansible control host Copy the driver bundle to a known location on the ansible control host. Set the ``vgpu_driver_url`` configuration variable to reference this path using ``file`` as the url scheme e.g: -.. code:: yaml +.. code-block:: yaml + :caption: $KAYOBE_CONFIG_PATH/vgpu.yml # Location of NVIDIA GRID driver on localhost vgpu_driver_url: "file://{{ lookup('env', 'HOME') }}/NVIDIA-GRID-Linux-KVM-525.105.14-525.105.17-528.89.zip" @@ -81,24 +66,12 @@ OS Configuration Host OS configuration is done by using roles in the `stackhpc.linux `_ ansible collection. -Add the following to your ansible ``requirements.yml``: - -.. code-block:: yaml - :caption: $KAYOBE_CONFIG_PATH/ansible/requirements.yml - - #FIXME: Update to known release When VGPU and IOMMU roles have landed - collections: - - name: stackhpc.linux - source: git+https://github.com/stackhpc/ansible-collection-linux.git,preemptive/vgpu-iommu - type: git - Create a new playbook or update an existing on to apply the roles: .. code-block:: yaml :caption: $KAYOBE_CONFIG_PATH/ansible/host-configure.yml --- - - hosts: iommu tags: - iommu @@ -176,15 +149,6 @@ hosts can automatically be mapped to these groups by configuring Role Configuration ^^^^^^^^^^^^^^^^^^ -Configure the location of the NVIDIA driver: - -.. code-block:: yaml - :caption: $KAYOBE_CONFIG_PATH/vgpu.yml - - --- - - vgpu_driver_url: "http://{{ pulp_url }}/pulp/content/nvidia/NVIDIA-GRID-Linux-KVM-525.105.14-525.105.17-528.89.zip" - Configure the VGPU devices: .. code-block:: yaml @@ -260,56 +224,8 @@ ensure you do not forget to run it when hosts are enrolled in the future. Kolla-Ansible configuration ^^^^^^^^^^^^^^^^^^^^^^^^^^^ -To use the mdev devices that were created, modify nova.conf to add a list of mdev devices that -can be passed through to guests: - -.. code-block:: - :caption: $KAYOBE_CONFIG_PATH/kolla/config/nova/nova-compute.conf - - {% if inventory_hostname in groups['compute_multi_instance_gpu'] %} - [devices] - enabled_mdev_types = nvidia-700, nvidia-699 - - [mdev_nvidia-700] - device_addresses = 0000:21:00.4,0000:21:00.5,0000:21:00.6,0000:81:00.4,0000:81:00.5,0000:81:00.6 - mdev_class = CUSTOM_NVIDIA_700 - - [mdev_nvidia-699] - device_addresses = 0000:21:00.7,0000:81:00.7 - mdev_class = CUSTOM_NVIDIA_699 - - {% elif inventory_hostname in groups['compute_vgpu'] %} - [devices] - enabled_mdev_types = nvidia-697 - - [mdev_nvidia-697] - device_addresses = 0000:21:00.4,0000:21:00.5,0000:81:00.4,0000:81:00.5 - # Custom resource classes don't work when you only have single resource type. - mdev_class = VGPU - - {% endif %} - -You will need to adjust the PCI addresses to match the virtual function -addresses. These can be obtained by checking the mdevctl configuration after -running the role: - -.. code-block:: shell - - # mdevctl list - - 73269d0f-b2c9-438d-8f28-f9e4bc6c6995 0000:17:00.4 nvidia-700 manual (defined) - dc352ef3-efeb-4a5d-a48e-912eb230bc76 0000:17:00.5 nvidia-700 manual (defined) - a464fbae-1f89-419a-a7bd-3a79c7b2eef4 0000:17:00.6 nvidia-700 manual (defined) - f3b823d3-97c8-4e0a-ae1b-1f102dcb3bce 0000:17:00.7 nvidia-699 manual (defined) - 330be289-ba3f-4416-8c8a-b46ba7e51284 0000:65:00.4 nvidia-700 manual (defined) - 1ba5392c-c61f-4f48-8fb1-4c6b2bbb0673 0000:65:00.5 nvidia-700 manual (defined) - f6868020-eb3a-49c6-9701-6c93e4e3fa9c 0000:65:00.6 nvidia-700 manual (defined) - 00501f37-c468-5ba4-8be2-8d653c4604ed 0000:65:00.7 nvidia-699 manual (defined) - -The mdev_class maps to a resource class that you can set in your flavor definition. -Note that if you only define a single mdev type on a given hypervisor, then the -mdev_class configuration option is silently ignored and it will use the ``VGPU`` -resource class (bug?). +See upstream documentation: `Kolla Ansible configuration `__ +then follow the rest. Map through the kayobe inventory groups into kolla: @@ -356,28 +272,7 @@ You will need to reconfigure nova for this change to be applied: Openstack flavors ^^^^^^^^^^^^^^^^^ -Define some flavors that request the resource class that was configured in nova.conf. -An example definition, that can be used with ``openstack.cloud.compute_flavor`` Ansible module, -is shown below: - -.. code-block:: yaml - - vgpu_a100_2g_20gb: - name: "vgpu.a100.2g.20gb" - ram: 65536 - disk: 30 - vcpus: 8 - is_public: false - extra_specs: - hw:cpu_policy: "dedicated" - hw:cpu_thread_policy: "prefer" - hw:mem_page_size: "1GB" - hw:cpu_sockets: 2 - hw:numa_nodes: 8 - hw_rng:allowed: "True" - resources:CUSTOM_NVIDIA_700: "1" - -You now should be able to launch a VM with this flavor. +See upstream documentation: `OpenStack flavors `__ NVIDIA License Server ^^^^^^^^^^^^^^^^^^^^^ @@ -667,123 +562,7 @@ Example output: Changing VGPU device types ^^^^^^^^^^^^^^^^^^^^^^^^^^ -Converting the second card to an NVIDIA-698 (whole card). The hypervisor -is empty so we can freely delete mdevs. First clean up the mdev -definition: - -.. code:: shell - - [stack@computegpu007 ~]$ sudo mdevctl list - 5c630867-a673-5d75-aa31-a499e6c7cb19 0000:21:00.4 nvidia-697 manual (defined) - eaa6e018-308e-58e2-b351-aadbcf01f5a8 0000:21:00.5 nvidia-697 manual (defined) - 72291b01-689b-5b7a-9171-6b3480deabf4 0000:81:00.4 nvidia-697 manual (defined) - 0a47ffd1-392e-5373-8428-707a4e0ce31a 0000:81:00.5 nvidia-697 manual (defined) - - [stack@computegpu007 ~]$ sudo mdevctl stop --uuid 72291b01-689b-5b7a-9171-6b3480deabf4 - [stack@computegpu007 ~]$ sudo mdevctl stop --uuid 0a47ffd1-392e-5373-8428-707a4e0ce31a - - [stack@computegpu007 ~]$ sudo mdevctl undefine --uuid 0a47ffd1-392e-5373-8428-707a4e0ce31a - - [stack@computegpu007 ~]$ sudo mdevctl list --defined - 5c630867-a673-5d75-aa31-a499e6c7cb19 0000:21:00.4 nvidia-697 manual (active) - eaa6e018-308e-58e2-b351-aadbcf01f5a8 0000:21:00.5 nvidia-697 manual (active) - 72291b01-689b-5b7a-9171-6b3480deabf4 0000:81:00.4 nvidia-697 manual - - # We can re-use the first virtual function - -Secondly remove the systemd unit that starts the mdev device: - -.. code:: shell - - [stack@computegpu007 ~]$ sudo rm /etc/systemd/system/multi-user.target.wants/nvidia-mdev@0a47ffd1-392e-5373-8428-707a4e0ce31a.service - -Example config change: - -.. code:: shell - - diff --git a/etc/kayobe/environments/cci1/inventory/host_vars/computegpu007/vgpu b/etc/kayobe/environments/cci1/inventory/host_vars/computegpu007/vgpu - new file mode 100644 - index 0000000..6cea9bf - --- /dev/null - +++ b/etc/kayobe/environments/cci1/inventory/host_vars/computegpu007/vgpu - @@ -0,0 +1,12 @@ - +--- - +vgpu_definitions: - + - pci_address: "0000:21:00.0" - + virtual_functions: - + - mdev_type: nvidia-697 - + index: 0 - + - mdev_type: nvidia-697 - + index: 1 - + - pci_address: "0000:81:00.0" - + virtual_functions: - + - mdev_type: nvidia-698 - + index: 0 - diff --git a/etc/kayobe/kolla/config/nova/nova-compute.conf b/etc/kayobe/kolla/config/nova/nova-compute.conf - index 6f680cb..e663ec4 100644 - --- a/etc/kayobe/kolla/config/nova/nova-compute.conf - +++ b/etc/kayobe/kolla/config/nova/nova-compute.conf - @@ -39,7 +39,19 @@ cpu_mode = host-model - {% endraw %} - - {% raw %} - -{% if inventory_hostname in groups['compute_multi_instance_gpu'] %} - +{% if inventory_hostname == "computegpu007" %} - +[devices] - +enabled_mdev_types = nvidia-697, nvidia-698 - + - +[mdev_nvidia-697] - +device_addresses = 0000:21:00.4,0000:21:00.5 - +mdev_class = VGPU - + - +[mdev_nvidia-698] - +device_addresses = 0000:81:00.4 - +mdev_class = CUSTOM_NVIDIA_698 - + - +{% elif inventory_hostname in groups['compute_multi_instance_gpu'] %} - [devices] - enabled_mdev_types = nvidia-700, nvidia-699 - - @@ -50,15 +62,14 @@ mdev_class = CUSTOM_NVIDIA_700 - [mdev_nvidia-699] - device_addresses = 0000:21:00.7,0000:81:00.7 - mdev_class = CUSTOM_NVIDIA_699 - -{% endif %} - - -{% if inventory_hostname in groups['compute_vgpu'] %} - +{% elif inventory_hostname in groups['compute_vgpu'] %} - [devices] - enabled_mdev_types = nvidia-697 - - [mdev_nvidia-697] - device_addresses = 0000:21:00.4,0000:21:00.5,0000:81:00.4,0000:81:00.5 - -# Custom resource classes don't seem to work for this card. - +# Custom resource classes don't work when you only have single resource type. - mdev_class = VGPU - - {% endif %} - -Re-run the configure playbook: - -.. code:: shell - - (kayobe) [stack@ansiblenode1 kayobe]$ kayobe playbook run $KAYOBE_CONFIG_PATH/ansible/host-configure.yml --tags vgpu --limit computegpu007 - -Check the result: - -.. code:: shell - - [stack@computegpu007 ~]$ mdevctl list - 5c630867-a673-5d75-aa31-a499e6c7cb19 0000:21:00.4 nvidia-697 manual - eaa6e018-308e-58e2-b351-aadbcf01f5a8 0000:21:00.5 nvidia-697 manual - 72291b01-689b-5b7a-9171-6b3480deabf4 0000:81:00.4 nvidia-698 manual - -Reconfigure nova to match the change: - -.. code:: shell - - kayobe overcloud service reconfigure -kt nova --kolla-limit computegpu007 --skip-prechecks - +See upstream documentation: `Changing VGPU device types `__ PCI Passthrough ############### @@ -986,81 +765,16 @@ IOMMU should be enabled at kernel level as well - we can verify that on the comp OpenStack Nova configuration ---------------------------- -Configure nova-scheduler -^^^^^^^^^^^^^^^^^^^^^^^^ - -The nova-scheduler service must be configured to enable the ``PciPassthroughFilter`` -To enable it add it to the list of filters to Kolla-Ansible configuration file: -``$KAYOBE_CONFIG_PATH/kolla/config/nova.conf``, for instance: - -.. code-block:: yaml - - [filter_scheduler] - available_filters = nova.scheduler.filters.all_filters - enabled_filters = AvailabilityZoneFilter, ComputeFilter, ComputeCapabilitiesFilter, ImagePropertiesFilter, ServerGroupAntiAffinityFilter, ServerGroupAffinityFilter, PciPassthroughFilter - -Configure nova-compute -^^^^^^^^^^^^^^^^^^^^^^ - -Configuration can be applied in flexible ways using Kolla-Ansible's -methods for `inventory-driven customisation of configuration -`_. -The following configuration could be added to -``$KAYOBE_CONFIG_PATH/kolla/config/nova/nova-compute.conf`` to enable PCI -passthrough of GPU devices for hosts in a group named ``compute_gpu``. -Again, the 4-digit PCI Vendor ID and Device ID extracted from ``lspci --nn`` can be used here to specify the GPU device(s). - -.. code-block:: jinja - - [pci] - {% raw %} - {% if inventory_hostname in groups['compute_gpu'] %} - # We could support multiple models of GPU. - # This can be done more selectively using different inventory groups. - # GPU models defined here: - # NVidia Tesla V100 16GB - # NVidia Tesla V100 32GB - # NVidia Tesla P100 16GB - passthrough_whitelist = [{ "vendor_id":"10de", "product_id":"1db4" }, - { "vendor_id":"10de", "product_id":"1db5" }, - { "vendor_id":"10de", "product_id":"15f8" }] - alias = { "vendor_id":"10de", "product_id":"1db4", "device_type":"type-PCI", "name":"gpu-v100-16" } - alias = { "vendor_id":"10de", "product_id":"1db5", "device_type":"type-PCI", "name":"gpu-v100-32" } - alias = { "vendor_id":"10de", "product_id":"15f8", "device_type":"type-PCI", "name":"gpu-p100" } - {% endif %} - {% endraw %} - -Configure nova-api -^^^^^^^^^^^^^^^^^^ - -pci.alias also needs to be configured on the controller. -This configuration should match the configuration found on the compute nodes. -Add it to Kolla-Ansible configuration file: -``$KAYOBE_CONFIG_PATH/kolla/config/nova/nova-api.conf``, for instance: - -.. code-block:: yaml - - [pci] - alias = { "vendor_id":"10de", "product_id":"1db4", "device_type":"type-PCI", "name":"gpu-v100-16" } - alias = { "vendor_id":"10de", "product_id":"1db5", "device_type":"type-PCI", "name":"gpu-v100-32" } - alias = { "vendor_id":"10de", "product_id":"15f8", "device_type":"type-PCI", "name":"gpu-p100" } - -Reconfigure nova service -^^^^^^^^^^^^^^^^^^^^^^^^ - -.. code-block:: text - - kayobe overcloud service reconfigure --kolla-tags nova --kolla-skip-tags common --skip-prechecks +See upsteram Nova documentation: `Attaching physical PCI devices to guests `__ Configure a flavor ^^^^^^^^^^^^^^^^^^ -For example, to request two of the GPUs with alias gpu-p100 +For example, to request two of the GPUs with alias **a1** .. code-block:: text - openstack flavor set m1.medium --property "pci_passthrough:alias"="gpu-p100:2" + openstack flavor set m1.medium --property "pci_passthrough:alias"="a1:2" This can be also defined in the openstack-config repository @@ -1072,12 +786,12 @@ add extra_specs to flavor in etc/openstack-config/openstack-config.yml: admin# cd src/openstack-config admin# vim etc/openstack-config/openstack-config.yml - name: "m1.medium" + name: "m1.medium-gpu" ram: 4096 disk: 40 vcpus: 2 extra_specs: - "pci_passthrough:alias": "gpu-p100:2" + "pci_passthrough:alias": "a1:2" Invoke configuration playbooks afterwards: @@ -1092,7 +806,7 @@ Create instance with GPU passthrough .. code-block:: text - openstack server create --flavor m1.medium --image ubuntu2004 --wait test-pci + openstack server create --flavor m1.medium-gpu --image ubuntu22.04 --wait test-pci Testing GPU in a Guest VM ------------------------- From 15b575fa646c34b45218a7837c428261038d3b5d Mon Sep 17 00:00:00 2001 From: Seunghun Lee <45145778+seunghun1ee@users.noreply.github.com> Date: Thu, 7 Nov 2024 10:53:30 +0000 Subject: [PATCH 36/42] Better wordings on section intro Co-authored-by: Alex-Welsh <112560678+Alex-Welsh@users.noreply.github.com> --- doc/source/configuration/vault.rst | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/doc/source/configuration/vault.rst b/doc/source/configuration/vault.rst index d05513632e..61a5ab1c9e 100644 --- a/doc/source/configuration/vault.rst +++ b/doc/source/configuration/vault.rst @@ -113,7 +113,7 @@ Certificates generation Create the external TLS certificates (testing only) --------------------------------------------------- -This method should only be used for testing. For external certificates on production system, +This method should only be used for testing. For external TLS on production systems, See `Installing External TLS Certificates `__. Typically external API TLS certificates should be generated by a organisation's trusted internal or third-party CA. From 7703f9314dd7bbb87b42a12fff01830e20657a3b Mon Sep 17 00:00:00 2001 From: Seunghun Lee Date: Thu, 7 Nov 2024 11:08:02 +0000 Subject: [PATCH 37/42] Remove unnecessary curly brackets --- doc/source/configuration/ci-cd.rst | 14 +++++++------- doc/source/configuration/lvm.rst | 4 ++-- doc/source/configuration/swap.rst | 4 ++-- doc/source/contributor/pre-commit.rst | 6 +++--- .../bifrost-hardware-inventory-management.rst | 14 +++++++------- doc/source/operations/hotfix-playbook.rst | 4 ++-- doc/source/operations/octavia.rst | 4 ++-- .../operations/openstack-reconfiguration.rst | 6 +++--- 8 files changed, 28 insertions(+), 28 deletions(-) diff --git a/doc/source/configuration/ci-cd.rst b/doc/source/configuration/ci-cd.rst index 6e495c2e81..dcf86350e6 100644 --- a/doc/source/configuration/ci-cd.rst +++ b/doc/source/configuration/ci-cd.rst @@ -57,26 +57,26 @@ Runner Deployment Ideally an Infra VM could be used here or failing that the control host. Wherever it is deployed the host will need access to the :code:`admin_network`, :code:`public_network` and the :code:`pulp registry` on the seed. -2. Edit the environment's :code:`${KAYOBE_CONFIG_PATH}/environments/${KAYOBE_ENVIRONMENT}/inventory/groups` to add the predefined :code:`github-runners` group to :code:`infra-vms` +2. Edit the environment's :code:`$KAYOBE_CONFIG_PATH/environments/$KAYOBE_ENVIRONMENT/inventory/groups` to add the predefined :code:`github-runners` group to :code:`infra-vms` .. code-block:: ini [infra-vms:children] github-runners -3. Edit the environment's :code:`${KAYOBE_CONFIG_PATH}/environments/${KAYOBE_ENVIRONMENT}/inventory/hosts` to define the host(s) that will host the runners. +3. Edit the environment's :code:`$KAYOBE_CONFIG_PATH/environments/$KAYOBE_ENVIRONMENT/inventory/hosts` to define the host(s) that will host the runners. .. code-block:: ini [github-runners] prod-runner-01 -4. Provide all the relevant Kayobe :code:`group_vars` for :code:`github-runners` under :code:`${KAYOBE_CONFIG_PATH}/environments/${KAYOBE_ENVIRONMENT}/inventory/group_vars/github-runners` +4. Provide all the relevant Kayobe :code:`group_vars` for :code:`github-runners` under :code:`$KAYOBE_CONFIG_PATH/environments/$KAYOBE_ENVIRONMENT/inventory/group_vars/github-runners` * `infra-vms` ensuring all required `infra_vm_extra_network_interfaces` are defined * `network-interfaces` * `python-interpreter.yml` ensuring that `ansible_python_interpreter: /usr/bin/python3` has been set -5. Edit the ``${KAYOBE_CONFIG_PATH}/inventory/group_vars/github-runners/runners.yml`` file which will contain the variables required to deploy a series of runners. +5. Edit the ``$KAYOBE_CONFIG_PATH/inventory/group_vars/github-runners/runners.yml`` file which will contain the variables required to deploy a series of runners. Below is a core set of variables that will require consideration and modification for successful deployment of the runners. The number of runners deployed can be configured by removing and extending the dict :code:`github-runners`. As for how many runners present three is suitable number as this would prevent situations where long running jobs could halt progress other tasks whilst waiting for a free runner. @@ -120,7 +120,7 @@ Runner Deployment 7. If the host is an actual Infra VM then please refer to upstream `Infrastructure VMs `__ documentation for additional configuration and steps. -8. Run :code:`kayobe playbook run ${KAYOBE_CONFIG_PATH}/ansible/deploy-github-runner.yml` +8. Run :code:`kayobe playbook run $KAYOBE_CONFIG_PATH/ansible/deploy-github-runner.yml` 9. Check runners have registered properly by visiting the repository's :code:`Action` tab -> :code:`Runners` -> :code:`Self-hosted runners` @@ -130,9 +130,9 @@ Runner Deployment Workflow Deployment ------------------- -1. Edit :code:`${KAYOBE_CONFIG_PATH}/inventory/group_vars/github-writer/writer.yml` in the base configuration making the appropriate changes to your deployments specific needs. See documentation for `stackhpc.kayobe_workflows.github `__. +1. Edit :code:`$KAYOBE_CONFIG_PATH/inventory/group_vars/github-writer/writer.yml` in the base configuration making the appropriate changes to your deployments specific needs. See documentation for `stackhpc.kayobe_workflows.github `__. -2. Run :code:`kayobe playbook run ${KAYOBE_CONFIG_PATH}/ansible/write-github-workflows.yml` +2. Run :code:`kayobe playbook run $KAYOBE_CONFIG_PATH/ansible/write-github-workflows.yml` 3. Add all required secrets and variables to repository either via the GitHub UI or GitHub CLI (may require repository owner) diff --git a/doc/source/configuration/lvm.rst b/doc/source/configuration/lvm.rst index a96ca8db99..bb2b7862c4 100644 --- a/doc/source/configuration/lvm.rst +++ b/doc/source/configuration/lvm.rst @@ -93,6 +93,6 @@ hosts: .. code-block:: console - mkdir -p ${KAYOBE_CONFIG_PATH}/hooks/overcloud-host-configure/pre.d - cd ${KAYOBE_CONFIG_PATH}/hooks/overcloud-host-configure/pre.d + mkdir -p $KAYOBE_CONFIG_PATH/hooks/overcloud-host-configure/pre.d + cd $KAYOBE_CONFIG_PATH/hooks/overcloud-host-configure/pre.d ln -s ../../../ansible/growroot.yml 30-growroot.yml diff --git a/doc/source/configuration/swap.rst b/doc/source/configuration/swap.rst index 58545e9066..2419195744 100644 --- a/doc/source/configuration/swap.rst +++ b/doc/source/configuration/swap.rst @@ -23,6 +23,6 @@ hosts: .. code-block:: console - mkdir -p ${KAYOBE_CONFIG_PATH}/hooks/overcloud-host-configure/post.d - cd ${KAYOBE_CONFIG_PATH}/hooks/overcloud-host-configure/post.d + mkdir -p $KAYOBE_CONFIG_PATH/hooks/overcloud-host-configure/post.d + cd $KAYOBE_CONFIG_PATH/hooks/overcloud-host-configure/post.d ln -s ../../../ansible/swap.yml 10-swap.yml diff --git a/doc/source/contributor/pre-commit.rst b/doc/source/contributor/pre-commit.rst index 3afffc11b4..dc9f691bf6 100644 --- a/doc/source/contributor/pre-commit.rst +++ b/doc/source/contributor/pre-commit.rst @@ -29,12 +29,12 @@ Once done you should find `pre-commit` is available within the `kayobe` virtuale To run the playbook using the following command -- ``kayobe playbook run ${KAYOBE_CONFIG_PATH}/ansible/install-pre-commit-hooks.yml`` +- ``kayobe playbook run $KAYOBE_CONFIG_PATH/ansible/install-pre-commit-hooks.yml`` Whereas to run the playbook when control host bootstrap runs ensure it registered as symlink using the following command -- ``mkdir -p ${KAYOBE_CONFIG_PATH}/hooks/control-host-bootstrap/post.d`` -- ``ln -s ${KAYOBE_CONFIG_PATH}/ansible/install-pre-commit-hooks.yml ${KAYOBE_CONFIG_PATH}/hooks/control-host-bootstrap/post.d/install-pre-commit-hooks.yml`` +- ``mkdir -p $KAYOBE_CONFIG_PATH/hooks/control-host-bootstrap/post.d`` +- ``ln -s $KAYOBE_CONFIG_PATH/ansible/install-pre-commit-hooks.yml $KAYOBE_CONFIG_PATH/hooks/control-host-bootstrap/post.d/install-pre-commit-hooks.yml`` All that remains is the installation of the hooks themselves which can be accomplished either by running `pre-commit run` or using `git commit` when you have changes that need to be committed. diff --git a/doc/source/operations/bifrost-hardware-inventory-management.rst b/doc/source/operations/bifrost-hardware-inventory-management.rst index 970e828aee..ba6e6bb253 100644 --- a/doc/source/operations/bifrost-hardware-inventory-management.rst +++ b/doc/source/operations/bifrost-hardware-inventory-management.rst @@ -26,7 +26,7 @@ configured to network boot on the provisioning network, the following commands will instruct them to PXE boot. The nodes will boot on the Ironic Python Agent kernel and ramdisk, which is configured to extract hardware information and send it to Bifrost. Note that IPMI credentials can be found in the encrypted -file located at ``${KAYOBE_CONFIG_PATH}/secrets.yml``. +file located at ``$KAYOBE_CONFIG_PATH/secrets.yml``. .. code-block:: console @@ -56,8 +56,8 @@ in Bifrost: | da0c61af-b411-41b9-8909-df2509f2059b | example-hypervisor-01 | None | power off | enroll | False | +--------------------------------------+-----------------------+---------------+-------------+--------------------+-------------+ -After editing ``${KAYOBE_CONFIG_PATH}/overcloud.yml`` (or -``${KAYOBE_CONFIG_PATH}/environments/${KAYOBE_ENVIRONMENT}/overcloud.yml`` +After editing ``$KAYOBE_CONFIG_PATH/overcloud.yml`` (or +``$KAYOBE_CONFIG_PATH/environments/$KAYOBE_ENVIRONMENT/overcloud.yml`` if Kayobe environment is used) to add these new hosts to the correct groups, import them in Kayobe's inventory with: @@ -201,7 +201,7 @@ To build ipa image with extra-hardware you need to edit ``ipa.yml`` and add thi - "extra-hardware" Extract introspection data from Bifrost with Kayobe. JSON files will be created -into ``${KAYOBE_CONFIG_PATH}/overcloud-introspection-data``: +into ``$KAYOBE_CONFIG_PATH/overcloud-introspection-data``: .. code-block:: console @@ -210,7 +210,7 @@ into ``${KAYOBE_CONFIG_PATH}/overcloud-introspection-data``: Using ADVise ------------ -The Ansible playbook ``advise-run.yml`` can be found at ``${KAYOBE_CONFIG_PATH}/ansible/advise-run.yml``. +The Ansible playbook ``advise-run.yml`` can be found at ``$KAYOBE_CONFIG_PATH/ansible/advise-run.yml``. The playbook will: @@ -220,8 +220,8 @@ The playbook will: .. code-block:: console - cd ${KAYOBE_CONFIG_PATH} - ansible-playbook ${KAYOBE_CONFIG_PATH}/ansible/advise-run.yml + cd $KAYOBE_CONFIG_PATH + ansible-playbook $KAYOBE_CONFIG_PATH/ansible/advise-run.yml The playbook has the following optional parameters: diff --git a/doc/source/operations/hotfix-playbook.rst b/doc/source/operations/hotfix-playbook.rst index ee4d9df012..8f7c6145e3 100644 --- a/doc/source/operations/hotfix-playbook.rst +++ b/doc/source/operations/hotfix-playbook.rst @@ -20,7 +20,7 @@ The playbook can be invoked with: .. code-block:: console - kayobe playbook run ${KAYOBE_CONFIG_PATH}/ansible/hotfix-containers.yml + kayobe playbook run $KAYOBE_CONFIG_PATH/ansible/hotfix-containers.yml Playbook variables: ------------------- @@ -49,7 +49,7 @@ to a file, then add them as an extra var. e.g: .. code-block:: console - kayobe playbook run ${KAYOBE_CONFIG_PATH}/ansible/hotfix-containers.yml -e "@~/vars.yml" + kayobe playbook run $KAYOBE_CONFIG_PATH/ansible/hotfix-containers.yml -e "@~/vars.yml" Example Variables file diff --git a/doc/source/operations/octavia.rst b/doc/source/operations/octavia.rst index f884d130f1..e13b0a1b3a 100644 --- a/doc/source/operations/octavia.rst +++ b/doc/source/operations/octavia.rst @@ -12,7 +12,7 @@ With your kayobe environment activated, you can build a new amphora image with: .. code-block:: console - kayobe playbook run ${KAYOBE_CONFIG_PATH}/ansible/octavia-amphora-image-build.yml + kayobe playbook run $KAYOBE_CONFIG_PATH/ansible/octavia-amphora-image-build.yml The resultant image is based on Ubuntu. By default the image will be built on the seed, but it is possible to change the group in the ansible inventory using the @@ -29,7 +29,7 @@ You can then run the playbook to upload the image: .. code-block:: console - kayobe playbook run ${KAYOBE_CONFIG_PATH}/ansible/octavia-amphora-image-register.yml + kayobe playbook run $KAYOBE_CONFIG_PATH/ansible/octavia-amphora-image-register.yml This will rename the old image by adding a timestamp suffix, before uploading a new image with the name, ``amphora-x64-haproxy``. Octavia should be configured diff --git a/doc/source/operations/openstack-reconfiguration.rst b/doc/source/operations/openstack-reconfiguration.rst index 31c1c5acc8..7a86a7e707 100644 --- a/doc/source/operations/openstack-reconfiguration.rst +++ b/doc/source/operations/openstack-reconfiguration.rst @@ -10,7 +10,7 @@ service is handled less well, because of Ansible's imperative style. To remove a service, it is disabled in Kayobe's Kolla config, which prevents other services from communicating with it. For example, to disable -``cinder-backup``, edit ``${KAYOBE_CONFIG_PATH}/kolla.yml``: +``cinder-backup``, edit ``$KAYOBE_CONFIG_PATH/kolla.yml``: .. code-block:: diff @@ -50,7 +50,7 @@ Use a command of this form: .. code-block:: console - kayobe# ansible-vault edit ${KAYOBE_CONFIG_PATH}/secrets.yml --vault-password-file= + kayobe# ansible-vault edit $KAYOBE_CONFIG_PATH/secrets.yml --vault-password-file= Concatenate the contents of the certificate and key files to create ``secrets_kolla_external_tls_cert``. The certificates should be installed in @@ -60,7 +60,7 @@ this order: * Any intermediate certificates * The TLS certificate private key -In ``${KAYOBE_CONFIG_PATH}/kolla.yml``, set the following: +In ``$KAYOBE_CONFIG_PATH/kolla.yml``, set the following: .. code-block:: yaml From 3eb65370b34e30ea3a0613faaa561492cf10e6fd Mon Sep 17 00:00:00 2001 From: Seunghun Lee Date: Thu, 7 Nov 2024 11:15:02 +0000 Subject: [PATCH 38/42] Add note of reconfiguring monitoring service --- .../operations/bifrost-hardware-inventory-management.rst | 5 +++++ 1 file changed, 5 insertions(+) diff --git a/doc/source/operations/bifrost-hardware-inventory-management.rst b/doc/source/operations/bifrost-hardware-inventory-management.rst index ba6e6bb253..3ec0dd2653 100644 --- a/doc/source/operations/bifrost-hardware-inventory-management.rst +++ b/doc/source/operations/bifrost-hardware-inventory-management.rst @@ -73,6 +73,11 @@ We can then provision and configure them: kayobe# kayobe overcloud host configure --limit kayobe# kayobe overcloud service deploy --limit --kolla-limit +.. note:: + + Reconfiguring monitoring services on controllers is required after provisioning them. + Otherwise, they will not show up. + Replacing a Failing Hypervisor ------------------------------ From 8f001c4c4d976371ff768b151b2c96ad4153fdf5 Mon Sep 17 00:00:00 2001 From: Seunghun Lee Date: Thu, 7 Nov 2024 11:16:29 +0000 Subject: [PATCH 39/42] Fix spacing --- doc/source/operations/migrating-vm.rst | 2 +- doc/source/operations/openstack-reconfiguration.rst | 2 +- 2 files changed, 2 insertions(+), 2 deletions(-) diff --git a/doc/source/operations/migrating-vm.rst b/doc/source/operations/migrating-vm.rst index 784abe74a4..fd5260286f 100644 --- a/doc/source/operations/migrating-vm.rst +++ b/doc/source/operations/migrating-vm.rst @@ -19,4 +19,4 @@ To move a virtual machine with local disks: .. code-block:: console - admin# openstack server migrate --live-migration --block-migration --host hypervisor-01 + admin# openstack server migrate --live-migration --block-migration --host hypervisor-01 diff --git a/doc/source/operations/openstack-reconfiguration.rst b/doc/source/operations/openstack-reconfiguration.rst index 7a86a7e707..771103cac3 100644 --- a/doc/source/operations/openstack-reconfiguration.rst +++ b/doc/source/operations/openstack-reconfiguration.rst @@ -104,7 +104,7 @@ reach the OpenStack APIs: To update an existing certificate, for example when it has reached expiration, change the value of ``secrets_kolla_external_tls_cert``, in the same order as -above. Run the following command: +above. Run the following command: .. code-block:: console From 869fb57696efbaf2a93f9151c3ce14348bbecf04 Mon Sep 17 00:00:00 2001 From: Seunghun Lee Date: Thu, 7 Nov 2024 11:41:17 +0000 Subject: [PATCH 40/42] Remove command prefixes --- .../bifrost-hardware-inventory-management.rst | 32 +++++++++---------- .../operations/control-plane-operation.rst | 28 ++++++++-------- doc/source/operations/gpu-in-openstack.rst | 10 +++--- doc/source/operations/migrating-vm.rst | 6 ++-- .../operations/openstack-reconfiguration.rst | 24 +++++++------- 5 files changed, 49 insertions(+), 51 deletions(-) diff --git a/doc/source/operations/bifrost-hardware-inventory-management.rst b/doc/source/operations/bifrost-hardware-inventory-management.rst index 3ec0dd2653..6730418192 100644 --- a/doc/source/operations/bifrost-hardware-inventory-management.rst +++ b/doc/source/operations/bifrost-hardware-inventory-management.rst @@ -13,7 +13,7 @@ can be reinspected like this: .. code-block:: console - kayobe# kayobe overcloud hardware inspect --limit + kayobe overcloud hardware inspect --limit .. _enrolling-new-hypervisors: @@ -30,26 +30,26 @@ file located at ``$KAYOBE_CONFIG_PATH/secrets.yml``. .. code-block:: console - bifrost# ipmitool -I lanplus -U -H -ipmi chassis bootdev pxe + ipmitool -I lanplus -U -H -ipmi chassis bootdev pxe If node is are off, power them on: .. code-block:: console - bifrost# ipmitool -I lanplus -U -H -ipmi power on + ipmitool -I lanplus -U -H -ipmi power on If nodes is on, reset them: .. code-block:: console - bifrost# ipmitool -I lanplus -U -H -ipmi power reset + ipmitool -I lanplus -U -H -ipmi power reset Once node have booted and have completed introspection, they should be visible in Bifrost: .. code-block:: console - bifrost# baremetal node list --provision-state enroll + baremetal node list --provision-state enroll +--------------------------------------+-----------------------+---------------+-------------+--------------------+-------------+ | UUID | Name | Instance UUID | Power State | Provisioning State | Maintenance | +--------------------------------------+-----------------------+---------------+-------------+--------------------+-------------+ @@ -63,15 +63,15 @@ the correct groups, import them in Kayobe's inventory with: .. code-block:: console - kayobe# kayobe overcloud inventory discover + kayobe overcloud inventory discover We can then provision and configure them: .. code-block:: console - kayobe# kayobe overcloud provision --limit - kayobe# kayobe overcloud host configure --limit - kayobe# kayobe overcloud service deploy --limit --kolla-limit + kayobe overcloud provision --limit + kayobe overcloud host configure --limit + kayobe overcloud service deploy --limit --kolla-limit .. note:: @@ -94,7 +94,7 @@ To deprovision an existing hypervisor, run: .. code-block:: console - kayobe# kayobe overcloud deprovision --limit + kayobe overcloud deprovision --limit .. warning:: @@ -109,14 +109,14 @@ Evacuating all instances .. code-block:: console - admin# openstack server evacuate $(openstack server list --host --format value --column ID) + openstack server evacuate $(openstack server list --host --format value --column ID) You should now check the status of all the instances that were running on that hypervisor. They should all show the status ACTIVE. This can be verified with: .. code-block:: console - admin# openstack server show + openstack server show Troubleshooting =============== @@ -145,7 +145,7 @@ migrate as the process needs manual confirmation. You can do this with: .. code-block:: console - openstack# openstack server resize confirm + openstack server resize confirm The symptom to look out for is that the server is showing a status of ``VERIFY RESIZE`` as shown in this snippet of ``openstack server show ``: @@ -161,7 +161,7 @@ Set maintenance mode on a node in Bifrost .. code-block:: console - seed# docker exec -it bifrost_deploy /bin/bash + docker exec -it bifrost_deploy /bin/bash (bifrost-deploy)[root@seed bifrost-base]# export OS_CLOUD=bifrost (bifrost-deploy)[root@seed bifrost-base]# baremetal node maintenance set @@ -172,7 +172,7 @@ Unset maintenance mode on a node in Bifrost .. code-block:: console - seed# docker exec -it bifrost_deploy /bin/bash + docker exec -it bifrost_deploy /bin/bash (bifrost-deploy)[root@seed bifrost-base]# export OS_CLOUD=bifrost (bifrost-deploy)[root@seed bifrost-base]# baremetal node maintenance unset @@ -210,7 +210,7 @@ into ``$KAYOBE_CONFIG_PATH/overcloud-introspection-data``: .. code-block:: console - kayobe# kayobe overcloud introspection data save + kayobe overcloud introspection data save Using ADVise ------------ diff --git a/doc/source/operations/control-plane-operation.rst b/doc/source/operations/control-plane-operation.rst index 7e7711004d..e2de527095 100644 --- a/doc/source/operations/control-plane-operation.rst +++ b/doc/source/operations/control-plane-operation.rst @@ -120,7 +120,7 @@ The password can be found using: .. code-block:: console - kayobe# ansible-vault view $KAYOBE_CONFIG_PATH/kolla/passwords.yml \ + ansible-vault view $KAYOBE_CONFIG_PATH/kolla/passwords.yml \ --vault-password-file | grep ^database Checking RabbitMQ @@ -188,7 +188,7 @@ Shutting down the seed VM .. code-block:: console - kayobe# virsh shutdown + virsh shutdown .. _full-shutdown: @@ -215,7 +215,7 @@ Example: Reboot all compute hosts apart from compute0: .. code-block:: console - kayobe# kayobe playbook run $KAYOBE_CONFIG_PATH/ansible/reboot.yml --limit 'compute:!compute0' + kayobe playbook run $KAYOBE_CONFIG_PATH/ansible/reboot.yml --limit 'compute:!compute0' References ---------- @@ -242,7 +242,7 @@ with the following command: .. code-block:: console - kayobe# kayobe overcloud database recover + kayobe overcloud database recover Ansible Control Host -------------------- @@ -258,7 +258,7 @@ hypervisor is powered on. If it does not, it can be started with: .. code-block:: console - kayobe# virsh start + virsh start Full power on ------------- @@ -275,7 +275,7 @@ Log into the monitoring host(s): .. code-block:: console - kayobe# ssh stack@monitoring0 + ssh stack@monitoring0 Stop all Docker containers: @@ -312,22 +312,22 @@ To sync host packages: .. code-block:: console - kayobe# kayobe playbook run $KAYOBE_CONFIG_PATH/ansible/pulp-repo-sync.yml - kayobe# kayobe playbook run $KAYOBE_CONFIG_PATH/ansible/pulp-repo-publish.yml + kayobe playbook run $KAYOBE_CONFIG_PATH/ansible/pulp-repo-sync.yml + kayobe playbook run $KAYOBE_CONFIG_PATH/ansible/pulp-repo-publish.yml If the system is production environment and want to use packages tested in test/staging environment, you can promote them by: .. code-block:: console - kayobe# kayobe playbook run $KAYOBE_CONFIG_PATH/ansible/pulp-repo-promote-production.yml + kayobe playbook run $KAYOBE_CONFIG_PATH/ansible/pulp-repo-promote-production.yml To sync container images: .. code-block:: console - kayobe# kayobe playbook run $KAYOBE_CONFIG_PATH/ansible/pulp-container-sync.yml - kayobe# kayobe playbook run $KAYOBE_CONFIG_PATH/ansible/pulp-container-publish.yml + kayobe playbook run $KAYOBE_CONFIG_PATH/ansible/pulp-container-sync.yml + kayobe playbook run $KAYOBE_CONFIG_PATH/ansible/pulp-container-publish.yml For more information about StackHPC Release Train, see :ref:`stackhpc-release-train` documentation. @@ -341,8 +341,8 @@ Host packages can be updated with: .. code-block:: console - kayobe# kayobe overcloud host package update --limit --packages '*' - kayobe# kayobe seed host package update --packages '*' + kayobe overcloud host package update --limit --packages '*' + kayobe seed host package update --packages '*' See https://docs.openstack.org/kayobe/latest/administration/overcloud.html#updating-packages @@ -387,7 +387,7 @@ Reconfigure Opensearch with new values: .. code-block:: console - kayobe# kayobe overcloud service reconfigure --kolla-tags opensearch + kayobe overcloud service reconfigure --kolla-tags opensearch For more information see the `upstream documentation `__. diff --git a/doc/source/operations/gpu-in-openstack.rst b/doc/source/operations/gpu-in-openstack.rst index 330edfbb1d..2d7e30dee1 100644 --- a/doc/source/operations/gpu-in-openstack.rst +++ b/doc/source/operations/gpu-in-openstack.rst @@ -783,8 +783,8 @@ add extra_specs to flavor in etc/openstack-config/openstack-config.yml: .. code-block:: console - admin# cd src/openstack-config - admin# vim etc/openstack-config/openstack-config.yml + cd src/openstack-config + vim etc/openstack-config/openstack-config.yml name: "m1.medium-gpu" ram: 4096 @@ -797,9 +797,9 @@ Invoke configuration playbooks afterwards: .. code-block:: console - admin# source src/kayobe-config/etc/kolla/public-openrc.sh - admin# source venvs/openstack/bin/activate - admin# tools/openstack-config --vault-password-file + source src/kayobe-config/etc/kolla/public-openrc.sh + source venvs/openstack/bin/activate + tools/openstack-config --vault-password-file Create instance with GPU passthrough ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ diff --git a/doc/source/operations/migrating-vm.rst b/doc/source/operations/migrating-vm.rst index fd5260286f..031df1b609 100644 --- a/doc/source/operations/migrating-vm.rst +++ b/doc/source/operations/migrating-vm.rst @@ -6,17 +6,17 @@ To see where all virtual machines are running on the hypervisors: .. code-block:: console - admin# openstack server list --all-projects --long + openstack server list --all-projects --long To move a virtual machine with shared storage or booted from volume from one hypervisor to another, for example to hypervisor-01: .. code-block:: console - admin# openstack server migrate --live-migration --host hypervisor-01 + openstack server migrate --live-migration --host hypervisor-01 To move a virtual machine with local disks: .. code-block:: console - admin# openstack server migrate --live-migration --block-migration --host hypervisor-01 + openstack server migrate --live-migration --block-migration --host hypervisor-01 diff --git a/doc/source/operations/openstack-reconfiguration.rst b/doc/source/operations/openstack-reconfiguration.rst index 771103cac3..36bcece666 100644 --- a/doc/source/operations/openstack-reconfiguration.rst +++ b/doc/source/operations/openstack-reconfiguration.rst @@ -21,7 +21,7 @@ Then, reconfigure Cinder services with Kayobe: .. code-block:: console - kayobe# kayobe overcloud service reconfigure --kolla-tags cinder + kayobe overcloud service reconfigure --kolla-tags cinder However, the service itself, no longer in Ansible's manifest of managed state, must be manually stopped and prevented from restarting. @@ -30,7 +30,7 @@ On each controller: .. code-block:: console - kayobe# docker rm -f cinder_backup + docker rm -f cinder_backup Some services may store data in a dedicated Docker volume, which can be removed with ``docker volume rm``. @@ -50,7 +50,7 @@ Use a command of this form: .. code-block:: console - kayobe# ansible-vault edit $KAYOBE_CONFIG_PATH/secrets.yml --vault-password-file= + ansible-vault edit $KAYOBE_CONFIG_PATH/secrets.yml --vault-password-file= Concatenate the contents of the certificate and key files to create ``secrets_kolla_external_tls_cert``. The certificates should be installed in @@ -72,7 +72,7 @@ be updated in Keystone: .. code-block:: console - kayobe# kayobe overcloud service reconfigure + kayobe overcloud service reconfigure Alternative Configuration ------------------------- @@ -95,7 +95,7 @@ reach the OpenStack APIs: .. code-block:: console - openstack# openssl s_client -connect :443 2> /dev/null | openssl x509 -noout -dates + openssl s_client -connect :443 2> /dev/null | openssl x509 -noout -dates .. note:: @@ -108,7 +108,7 @@ above. Run the following command: .. code-block:: console - kayobe# kayobe overcloud service reconfigure --kolla-tags haproxy + kayobe overcloud service reconfigure --kolla-tags haproxy .. _taking-a-hypervisor-out-of-service: @@ -119,8 +119,7 @@ To take a hypervisor out of Nova scheduling: .. code-block:: console - admin# openstack compute service set --disable \ - nova-compute + openstack compute service set --disable nova-compute Running instances on the hypervisor will not be affected, but new instances will not be deployed on it. @@ -130,19 +129,18 @@ A reason for disabling a hypervisor can be documented with the .. code-block:: console - admin# openstack compute service set --disable \ - --disable-reason "Broken drive" nova-compute + openstack compute service set --disable \ + --disable-reason "Broken drive" nova-compute Details about all hypervisors and the reasons they are disabled can be displayed with: .. code-block:: console - admin# openstack compute service list --long + openstack compute service list --long And then to enable a hypervisor again: .. code-block:: console - admin# openstack compute service set --enable \ - nova-compute + openstack compute service set --enable nova-compute From 1788f7b60f063434b9117e6d218a3ddc2cdd1e51 Mon Sep 17 00:00:00 2001 From: Seunghun Lee Date: Thu, 7 Nov 2024 12:02:26 +0000 Subject: [PATCH 41/42] Add warning of brief downtime --- doc/source/operations/openstack-reconfiguration.rst | 4 ++++ 1 file changed, 4 insertions(+) diff --git a/doc/source/operations/openstack-reconfiguration.rst b/doc/source/operations/openstack-reconfiguration.rst index 36bcece666..392b92421d 100644 --- a/doc/source/operations/openstack-reconfiguration.rst +++ b/doc/source/operations/openstack-reconfiguration.rst @@ -106,6 +106,10 @@ To update an existing certificate, for example when it has reached expiration, change the value of ``secrets_kolla_external_tls_cert``, in the same order as above. Run the following command: +.. warning:: + + Services can be briefly unavailable during reconfiguring HAProxy. + .. code-block:: console kayobe overcloud service reconfigure --kolla-tags haproxy From a6872b57c75cae27d4e2cb51e0622202bb6eee6a Mon Sep 17 00:00:00 2001 From: Seunghun Lee Date: Thu, 5 Dec 2024 11:34:45 +0000 Subject: [PATCH 42/42] Remove outdated information --- doc/source/operations/ceph-management.rst | 2 +- .../operations/control-plane-operation.rst | 9 +----- doc/source/operations/gpu-in-openstack.rst | 29 ------------------- 3 files changed, 2 insertions(+), 38 deletions(-) diff --git a/doc/source/operations/ceph-management.rst b/doc/source/operations/ceph-management.rst index e48a8d3e02..98988959b7 100644 --- a/doc/source/operations/ceph-management.rst +++ b/doc/source/operations/ceph-management.rst @@ -120,7 +120,7 @@ Ceph can report details about failed OSDs by running: sudo cephadm shell ceph health detail -.. note :: +.. note:: Remember to run ceph/rbd commands from within ``cephadm shell`` (preferred method) or after installing Ceph client. Details in the diff --git a/doc/source/operations/control-plane-operation.rst b/doc/source/operations/control-plane-operation.rst index e2de527095..3dfd1ec44b 100644 --- a/doc/source/operations/control-plane-operation.rst +++ b/doc/source/operations/control-plane-operation.rst @@ -81,12 +81,6 @@ Then, create another silence to match ``hostname=`` (this is required because, for the OpenStack exporter, the instance is the host running the monitoring service rather than the host being monitored). -.. note:: - - After creating the silence, you may get redirected to a 404 page. This is a - `known issue `__ - when running several Alertmanager instances behind HAProxy. - Control Plane Shutdown Procedure ================================ @@ -353,8 +347,7 @@ Deploying to a Specific Hypervisor ---------------------------------- To test creating an instance on a specific hypervisor, *as an admin-level user* -you can specify the hypervisor name as part of an extended availability zone -description. +you can specify the hypervisor name. To see the list of hypervisor names: diff --git a/doc/source/operations/gpu-in-openstack.rst b/doc/source/operations/gpu-in-openstack.rst index 2d7e30dee1..1fd99d30d1 100644 --- a/doc/source/operations/gpu-in-openstack.rst +++ b/doc/source/operations/gpu-in-openstack.rst @@ -190,35 +190,6 @@ Configure the VGPU devices: - mdev_type: nvidia-697 index: 1 -Running the playbook -^^^^^^^^^^^^^^^^^^^^ - -The playbook defined in the :ref:`previous step` -should be run after `kayobe overcloud host configure` has completed. This will -ensure the host has been fully bootstrapped. With default settings, internet -connectivity is required to download `MIG Partition Editor for NVIDIA GPUs`. If -this is not desirable, you can override the one of the following variables -(depending on host OS): - -.. code-block:: yaml - :caption: $KAYOBE_CONFIG_PATH/inventory/group_vars/compute_vgpu/vgpu - - vgpu_nvidia_mig_manager_rpm_url: "https://github.com/NVIDIA/mig-parted/releases/download/v0.5.1/nvidia-mig-manager-0.5.1-1.x86_64.rpm" - vgpu_nvidia_mig_manager_deb_url: "https://github.com/NVIDIA/mig-parted/releases/download/v0.5.1/nvidia-mig-manager_0.5.1-1_amd64.deb" - -For example, you may wish to upload these artifacts to the local pulp. - -Run the playbook that you defined earlier: - -.. code-block:: shell - - kayobe playbook run $KAYOBE_CONFIG_PATH/ansible/host-configure.yml - -Note: This will reboot the hosts on first run. - -The playbook may be added as a hook in ``$KAYOBE_CONFIG_PATH/hooks/overcloud-host-configure/post.d``; this will -ensure you do not forget to run it when hosts are enrolled in the future. - .. _NVIDIA Kolla Ansible Configuration: Kolla-Ansible configuration