 68fbace8af
			
		
	
	68fbace8af
	
	
	
		
			
			This change fixes duplicate consecutive words from docs as well as code. Signed-off-by: Rajesh Tailor <ratailor@redhat.com> Change-Id: I236ff41fccf831023b6f85840097148a30e84743
		
			
				
	
	
		
			754 lines
		
	
	
		
			33 KiB
		
	
	
	
		
			ReStructuredText
		
	
	
	
	
	
			
		
		
	
	
			754 lines
		
	
	
		
			33 KiB
		
	
	
	
		
			ReStructuredText
		
	
	
	
	
	
| ========================================
 | |
| Attaching physical PCI devices to guests
 | |
| ========================================
 | |
| 
 | |
| The PCI passthrough feature in OpenStack allows full access and direct control
 | |
| of a physical PCI device in guests. This mechanism is generic for any kind of
 | |
| PCI device, and runs with a Network Interface Card (NIC), Graphics Processing
 | |
| Unit (GPU), or any other devices that can be attached to a PCI bus. Correct
 | |
| driver installation is the only requirement for the guest to properly use the
 | |
| devices.
 | |
| 
 | |
| Some PCI devices provide Single Root I/O Virtualization and Sharing (SR-IOV)
 | |
| capabilities. When SR-IOV is used, a physical device is virtualized and appears
 | |
| as multiple PCI devices. Virtual PCI devices are assigned to the same or
 | |
| different guests. In the case of PCI passthrough, the full physical device is
 | |
| assigned to only one guest and cannot be shared.
 | |
| 
 | |
| PCI devices are requested through flavor extra specs, specifically via the
 | |
| :nova:extra-spec:`pci_passthrough:alias` flavor extra spec.
 | |
| This guide demonstrates how to enable PCI passthrough for a type of PCI device
 | |
| with a vendor ID of ``8086`` and a product ID of ``154d`` - an Intel X520
 | |
| Network Adapter - by mapping them to the alias ``a1``.
 | |
| You should adjust the instructions for other devices with potentially different
 | |
| capabilities.
 | |
| 
 | |
| .. note::
 | |
| 
 | |
|    For information on creating servers with SR-IOV network interfaces, refer to
 | |
|    the :neutron-doc:`Networking Guide <admin/config-sriov>`.
 | |
| 
 | |
|    **Limitations**
 | |
| 
 | |
|    * Attaching SR-IOV ports to existing servers was not supported until the
 | |
|      22.0.0 Victoria release. Due to various bugs in libvirt and qemu we
 | |
|      recommend to use at least libvirt version 6.0.0 and at least qemu version
 | |
|      4.2.
 | |
|    * Cold migration (resize) of servers with SR-IOV devices attached was not
 | |
|      supported until the 14.0.0 Newton release, see
 | |
|      `bug 1512800 <https://bugs.launchpad.net/nova/+bug/1512880>`_ for details.
 | |
| 
 | |
| .. note::
 | |
| 
 | |
|    Nova only supports PCI addresses where the fields are restricted to the
 | |
|    following maximum value:
 | |
| 
 | |
|    * domain - 0xFFFF
 | |
|    * bus - 0xFF
 | |
|    * slot - 0x1F
 | |
|    * function - 0x7
 | |
| 
 | |
|    Nova will ignore PCI devices reported by the hypervisor if the address is
 | |
|    outside of these ranges.
 | |
| 
 | |
| .. versionchanged:: 26.0.0 (Zed):
 | |
|   PCI passthrough device inventories now can be tracked in Placement.
 | |
|   For more information, refer to :ref:`pci-tracking-in-placement`.
 | |
| 
 | |
| .. versionchanged:: 26.0.0 (Zed):
 | |
|   The nova-compute service will refuse to start if both the parent PF and its
 | |
|   children VFs are configured in :oslo.config:option:`pci.device_spec`.
 | |
|   For more information, refer to :ref:`pci-tracking-in-placement`.
 | |
| 
 | |
| .. versionchanged:: 26.0.0 (Zed):
 | |
|   The nova-compute service will refuse to start with
 | |
|   :oslo.config:option:`pci.device_spec` configuration that uses the
 | |
|   ``devname`` field.
 | |
| 
 | |
| .. versionchanged:: 27.0.0 (2023.1 Antelope):
 | |
|    Nova provides Placement based scheduling support for servers with flavor
 | |
|    based PCI requests. This support is disable by default.
 | |
| 
 | |
| .. versionchanged:: 31.0.0 (2025.1 Epoxy):
 | |
| 
 | |
|    * Add managed tag to define if the PCI device is managed (attached/detached
 | |
|      from the host) by libvirt. This is required to support SR-IOV devices
 | |
|      using the new kernel variant driver interface.
 | |
|    * Add a live_migratable tag to define whether a PCI device supports live
 | |
|      migration.
 | |
|    * Add a live_migratable tag to alias definitions to allow requesting either
 | |
|      a live-migratable or non-live-migratable device.
 | |
| 
 | |
| Enabling PCI passthrough
 | |
| ------------------------
 | |
| 
 | |
| Configure compute host
 | |
| ~~~~~~~~~~~~~~~~~~~~~~
 | |
| 
 | |
| To enable PCI passthrough on an x86, Linux-based compute node, the following
 | |
| are required:
 | |
| 
 | |
| * VT-d enabled in the BIOS
 | |
| * IOMMU enabled on the host OS, e.g. by adding the ``intel_iommu=on`` or
 | |
|   ``amd_iommu=on`` parameter to the kernel parameters
 | |
| * Assignable PCIe devices
 | |
| 
 | |
| Configure ``nova-compute``
 | |
| ~~~~~~~~~~~~~~~~~~~~~~~~~~
 | |
| 
 | |
| Once PCI passthrough has been configured for the host, :program:`nova-compute`
 | |
| must be configured to allow the PCI device to pass through to VMs. This is done
 | |
| using the :oslo.config:option:`pci.device_spec` option. For example,
 | |
| assuming our sample PCI device has a PCI address of ``41:00.0`` on each host:
 | |
| 
 | |
| .. code-block:: ini
 | |
| 
 | |
|    [pci]
 | |
|    device_spec = { "address": "0000:41:00.0" }
 | |
| 
 | |
| Refer to :oslo.config:option:`pci.device_spec` for syntax information.
 | |
| 
 | |
| Alternatively, to enable passthrough of all devices with the same product and
 | |
| vendor ID:
 | |
| 
 | |
| .. code-block:: ini
 | |
| 
 | |
|    [pci]
 | |
|    device_spec = { "vendor_id": "8086", "product_id": "154d" }
 | |
| 
 | |
| If using vendor and product IDs, all PCI devices matching the ``vendor_id`` and
 | |
| ``product_id`` are added to the pool of PCI devices available for passthrough
 | |
| to VMs.
 | |
| 
 | |
| In addition, it is necessary to configure the :oslo.config:option:`pci.alias`
 | |
| option, which is a JSON-style configuration option that allows you to map a
 | |
| given device type, identified by the standard PCI ``vendor_id`` and (optional)
 | |
| ``product_id`` fields, to an arbitrary name or *alias*. This alias can then be
 | |
| used to request a PCI device using the :nova:extra-spec:`pci_passthrough:alias`
 | |
| flavor extra spec, as discussed previously.
 | |
| For our sample device with a vendor ID of ``0x8086`` and a product ID of
 | |
| ``0x154d``, this would be:
 | |
| 
 | |
| .. code-block:: ini
 | |
| 
 | |
|    [pci]
 | |
|    alias = { "vendor_id":"8086", "product_id":"154d", "device_type":"type-PF", "name":"a1" }
 | |
| 
 | |
| It's important to note the addition of the ``device_type`` field. This is
 | |
| necessary because this PCI device supports SR-IOV. The ``nova-compute`` service
 | |
| categorizes devices into one of three types, depending on the capabilities the
 | |
| devices report:
 | |
| 
 | |
| ``type-PF``
 | |
|   The device supports SR-IOV and is the parent or root device.
 | |
| 
 | |
| ``type-VF``
 | |
|   The device is a child device of a device that supports SR-IOV.
 | |
| 
 | |
| ``type-PCI``
 | |
|   The device does not support SR-IOV.
 | |
| 
 | |
| By default, it is only possible to attach ``type-PCI`` devices using PCI
 | |
| passthrough. If you wish to attach ``type-PF`` or ``type-VF`` devices, you must
 | |
| specify the ``device_type`` field in the config option. If the device was a
 | |
| device that did not support SR-IOV, the ``device_type`` field could be omitted.
 | |
| 
 | |
| Refer to :oslo.config:option:`pci.alias` for syntax information.
 | |
| 
 | |
| .. important::
 | |
| 
 | |
|    This option must also be configured on controller nodes. This is discussed later
 | |
|    in this document.
 | |
| 
 | |
| Once configured, restart the :program:`nova-compute` service.
 | |
| 
 | |
| Special Tags
 | |
| ^^^^^^^^^^^^
 | |
| 
 | |
| When specified in :oslo.config:option:`pci.device_spec` some tags
 | |
| have special meaning:
 | |
| 
 | |
| ``physical_network``
 | |
|   Associates a device with a physical network label which corresponds to the
 | |
|   ``physical_network`` attribute of a network segment object in Neutron. For
 | |
|   virtual networks such as overlays a value of ``null`` should be specified
 | |
|   as follows: ``"physical_network": null``. In the case of physical networks,
 | |
|   this tag is used to supply the metadata necessary for identifying a switched
 | |
|   fabric to which a PCI device belongs and associate the port with the correct
 | |
|   network segment in the networking backend. Besides typical SR-IOV scenarios,
 | |
|   this tag can be used for remote-managed devices in conjunction with the
 | |
|   ``remote_managed`` tag.
 | |
| 
 | |
| .. note::
 | |
| 
 | |
|    The use of ``"physical_network": null`` is only supported in single segment
 | |
|    networks. This is due to Nova not supporting multisegment networks for
 | |
|    SR-IOV ports. See
 | |
|    `bug 1983570 <https://bugs.launchpad.net/nova/+bug/1983570>`_ for details.
 | |
| 
 | |
| ``remote_managed``
 | |
|   Used to specify whether a PCI device is managed remotely or not. By default,
 | |
|   devices are implicitly tagged as ``"remote_managed": "false"`` but and they
 | |
|   must be tagged as ``"remote_managed": "true"`` if ports with
 | |
|   ``VNIC_TYPE_REMOTE_MANAGED`` are intended to be used. Once that is done,
 | |
|   those PCI devices will not be available for allocation for regular
 | |
|   PCI passthrough use. Specifying ``"remote_managed": "true"`` is only valid
 | |
|   for SR-IOV VFs and specifying it for PFs is prohibited.
 | |
| 
 | |
|   .. important::
 | |
|      It is recommended that PCI VFs that are meant to be remote-managed
 | |
|      (e.g. the ones provided by SmartNIC DPUs) are tagged as remote-managed in
 | |
|      order to prevent them from being allocated for regular PCI passthrough since
 | |
|      they have to be programmed accordingly at the host that has access to the
 | |
|      NIC switch control plane. If this is not done, instances requesting regular
 | |
|      SR-IOV ports may get a device that will not be configured correctly and
 | |
|      will not be usable for sending network traffic.
 | |
| 
 | |
|   .. important::
 | |
|      For the Libvirt virt driver, clearing a VLAN by programming VLAN 0 must not
 | |
|      result in errors in the VF kernel driver at the compute host. Before v8.1.0
 | |
|      Libvirt clears a VLAN before passing a VF through to the guest which may
 | |
|      result in an error depending on your driver and kernel version (see, for
 | |
|      example, `this bug <https://bugs.launchpad.net/ubuntu/+source/linux/+bug/1957753>`_
 | |
|      which discusses a case relevant to one driver). As of Libvirt v8.1.0, EPERM
 | |
|      errors encountered while programming a VLAN are ignored if VLAN clearing is
 | |
|      not explicitly requested in the device XML.
 | |
| 
 | |
| ``trusted``
 | |
|   If a port is requested to be trusted by specifying an extra option during
 | |
|   port creation via ``--binding-profile trusted=true``, only devices tagged as
 | |
|   ``trusted: "true"`` will be allocated to instances. Nova will then configure
 | |
|   those devices as trusted by the network controller through its PF device driver.
 | |
|   The specific set of features allowed by the trusted mode of a VF will differ
 | |
|   depending on the network controller itself, its firmware version and what a PF
 | |
|   device driver version allows to pass to the NIC. Common features to be affected
 | |
|   by this tag are changing the VF MAC address, enabling promiscuous mode or
 | |
|   multicast promiscuous mode.
 | |
| 
 | |
|   .. important::
 | |
|      While the ``trusted tag`` does not directly conflict with the
 | |
|      ``remote_managed`` tag, network controllers in SmartNIC DPUs may prohibit
 | |
|      setting the ``trusted`` mode on a VF via a PF device driver in the first
 | |
|      place. It is recommended to test specific devices, drivers and firmware
 | |
|      versions before assuming this feature can be used.
 | |
| 
 | |
| ``managed``
 | |
|   Users must specify whether the PCI device is managed by libvirt to allow
 | |
|   detachment from the host and assignment to the guest, or vice versa.
 | |
|   The managed mode of a device depends on the specific device and the support
 | |
|   provided by its driver.
 | |
| 
 | |
|   - ``managed='yes'`` means that nova will let libvirt to detach the device
 | |
|     from the host before attaching it to the guest and re-attach it to the host
 | |
|     after the guest is deleted.
 | |
| 
 | |
|   - ``managed='no'`` means that Nova will not request libvirt to attach
 | |
|     or detach the device from the host. Instead, Nova assumes that
 | |
|     the operator has pre-configured the host so that the devices are
 | |
|     already bound to vfio-pci or an appropriate variant driver. This
 | |
|     setup allows the devices to be directly usable by QEMU without
 | |
|     requiring any additional operations to enable passthrough.
 | |
| 
 | |
|   .. note::
 | |
|     If not set, the default value is managed='yes' to preserve the existing
 | |
|     behavior, primarily for upgrade purposes.
 | |
| 
 | |
|   .. warning::
 | |
|      Incorrect configuration of this parameter may result in compute
 | |
|      node crashes.
 | |
| 
 | |
| 
 | |
| Configure ``nova-scheduler``
 | |
| ~~~~~~~~~~~~~~~~~~~~~~~~~~~~
 | |
| 
 | |
| The :program:`nova-scheduler` service must be configured to enable the
 | |
| ``PciPassthroughFilter``. To do this, add this filter to the list of filters
 | |
| specified in :oslo.config:option:`filter_scheduler.enabled_filters` and set
 | |
| :oslo.config:option:`filter_scheduler.available_filters` to the default of
 | |
| ``nova.scheduler.filters.all_filters``. For example:
 | |
| 
 | |
| .. code-block:: ini
 | |
| 
 | |
|    [filter_scheduler]
 | |
|    enabled_filters = ...,PciPassthroughFilter
 | |
|    available_filters = nova.scheduler.filters.all_filters
 | |
| 
 | |
| Once done, restart the :program:`nova-scheduler` service.
 | |
| 
 | |
| Configure ``nova-api``
 | |
| ~~~~~~~~~~~~~~~~~~~~~~
 | |
| 
 | |
| It is necessary to also configure the :oslo.config:option:`pci.alias` config
 | |
| option on the controller. This configuration should match the configuration
 | |
| found on the compute nodes. For example:
 | |
| 
 | |
| .. code-block:: ini
 | |
| 
 | |
|    [pci]
 | |
|    alias = { "vendor_id":"8086", "product_id":"154d", "device_type":"type-PF", "name":"a1", "numa_policy":"preferred" }
 | |
| 
 | |
| Refer to :oslo.config:option:`pci.alias` for syntax information.
 | |
| Refer to :ref:`Affinity  <pci-numa-affinity-policy>` for ``numa_policy``
 | |
| information.
 | |
| 
 | |
| Once configured, restart the :program:`nova-api-wsgi` service.
 | |
| 
 | |
| 
 | |
| Configuring a flavor or image
 | |
| -----------------------------
 | |
| 
 | |
| Once the alias has been configured, it can be used for an flavor extra spec.
 | |
| For example, to request two of the PCI devices referenced by alias ``a1``, run:
 | |
| 
 | |
| .. code-block:: console
 | |
| 
 | |
|    $ openstack flavor set m1.large --property "pci_passthrough:alias"="a1:2"
 | |
| 
 | |
| For more information about the syntax for ``pci_passthrough:alias``, refer to
 | |
| :doc:`the documentation </configuration/extra-specs>`.
 | |
| 
 | |
| 
 | |
| .. _pci-numa-affinity-policy:
 | |
| 
 | |
| PCI-NUMA affinity policies
 | |
| --------------------------
 | |
| 
 | |
| By default, the libvirt driver enforces strict NUMA affinity for PCI devices,
 | |
| be they PCI passthrough devices or neutron SR-IOV interfaces. This means that
 | |
| by default a PCI device must be allocated from the same host NUMA node as at
 | |
| least one of the instance's CPUs. This isn't always necessary, however, and you
 | |
| can configure this policy using the
 | |
| :nova:extra-spec:`hw:pci_numa_affinity_policy` flavor extra spec or equivalent
 | |
| image metadata property. There are three possible values allowed:
 | |
| 
 | |
| **required**
 | |
|     This policy means that nova will boot instances with PCI devices **only**
 | |
|     if at least one of the NUMA nodes of the instance is associated with these
 | |
|     PCI devices. It means that if NUMA node info for some PCI devices could not
 | |
|     be determined, those PCI devices wouldn't be consumable by the instance.
 | |
|     This provides maximum performance.
 | |
| 
 | |
| **socket**
 | |
|     This policy means that the PCI device must be affined to the same host
 | |
|     socket as at least one of the guest NUMA nodes. For example, consider a
 | |
|     system with two sockets, each with two NUMA nodes, numbered node 0 and node
 | |
|     1 on socket 0, and node 2 and node 3 on socket 1. There is a PCI device
 | |
|     affined to node 0. An PCI instance with two guest NUMA nodes and the
 | |
|     ``socket`` policy can be affined to either:
 | |
| 
 | |
|     * node 0 and node 1
 | |
|     * node 0 and node 2
 | |
|     * node 0 and node 3
 | |
|     * node 1 and node 2
 | |
|     * node 1 and node 3
 | |
| 
 | |
|     The instance cannot be affined to node 2 and node 3, as neither of those
 | |
|     are on the same socket as the PCI device. If the other nodes are consumed
 | |
|     by other instances and only nodes 2 and 3 are available, the instance
 | |
|     will not boot.
 | |
| 
 | |
| **preferred**
 | |
|     This policy means that ``nova-scheduler`` will choose a compute host
 | |
|     with minimal consideration for the NUMA affinity of PCI devices.
 | |
|     ``nova-compute`` will attempt a best effort selection of PCI devices
 | |
|     based on NUMA affinity, however, if this is not possible then
 | |
|     ``nova-compute`` will fall back to scheduling on a NUMA node that is not
 | |
|     associated with the PCI device.
 | |
| 
 | |
| **legacy**
 | |
|     This is the default policy and it describes the current nova behavior.
 | |
|     Usually we have information about association of PCI devices with NUMA
 | |
|     nodes. However, some PCI devices do not provide such information. The
 | |
|     ``legacy`` value will mean that nova will boot instances with PCI device
 | |
|     if either:
 | |
| 
 | |
|     * The PCI device is associated with at least one NUMA nodes on which the
 | |
|       instance will be booted
 | |
| 
 | |
|     * There is no information about PCI-NUMA affinity available
 | |
| 
 | |
| For example, to configure a flavor to use the ``preferred`` PCI NUMA affinity
 | |
| policy for any neutron SR-IOV interfaces attached by the user:
 | |
| 
 | |
| .. code-block:: console
 | |
| 
 | |
|    $ openstack flavor set $FLAVOR \
 | |
|        --property hw:pci_numa_affinity_policy=preferred
 | |
| 
 | |
| You can also configure this for PCI passthrough devices by specifying the
 | |
| policy in the alias configuration via :oslo.config:option:`pci.alias`. For more
 | |
| information, refer to :oslo.config:option:`the documentation <pci.alias>`.
 | |
| 
 | |
| .. _pci-tracking-in-placement:
 | |
| 
 | |
| PCI tracking in Placement
 | |
| -------------------------
 | |
| .. note::
 | |
|    The feature described below are optional and disabled by default in nova
 | |
|    26.0.0. (Zed). The legacy PCI tracker code path is still supported and
 | |
|    enabled. The Placement PCI tracking can be enabled via the
 | |
|    :oslo.config:option:`pci.report_in_placement` configuration.
 | |
| 
 | |
| .. warning::
 | |
|    Please note that once it is enabled on a given compute host
 | |
|    **it cannot be disabled there any more**.
 | |
| 
 | |
| Since nova 26.0.0 (Zed) PCI passthrough device inventories are tracked in
 | |
| Placement. If a PCI device exists on the hypervisor and
 | |
| matches one of the device specifications configured via
 | |
| :oslo.config:option:`pci.device_spec` then Placement will have a representation
 | |
| of the device. Each PCI device of type ``type-PCI`` and ``type-PF`` will be
 | |
| modeled as a Placement resource provider (RP) with the name
 | |
| ``<hypervisor_hostname>_<pci_address>``. A devices with type ``type-VF`` is
 | |
| represented by its parent PCI device, the PF, as resource provider.
 | |
| 
 | |
| By default nova will use ``CUSTOM_PCI_<vendor_id>_<product_id>`` as the
 | |
| resource class in PCI inventories in Placement. However the name of the
 | |
| resource class can be customized via the ``resource_class`` tag in the
 | |
| :oslo.config:option:`pci.device_spec` option. There is also a new ``traits``
 | |
| tag in that configuration that allows specifying a list of placement traits to
 | |
| be added to the resource provider representing the matching PCI devices.
 | |
| 
 | |
| .. note::
 | |
|    In nova 26.0.0 (Zed) the Placement resource tracking of PCI devices does not
 | |
|    support SR-IOV devices intended to be consumed via Neutron ports and
 | |
|    therefore having ``physical_network`` tag in
 | |
|    :oslo.config:option:`pci.device_spec`. Such devices are supported via the
 | |
|    legacy PCI tracker code path in Nova.
 | |
| 
 | |
| .. note::
 | |
|    Having different resource class or traits configuration for VFs under the
 | |
|    same parent PF is not supported and the nova-compute service will refuse to
 | |
|    start with such configuration.
 | |
| 
 | |
| .. important::
 | |
|    While nova supported configuring both the PF and its children VFs for PCI
 | |
|    passthrough in the past, it only allowed consuming either the parent PF or
 | |
|    its children VFs. Since 26.0.0. (Zed) the nova-compute service will
 | |
|    enforce the same rule for the configuration as well and will refuse to
 | |
|    start if both the parent PF and its VFs are configured.
 | |
| 
 | |
| .. important::
 | |
|    While nova supported configuring PCI devices by device name via the
 | |
|    ``devname`` parameter in :oslo.config:option:`pci.device_spec` in the past,
 | |
|    this proved to be problematic as the netdev name of a PCI device could
 | |
|    change for multiple reasons during hypervisor reboot. So since nova 26.0.0
 | |
|    (Zed) the nova-compute service will refuse to start with such configuration.
 | |
|    It is suggested to use the PCI address of the device instead.
 | |
| 
 | |
| .. important::
 | |
|    While nova supported configuring :oslo.config:option:`pci.alias` where an
 | |
|    alias name is repeated and therefore associated to multiple alias
 | |
|    specifications, such configuration is not supported when PCI tracking in
 | |
|    Placement is enabled.
 | |
| 
 | |
| The nova-compute service makes sure that existing instances with PCI
 | |
| allocations in the nova DB will have a corresponding PCI allocation in
 | |
| placement. This allocation healing also acts on any new instances regardless of
 | |
| the status of the scheduling part of this feature to make sure that the nova
 | |
| DB and placement are in sync. There is one limitation of the healing logic.
 | |
| It assumes that there is no in-progress migration when the nova-compute service
 | |
| is upgraded. If there is an in-progress migration then the PCI allocation on
 | |
| the source host of the migration will not be healed. The placement view will be
 | |
| consistent after such migration is completed or reverted.
 | |
| 
 | |
| Reconfiguring the PCI devices on the hypervisor or changing the
 | |
| :oslo.config:option:`pci.device_spec` configuration option and restarting the
 | |
| nova-compute service is supported in the following cases:
 | |
| 
 | |
| * new devices are added
 | |
| * devices without allocation are removed
 | |
| 
 | |
| Removing a device that has allocations is not supported. If a device having any
 | |
| allocation is removed then the nova-compute service will keep the device and
 | |
| the allocation exists in the nova DB and in placement and logs a warning. If
 | |
| a device with any allocation is reconfigured in a way that an allocated PF is
 | |
| removed and VFs from the same PF is configured (or vice versa) then
 | |
| nova-compute will refuse to start as it would create a situation where both
 | |
| the PF and its VFs are made available for consumption.
 | |
| 
 | |
| Since nova 27.0.0 (2023.1 Antelope) scheduling and allocation of PCI devices
 | |
| in Placement can also be enabled via
 | |
| :oslo.config:option:`filter_scheduler.pci_in_placement` config option set in
 | |
| the nova-api, nova-scheduler, and nova-conductor configuration. Please note
 | |
| that this should only be enabled after all the computes in the system is
 | |
| configured to report PCI inventory in Placement via enabling
 | |
| :oslo.config:option:`pci.report_in_placement`. In Antelope flavor
 | |
| based PCI requests are support but Neutron port base PCI requests are not
 | |
| handled in Placement.
 | |
| 
 | |
| If you are upgrading from an earlier version with already existing servers with
 | |
| PCI usage then you must enable :oslo.config:option:`pci.report_in_placement`
 | |
| first on all your computes having PCI allocations and then restart the
 | |
| nova-compute service, before you enable
 | |
| :oslo.config:option:`filter_scheduler.pci_in_placement`. The compute service
 | |
| will heal the missing PCI allocation in placement during startup and will
 | |
| continue healing missing allocations for future servers until the scheduling
 | |
| support is enabled.
 | |
| 
 | |
| If a flavor requests multiple ``type-VF`` devices via
 | |
| :nova:extra-spec:`pci_passthrough:alias` then it is important to consider the
 | |
| value of :nova:extra-spec:`group_policy` as well. The value ``none``
 | |
| allows nova to select VFs from the same parent PF to fulfill the request. The
 | |
| value ``isolate`` restricts nova to select each VF from a different parent PF
 | |
| to fulfill the request. If :nova:extra-spec:`group_policy` is not provided in
 | |
| such flavor then it will defaulted to ``none``.
 | |
| 
 | |
| Symmetrically with the ``resource_class`` and ``traits`` fields of
 | |
| :oslo.config:option:`pci.device_spec` the :oslo.config:option:`pci.alias`
 | |
| configuration option supports requesting devices by Placement resource class
 | |
| name via the ``resource_class`` field and also support requesting traits to
 | |
| be present on the selected devices via the ``traits`` field in the alias. If
 | |
| the ``resource_class`` field is not specified in the alias then it is defaulted
 | |
| by nova to ``CUSTOM_PCI_<vendor_id>_<product_id>``. Either the ``product_id``
 | |
| and ``vendor_id`` or the ``resource_class`` field must be provided in each
 | |
| alias.
 | |
| 
 | |
| For deeper technical details please read the `nova specification. <https://specs.openstack.org/openstack/nova-specs/specs/zed/approved/pci-device-tracking-in-placement.html>`_
 | |
| 
 | |
| Support for multiple types of VFs
 | |
| ---------------------------------
 | |
| 
 | |
| SR-IOV devices, such as GPUs, can be configured to provide VFs with various
 | |
| characteristics under the same vendor ID and product ID.
 | |
| 
 | |
| To enable Nova to model this, if you configure the VFs with different
 | |
| resource allocations, you will need to use separate resource_classes for each.
 | |
| 
 | |
| This can be achieved by following the steps below:
 | |
| 
 | |
| - Enable PCI in Placement: This is necessary to track PCI devices with
 | |
|   custom resource classes in the placement service.
 | |
| 
 | |
| - Define Device Specifications: Use a custom resource class to represent
 | |
|   a specific VF type and ensure that the VFs existing on the hypervisor are
 | |
|   matched via the VF's PCI address.
 | |
| 
 | |
| - Specify Type-Specific Flavors: Define flavors with an alias that matches
 | |
|   the resource class to ensure proper allocation.
 | |
| 
 | |
| Examples:
 | |
| 
 | |
| .. note::
 | |
|   The following example demonstrates device specifications and alias
 | |
|   configurations, utilizing resource classes as part of the "PCI in
 | |
|   placement" feature.
 | |
| 
 | |
| .. code-block:: shell
 | |
| 
 | |
|   [pci]
 | |
|   device_spec = { "vendor_id": "10de", "product_id": "25b6", "address": "0000:25:00.4", "resource_class": "CUSTOM_A16_16A", "managed": "no" }
 | |
|   device_spec = { "vendor_id": "10de", "product_id": "25b6", "address": "0000:25:00.5", "resource_class": "CUSTOM_A16_8A", "managed": "no" }
 | |
|   alias = { "device_type": "type-VF", "resource_class": "CUSTOM_A16_16A", "name": "A16_16A" }
 | |
|   alias = { "device_type": "type-VF", "resource_class": "CUSTOM_A16_8A", "name": "A16_8A" }
 | |
| 
 | |
| 
 | |
| Configuring Live Migration for PCI devices
 | |
| ------------------------------------------
 | |
| 
 | |
| Live migration of instances with PCI devices requires specific configuration
 | |
| at both the device and alias levels to ensure that the migration can succeed.
 | |
| This section explains how to configure PCI passthrough to support live
 | |
| migration.
 | |
| 
 | |
| Configuring PCI Device Specification
 | |
| ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
 | |
| 
 | |
| Administrators must explicitly define whether a PCI device support live
 | |
| migration.
 | |
| This is done by adding the ``live_migratable`` attribute to the device
 | |
| specification in the :oslo.config:option:`pci.device_spec` configuration.
 | |
| 
 | |
| .. note::
 | |
| 
 | |
|     Of course, this requires hardware support, as well as proper system
 | |
|     and hypervisor configuration.
 | |
| 
 | |
| Example Configuration:
 | |
| 
 | |
| .. code-block:: ini
 | |
| 
 | |
|    [pci]
 | |
|    dev_spec = {'vendor_id': '8086', 'product_id': '1515', 'live_migratable': 'yes'}
 | |
|    dev_spec = {'vendor_id': '8086', 'product_id': '1516', 'live_migratable': 'no'}
 | |
| 
 | |
| Configuring PCI Aliases for Users
 | |
| ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
 | |
| 
 | |
| PCI devices can be requested through flavor exta_specs.. To request a live
 | |
| migratable PCI device, the PCI alias definition in
 | |
| the :oslo.config:option:`pci.alias` configuration must include
 | |
| the ``live_migratable`` key.
 | |
| 
 | |
| Example Configuration:
 | |
| 
 | |
| .. code-block:: ini
 | |
| 
 | |
|    [pci]
 | |
|    alias = {'name': 'vf_live', 'vendor_id': '8086', 'product_id': '1515', 'device_type': 'type-VF', 'live_migratable': 'yes'}
 | |
|    alias = {'name': 'vf_no_migrate', 'vendor_id': '8086', 'product_id': '1516', 'device_type': 'type-VF', 'live_migratable': 'no'}
 | |
| 
 | |
| 
 | |
| Virtual IOMMU support
 | |
| ---------------------
 | |
| 
 | |
| With provided :nova:extra-spec:`hw:viommu_model` flavor extra spec or equivalent
 | |
| image metadata property ``hw_viommu_model`` and with the guest CPU architecture
 | |
| and OS allows, we can enable vIOMMU in libvirt driver.
 | |
| 
 | |
| .. note::
 | |
| 
 | |
|     Enable vIOMMU might introduce significant performance overhead.
 | |
|     You can see performance comparison table from
 | |
|     `AMD vIOMMU session on KVM Forum 2021`_.
 | |
|     For the above reason, vIOMMU should only be enabled for workflow that
 | |
|     require it.
 | |
| 
 | |
| .. _`AMD vIOMMU session on KVM Forum 2021`: https://static.sched.com/hosted_files/kvmforum2021/da/vIOMMU%20KVM%20Forum%202021%20-%20v4.pdf
 | |
| 
 | |
| Here are four possible values allowed for ``hw:viommu_model``
 | |
| (and ``hw_viommu_model``):
 | |
| 
 | |
| **virtio**
 | |
|     Supported on Libvirt since 8.3.0, for Q35 and ARM virt guests.
 | |
| 
 | |
| **smmuv3**
 | |
|     Supported on Libvirt since 5.5.0, for ARM virt guests.
 | |
| **intel**
 | |
|     Supported for Q35 guests.
 | |
| 
 | |
| **auto**
 | |
|     This option will translate to ``virtio`` if Libvirt supported,
 | |
|     else ``intel`` on X86 (Q35) and ``smmuv3`` on AArch64.
 | |
| 
 | |
| For the viommu attributes:
 | |
| 
 | |
| * ``intremap``, ``caching_mode``, and ``iotlb``
 | |
|   options for viommu (These attributes are driver attributes defined in
 | |
|   `Libvirt IOMMU Domain`_) will directly enabled.
 | |
| 
 | |
| * ``eim`` will directly enabled if machine type is Q35.
 | |
|   ``eim`` is driver attribute defined in `Libvirt IOMMU Domain`_.
 | |
| 
 | |
| .. note::
 | |
| 
 | |
|     eim(Extended Interrupt Mode) attribute (with possible values on and off)
 | |
|     can be used to configure Extended Interrupt Mode.
 | |
|     A q35 domain with split I/O APIC (as described in hypervisor features),
 | |
|     and both interrupt remapping and EIM turned on for the IOMMU, will be
 | |
|     able to use more than 255 vCPUs. Since 3.4.0 (QEMU/KVM only).
 | |
| 
 | |
| * ``aw_bits`` attribute can used to set the address width to allow mapping
 | |
|   larger iova addresses in the guest. Since Qemu current supported
 | |
|   values are 39 and 48, we directly set this to larger width (48)
 | |
|   if Libvirt supported.
 | |
|   ``aw_bits`` is driver attribute defined in `Libvirt IOMMU Domain`_.
 | |
| 
 | |
| .. _`Libvirt IOMMU Domain`: https://libvirt.org/formatdomain.html#iommu-devices
 | |
| 
 | |
| Known Issues
 | |
| ------------
 | |
| 
 | |
| A known issue exists where the ``live_migratable`` flag is ignored for
 | |
| devices that include the ``physical_network`` tag.
 | |
| As a result, instances using such devices do not behave as non-live
 | |
| migratable, and instead, they continue to migrate using the legacy VIF
 | |
| unplug/live migrate/VIF plug procedure.
 | |
| 
 | |
| Example configuration where the live_migratable flag is ignored:
 | |
| 
 | |
| .. code-block:: ini
 | |
| 
 | |
|    [pci]
 | |
|    device_spec = { "vendor_id":"8086", "product_id":"10ca", "address": "0000:06:", "physical_network": "physnet2", "live_migratable": false}
 | |
| 
 | |
| A fix for this issue is planned in a follow-up for the **Epoxy** release.
 | |
| The upstream bug report is `here`__.
 | |
| 
 | |
| .. __: https://bugs.launchpad.net/nova/+bug/2102161
 | |
| 
 | |
| One-Time-Use Devices
 | |
| --------------------
 | |
| 
 | |
| Certain devices may need attention after they are released from one user and
 | |
| before they are attached to another. This is especially true of direct
 | |
| passthrough devices because the instance has full control over them while
 | |
| attached, and Nova doesn't know specifics about the device itself, unlike
 | |
| regular more cloudy resources. Examples include:
 | |
| 
 | |
| * Securely erasing NVMe devices to ensure data residue is not passed from one
 | |
|   user to the other unintentionally
 | |
| * Reinstalling known-good firmware to the device to avoid a hijack attack
 | |
| * Updating firmware to the latest release before each user
 | |
| * Checking a property of the device to determine if it needs repair or
 | |
|   replacement before giving it to another user (i.e. NVMe write-wear indicator)
 | |
| * Some custom behavior, reset, etc
 | |
| 
 | |
| Nova's scope does not cover the above, but it does support a feature that makes
 | |
| it easier for the operator to orchestrate tasks like this. By marking a device
 | |
| as "one time use" (hereafter referred to as OTU), Nova will allocate a device
 | |
| once, after which it will remain in a "reserved" state to avoid being
 | |
| allocated to another instance. After the operator's workflow is performed and
 | |
| the device should be returned to the pool of available resources, the reserved
 | |
| flag can be dropped and Nova will consider it usable again.
 | |
| 
 | |
| .. note:: This feature requires :ref:`pci-tracking-in-placement` in order to
 | |
|   work. The compute configuration is required, but the transitional scheduler
 | |
|   config is optional (during transition but required for safety).
 | |
| 
 | |
| A device can be marked as OTU by adding a tag in the ``device_spec`` like this:
 | |
| 
 | |
| .. code-block:: shell
 | |
| 
 | |
|   device_spec = {"address": "0000:00:1.0", "one_time_use": true}
 | |
| 
 | |
| By marking the device as such, Nova will set the ``reserved`` inventory value
 | |
| on the placement provider to fully cover the device (i.e. ``reserved=total``
 | |
| at the point at which the instance is assigned the PCI device on the compute
 | |
| node. When the instance is deleted, the ``used`` value will return to zero but
 | |
| ``reserved`` will remain. It is the operator's responsibility to return the
 | |
| ``reserved`` value to zero when the device is ready for re-assignment.
 | |
| 
 | |
| The best way to handle this would be to listen to Nova's notifications for the
 | |
| ``instance.delete.end`` event so that the post-processing workflow can happen
 | |
| immediately. However, since notifications could be dropped or missed, regular
 | |
| polling should be performed. Providers that represent devices that Nova is
 | |
| applying the OTU behavior to will have the ``HW_PCI_ONE_TIME_USE`` trait,
 | |
| making it easier to identify them. For example:
 | |
| 
 | |
| .. code-block:: shell
 | |
| 
 | |
|  $ openstack resource provider list --required HW_PCI_ONE_TIME_USE
 | |
|  +--------------------------------------+--------------------+------------+--------------------------------------+--------------------------------------+
 | |
|  | uuid                                 | name               | generation | root_provider_uuid                   | parent_provider_uuid                 |
 | |
|  +--------------------------------------+--------------------+------------+--------------------------------------+--------------------------------------+
 | |
|  | b9e67d7d-43db-49c7-8ce8-803cad08e656 | jammy_0000:00:01.0 |         39 | 2ee402e8-c5c6-4586-9ac7-58e7594d27d1 | 2ee402e8-c5c6-4586-9ac7-58e7594d27d1 |
 | |
|  +--------------------------------------+--------------------+------------+--------------------------------------+--------------------------------------+
 | |
| 
 | |
| Will find all such providers. For each of those, checking the inventory to find
 | |
| ones with ``used=0`` and ``reserved=1`` will identify devices in need of
 | |
| processing. To use the above example:
 | |
| 
 | |
| .. code-block:: shell
 | |
| 
 | |
|  $ openstack resource provider inventory list b9e67d7d-43db-49c7-8ce8-803cad08e656
 | |
|  +----------------------+------------------+----------+----------+----------+-----------+-------+------+
 | |
|  | resource_class       | allocation_ratio | min_unit | max_unit | reserved | step_size | total | used |
 | |
|  +----------------------+------------------+----------+----------+----------+-----------+-------+------+
 | |
|  | CUSTOM_PCI_1B36_0100 |              1.0 |        1 |        1 |        1 |         1 |     1 |    0 |
 | |
|  +----------------------+------------------+----------+----------+----------+-----------+-------+------+
 | |
| 
 | |
| To return the above device back to the pool of allocatable resources, we can
 | |
| set the reserved count back to zero:
 | |
| 
 | |
| .. code-block:: shell
 | |
| 
 | |
|  $ openstack resource provider inventory set --amend \
 | |
|      --resource CUSTOM_PCI_1B36_0100:reserved=0 \
 | |
|      b9e67d7d-43db-49c7-8ce8-803cad08e656
 | |
|  +----------------------+------------------+----------+----------+----------+-----------+-------+
 | |
|  | resource_class       | allocation_ratio | min_unit | max_unit | reserved | step_size | total |
 | |
|  +----------------------+------------------+----------+----------+----------+-----------+-------+
 | |
|  | CUSTOM_PCI_1B36_0100 |              1.0 |        1 |        1 |        0 |         1 |     1 |
 | |
|  +----------------------+------------------+----------+----------+----------+-----------+-------+
 |