qemu/docs/specs/ppc-spapr-xive.rst
<<
>>
Prefs
   1XIVE for sPAPR (pseries machines)
   2=================================
   3
   4The POWER9 processor comes with a new interrupt controller
   5architecture, called XIVE as "eXternal Interrupt Virtualization
   6Engine". It supports a larger number of interrupt sources and offers
   7virtualization features which enables the HW to deliver interrupts
   8directly to virtual processors without hypervisor assistance.
   9
  10A QEMU ``pseries`` machine (which is PAPR compliant) using POWER9
  11processors can run under two interrupt modes:
  12
  13- *Legacy Compatibility Mode*
  14
  15  the hypervisor provides identical interfaces and similar
  16  functionality to PAPR+ Version 2.7.  This is the default mode
  17
  18  It is also referred as *XICS* in QEMU.
  19
  20- *XIVE native exploitation mode*
  21
  22  the hypervisor provides new interfaces to manage the XIVE control
  23  structures, and provides direct control for interrupt management
  24  through MMIO pages.
  25
  26Which interrupt modes can be used by the machine is negotiated with
  27the guest O/S during the Client Architecture Support negotiation
  28sequence. The two modes are mutually exclusive.
  29
  30Both interrupt mode share the same IRQ number space. See below for the
  31layout.
  32
  33CAS Negotiation
  34---------------
  35
  36QEMU advertises the supported interrupt modes in the device tree
  37property ``ibm,arch-vec-5-platform-support`` in byte 23 and the OS
  38Selection for XIVE is indicated in the ``ibm,architecture-vec-5``
  39property byte 23.
  40
  41The interrupt modes supported by the machine depend on the CPU type
  42(POWER9 is required for XIVE) but also on the machine property
  43``ic-mode`` which can be set on the command line. It can take the
  44following values: ``xics``, ``xive``, and ``dual`` which is the
  45default mode. ``dual`` means that both modes XICS **and** XIVE are
  46supported and if the guest OS supports XIVE, this mode will be
  47selected.
  48
  49The chosen interrupt mode is activated after a reconfiguration done
  50in a machine reset.
  51
  52KVM negotiation
  53---------------
  54
  55When the guest starts under KVM, the capabilities of the host kernel
  56and QEMU are also negotiated. Depending on the version of the host
  57kernel, KVM will advertise the XIVE capability to QEMU or not.
  58
  59Nevertheless, the available interrupt modes in the machine should not
  60depend on the XIVE KVM capability of the host. On older kernels
  61without XIVE KVM support, QEMU will use the emulated XIVE device as a
  62fallback and on newer kernels (>=5.2), the KVM XIVE device.
  63
  64XIVE native exploitation mode is not supported for KVM nested guests,
  65VMs running under a L1 hypervisor (KVM on pSeries). In that case, the
  66hypervisor will not advertise the KVM capability and QEMU will use the
  67emulated XIVE device, same as for older versions of KVM.
  68
  69As a final refinement, the user can also switch the use of the KVM
  70device with the machine option ``kernel_irqchip``.
  71
  72
  73XIVE support in KVM
  74~~~~~~~~~~~~~~~~~~~
  75
  76For guest OSes supporting XIVE, the resulting interrupt modes on host
  77kernels with XIVE KVM support are the following:
  78
  79==============  =============  =============  ================
  80ic-mode                            kernel_irqchip
  81--------------  ----------------------------------------------
  82/               allowed        off            on
  83                (default)
  84==============  =============  =============  ================
  85dual (default)  XIVE KVM       XIVE emul.     XIVE KVM
  86xive            XIVE KVM       XIVE emul.     XIVE KVM
  87xics            XICS KVM       XICS emul.     XICS KVM
  88==============  =============  =============  ================
  89
  90For legacy guest OSes without XIVE support, the resulting interrupt
  91modes are the following:
  92
  93==============  =============  =============  ================
  94ic-mode                            kernel_irqchip
  95--------------  ----------------------------------------------
  96/               allowed        off            on
  97                (default)
  98==============  =============  =============  ================
  99dual (default)  XICS KVM       XICS emul.     XICS KVM
 100xive            QEMU error(3)  QEMU error(3)  QEMU error(3)
 101xics            XICS KVM       XICS emul.     XICS KVM
 102==============  =============  =============  ================
 103
 104(3) QEMU fails at CAS with ``Guest requested unavailable interrupt
 105    mode (XICS), either don't set the ic-mode machine property or try
 106    ic-mode=xics or ic-mode=dual``
 107
 108
 109No XIVE support in KVM
 110~~~~~~~~~~~~~~~~~~~~~~
 111
 112For guest OSes supporting XIVE, the resulting interrupt modes on host
 113kernels without XIVE KVM support are the following:
 114
 115==============  =============  =============  ================
 116ic-mode                            kernel_irqchip
 117--------------  ----------------------------------------------
 118/               allowed        off            on
 119                (default)
 120==============  =============  =============  ================
 121dual (default)  XIVE emul.(1)  XIVE emul.     QEMU error (2)
 122xive            XIVE emul.(1)  XIVE emul.     QEMU error (2)
 123xics            XICS KVM       XICS emul.     XICS KVM
 124==============  =============  =============  ================
 125
 126
 127(1) QEMU warns with ``warning: kernel_irqchip requested but unavailable:
 128    IRQ_XIVE capability must be present for KVM``
 129    In some cases (old host kernels or KVM nested guests), one may hit a
 130    QEMU/KVM incompatibility due to device destruction in reset. QEMU fails
 131    with ``KVM is incompatible with ic-mode=dual,kernel-irqchip=on``
 132(2) QEMU fails with ``kernel_irqchip requested but unavailable:
 133    IRQ_XIVE capability must be present for KVM``
 134
 135
 136For legacy guest OSes without XIVE support, the resulting interrupt
 137modes are the following:
 138
 139==============  =============  =============  ================
 140ic-mode                            kernel_irqchip
 141--------------  ----------------------------------------------
 142/               allowed        off            on
 143                (default)
 144==============  =============  =============  ================
 145dual (default)  QEMU error(4)  XICS emul.     QEMU error(4)
 146xive            QEMU error(3)  QEMU error(3)  QEMU error(3)
 147xics            XICS KVM       XICS emul.     XICS KVM
 148==============  =============  =============  ================
 149
 150(3) QEMU fails at CAS with ``Guest requested unavailable interrupt
 151    mode (XICS), either don't set the ic-mode machine property or try
 152    ic-mode=xics or ic-mode=dual``
 153(4) QEMU/KVM incompatibility due to device destruction in reset. QEMU fails
 154    with ``KVM is incompatible with ic-mode=dual,kernel-irqchip=on``
 155
 156
 157XIVE Device tree properties
 158---------------------------
 159
 160The properties for the PAPR interrupt controller node when the *XIVE
 161native exploitation mode* is selected should contain:
 162
 163- ``device_type``
 164
 165  value should be "power-ivpe".
 166
 167- ``compatible``
 168
 169  value should be "ibm,power-ivpe".
 170
 171- ``reg``
 172
 173  contains the base address and size of the thread interrupt
 174  managnement areas (TIMA), for the User level and for the Guest OS
 175  level. Only the Guest OS level is taken into account today.
 176
 177- ``ibm,xive-eq-sizes``
 178
 179  the size of the event queues. One cell per size supported, contains
 180  log2 of size, in ascending order.
 181
 182- ``ibm,xive-lisn-ranges``
 183
 184  the IRQ interrupt number ranges assigned to the guest for the IPIs.
 185
 186The root node also exports :
 187
 188- ``ibm,plat-res-int-priorities``
 189
 190  contains a list of priorities that the hypervisor has reserved for
 191  its own use.
 192
 193IRQ number space
 194----------------
 195
 196IRQ Number space of the ``pseries`` machine is 8K wide and is the same
 197for both interrupt mode. The different ranges are defined as follow :
 198
 199- ``0x0000 .. 0x0FFF`` 4K CPU IPIs (only used under XIVE)
 200- ``0x1000 .. 0x1000`` 1 EPOW
 201- ``0x1001 .. 0x1001`` 1 HOTPLUG
 202- ``0x1002 .. 0x10FF`` unused
 203- ``0x1100 .. 0x11FF`` 256 VIO devices
 204- ``0x1200 .. 0x127F`` 32x4 LSIs for PHB devices
 205- ``0x1280 .. 0x12FF`` unused
 206- ``0x1300 .. 0x1FFF`` PHB MSIs (dynamically allocated)
 207
 208Monitoring XIVE
 209---------------
 210
 211The state of the XIVE interrupt controller can be queried through the
 212monitor commands ``info pic``. The output comes in two parts.
 213
 214First, the state of the thread interrupt context registers is dumped
 215for each CPU :
 216
 217::
 218
 219   (qemu) info pic
 220   CPU[0000]:   QW   NSR CPPR IPB LSMFB ACK# INC AGE PIPR  W2
 221   CPU[0000]: USER    00   00  00    00   00  00  00   00  00000000
 222   CPU[0000]:   OS    00   ff  00    00   ff  00  ff   ff  80000400
 223   CPU[0000]: POOL    00   00  00    00   00  00  00   00  00000000
 224   CPU[0000]: PHYS    00   00  00    00   00  00  00   ff  00000000
 225   ...
 226
 227In the case of a ``pseries`` machine, QEMU acts as the hypervisor and only
 228the O/S and USER register rings make sense. ``W2`` contains the vCPU CAM
 229line which is set to the VP identifier.
 230
 231Then comes the routing information which aggregates the EAS and the
 232END configuration:
 233
 234::
 235
 236   ...
 237   LISN         PQ    EISN     CPU/PRIO EQ
 238   00000000 MSI --    00000010   0/6    380/16384 @1fe3e0000 ^1 [ 80000010 ... ]
 239   00000001 MSI --    00000010   1/6    305/16384 @1fc230000 ^1 [ 80000010 ... ]
 240   00000002 MSI --    00000010   2/6    220/16384 @1fc2f0000 ^1 [ 80000010 ... ]
 241   00000003 MSI --    00000010   3/6    201/16384 @1fc390000 ^1 [ 80000010 ... ]
 242   00000004 MSI -Q  M 00000000
 243   00000005 MSI -Q  M 00000000
 244   00000006 MSI -Q  M 00000000
 245   00000007 MSI -Q  M 00000000
 246   00001000 MSI --    00000012   0/6    380/16384 @1fe3e0000 ^1 [ 80000010 ... ]
 247   00001001 MSI --    00000013   0/6    380/16384 @1fe3e0000 ^1 [ 80000010 ... ]
 248   00001100 MSI --    00000100   1/6    305/16384 @1fc230000 ^1 [ 80000010 ... ]
 249   00001101 MSI -Q  M 00000000
 250   00001200 LSI -Q  M 00000000
 251   00001201 LSI -Q  M 00000000
 252   00001202 LSI -Q  M 00000000
 253   00001203 LSI -Q  M 00000000
 254   00001300 MSI --    00000102   1/6    305/16384 @1fc230000 ^1 [ 80000010 ... ]
 255   00001301 MSI --    00000103   2/6    220/16384 @1fc2f0000 ^1 [ 80000010 ... ]
 256   00001302 MSI --    00000104   3/6    201/16384 @1fc390000 ^1 [ 80000010 ... ]
 257
 258The source information and configuration:
 259
 260- The ``LISN`` column outputs the interrupt number of the source in
 261  range ``[ 0x0 ... 0x1FFF ]`` and its type : ``MSI`` or ``LSI``
 262- The ``PQ`` column reflects the state of the PQ bits of the source :
 263
 264  - ``--`` source is ready to take events
 265  - ``P-`` an event was sent and an EOI is PENDING
 266  - ``PQ`` an event was QUEUED
 267  - ``-Q`` source is OFF
 268
 269  a ``M`` indicates that source is *MASKED* at the EAS level,
 270
 271The targeting configuration :
 272
 273- The ``EISN`` column is the event data that will be queued in the event
 274  queue of the O/S.
 275- The ``CPU/PRIO`` column is the tuple defining the CPU number and
 276  priority queue serving the source.
 277- The ``EQ`` column outputs :
 278
 279  - the current index of the event queue/ the max number of entries
 280  - the O/S event queue address
 281  - the toggle bit
 282  - the last entries that were pushed in the event queue.
 283