qemu/docs/specs/ppc-spapr-hotplug.txt
<<
>>
Prefs
   1= sPAPR Dynamic Reconfiguration =
   2
   3sPAPR/"pseries" guests make use of a facility called dynamic-reconfiguration
   4to handle hotplugging of dynamic "physical" resources like PCI cards, or
   5"logical"/paravirtual resources like memory, CPUs, and "physical"
   6host-bridges, which are generally managed by the host/hypervisor and provided
   7to guests as virtualized resources. The specifics of dynamic-reconfiguration
   8are documented extensively in PAPR+ v2.7, Section 13.1. This document
   9provides a summary of that information as it applies to the implementation
  10within QEMU.
  11
  12== Dynamic-reconfiguration Connectors ==
  13
  14To manage hotplug/unplug of these resources, a firmware abstraction known as
  15a Dynamic Resource Connector (DRC) is used to assign a particular dynamic
  16resource to the guest, and provide an interface for the guest to manage
  17configuration/removal of the resource associated with it.
  18
  19== Device-tree description of DRCs ==
  20
  21A set of 4 Open Firmware device tree array properties are used to describe
  22the name/index/power-domain/type of each DRC allocated to a guest at
  23boot-time. There may be multiple sets of these arrays, rooted at different
  24paths in the device tree depending on the type of resource the DRCs manage.
  25
  26In some cases, the DRCs themselves may be provided by a dynamic resource,
  27such as the DRCs managing PCI slots on a hotplugged PHB. In this case the
  28arrays would be fetched as part of the device tree retrieval interfaces
  29for hotplugged resources described under "Guest->Host interface".
  30
  31The array properties are described below. Each entry/element in an array
  32describes the DRC identified by the element in the corresponding position
  33of ibm,drc-indexes:
  34
  35ibm,drc-names:
  36  first 4-bytes: BE-encoded integer denoting the number of entries
  37  each entry: a NULL-terminated <name> string encoded as a byte array
  38
  39  <name> values for logical/virtual resources are defined in PAPR+ v2.7,
  40  Section 13.5.2.4, and basically consist of the type of the resource
  41  followed by a space and a numerical value that's unique across resources
  42  of that type.
  43
  44  <name> values for "physical" resources such as PCI or VIO devices are
  45  defined as being "location codes", which are the "location labels" of
  46  each encapsulating device, starting from the chassis down to the
  47  individual slot for the device, concatenated by a hyphen. This provides
  48  a mapping of resources to a physical location in a chassis for debugging
  49  purposes. For QEMU, this mapping is less important, so we assign a
  50  location code that conforms to naming specifications, but is simply a
  51  location label for the slot by itself to simplify the implementation.
  52  The naming convention for location labels is documented in detail in
  53  PAPR+ v2.7, Section 12.3.1.5, and in our case amounts to using "C<n>"
  54  for PCI/VIO device slots, where <n> is unique across all PCI/VIO
  55  device slots.
  56
  57ibm,drc-indexes:
  58  first 4-bytes: BE-encoded integer denoting the number of entries
  59  each 4-byte entry: BE-encoded <index> integer that is unique across all DRCs
  60    in the machine
  61
  62  <index> is arbitrary, but in the case of QEMU we try to maintain the
  63  convention used to assign them to pSeries guests on pHyp:
  64
  65    bit[31:28]: integer encoding of <type>, where <type> is:
  66                  1 for CPU resource
  67                  2 for PHB resource
  68                  3 for VIO resource
  69                  4 for PCI resource
  70                  8 for Memory resource
  71    bit[27:0]: integer encoding of <id>, where <id> is unique across
  72                 all resources of specified type
  73
  74ibm,drc-power-domains:
  75  first 4-bytes: BE-encoded integer denoting the number of entries
  76  each 4-byte entry: 32-bit, BE-encoded <index> integer that specifies the
  77    power domain the resource will be assigned to. In the case of QEMU
  78    we associated all resources with a "live insertion" domain, where the
  79    power is assumed to be managed automatically. The integer value for
  80    this domain is a special value of -1.
  81
  82
  83ibm,drc-types:
  84  first 4-bytes: BE-encoded integer denoting the number of entries
  85  each entry: a NULL-terminated <type> string encoded as a byte array
  86
  87  <type> is assigned as follows:
  88    "CPU" for a CPU
  89    "PHB" for a physical host-bridge
  90    "SLOT" for a VIO slot
  91    "28" for a PCI slot
  92    "MEM" for memory resource
  93
  94== Guest->Host interface to manage dynamic resources ==
  95
  96Each DRC is given a globally unique DRC Index, and resources associated with
  97a particular DRC are configured/managed by the guest via a number of RTAS
  98calls which reference individual DRCs based on the DRC index. This can be
  99considered the guest->host interface.
 100
 101rtas-set-power-level:
 102  arg[0]: integer identifying power domain
 103  arg[1]: new power level for the domain, 0-100
 104  output[0]: status, 0 on success
 105  output[1]: power level after command
 106
 107  Set the power level for a specified power domain
 108
 109rtas-get-power-level:
 110  arg[0]: integer identifying power domain
 111  output[0]: status, 0 on success
 112  output[1]: current power level
 113
 114  Get the power level for a specified power domain
 115
 116rtas-set-indicator:
 117  arg[0]: integer identifying sensor/indicator type
 118  arg[1]: index of sensor, for DR-related sensors this is generally the
 119          DRC index
 120  arg[2]: desired sensor value
 121  output[0]: status, 0 on success
 122
 123  Set the state of an indicator or sensor. For the purpose of this document we
 124  focus on the indicator/sensor types associated with a DRC. The types are:
 125
 126    9001: isolation-state, controls/indicates whether a device has been made
 127          accessible to a guest
 128
 129          supported sensor values:
 130            0: isolate, device is made unaccessible by guest OS
 131            1: unisolate, device is made available to guest OS
 132
 133    9002: dr-indicator, controls "visual" indicator associated with device
 134
 135          supported sensor values:
 136            0: inactive, resource may be safely removed
 137            1: active, resource is in use and cannot be safely removed
 138            2: identify, used to visually identify slot for interactive hotplug
 139            3: action, in most cases, used in the same manner as identify
 140
 141    9003: allocation-state, generally only used for "logical" DR resources to
 142          request the allocation/deallocation of a resource prior to acquiring
 143          it via isolation-state->unisolate, or after releasing it via
 144          isolation-state->isolate, respectively. for "physical" DR (like PCI
 145          hotplug/unplug) the pre-allocation of the resource is implied and
 146          this sensor is unused.
 147
 148          supported sensor values:
 149            0: unusable, tell firmware/system the resource can be
 150               unallocated/reclaimed and added back to the system resource pool
 151            1: usable, request the resource be allocated/reserved for use by
 152               guest OS
 153            2: exchange, used to allocate a spare resource to use for fail-over
 154               in certain situations. unused in QEMU
 155            3: recover, used to reclaim a previously allocated resource that's
 156               not currently allocated to the guest OS. unused in QEMU
 157
 158rtas-get-sensor-state:
 159  arg[0]: integer identifying sensor/indicator type
 160  arg[1]: index of sensor, for DR-related sensors this is generally the
 161          DRC index
 162  output[0]: status, 0 on success
 163
 164  Used to read an indicator or sensor value.
 165
 166  For DR-related operations, the only noteworthy sensor is dr-entity-sense,
 167  which has a type value of 9003, as allocation-state does in the case of
 168  rtas-set-indicator. The semantics/encodings of the sensor values are distinct
 169  however:
 170
 171  supported sensor values for dr-entity-sense (9003) sensor:
 172    0: empty,
 173         for physical resources: DRC/slot is empty
 174         for logical resources: unused
 175    1: present,
 176         for physical resources: DRC/slot is populated with a device/resource
 177         for logical resources: resource has been allocated to the DRC
 178    2: unusable,
 179         for physical resources: unused
 180         for logical resources: DRC has no resource allocated to it
 181    3: exchange,
 182         for physical resources: unused
 183         for logical resources: resource available for exchange (see
 184           allocation-state sensor semantics above)
 185    4: recovery,
 186         for physical resources: unused
 187         for logical resources: resource available for recovery (see
 188           allocation-state sensor semantics above)
 189
 190rtas-ibm-configure-connector:
 191  arg[0]: guest physical address of 4096-byte work area buffer
 192  arg[1]: 0, or address of additional 4096-byte work area buffer. only non-zero
 193          if a prior RTAS response indicated a need for additional memory
 194  output[0]: status:
 195               0: completed transmittal of device-tree node
 196               1: instruct guest to prepare for next DT sibling node
 197               2: instruct guest to prepare for next DT child node
 198               3: instruct guest to prepare for next DT property
 199               4: instruct guest to ascend to parent DT node
 200               5: instruct guest to provide additional work-area buffer
 201                  via arg[1]
 202            990x: instruct guest that operation took too long and to try
 203                  again later
 204
 205  Used to fetch an OF device-tree description of the resource associated with
 206  a particular DRC. The DRC index is encoded in the first 4-bytes of the first
 207  work area buffer.
 208
 209  Work area layout, using 4-byte offsets:
 210    wa[0]: DRC index of the DRC to fetch device-tree nodes from
 211    wa[1]: 0 (hard-coded)
 212    wa[2]: for next-sibling/next-child response:
 213             wa offset of null-terminated string denoting the new node's name
 214           for next-property response:
 215             wa offset of null-terminated string denoting new property's name
 216    wa[3]: for next-property response (unused otherwise):
 217             byte-length of new property's value
 218    wa[4]: for next-property response (unused otherwise):
 219             new property's value, encoded as an OFDT-compatible byte array
 220
 221== hotplug/unplug events ==
 222
 223For most DR operations, the hypervisor will issue host->guest add/remove events
 224using the EPOW/check-exception notification framework, where the host issues a
 225check-exception interrupt, then provides an RTAS event log via an
 226rtas-check-exception call issued by the guest in response. This framework is
 227documented by PAPR+ v2.7, and already use in by QEMU for generating powerdown
 228requests via EPOW events.
 229
 230For DR, this framework has been extended to include hotplug events, which were
 231previously unneeded due to direct manipulation of DR-related guest userspace
 232tools by host-level management such as an HMC. This level of management is not
 233applicable to PowerKVM, hence the reason for extending the notification
 234framework to support hotplug events.
 235
 236The format for these EPOW-signalled events is described below under
 237"hotplug/unplug event structure". Note that these events are not
 238formally part of the PAPR+ specification, and have been superseded by a
 239newer format, also described below under "hotplug/unplug event structure",
 240and so are now deemed a "legacy" format. The formats are similar, but the
 241"modern" format contains additional fields/flags, which are denoted for the
 242purposes of this documentation with "#ifdef GUEST_SUPPORTS_MODERN" guards.
 243
 244QEMU should assume support only for "legacy" fields/flags unless the guest
 245advertises support for the "modern" format via ibm,client-architecture-support
 246hcall by setting byte 5, bit 6 of it's ibm,architecture-vec-5 option vector
 247structure (as described by LoPAPR v11, B.6.2.3). As with "legacy" format events,
 248"modern" format events are surfaced to the guest via check-exception RTAS calls,
 249but use a dedicated event source to signal the guest. This event source is
 250advertised to the guest by the addition of a "hot-plug-events" node under
 251"/event-sources" node of the guest's device tree using the standard format
 252described in LoPAPR v11, B.6.12.1.
 253
 254== hotplug/unplug event structure ==
 255
 256The hotplug-specific payload in QEMU is implemented as follows (with all values
 257encoded in big-endian format):
 258
 259struct rtas_event_log_v6_hp {
 260#define SECTION_ID_HOTPLUG              0x4850 /* HP */
 261    struct section_header {
 262        uint16_t section_id;            /* set to SECTION_ID_HOTPLUG */
 263        uint16_t section_length;        /* sizeof(rtas_event_log_v6_hp),
 264                                         * plus the length of the DRC name
 265                                         * if a DRC name identifier is
 266                                         * specified for hotplug_identifier
 267                                         */
 268        uint8_t section_version;        /* version 1 */
 269        uint8_t section_subtype;        /* unused */
 270        uint16_t creator_component_id;  /* unused */
 271    } hdr;
 272#define RTAS_LOG_V6_HP_TYPE_CPU         1
 273#define RTAS_LOG_V6_HP_TYPE_MEMORY      2
 274#define RTAS_LOG_V6_HP_TYPE_SLOT        3
 275#define RTAS_LOG_V6_HP_TYPE_PHB         4
 276#define RTAS_LOG_V6_HP_TYPE_PCI         5
 277    uint8_t hotplug_type;               /* type of resource/device */
 278#define RTAS_LOG_V6_HP_ACTION_ADD       1
 279#define RTAS_LOG_V6_HP_ACTION_REMOVE    2
 280    uint8_t hotplug_action;             /* action (add/remove) */
 281#define RTAS_LOG_V6_HP_ID_DRC_NAME          1
 282#define RTAS_LOG_V6_HP_ID_DRC_INDEX         2
 283#define RTAS_LOG_V6_HP_ID_DRC_COUNT         3
 284#ifdef GUEST_SUPPORTS_MODERN
 285#define RTAS_LOG_V6_HP_ID_DRC_COUNT_INDEXED 4
 286#endif
 287    uint8_t hotplug_identifier;         /* type of the resource identifier,
 288                                         * which serves as the discriminator
 289                                         * for the 'drc' union field below
 290                                         */
 291#ifdef GUEST_SUPPORTS_MODERN
 292    uint8_t capabilities;               /* capability flags, currently unused
 293                                         * by QEMU
 294                                         */
 295#else
 296    uint8_t reserved;
 297#endif
 298    union {
 299        uint32_t index;                 /* DRC index of resource to take action
 300                                         * on
 301                                         */
 302        uint32_t count;                 /* number of DR resources to take
 303                                         * action on (guest chooses which)
 304                                         */
 305#ifdef GUEST_SUPPORTS_MODERN
 306        struct {
 307            uint32_t count;             /* number of DR resources to take
 308                                         * action on
 309                                         */
 310            uint32_t index;             /* DRC index of first resource to take
 311                                         * action on. guest will take action
 312                                         * on DRC index <index> through
 313                                         * DRC index <index + count - 1> in
 314                                         * sequential order
 315                                         */
 316        } count_indexed;
 317#endif
 318        char name[1];                   /* string representing the name of the
 319                                         * DRC to take action on
 320                                         */
 321    } drc;
 322} QEMU_PACKED;
 323
 324== ibm,lrdr-capacity ==
 325
 326ibm,lrdr-capacity is a property in the /rtas device tree node that identifies
 327the dynamic reconfiguration capabilities of the guest. It consists of a triple
 328consisting of <phys>, <size> and <maxcpus>.
 329
 330  <phys>, encoded in BE format represents the maximum address in bytes and
 331  hence the maximum memory that can be allocated to the guest.
 332
 333  <size>, encoded in BE format represents the size increments in which
 334  memory can be hot-plugged to the guest.
 335
 336  <maxcpus>, a BE-encoded integer, represents the maximum number of
 337  processors that the guest can have.
 338
 339pseries guests use this property to note the maximum allowed CPUs for the
 340guest.
 341
 342== ibm,dynamic-reconfiguration-memory ==
 343
 344ibm,dynamic-reconfiguration-memory is a device tree node that represents
 345dynamically reconfigurable logical memory blocks (LMB). This node
 346is generated only when the guest advertises the support for it via
 347ibm,client-architecture-support call. Memory that is not dynamically
 348reconfigurable is represented by /memory nodes. The properties of this
 349node that are of interest to the sPAPR memory hotplug implementation
 350in QEMU are described here.
 351
 352ibm,lmb-size
 353
 354This 64bit integer defines the size of each dynamically reconfigurable LMB.
 355
 356ibm,associativity-lookup-arrays
 357
 358This property defines a lookup array in which the NUMA associativity
 359information for each LMB can be found. It is a property encoded array
 360that begins with an integer M, the number of associativity lists followed
 361by an integer N, the number of entries per associativity list and terminated
 362by M associativity lists each of length N integers.
 363
 364This property provides the same information as given by ibm,associativity
 365property in a /memory node. Each assigned LMB has an index value between
 3660 and M-1 which is used as an index into this table to select which
 367associativity list to use for the LMB. This index value for each LMB
 368is defined in ibm,dynamic-memory property.
 369
 370ibm,dynamic-memory
 371
 372This property describes the dynamically reconfigurable memory. It is a
 373property encoded array that has an integer N, the number of LMBs followed
 374by N LMB list entries.
 375
 376Each LMB list entry consists of the following elements:
 377
 378- Logical address of the start of the LMB encoded as a 64bit integer. This
 379  corresponds to reg property in /memory node.
 380- DRC index of the LMB that corresponds to ibm,my-drc-index property
 381  in a /memory node.
 382- Four bytes reserved for expansion.
 383- Associativity list index for the LMB that is used as an index into
 384  ibm,associativity-lookup-arrays property described earlier. This
 385  is used to retrieve the right associativity list to be used for this
 386  LMB.
 387- A 32bit flags word. The bit at bit position 0x00000008 defines whether
 388  the LMB is assigned to the partition as of boot time.
 389
 390ibm,dynamic-memory-v2
 391
 392This property describes the dynamically reconfigurable memory. This is
 393an alternate and newer way to describe dynamically reconfigurable memory.
 394It is a property encoded array that has an integer N (the number of
 395LMB set entries) followed by N LMB set entries. There is an LMB set entry
 396for each sequential group of LMBs that share common attributes.
 397
 398Each LMB set entry consists of the following elements:
 399
 400- Number of sequential LMBs in the entry represented by a 32bit integer.
 401- Logical address of the first LMB in the set encoded as a 64bit integer.
 402- DRC index of the first LMB in the set.
 403- Associativity list index that is used as an index into
 404  ibm,associativity-lookup-arrays property described earlier. This
 405  is used to retrieve the right associativity list to be used for all
 406  the LMBs in this set.
 407- A 32bit flags word that applies to all the LMBs in the set.
 408
 409[1] http://thread.gmane.org/gmane.linux.ports.ppc.embedded/75350/focus=106867
 410