qemu/docs/pvrdma.txt
<<
>>
Prefs
   1Paravirtualized RDMA Device (PVRDMA)
   2====================================
   3
   4
   51. Description
   6===============
   7PVRDMA is the QEMU implementation of VMware's paravirtualized RDMA device.
   8It works with its Linux Kernel driver AS IS, no need for any special guest
   9modifications.
  10
  11While it complies with the VMware device, it can also communicate with bare
  12metal RDMA-enabled machines as peers.
  13
  14It does not require an RDMA HCA in the host, it can work with Soft-RoCE (rxe).
  15
  16It does not require the whole guest RAM to be pinned allowing memory
  17over-commit and, even if not implemented yet, migration support will be
  18possible with some HW assistance.
  19
  20A project presentation accompany this document:
  21- https://blog.linuxplumbersconf.org/2017/ocw/system/presentations/4730/original/lpc-2017-pvrdma-marcel-apfelbaum-yuval-shaia.pdf
  22
  23
  24
  252. Setup
  26========
  27
  28
  292.1 Guest setup
  30===============
  31Fedora 27+ kernels work out of the box, older distributions
  32require updating the kernel to 4.14 to include the pvrdma driver.
  33
  34However the libpvrdma library needed by User Level Software is still
  35not available as part of the distributions, so the rdma-core library
  36needs to be compiled and optionally installed.
  37
  38Please follow the instructions at:
  39  https://github.com/linux-rdma/rdma-core.git
  40
  41
  422.2 Host Setup
  43==============
  44The pvrdma backend is an ibdevice interface that can be exposed
  45either by a Soft-RoCE(rxe) device on machines with no RDMA device,
  46or an HCA SRIOV function(VF/PF).
  47Note that ibdevice interfaces can't be shared between pvrdma devices,
  48each one requiring a separate instance (rxe or SRIOV VF).
  49
  50
  512.2.1 Soft-RoCE backend(rxe)
  52===========================
  53A stable version of rxe is required, Fedora 27+ or a Linux
  54Kernel 4.14+ is preferred.
  55
  56The rdma_rxe module is part of the Linux Kernel but not loaded by default.
  57Install the User Level library (librxe) following the instructions from:
  58https://github.com/SoftRoCE/rxe-dev/wiki/rxe-dev:-Home
  59
  60Associate an ETH interface with rxe by running:
  61   rxe_cfg add eth0
  62An rxe0 ibdevice interface will be created and can be used as pvrdma backend.
  63
  64
  652.2.2 RDMA device Virtual Function backend
  66==========================================
  67Nothing special is required, the pvrdma device can work not only with
  68Ethernet Links, but also Infinibands Links.
  69All is needed is an ibdevice with an active port, for Mellanox cards
  70will be something like mlx5_6 which can be the backend.
  71
  72
  732.2.3 QEMU setup
  74================
  75Configure QEMU with --enable-rdma flag, installing
  76the required RDMA libraries.
  77
  78
  79
  803. Usage
  81========
  82
  83
  843.1 VM Memory settings
  85======================
  86Currently the device is working only with memory backed RAM
  87and it must be mark as "shared":
  88   -m 1G \
  89   -object memory-backend-ram,id=mb1,size=1G,share \
  90   -numa node,memdev=mb1 \
  91
  92
  933.2 MAD Multiplexer
  94===================
  95MAD Multiplexer is a service that exposes MAD-like interface for VMs in
  96order to overcome the limitation where only single entity can register with
  97MAD layer to send and receive RDMA-CM MAD packets.
  98
  99To build rdmacm-mux run
 100# make rdmacm-mux
 101
 102Before running the rdmacm-mux make sure that both ib_cm and rdma_cm kernel
 103modules aren't loaded, otherwise the rdmacm-mux service will fail to start.
 104
 105The application accepts 3 command line arguments and exposes a UNIX socket
 106to pass control and data to it.
 107-d rdma-device-name  Name of RDMA device to register with
 108-s unix-socket-path  Path to unix socket to listen (default /var/run/rdmacm-mux)
 109-p rdma-device-port  Port number of RDMA device to register with (default 1)
 110The final UNIX socket file name is a concatenation of the 3 arguments so
 111for example for device mlx5_0 on port 2 this /var/run/rdmacm-mux-mlx5_0-2
 112will be created.
 113
 114pvrdma requires this service.
 115
 116Please refer to contrib/rdmacm-mux for more details.
 117
 118
 1193.3 Service exposed by libvirt daemon
 120=====================================
 121The control over the RDMA device's GID table is done by updating the
 122device's Ethernet function addresses.
 123Usually the first GID entry is determined by the MAC address, the second by
 124the first IPv6 address and the third by the IPv4 address. Other entries can
 125be added by adding more IP addresses. The opposite is the same, i.e.
 126whenever an address is removed, the corresponding GID entry is removed.
 127The process is done by the network and RDMA stacks. Whenever an address is
 128added the ib_core driver is notified and calls the device driver add_gid
 129function which in turn update the device.
 130To support this in pvrdma device the device hooks into the create_bind and
 131destroy_bind HW commands triggered by pvrdma driver in guest.
 132
 133Whenever changed is made to the pvrdma port's GID table a special QMP
 134messages is sent to be processed by libvirt to update the address of the
 135backend Ethernet device.
 136
 137pvrdma requires that libvirt service will be up.
 138
 139
 1403.4 PCI devices settings
 141========================
 142RoCE device exposes two functions - an Ethernet and RDMA.
 143To support it, pvrdma device is composed of two PCI functions, an Ethernet
 144device of type vmxnet3 on PCI slot 0 and a PVRDMA device on PCI slot 1. The
 145Ethernet function can be used for other Ethernet purposes such as IP.
 146
 147
 1483.5 Device parameters
 149=====================
 150- netdev: Specifies the Ethernet device function name on the host for
 151  example enp175s0f0. For Soft-RoCE device (rxe) this would be the Ethernet
 152  device used to create it.
 153- ibdev: The IB device name on host for example rxe0, mlx5_0 etc.
 154- mad-chardev: The name of the MAD multiplexer char device.
 155- ibport: In case of multi-port device (such as Mellanox's HCA) this
 156  specify the port to use. If not set 1 will be used.
 157- dev-caps-max-mr-size: The maximum size of MR.
 158- dev-caps-max-qp:      Maximum number of QPs.
 159- dev-caps-max-cq:      Maximum number of CQs.
 160- dev-caps-max-mr:      Maximum number of MRs.
 161- dev-caps-max-pd:      Maximum number of PDs.
 162- dev-caps-max-ah:      Maximum number of AHs.
 163
 164Notes:
 165- The first 3 parameters are mandatory settings, the rest have their
 166  defaults.
 167- The last 8 parameters (the ones that prefixed by dev-caps) defines the top
 168  limits but the final values is adjusted by the backend device limitations.
 169- netdev can be extracted from ibdev's sysfs
 170  (/sys/class/infiniband/<ibdev>/device/net/)
 171
 172
 1733.6 Example
 174===========
 175Define bridge device with vmxnet3 network backend:
 176<interface type='bridge'>
 177  <mac address='56:b4:44:e9:62:dc'/>
 178  <source bridge='bridge1'/>
 179  <model type='vmxnet3'/>
 180  <address type='pci' domain='0x0000' bus='0x00' slot='0x10' function='0x0' multifunction='on'/>
 181</interface>
 182
 183Define pvrdma device:
 184<qemu:commandline>
 185  <qemu:arg value='-object'/>
 186  <qemu:arg value='memory-backend-ram,id=mb1,size=1G,share'/>
 187  <qemu:arg value='-numa'/>
 188  <qemu:arg value='node,memdev=mb1'/>
 189  <qemu:arg value='-chardev'/>
 190  <qemu:arg value='socket,path=/var/run/rdmacm-mux-rxe0-1,id=mads'/>
 191  <qemu:arg value='-device'/>
 192  <qemu:arg value='pvrdma,addr=10.1,ibdev=rxe0,netdev=bridge0,mad-chardev=mads'/>
 193</qemu:commandline>
 194
 195
 196
 1974. Implementation details
 198=========================
 199
 200
 2014.1 Overview
 202============
 203The device acts like a proxy between the Guest Driver and the host
 204ibdevice interface.
 205On configuration path:
 206 - For every hardware resource request (PD/QP/CQ/...) the pvrdma will request
 207   a resource from the backend interface, maintaining a 1-1 mapping
 208   between the guest and host.
 209On data path:
 210 - Every post_send/receive received from the guest will be converted into
 211   a post_send/receive for the backend. The buffers data will not be touched
 212   or copied resulting in near bare-metal performance for large enough buffers.
 213 - Completions from the backend interface will result in completions for
 214   the pvrdma device.
 215
 216
 2174.2 PCI BARs
 218============
 219PCI Bars:
 220        BAR 0 - MSI-X
 221        MSI-X vectors:
 222                (0) Command - used when execution of a command is completed.
 223                (1) Async - not in use.
 224                (2) Completion - used when a completion event is placed in
 225                  device's CQ ring.
 226        BAR 1 - Registers
 227        --------------------------------------------------------
 228        | VERSION |  DSR | CTL | REQ | ERR |  ICR | IMR  | MAC |
 229        --------------------------------------------------------
 230                DSR - Address of driver/device shared memory used
 231              for the command channel, used for passing:
 232                            - General info such as driver version
 233                            - Address of 'command' and 'response'
 234                            - Address of async ring
 235                            - Address of device's CQ ring
 236                            - Device capabilities
 237                CTL - Device control operations (activate, reset etc)
 238                IMG - Set interrupt mask
 239                REQ - Command execution register
 240                ERR - Operation status
 241
 242        BAR 2 - UAR
 243        ---------------------------------------------------------
 244        | QP_NUM  | SEND/RECV Flag ||  CQ_NUM |   ARM/POLL Flag |
 245        ---------------------------------------------------------
 246                - Offset 0 used for QP operations (send and recv)
 247                - Offset 4 used for CQ operations (arm and poll)
 248
 249
 2504.3 Major flows
 251===============
 252
 2534.3.1 Create CQ
 254===============
 255    - Guest driver
 256        - Allocates pages for CQ ring
 257        - Creates page directory (pdir) to hold CQ ring's pages
 258        - Initializes CQ ring
 259        - Initializes 'Create CQ' command object (cqe, pdir etc)
 260        - Copies the command to 'command' address
 261        - Writes 0 into REQ register
 262    - Device
 263        - Reads the request object from the 'command' address
 264        - Allocates CQ object and initialize CQ ring based on pdir
 265        - Creates the backend CQ
 266        - Writes operation status to ERR register
 267        - Posts command-interrupt to guest
 268    - Guest driver
 269        - Reads the HW response code from ERR register
 270
 2714.3.2 Create QP
 272===============
 273    - Guest driver
 274        - Allocates pages for send and receive rings
 275        - Creates page directory(pdir) to hold the ring's pages
 276        - Initializes 'Create QP' command object (max_send_wr,
 277          send_cq_handle, recv_cq_handle, pdir etc)
 278        - Copies the object to 'command' address
 279        - Write 0 into REQ register
 280    - Device
 281        - Reads the request object from 'command' address
 282        - Allocates the QP object and initialize
 283            - Send and recv rings based on pdir
 284            - Send and recv ring state
 285        - Creates the backend QP
 286        - Writes the operation status to ERR register
 287        - Posts command-interrupt to guest
 288    - Guest driver
 289        - Reads the HW response code from ERR register
 290
 2914.3.3 Post receive
 292==================
 293    - Guest driver
 294        - Initializes a wqe and place it on recv ring
 295        - Write to qpn|qp_recv_bit (31) to QP offset in UAR
 296    - Device
 297        - Extracts qpn from UAR
 298        - Walks through the ring and does the following for each wqe
 299            - Prepares the backend CQE context to be used when
 300              receiving completion from backend (wr_id, op_code, emu_cq_num)
 301            - For each sge prepares backend sge
 302            - Calls backend's post_recv
 303
 3044.3.4 Process backend events
 305============================
 306    - Done by a dedicated thread used to process backend events;
 307      at initialization is attached to the device and creates
 308      the communication channel.
 309    - Thread main loop:
 310        - Polls for completions
 311        - Extracts QEMU _cq_num, wr_id and op_code from context
 312        - Writes CQE to CQ ring
 313        - Writes CQ number to device CQ
 314        - Sends completion-interrupt to guest
 315        - Deallocates context
 316        - Acks the event to backend
 317
 318
 319
 3205. Limitations
 321==============
 322- The device obviously is limited by the Guest Linux Driver features implementation
 323  of the VMware device API.
 324- Memory registration mechanism requires mremap for every page in the buffer in order
 325  to map it to a contiguous virtual address range. Since this is not the data path
 326  it should not matter much. If the default max mr size is increased, be aware that
 327  memory registration can take up to 0.5 seconds for 1GB of memory.
 328- The device requires target page size to be the same as the host page size,
 329  otherwise it will fail to init.
 330- QEMU cannot map guest RAM from a file descriptor if a pvrdma device is attached,
 331  so it can't work with huge pages. The limitation will be addressed in the future,
 332  however QEMU allocates Guest RAM with MADV_HUGEPAGE so if there are enough huge
 333  pages available, QEMU will use them. QEMU will fail to init if the requirements
 334  are not met.
 335
 336
 337
 3386. Performance
 339==============
 340By design the pvrdma device exits on each post-send/receive, so for small buffers
 341the performance is affected; however for medium buffers it will became close to
 342bare metal and from 1MB buffers and  up it reaches bare metal performance.
 343(tested with 2 VMs, the pvrdma devices connected to 2 VFs of the same device)
 344
 345All the above assumes no memory registration is done on data path.
 346