qemu/docs/specs/ivshmem-spec.txt
<<
>>
Prefs
   1= Device Specification for Inter-VM shared memory device =
   2
   3The Inter-VM shared memory device (ivshmem) is designed to share a
   4memory region between multiple QEMU processes running different guests
   5and the host.  In order for all guests to be able to pick up the
   6shared memory area, it is modeled by QEMU as a PCI device exposing
   7said memory to the guest as a PCI BAR.
   8
   9The device can use a shared memory object on the host directly, or it
  10can obtain one from an ivshmem server.
  11
  12In the latter case, the device can additionally interrupt its peers, and
  13get interrupted by its peers.
  14
  15
  16== Configuring the ivshmem PCI device ==
  17
  18There are two basic configurations:
  19
  20- Just shared memory:
  21
  22      -device ivshmem-plain,memdev=HMB,...
  23
  24  This uses host memory backend HMB.  It should have option "share"
  25  set.
  26
  27- Shared memory plus interrupts:
  28
  29      -device ivshmem-doorbell,chardev=CHR,vectors=N,...
  30
  31  An ivshmem server must already be running on the host.  The device
  32  connects to the server's UNIX domain socket via character device
  33  CHR.
  34
  35  Each peer gets assigned a unique ID by the server.  IDs must be
  36  between 0 and 65535.
  37
  38  Interrupts are message-signaled (MSI-X).  vectors=N configures the
  39  number of vectors to use.
  40
  41For more details on ivshmem device properties, see the QEMU Emulator
  42user documentation.
  43
  44
  45== The ivshmem PCI device's guest interface ==
  46
  47The device has vendor ID 1af4, device ID 1110, revision 1.  Before
  48QEMU 2.6.0, it had revision 0.
  49
  50=== PCI BARs ===
  51
  52The ivshmem PCI device has two or three BARs:
  53
  54- BAR0 holds device registers (256 Byte MMIO)
  55- BAR1 holds MSI-X table and PBA (only ivshmem-doorbell)
  56- BAR2 maps the shared memory object
  57
  58There are two ways to use this device:
  59
  60- If you only need the shared memory part, BAR2 suffices.  This way,
  61  you have access to the shared memory in the guest and can use it as
  62  you see fit.  Memnic, for example, uses ivshmem this way from guest
  63  user space (see http://dpdk.org/browse/memnic).
  64
  65- If you additionally need the capability for peers to interrupt each
  66  other, you need BAR0 and BAR1.  You will most likely want to write a
  67  kernel driver to handle interrupts.  Requires the device to be
  68  configured for interrupts, obviously.
  69
  70Before QEMU 2.6.0, BAR2 can initially be invalid if the device is
  71configured for interrupts.  It becomes safely accessible only after
  72the ivshmem server provided the shared memory.  These devices have PCI
  73revision 0 rather than 1.  Guest software should wait for the
  74IVPosition register (described below) to become non-negative before
  75accessing BAR2.
  76
  77Revision 0 of the device is not capable to tell guest software whether
  78it is configured for interrupts.
  79
  80=== PCI device registers ===
  81
  82BAR 0 contains the following registers:
  83
  84    Offset  Size  Access      On reset  Function
  85        0     4   read/write        0   Interrupt Mask
  86                                        bit 0: peer interrupt (rev 0)
  87                                               reserved       (rev 1)
  88                                        bit 1..31: reserved
  89        4     4   read/write        0   Interrupt Status
  90                                        bit 0: peer interrupt (rev 0)
  91                                               reserved       (rev 1)
  92                                        bit 1..31: reserved
  93        8     4   read-only   0 or ID   IVPosition
  94       12     4   write-only      N/A   Doorbell
  95                                        bit 0..15: vector
  96                                        bit 16..31: peer ID
  97       16   240   none            N/A   reserved
  98
  99Software should only access the registers as specified in column
 100"Access".  Reserved bits should be ignored on read, and preserved on
 101write.
 102
 103In revision 0 of the device, Interrupt Status and Mask Register
 104together control the legacy INTx interrupt when the device has no
 105MSI-X capability: INTx is asserted when the bit-wise AND of Status and
 106Mask is non-zero and the device has no MSI-X capability.  Interrupt
 107Status Register bit 0 becomes 1 when an interrupt request from a peer
 108is received.  Reading the register clears it.
 109
 110IVPosition Register: if the device is not configured for interrupts,
 111this is zero.  Else, it is the device's ID (between 0 and 65535).
 112
 113Before QEMU 2.6.0, the register may read -1 for a short while after
 114reset.  These devices have PCI revision 0 rather than 1.
 115
 116There is no good way for software to find out whether the device is
 117configured for interrupts.  A positive IVPosition means interrupts,
 118but zero could be either.
 119
 120Doorbell Register: writing this register requests to interrupt a peer.
 121The written value's high 16 bits are the ID of the peer to interrupt,
 122and its low 16 bits select an interrupt vector.
 123
 124If the device is not configured for interrupts, the write is ignored.
 125
 126If the interrupt hasn't completed setup, the write is ignored.  The
 127device is not capable to tell guest software whether setup is
 128complete.  Interrupts can regress to this state on migration.
 129
 130If the peer with the requested ID isn't connected, or it has fewer
 131interrupt vectors connected, the write is ignored.  The device is not
 132capable to tell guest software what peers are connected, or how many
 133interrupt vectors are connected.
 134
 135The peer's interrupt for this vector then becomes pending.  There is
 136no way for software to clear the pending bit, and a polling mode of
 137operation is therefore impossible.
 138
 139If the peer is a revision 0 device without MSI-X capability, its
 140Interrupt Status register is set to 1.  This asserts INTx unless
 141masked by the Interrupt Mask register.  The device is not capable to
 142communicate the interrupt vector to guest software then.
 143
 144With multiple MSI-X vectors, different vectors can be used to indicate
 145different events have occurred.  The semantics of interrupt vectors
 146are left to the application.
 147
 148
 149== Interrupt infrastructure ==
 150
 151When configured for interrupts, the peers share eventfd objects in
 152addition to shared memory.  The shared resources are managed by an
 153ivshmem server.
 154
 155=== The ivshmem server ===
 156
 157The server listens on a UNIX domain socket.
 158
 159For each new client that connects to the server, the server
 160- picks an ID,
 161- creates eventfd file descriptors for the interrupt vectors,
 162- sends the ID and the file descriptor for the shared memory to the
 163  new client,
 164- sends connect notifications for the new client to the other clients
 165  (these contain file descriptors for sending interrupts),
 166- sends connect notifications for the other clients to the new client,
 167  and
 168- sends interrupt setup messages to the new client (these contain file
 169  descriptors for receiving interrupts).
 170
 171The first client to connect to the server receives ID zero.
 172
 173When a client disconnects from the server, the server sends disconnect
 174notifications to the other clients.
 175
 176The next section describes the protocol in detail.
 177
 178If the server terminates without sending disconnect notifications for
 179its connected clients, the clients can elect to continue.  They can
 180communicate with each other normally, but won't receive disconnect
 181notification on disconnect, and no new clients can connect.  There is
 182no way for the clients to connect to a restarted server.  The device
 183is not capable to tell guest software whether the server is still up.
 184
 185Example server code is in contrib/ivshmem-server/.  Not to be used in
 186production.  It assumes all clients use the same number of interrupt
 187vectors.
 188
 189A standalone client is in contrib/ivshmem-client/.  It can be useful
 190for debugging.
 191
 192=== The ivshmem Client-Server Protocol ===
 193
 194An ivshmem device configured for interrupts connects to an ivshmem
 195server.  This section details the protocol between the two.
 196
 197The connection is one-way: the server sends messages to the client.
 198Each message consists of a single 8 byte little-endian signed number,
 199and may be accompanied by a file descriptor via SCM_RIGHTS.  Both
 200client and server close the connection on error.
 201
 202Note: QEMU currently doesn't close the connection right on error, but
 203only when the character device is destroyed.
 204
 205On connect, the server sends the following messages in order:
 206
 2071. The protocol version number, currently zero.  The client should
 208   close the connection on receipt of versions it can't handle.
 209
 2102. The client's ID.  This is unique among all clients of this server.
 211   IDs must be between 0 and 65535, because the Doorbell register
 212   provides only 16 bits for them.
 213
 2143. The number -1, accompanied by the file descriptor for the shared
 215   memory.
 216
 2174. Connect notifications for existing other clients, if any.  This is
 218   a peer ID (number between 0 and 65535 other than the client's ID),
 219   repeated N times.  Each repetition is accompanied by one file
 220   descriptor.  These are for interrupting the peer with that ID using
 221   vector 0,..,N-1, in order.  If the client is configured for fewer
 222   vectors, it closes the extra file descriptors.  If it is configured
 223   for more, the extra vectors remain unconnected.
 224
 2255. Interrupt setup.  This is the client's own ID, repeated N times.
 226   Each repetition is accompanied by one file descriptor.  These are
 227   for receiving interrupts from peers using vector 0,..,N-1, in
 228   order.  If the client is configured for fewer vectors, it closes
 229   the extra file descriptors.  If it is configured for more, the
 230   extra vectors remain unconnected.
 231
 232From then on, the server sends these kinds of messages:
 233
 2346. Connection / disconnection notification.  This is a peer ID.
 235
 236  - If the number comes with a file descriptor, it's a connection
 237    notification, exactly like in step 4.
 238
 239  - Else, it's a disconnection notification for the peer with that ID.
 240
 241Known bugs:
 242
 243* The protocol changed incompatibly in QEMU 2.5.  Before, messages
 244  were native endian long, and there was no version number.
 245
 246* The protocol is poorly designed.
 247
 248=== The ivshmem Client-Client Protocol ===
 249
 250An ivshmem device configured for interrupts receives eventfd file
 251descriptors for interrupting peers and getting interrupted by peers
 252from the server, as explained in the previous section.
 253
 254To interrupt a peer, the device writes the 8-byte integer 1 in native
 255byte order to the respective file descriptor.
 256
 257To receive an interrupt, the device reads and discards as many 8-byte
 258integers as it can.
 259