linux/Documentation/x86/sva.rst
<<
>>
Prefs
   1.. SPDX-License-Identifier: GPL-2.0
   2
   3===========================================
   4Shared Virtual Addressing (SVA) with ENQCMD
   5===========================================
   6
   7Background
   8==========
   9
  10Shared Virtual Addressing (SVA) allows the processor and device to use the
  11same virtual addresses avoiding the need for software to translate virtual
  12addresses to physical addresses. SVA is what PCIe calls Shared Virtual
  13Memory (SVM).
  14
  15In addition to the convenience of using application virtual addresses
  16by the device, it also doesn't require pinning pages for DMA.
  17PCIe Address Translation Services (ATS) along with Page Request Interface
  18(PRI) allow devices to function much the same way as the CPU handling
  19application page-faults. For more information please refer to the PCIe
  20specification Chapter 10: ATS Specification.
  21
  22Use of SVA requires IOMMU support in the platform. IOMMU is also
  23required to support the PCIe features ATS and PRI. ATS allows devices
  24to cache translations for virtual addresses. The IOMMU driver uses the
  25mmu_notifier() support to keep the device TLB cache and the CPU cache in
  26sync. When an ATS lookup fails for a virtual address, the device should
  27use the PRI in order to request the virtual address to be paged into the
  28CPU page tables. The device must use ATS again in order the fetch the
  29translation before use.
  30
  31Shared Hardware Workqueues
  32==========================
  33
  34Unlike Single Root I/O Virtualization (SR-IOV), Scalable IOV (SIOV) permits
  35the use of Shared Work Queues (SWQ) by both applications and Virtual
  36Machines (VM's). This allows better hardware utilization vs. hard
  37partitioning resources that could result in under utilization. In order to
  38allow the hardware to distinguish the context for which work is being
  39executed in the hardware by SWQ interface, SIOV uses Process Address Space
  40ID (PASID), which is a 20-bit number defined by the PCIe SIG.
  41
  42PASID value is encoded in all transactions from the device. This allows the
  43IOMMU to track I/O on a per-PASID granularity in addition to using the PCIe
  44Resource Identifier (RID) which is the Bus/Device/Function.
  45
  46
  47ENQCMD
  48======
  49
  50ENQCMD is a new instruction on Intel platforms that atomically submits a
  51work descriptor to a device. The descriptor includes the operation to be
  52performed, virtual addresses of all parameters, virtual address of a completion
  53record, and the PASID (process address space ID) of the current process.
  54
  55ENQCMD works with non-posted semantics and carries a status back if the
  56command was accepted by hardware. This allows the submitter to know if the
  57submission needs to be retried or other device specific mechanisms to
  58implement fairness or ensure forward progress should be provided.
  59
  60ENQCMD is the glue that ensures applications can directly submit commands
  61to the hardware and also permits hardware to be aware of application context
  62to perform I/O operations via use of PASID.
  63
  64Process Address Space Tagging
  65=============================
  66
  67A new thread-scoped MSR (IA32_PASID) provides the connection between
  68user processes and the rest of the hardware. When an application first
  69accesses an SVA-capable device, this MSR is initialized with a newly
  70allocated PASID. The driver for the device calls an IOMMU-specific API
  71that sets up the routing for DMA and page-requests.
  72
  73For example, the Intel Data Streaming Accelerator (DSA) uses
  74iommu_sva_bind_device(), which will do the following:
  75
  76- Allocate the PASID, and program the process page-table (%cr3 register) in the
  77  PASID context entries.
  78- Register for mmu_notifier() to track any page-table invalidations to keep
  79  the device TLB in sync. For example, when a page-table entry is invalidated,
  80  the IOMMU propagates the invalidation to the device TLB. This will force any
  81  future access by the device to this virtual address to participate in
  82  ATS. If the IOMMU responds with proper response that a page is not
  83  present, the device would request the page to be paged in via the PCIe PRI
  84  protocol before performing I/O.
  85
  86This MSR is managed with the XSAVE feature set as "supervisor state" to
  87ensure the MSR is updated during context switch.
  88
  89PASID Management
  90================
  91
  92The kernel must allocate a PASID on behalf of each process which will use
  93ENQCMD and program it into the new MSR to communicate the process identity to
  94platform hardware.  ENQCMD uses the PASID stored in this MSR to tag requests
  95from this process.  When a user submits a work descriptor to a device using the
  96ENQCMD instruction, the PASID field in the descriptor is auto-filled with the
  97value from MSR_IA32_PASID. Requests for DMA from the device are also tagged
  98with the same PASID. The platform IOMMU uses the PASID in the transaction to
  99perform address translation. The IOMMU APIs setup the corresponding PASID
 100entry in IOMMU with the process address used by the CPU (e.g. %cr3 register in
 101x86).
 102
 103The MSR must be configured on each logical CPU before any application
 104thread can interact with a device. Threads that belong to the same
 105process share the same page tables, thus the same MSR value.
 106
 107PASID is cleared when a process is created. The PASID allocation and MSR
 108programming may occur long after a process and its threads have been created.
 109One thread must call iommu_sva_bind_device() to allocate the PASID for the
 110process. If a thread uses ENQCMD without the MSR first being populated, a #GP
 111will be raised. The kernel will update the PASID MSR with the PASID for all
 112threads in the process. A single process PASID can be used simultaneously
 113with multiple devices since they all share the same address space.
 114
 115One thread can call iommu_sva_unbind_device() to free the allocated PASID.
 116The kernel will clear the PASID MSR for all threads belonging to the process.
 117
 118New threads inherit the MSR value from the parent.
 119
 120Relationships
 121=============
 122
 123 * Each process has many threads, but only one PASID.
 124 * Devices have a limited number (~10's to 1000's) of hardware workqueues.
 125   The device driver manages allocating hardware workqueues.
 126 * A single mmap() maps a single hardware workqueue as a "portal" and
 127   each portal maps down to a single workqueue.
 128 * For each device with which a process interacts, there must be
 129   one or more mmap()'d portals.
 130 * Many threads within a process can share a single portal to access
 131   a single device.
 132 * Multiple processes can separately mmap() the same portal, in
 133   which case they still share one device hardware workqueue.
 134 * The single process-wide PASID is used by all threads to interact
 135   with all devices.  There is not, for instance, a PASID for each
 136   thread or each thread<->device pair.
 137
 138FAQ
 139===
 140
 141* What is SVA/SVM?
 142
 143Shared Virtual Addressing (SVA) permits I/O hardware and the processor to
 144work in the same address space, i.e., to share it. Some call it Shared
 145Virtual Memory (SVM), but Linux community wanted to avoid confusing it with
 146POSIX Shared Memory and Secure Virtual Machines which were terms already in
 147circulation.
 148
 149* What is a PASID?
 150
 151A Process Address Space ID (PASID) is a PCIe-defined Transaction Layer Packet
 152(TLP) prefix. A PASID is a 20-bit number allocated and managed by the OS.
 153PASID is included in all transactions between the platform and the device.
 154
 155* How are shared workqueues different?
 156
 157Traditionally, in order for userspace applications to interact with hardware,
 158there is a separate hardware instance required per process. For example,
 159consider doorbells as a mechanism of informing hardware about work to process.
 160Each doorbell is required to be spaced 4k (or page-size) apart for process
 161isolation. This requires hardware to provision that space and reserve it in
 162MMIO. This doesn't scale as the number of threads becomes quite large. The
 163hardware also manages the queue depth for Shared Work Queues (SWQ), and
 164consumers don't need to track queue depth. If there is no space to accept
 165a command, the device will return an error indicating retry.
 166
 167A user should check Deferrable Memory Write (DMWr) capability on the device
 168and only submits ENQCMD when the device supports it. In the new DMWr PCIe
 169terminology, devices need to support DMWr completer capability. In addition,
 170it requires all switch ports to support DMWr routing and must be enabled by
 171the PCIe subsystem, much like how PCIe atomic operations are managed for
 172instance.
 173
 174SWQ allows hardware to provision just a single address in the device. When
 175used with ENQCMD to submit work, the device can distinguish the process
 176submitting the work since it will include the PASID assigned to that
 177process. This helps the device scale to a large number of processes.
 178
 179* Is this the same as a user space device driver?
 180
 181Communicating with the device via the shared workqueue is much simpler
 182than a full blown user space driver. The kernel driver does all the
 183initialization of the hardware. User space only needs to worry about
 184submitting work and processing completions.
 185
 186* Is this the same as SR-IOV?
 187
 188Single Root I/O Virtualization (SR-IOV) focuses on providing independent
 189hardware interfaces for virtualizing hardware. Hence, it's required to be
 190almost fully functional interface to software supporting the traditional
 191BARs, space for interrupts via MSI-X, its own register layout.
 192Virtual Functions (VFs) are assisted by the Physical Function (PF)
 193driver.
 194
 195Scalable I/O Virtualization builds on the PASID concept to create device
 196instances for virtualization. SIOV requires host software to assist in
 197creating virtual devices; each virtual device is represented by a PASID
 198along with the bus/device/function of the device.  This allows device
 199hardware to optimize device resource creation and can grow dynamically on
 200demand. SR-IOV creation and management is very static in nature. Consult
 201references below for more details.
 202
 203* Why not just create a virtual function for each app?
 204
 205Creating PCIe SR-IOV type Virtual Functions (VF) is expensive. VFs require
 206duplicated hardware for PCI config space and interrupts such as MSI-X.
 207Resources such as interrupts have to be hard partitioned between VFs at
 208creation time, and cannot scale dynamically on demand. The VFs are not
 209completely independent from the Physical Function (PF). Most VFs require
 210some communication and assistance from the PF driver. SIOV, in contrast,
 211creates a software-defined device where all the configuration and control
 212aspects are mediated via the slow path. The work submission and completion
 213happen without any mediation.
 214
 215* Does this support virtualization?
 216
 217ENQCMD can be used from within a guest VM. In these cases, the VMM helps
 218with setting up a translation table to translate from Guest PASID to Host
 219PASID. Please consult the ENQCMD instruction set reference for more
 220details.
 221
 222* Does memory need to be pinned?
 223
 224When devices support SVA along with platform hardware such as IOMMU
 225supporting such devices, there is no need to pin memory for DMA purposes.
 226Devices that support SVA also support other PCIe features that remove the
 227pinning requirement for memory.
 228
 229Device TLB support - Device requests the IOMMU to lookup an address before
 230use via Address Translation Service (ATS) requests.  If the mapping exists
 231but there is no page allocated by the OS, IOMMU hardware returns that no
 232mapping exists.
 233
 234Device requests the virtual address to be mapped via Page Request
 235Interface (PRI). Once the OS has successfully completed the mapping, it
 236returns the response back to the device. The device requests again for
 237a translation and continues.
 238
 239IOMMU works with the OS in managing consistency of page-tables with the
 240device. When removing pages, it interacts with the device to remove any
 241device TLB entry that might have been cached before removing the mappings from
 242the OS.
 243
 244References
 245==========
 246
 247VT-D:
 248https://01.org/blogs/ashokraj/2018/recent-enhancements-intel-virtualization-technology-directed-i/o-intel-vt-d
 249
 250SIOV:
 251https://01.org/blogs/2019/assignable-interfaces-intel-scalable-i/o-virtualization-linux
 252
 253ENQCMD in ISE:
 254https://software.intel.com/sites/default/files/managed/c5/15/architecture-instruction-set-extensions-programming-reference.pdf
 255
 256DSA spec:
 257https://software.intel.com/sites/default/files/341204-intel-data-streaming-accelerator-spec.pdf
 258