qemu/docs/nvdimm.txt
<<
>>
Prefs
   1QEMU Virtual NVDIMM
   2===================
   3
   4This document explains the usage of virtual NVDIMM (vNVDIMM) feature
   5which is available since QEMU v2.6.0.
   6
   7The current QEMU only implements the persistent memory mode of vNVDIMM
   8device and not the block window mode.
   9
  10Basic Usage
  11-----------
  12
  13The storage of a vNVDIMM device in QEMU is provided by the memory
  14backend (i.e. memory-backend-file and memory-backend-ram). A simple
  15way to create a vNVDIMM device at startup time is done via the
  16following command line options:
  17
  18 -machine pc,nvdimm=on
  19 -m $RAM_SIZE,slots=$N,maxmem=$MAX_SIZE
  20 -object memory-backend-file,id=mem1,share=on,mem-path=$PATH,size=$NVDIMM_SIZE,readonly=off
  21 -device nvdimm,id=nvdimm1,memdev=mem1,unarmed=off
  22
  23Where,
  24
  25 - the "nvdimm" machine option enables vNVDIMM feature.
  26
  27 - "slots=$N" should be equal to or larger than the total amount of
  28   normal RAM devices and vNVDIMM devices, e.g. $N should be >= 2 here.
  29
  30 - "maxmem=$MAX_SIZE" should be equal to or larger than the total size
  31   of normal RAM devices and vNVDIMM devices, e.g. $MAX_SIZE should be
  32   >= $RAM_SIZE + $NVDIMM_SIZE here.
  33
  34 - "object memory-backend-file,id=mem1,share=on,mem-path=$PATH,
  35   size=$NVDIMM_SIZE,readonly=off" creates a backend storage of size
  36   $NVDIMM_SIZE on a file $PATH. All accesses to the virtual NVDIMM device go
  37   to the file $PATH.
  38
  39   "share=on/off" controls the visibility of guest writes. If
  40   "share=on", then guest writes will be applied to the backend
  41   file. If another guest uses the same backend file with option
  42   "share=on", then above writes will be visible to it as well. If
  43   "share=off", then guest writes won't be applied to the backend
  44   file and thus will be invisible to other guests.
  45
  46   "readonly=on/off" controls whether the file $PATH is opened read-only or
  47   read/write (default).
  48
  49 - "device nvdimm,id=nvdimm1,memdev=mem1,unarmed=off" creates a read/write
  50   virtual NVDIMM device whose storage is provided by above memory backend
  51   device.
  52
  53   "unarmed" controls the ACPI NFIT NVDIMM Region Mapping Structure "NVDIMM
  54   State Flags" Bit 3 indicating that the device is "unarmed" and cannot accept
  55   persistent writes. Linux guest drivers set the device to read-only when this
  56   bit is present. Set unarmed to on when the memdev has readonly=on.
  57
  58Multiple vNVDIMM devices can be created if multiple pairs of "-object"
  59and "-device" are provided.
  60
  61For above command line options, if the guest OS has the proper NVDIMM
  62driver (e.g. "CONFIG_ACPI_NFIT=y" under Linux), it should be able to
  63detect a NVDIMM device which is in the persistent memory mode and whose
  64size is $NVDIMM_SIZE.
  65
  66Note:
  67
  681. Prior to QEMU v2.8.0, if memory-backend-file is used and the actual
  69   backend file size is not equal to the size given by "size" option,
  70   QEMU will truncate the backend file by ftruncate(2), which will
  71   corrupt the existing data in the backend file, especially for the
  72   shrink case.
  73
  74   QEMU v2.8.0 and later check the backend file size and the "size"
  75   option. If they do not match, QEMU will report errors and abort in
  76   order to avoid the data corruption.
  77
  782. QEMU v2.6.0 only puts a basic alignment requirement on the "size"
  79   option of memory-backend-file, e.g. 4KB alignment on x86.  However,
  80   QEMU v.2.7.0 puts an additional alignment requirement, which may
  81   require a larger value than the basic one, e.g. 2MB on x86. This
  82   change breaks the usage of memory-backend-file that only satisfies
  83   the basic alignment.
  84
  85   QEMU v2.8.0 and later remove the additional alignment on non-s390x
  86   architectures, so the broken memory-backend-file can work again.
  87
  88Label
  89-----
  90
  91QEMU v2.7.0 and later implement the label support for vNVDIMM devices.
  92To enable label on vNVDIMM devices, users can simply add
  93"label-size=$SZ" option to "-device nvdimm", e.g.
  94
  95 -device nvdimm,id=nvdimm1,memdev=mem1,label-size=128K
  96
  97Note:
  98
  991. The minimal label size is 128KB.
 100
 1012. QEMU v2.7.0 and later store labels at the end of backend storage.
 102   If a memory backend file, which was previously used as the backend
 103   of a vNVDIMM device without labels, is now used for a vNVDIMM
 104   device with label, the data in the label area at the end of file
 105   will be inaccessible to the guest. If any useful data (e.g. the
 106   meta-data of the file system) was stored there, the latter usage
 107   may result guest data corruption (e.g. breakage of guest file
 108   system).
 109
 110Hotplug
 111-------
 112
 113QEMU v2.8.0 and later implement the hotplug support for vNVDIMM
 114devices. Similarly to the RAM hotplug, the vNVDIMM hotplug is
 115accomplished by two monitor commands "object_add" and "device_add".
 116
 117For example, the following commands add another 4GB vNVDIMM device to
 118the guest:
 119
 120 (qemu) object_add memory-backend-file,id=mem2,share=on,mem-path=new_nvdimm.img,size=4G
 121 (qemu) device_add nvdimm,id=nvdimm2,memdev=mem2
 122
 123Note:
 124
 1251. Each hotplugged vNVDIMM device consumes one memory slot. Users
 126   should always ensure the memory option "-m ...,slots=N" specifies
 127   enough number of slots, i.e.
 128     N >= number of RAM devices +
 129          number of statically plugged vNVDIMM devices +
 130          number of hotplugged vNVDIMM devices
 131
 1322. The similar is required for the memory option "-m ...,maxmem=M", i.e.
 133     M >= size of RAM devices +
 134          size of statically plugged vNVDIMM devices +
 135          size of hotplugged vNVDIMM devices
 136
 137Alignment
 138---------
 139
 140QEMU uses mmap(2) to maps vNVDIMM backends and aligns the mapping
 141address to the page size (getpagesize(2)) by default. However, some
 142types of backends may require an alignment different than the page
 143size. In that case, QEMU v2.12.0 and later provide 'align' option to
 144memory-backend-file to allow users to specify the proper alignment.
 145For device dax (e.g., /dev/dax0.0), this alignment needs to match the
 146alignment requirement of the device dax. The NUM of 'align=NUM' option
 147must be larger than or equal to the 'align' of device dax.
 148We can use one of the following commands to show the 'align' of device dax.
 149
 150    ndctl list -X
 151    daxctl list -R
 152
 153In order to get the proper 'align' of device dax, you need to install
 154the library 'libdaxctl'.
 155
 156For example, device dax require the 2 MB alignment, so we can use
 157following QEMU command line options to use it (/dev/dax0.0) as the
 158backend of vNVDIMM:
 159
 160 -object memory-backend-file,id=mem1,share=on,mem-path=/dev/dax0.0,size=4G,align=2M
 161 -device nvdimm,id=nvdimm1,memdev=mem1
 162
 163Guest Data Persistence
 164----------------------
 165
 166Though QEMU supports multiple types of vNVDIMM backends on Linux,
 167the only backend that can guarantee the guest write persistence is:
 168
 169A. DAX device (e.g., /dev/dax0.0, ) or
 170B. DAX file(mounted with dax option)
 171
 172When using B (A file supporting direct mapping of persistent memory)
 173as a backend, write persistence is guaranteed if the host kernel has
 174support for the MAP_SYNC flag in the mmap system call (available
 175since Linux 4.15 and on certain distro kernels) and additionally
 176both 'pmem' and 'share' flags are set to 'on' on the backend.
 177
 178If these conditions are not satisfied i.e. if either 'pmem' or 'share'
 179are not set, if the backend file does not support DAX or if MAP_SYNC
 180is not supported by the host kernel, write persistence is not
 181guaranteed after a system crash. For compatibility reasons, these
 182conditions are ignored if not satisfied. Currently, no way is
 183provided to test for them.
 184For more details, please reference mmap(2) man page:
 185http://man7.org/linux/man-pages/man2/mmap.2.html.
 186
 187When using other types of backends, it's suggested to set 'unarmed'
 188option of '-device nvdimm' to 'on', which sets the unarmed flag of the
 189guest NVDIMM region mapping structure.  This unarmed flag indicates
 190guest software that this vNVDIMM device contains a region that cannot
 191accept persistent writes. In result, for example, the guest Linux
 192NVDIMM driver, marks such vNVDIMM device as read-only.
 193
 194Backend File Setup Example
 195--------------------------
 196
 197Here are two examples showing how to setup these persistent backends on
 198linux using the tool ndctl [3].
 199
 200A. DAX device
 201
 202Use the following command to set up /dev/dax0.0 so that the entirety of
 203namespace0.0 can be exposed as an emulated NVDIMM to the guest:
 204
 205    ndctl create-namespace -f -e namespace0.0 -m devdax
 206
 207The /dev/dax0.0 could be used directly in "mem-path" option.
 208
 209B. DAX file
 210
 211Individual files on a DAX host file system can be exposed as emulated
 212NVDIMMS.  First an fsdax block device is created, partitioned, and then
 213mounted with the "dax" mount option:
 214
 215    ndctl create-namespace -f -e namespace0.0 -m fsdax
 216    (partition /dev/pmem0 with name pmem0p1)
 217    mount -o dax /dev/pmem0p1 /mnt
 218    (create or copy a disk image file with qemu-img(1), cp(1), or dd(1)
 219     in /mnt)
 220
 221Then the new file in /mnt could be used in "mem-path" option.
 222
 223NVDIMM Persistence
 224------------------
 225
 226ACPI 6.2 Errata A added support for a new Platform Capabilities Structure
 227which allows the platform to communicate what features it supports related to
 228NVDIMM data persistence.  Users can provide a persistence value to a guest via
 229the optional "nvdimm-persistence" machine command line option:
 230
 231    -machine pc,accel=kvm,nvdimm,nvdimm-persistence=cpu
 232
 233There are currently two valid values for this option:
 234
 235"mem-ctrl" - The platform supports flushing dirty data from the memory
 236             controller to the NVDIMMs in the event of power loss.
 237
 238"cpu"      - The platform supports flushing dirty data from the CPU cache to
 239             the NVDIMMs in the event of power loss.  This implies that the
 240             platform also supports flushing dirty data through the memory
 241             controller on power loss.
 242
 243If the vNVDIMM backend is in host persistent memory that can be accessed in
 244SNIA NVM Programming Model [1] (e.g., Intel NVDIMM), it's suggested to set
 245the 'pmem' option of memory-backend-file to 'on'. When 'pmem' is 'on' and QEMU
 246is built with libpmem [2] support (configured with --enable-libpmem), QEMU
 247will take necessary operations to guarantee the persistence of its own writes
 248to the vNVDIMM backend(e.g., in vNVDIMM label emulation and live migration).
 249If 'pmem' is 'on' while there is no libpmem support, qemu will exit and report
 250a "lack of libpmem support" message to ensure the persistence is available.
 251For example, if we want to ensure the persistence for some backend file,
 252use the QEMU command line:
 253
 254    -object memory-backend-file,id=nv_mem,mem-path=/XXX/yyy,size=4G,pmem=on
 255
 256References
 257----------
 258
 259[1] NVM Programming Model (NPM)
 260        Version 1.2
 261    https://www.snia.org/sites/default/files/technical_work/final/NVMProgrammingModel_v1.2.pdf
 262[2] Persistent Memory Development Kit (PMDK), formerly known as NVML project, home page:
 263    http://pmem.io/pmdk/
 264[3] ndctl-create-namespace - provision or reconfigure a namespace
 265    http://pmem.io/ndctl/ndctl-create-namespace.html
 266