qemu/docs/replay.txt
<<
>>
Prefs
   1Copyright (c) 2010-2015 Institute for System Programming
   2                        of the Russian Academy of Sciences.
   3
   4This work is licensed under the terms of the GNU GPL, version 2 or later.
   5See the COPYING file in the top-level directory.
   6
   7Record/replay
   8-------------
   9
  10Record/replay functions are used for the deterministic replay of qemu execution.
  11Execution recording writes a non-deterministic events log, which can be later
  12used for replaying the execution anywhere and for unlimited number of times.
  13It also supports checkpointing for faster rewind to the specific replay moment.
  14Execution replaying reads the log and replays all non-deterministic events
  15including external input, hardware clocks, and interrupts.
  16
  17Deterministic replay has the following features:
  18 * Deterministically replays whole system execution and all contents of
  19   the memory, state of the hardware devices, clocks, and screen of the VM.
  20 * Writes execution log into the file for later replaying for multiple times
  21   on different machines.
  22 * Supports i386, x86_64, and ARM hardware platforms.
  23 * Performs deterministic replay of all operations with keyboard and mouse
  24   input devices.
  25
  26Usage of the record/replay:
  27 * First, record the execution with the following command line:
  28    qemu-system-i386 \
  29     -icount shift=7,rr=record,rrfile=replay.bin \
  30     -drive file=disk.qcow2,if=none,id=img-direct \
  31     -drive driver=blkreplay,if=none,image=img-direct,id=img-blkreplay \
  32     -device ide-hd,drive=img-blkreplay \
  33     -netdev user,id=net1 -device rtl8139,netdev=net1 \
  34     -object filter-replay,id=replay,netdev=net1
  35 * After recording, you can replay it by using another command line:
  36    qemu-system-i386 \
  37     -icount shift=7,rr=replay,rrfile=replay.bin \
  38     -drive file=disk.qcow2,if=none,id=img-direct \
  39     -drive driver=blkreplay,if=none,image=img-direct,id=img-blkreplay \
  40     -device ide-hd,drive=img-blkreplay \
  41     -netdev user,id=net1 -device rtl8139,netdev=net1 \
  42     -object filter-replay,id=replay,netdev=net1
  43   The only difference with recording is changing the rr option
  44   from record to replay.
  45 * Block device images are not actually changed in the recording mode,
  46   because all of the changes are written to the temporary overlay file.
  47   This behavior is enabled by using blkreplay driver. It should be used
  48   for every enabled block device, as described in 'Block devices' section.
  49 * '-net none' option should be specified when network is not used,
  50   because QEMU adds network card by default. When network is needed,
  51   it should be configured explicitly with replay filter, as described
  52   in 'Network devices' section.
  53 * Interaction with audio devices and serial ports are recorded and replayed
  54   automatically when such devices are enabled.
  55
  56Academic papers with description of deterministic replay implementation:
  57http://www.computer.org/csdl/proceedings/csmr/2012/4666/00/4666a553-abs.html
  58http://dl.acm.org/citation.cfm?id=2786805.2803179
  59
  60Modifications of qemu include:
  61 * wrappers for clock and time functions to save their return values in the log
  62 * saving different asynchronous events (e.g. system shutdown) into the log
  63 * synchronization of the bottom halves execution
  64 * synchronization of the threads from thread pool
  65 * recording/replaying user input (mouse, keyboard, and microphone)
  66 * adding internal checkpoints for cpu and io synchronization
  67 * network filter for recording and replaying the packets
  68 * block driver for making block layer deterministic
  69 * serial port input record and replay
  70
  71Locking and thread synchronisation
  72----------------------------------
  73
  74Previously the synchronisation of the main thread and the vCPU thread
  75was ensured by the holding of the BQL. However the trend has been to
  76reduce the time the BQL was held across the system including under TCG
  77system emulation. As it is important that batches of events are kept
  78in sequence (e.g. expiring timers and checkpoints in the main thread
  79while instruction checkpoints are written by the vCPU thread) we need
  80another lock to keep things in lock-step. This role is now handled by
  81the replay_mutex_lock. It used to be held only for each event being
  82written but now it is held for a whole execution period. This results
  83in a deterministic ping-pong between the two main threads.
  84
  85As the BQL is now a finer grained lock than the replay_lock it is almost
  86certainly a bug, and a source of deadlocks, to take the
  87replay_mutex_lock while the BQL is held. This is enforced by an assert.
  88While the unlocks are usually in the reverse order, this is not
  89necessary; you can drop the replay_lock while holding the BQL, without
  90doing a more complicated unlock_iothread/replay_unlock/lock_iothread
  91sequence.
  92
  93Non-deterministic events
  94------------------------
  95
  96Our record/replay system is based on saving and replaying non-deterministic
  97events (e.g. keyboard input) and simulating deterministic ones (e.g. reading
  98from HDD or memory of the VM). Saving only non-deterministic events makes
  99log file smaller and simulation faster.
 100
 101The following non-deterministic data from peripheral devices is saved into
 102the log: mouse and keyboard input, network packets, audio controller input,
 103serial port input, and hardware clocks (they are non-deterministic
 104too, because their values are taken from the host machine). Inputs from
 105simulated hardware, memory of VM, software interrupts, and execution of
 106instructions are not saved into the log, because they are deterministic and
 107can be replayed by simulating the behavior of virtual machine starting from
 108initial state.
 109
 110We had to solve three tasks to implement deterministic replay: recording
 111non-deterministic events, replaying non-deterministic events, and checking
 112that there is no divergence between record and replay modes.
 113
 114We changed several parts of QEMU to make event log recording and replaying.
 115Devices' models that have non-deterministic input from external devices were
 116changed to write every external event into the execution log immediately.
 117E.g. network packets are written into the log when they arrive into the virtual
 118network adapter.
 119
 120All non-deterministic events are coming from these devices. But to
 121replay them we need to know at which moments they occur. We specify
 122these moments by counting the number of instructions executed between
 123every pair of consecutive events.
 124
 125Instruction counting
 126--------------------
 127
 128QEMU should work in icount mode to use record/replay feature. icount was
 129designed to allow deterministic execution in absence of external inputs
 130of the virtual machine. We also use icount to control the occurrence of the
 131non-deterministic events. The number of instructions elapsed from the last event
 132is written to the log while recording the execution. In replay mode we
 133can predict when to inject that event using the instruction counter.
 134
 135Timers
 136------
 137
 138Timers are used to execute callbacks from different subsystems of QEMU
 139at the specified moments of time. There are several kinds of timers:
 140 * Real time clock. Based on host time and used only for callbacks that
 141   do not change the virtual machine state. For this reason real time
 142   clock and timers does not affect deterministic replay at all.
 143 * Virtual clock. These timers run only during the emulation. In icount
 144   mode virtual clock value is calculated using executed instructions counter.
 145   That is why it is completely deterministic and does not have to be recorded.
 146 * Host clock. This clock is used by device models that simulate real time
 147   sources (e.g. real time clock chip). Host clock is the one of the sources
 148   of non-determinism. Host clock read operations should be logged to
 149   make the execution deterministic.
 150 * Virtual real time clock. This clock is similar to real time clock but
 151   it is used only for increasing virtual clock while virtual machine is
 152   sleeping. Due to its nature it is also non-deterministic as the host clock
 153   and has to be logged too.
 154
 155Checkpoints
 156-----------
 157
 158Replaying of the execution of virtual machine is bound by sources of
 159non-determinism. These are inputs from clock and peripheral devices,
 160and QEMU thread scheduling. Thread scheduling affect on processing events
 161from timers, asynchronous input-output, and bottom halves.
 162
 163Invocations of timers are coupled with clock reads and changing the state
 164of the virtual machine. Reads produce non-deterministic data taken from
 165host clock. And VM state changes should preserve their order. Their relative
 166order in replay mode must replicate the order of callbacks in record mode.
 167To preserve this order we use checkpoints. When a specific clock is processed
 168in record mode we save to the log special "checkpoint" event.
 169Checkpoints here do not refer to virtual machine snapshots. They are just
 170record/replay events used for synchronization.
 171
 172QEMU in replay mode will try to invoke timers processing in random moment
 173of time. That's why we do not process a group of timers until the checkpoint
 174event will be read from the log. Such an event allows synchronizing CPU
 175execution and timer events.
 176
 177Two other checkpoints govern the "warping" of the virtual clock.
 178While the virtual machine is idle, the virtual clock increments at
 1791 ns per *real time* nanosecond.  This is done by setting up a timer
 180(called the warp timer) on the virtual real time clock, so that the
 181timer fires at the next deadline of the virtual clock; the virtual clock
 182is then incremented (which is called "warping" the virtual clock) as
 183soon as the timer fires or the CPUs need to go out of the idle state.
 184Two functions are used for this purpose; because these actions change
 185virtual machine state and must be deterministic, each of them creates a
 186checkpoint.  qemu_start_warp_timer checks if the CPUs are idle and if so
 187starts accounting real time to virtual clock.  qemu_account_warp_timer
 188is called when the CPUs get an interrupt or when the warp timer fires,
 189and it warps the virtual clock by the amount of real time that has passed
 190since qemu_start_warp_timer.
 191
 192Bottom halves
 193-------------
 194
 195Disk I/O events are completely deterministic in our model, because
 196in both record and replay modes we start virtual machine from the same
 197disk state. But callbacks that virtual disk controller uses for reading and
 198writing the disk may occur at different moments of time in record and replay
 199modes.
 200
 201Reading and writing requests are created by CPU thread of QEMU. Later these
 202requests proceed to block layer which creates "bottom halves". Bottom
 203halves consist of callback and its parameters. They are processed when
 204main loop locks the global mutex. These locks are not synchronized with
 205replaying process because main loop also processes the events that do not
 206affect the virtual machine state (like user interaction with monitor).
 207
 208That is why we had to implement saving and replaying bottom halves callbacks
 209synchronously to the CPU execution. When the callback is about to execute
 210it is added to the queue in the replay module. This queue is written to the
 211log when its callbacks are executed. In replay mode callbacks are not processed
 212until the corresponding event is read from the events log file.
 213
 214Sometimes the block layer uses asynchronous callbacks for its internal purposes
 215(like reading or writing VM snapshots or disk image cluster tables). In this
 216case bottom halves are not marked as "replayable" and do not saved
 217into the log.
 218
 219Block devices
 220-------------
 221
 222Block devices record/replay module intercepts calls of
 223bdrv coroutine functions at the top of block drivers stack.
 224To record and replay block operations the drive must be configured
 225as following:
 226 -drive file=disk.qcow2,if=none,id=img-direct
 227 -drive driver=blkreplay,if=none,image=img-direct,id=img-blkreplay
 228 -device ide-hd,drive=img-blkreplay
 229
 230blkreplay driver should be inserted between disk image and virtual driver
 231controller. Therefore all disk requests may be recorded and replayed.
 232
 233All block completion operations are added to the queue in the coroutines.
 234Queue is flushed at checkpoints and information about processed requests
 235is recorded to the log. In replay phase the queue is matched with
 236events read from the log. Therefore block devices requests are processed
 237deterministically.
 238
 239Snapshotting
 240------------
 241
 242New VM snapshots may be created in replay mode. They can be used later
 243to recover the desired VM state. All VM states created in replay mode
 244are associated with the moment of time in the replay scenario.
 245After recovering the VM state replay will start from that position.
 246
 247Default starting snapshot name may be specified with icount field
 248rrsnapshot as follows:
 249 -icount shift=7,rr=record,rrfile=replay.bin,rrsnapshot=snapshot_name
 250
 251This snapshot is created at start of recording and restored at start
 252of replaying. It also can be loaded while replaying to roll back
 253the execution.
 254
 255Use QEMU monitor to create additional snapshots. 'savevm <name>' command
 256created the snapshot and 'loadvm <name>' restores it. To prevent corruption
 257of the original disk image, use overlay files linked to the original images.
 258Therefore all new snapshots (including the starting one) will be saved in
 259overlays and the original image remains unchanged.
 260
 261Network devices
 262---------------
 263
 264Record and replay for network interactions is performed with the network filter.
 265Each backend must have its own instance of the replay filter as follows:
 266 -netdev user,id=net1 -device rtl8139,netdev=net1
 267 -object filter-replay,id=replay,netdev=net1
 268
 269Replay network filter is used to record and replay network packets. While
 270recording the virtual machine this filter puts all packets coming from
 271the outer world into the log. In replay mode packets from the log are
 272injected into the network device. All interactions with network backend
 273in replay mode are disabled.
 274
 275Audio devices
 276-------------
 277
 278Audio data is recorded and replay automatically. The command line for recording
 279and replaying must contain identical specifications of audio hardware, e.g.:
 280 -soundhw ac97
 281
 282Serial ports
 283------------
 284
 285Serial ports input is recorded and replay automatically. The command lines
 286for recording and replaying must contain identical number of ports in record
 287and replay modes, but their backends may differ.
 288E.g., '-serial stdio' in record mode, and '-serial null' in replay mode.
 289
 290Replay log format
 291-----------------
 292
 293Record/replay log consists of the header and the sequence of execution
 294events. The header includes 4-byte replay version id and 8-byte reserved
 295field. Version is updated every time replay log format changes to prevent
 296using replay log created by another build of qemu.
 297
 298The sequence of the events describes virtual machine state changes.
 299It includes all non-deterministic inputs of VM, synchronization marks and
 300instruction counts used to correctly inject inputs at replay.
 301
 302Synchronization marks (checkpoints) are used for synchronizing qemu threads
 303that perform operations with virtual hardware. These operations may change
 304system's state (e.g., change some register or generate interrupt) and
 305therefore should execute synchronously with CPU thread.
 306
 307Every event in the log includes 1-byte event id and optional arguments.
 308When argument is an array, it is stored as 4-byte array length
 309and corresponding number of bytes with data.
 310Here is the list of events that are written into the log:
 311
 312 - EVENT_INSTRUCTION. Instructions executed since last event.
 313   Argument: 4-byte number of executed instructions.
 314 - EVENT_INTERRUPT. Used to synchronize interrupt processing.
 315 - EVENT_EXCEPTION. Used to synchronize exception handling.
 316 - EVENT_ASYNC. This is a group of events. They are always processed
 317   together with checkpoints. When such an event is generated, it is
 318   stored in the queue and processed only when checkpoint occurs.
 319   Every such event is followed by 1-byte checkpoint id and 1-byte
 320   async event id from the following list:
 321     - REPLAY_ASYNC_EVENT_BH. Bottom-half callback. This event synchronizes
 322       callbacks that affect virtual machine state, but normally called
 323       asynchronously.
 324       Argument: 8-byte operation id.
 325     - REPLAY_ASYNC_EVENT_INPUT. Input device event. Contains
 326       parameters of keyboard and mouse input operations
 327       (key press/release, mouse pointer movement).
 328       Arguments: 9-16 bytes depending of input event.
 329     - REPLAY_ASYNC_EVENT_INPUT_SYNC. Internal input synchronization event.
 330     - REPLAY_ASYNC_EVENT_CHAR_READ. Character (e.g., serial port) device input
 331       initiated by the sender.
 332       Arguments: 1-byte character device id.
 333                  Array with bytes were read.
 334     - REPLAY_ASYNC_EVENT_BLOCK. Block device operation. Used to synchronize
 335       operations with disk and flash drives with CPU.
 336       Argument: 8-byte operation id.
 337     - REPLAY_ASYNC_EVENT_NET. Incoming network packet.
 338       Arguments: 1-byte network adapter id.
 339                  4-byte packet flags.
 340                  Array with packet bytes.
 341 - EVENT_SHUTDOWN. Occurs when user sends shutdown event to qemu,
 342   e.g., by closing the window.
 343 - EVENT_CHAR_WRITE. Used to synchronize character output operations.
 344   Arguments: 4-byte output function return value.
 345              4-byte offset in the output array.
 346 - EVENT_CHAR_READ_ALL. Used to synchronize character input operations,
 347   initiated by qemu.
 348   Argument: Array with bytes that were read.
 349 - EVENT_CHAR_READ_ALL_ERROR. Unsuccessful character input operation,
 350   initiated by qemu.
 351   Argument: 4-byte error code.
 352 - EVENT_CLOCK + clock_id. Group of events for host clock read operations.
 353   Argument: 8-byte clock value.
 354 - EVENT_CHECKPOINT + checkpoint_id. Checkpoint for synchronization of
 355   CPU, internal threads, and asynchronous input events. May be followed
 356   by one or more EVENT_ASYNC events.
 357 - EVENT_END. Last event in the log.
 358