qemu/docs/devel/multi-thread-tcg.rst
<<
>>
Prefs
   1..
   2  Copyright (c) 2015-2020 Linaro Ltd.
   3
   4  This work is licensed under the terms of the GNU GPL, version 2 or
   5  later. See the COPYING file in the top-level directory.
   6
   7==================
   8Multi-threaded TCG
   9==================
  10
  11This document outlines the design for multi-threaded TCG (a.k.a MTTCG)
  12system-mode emulation. user-mode emulation has always mirrored the
  13thread structure of the translated executable although some of the
  14changes done for MTTCG system emulation have improved the stability of
  15linux-user emulation.
  16
  17The original system-mode TCG implementation was single threaded and
  18dealt with multiple CPUs with simple round-robin scheduling. This
  19simplified a lot of things but became increasingly limited as systems
  20being emulated gained additional cores and per-core performance gains
  21for host systems started to level off.
  22
  23vCPU Scheduling
  24===============
  25
  26We introduce a new running mode where each vCPU will run on its own
  27user-space thread. This is enabled by default for all FE/BE
  28combinations where the host memory model is able to accommodate the
  29guest (TCG_GUEST_DEFAULT_MO & ~TCG_TARGET_DEFAULT_MO is zero) and the
  30guest has had the required work done to support this safely
  31(TARGET_SUPPORTS_MTTCG).
  32
  33System emulation will fall back to the original round robin approach
  34if:
  35
  36* forced by --accel tcg,thread=single
  37* enabling --icount mode
  38* 64 bit guests on 32 bit hosts (TCG_OVERSIZED_GUEST)
  39
  40In the general case of running translated code there should be no
  41inter-vCPU dependencies and all vCPUs should be able to run at full
  42speed. Synchronisation will only be required while accessing internal
  43shared data structures or when the emulated architecture requires a
  44coherent representation of the emulated machine state.
  45
  46Shared Data Structures
  47======================
  48
  49Main Run Loop
  50-------------
  51
  52Even when there is no code being generated there are a number of
  53structures associated with the hot-path through the main run-loop.
  54These are associated with looking up the next translation block to
  55execute. These include:
  56
  57    tb_jmp_cache (per-vCPU, cache of recent jumps)
  58    tb_ctx.htable (global hash table, phys address->tb lookup)
  59
  60As TB linking only occurs when blocks are in the same page this code
  61is critical to performance as looking up the next TB to execute is the
  62most common reason to exit the generated code.
  63
  64DESIGN REQUIREMENT: Make access to lookup structures safe with
  65multiple reader/writer threads. Minimise any lock contention to do it.
  66
  67The hot-path avoids using locks where possible. The tb_jmp_cache is
  68updated with atomic accesses to ensure consistent results. The fall
  69back QHT based hash table is also designed for lockless lookups. Locks
  70are only taken when code generation is required or TranslationBlocks
  71have their block-to-block jumps patched.
  72
  73Global TCG State
  74----------------
  75
  76User-mode emulation
  77~~~~~~~~~~~~~~~~~~~
  78
  79We need to protect the entire code generation cycle including any post
  80generation patching of the translated code. This also implies a shared
  81translation buffer which contains code running on all cores. Any
  82execution path that comes to the main run loop will need to hold a
  83mutex for code generation. This also includes times when we need flush
  84code or entries from any shared lookups/caches. Structures held on a
  85per-vCPU basis won't need locking unless other vCPUs will need to
  86modify them.
  87
  88DESIGN REQUIREMENT: Add locking around all code generation and TB
  89patching.
  90
  91(Current solution)
  92
  93Code generation is serialised with mmap_lock().
  94
  95!User-mode emulation
  96~~~~~~~~~~~~~~~~~~~~
  97
  98Each vCPU has its own TCG context and associated TCG region, thereby
  99requiring no locking during translation.
 100
 101Translation Blocks
 102------------------
 103
 104Currently the whole system shares a single code generation buffer
 105which when full will force a flush of all translations and start from
 106scratch again. Some operations also force a full flush of translations
 107including:
 108
 109  - debugging operations (breakpoint insertion/removal)
 110  - some CPU helper functions
 111  - linux-user spawning its first thread
 112
 113This is done with the async_safe_run_on_cpu() mechanism to ensure all
 114vCPUs are quiescent when changes are being made to shared global
 115structures.
 116
 117More granular translation invalidation events are typically due
 118to a change of the state of a physical page:
 119
 120  - code modification (self modify code, patching code)
 121  - page changes (new page mapping in linux-user mode)
 122
 123While setting the invalid flag in a TranslationBlock will stop it
 124being used when looked up in the hot-path there are a number of other
 125book-keeping structures that need to be safely cleared.
 126
 127Any TranslationBlocks which have been patched to jump directly to the
 128now invalid blocks need the jump patches reversing so they will return
 129to the C code.
 130
 131There are a number of look-up caches that need to be properly updated
 132including the:
 133
 134  - jump lookup cache
 135  - the physical-to-tb lookup hash table
 136  - the global page table
 137
 138The global page table (l1_map) which provides a multi-level look-up
 139for PageDesc structures which contain pointers to the start of a
 140linked list of all Translation Blocks in that page (see page_next).
 141
 142Both the jump patching and the page cache involve linked lists that
 143the invalidated TranslationBlock needs to be removed from.
 144
 145DESIGN REQUIREMENT: Safely handle invalidation of TBs
 146                      - safely patch/revert direct jumps
 147                      - remove central PageDesc lookup entries
 148                      - ensure lookup caches/hashes are safely updated
 149
 150(Current solution)
 151
 152The direct jump themselves are updated atomically by the TCG
 153tb_set_jmp_target() code. Modification to the linked lists that allow
 154searching for linked pages are done under the protection of tb->jmp_lock,
 155where tb is the destination block of a jump. Each origin block keeps a
 156pointer to its destinations so that the appropriate lock can be acquired before
 157iterating over a jump list.
 158
 159The global page table is a lockless radix tree; cmpxchg is used
 160to atomically insert new elements.
 161
 162The lookup caches are updated atomically and the lookup hash uses QHT
 163which is designed for concurrent safe lookup.
 164
 165Parallel code generation is supported. QHT is used at insertion time
 166as the synchronization point across threads, thereby ensuring that we only
 167keep track of a single TranslationBlock for each guest code block.
 168
 169Memory maps and TLBs
 170--------------------
 171
 172The memory handling code is fairly critical to the speed of memory
 173access in the emulated system. The SoftMMU code is designed so the
 174hot-path can be handled entirely within translated code. This is
 175handled with a per-vCPU TLB structure which once populated will allow
 176a series of accesses to the page to occur without exiting the
 177translated code. It is possible to set flags in the TLB address which
 178will ensure the slow-path is taken for each access. This can be done
 179to support:
 180
 181  - Memory regions (dividing up access to PIO, MMIO and RAM)
 182  - Dirty page tracking (for code gen, SMC detection, migration and display)
 183  - Virtual TLB (for translating guest address->real address)
 184
 185When the TLB tables are updated by a vCPU thread other than their own
 186we need to ensure it is done in a safe way so no inconsistent state is
 187seen by the vCPU thread.
 188
 189Some operations require updating a number of vCPUs TLBs at the same
 190time in a synchronised manner.
 191
 192DESIGN REQUIREMENTS:
 193
 194  - TLB Flush All/Page
 195    - can be across-vCPUs
 196    - cross vCPU TLB flush may need other vCPU brought to halt
 197    - change may need to be visible to the calling vCPU immediately
 198  - TLB Flag Update
 199    - usually cross-vCPU
 200    - want change to be visible as soon as possible
 201  - TLB Update (update a CPUTLBEntry, via tlb_set_page_with_attrs)
 202    - This is a per-vCPU table - by definition can't race
 203    - updated by its own thread when the slow-path is forced
 204
 205(Current solution)
 206
 207We have updated cputlb.c to defer operations when a cross-vCPU
 208operation with async_run_on_cpu() which ensures each vCPU sees a
 209coherent state when it next runs its work (in a few instructions
 210time).
 211
 212A new set up operations (tlb_flush_*_all_cpus) take an additional flag
 213which when set will force synchronisation by setting the source vCPUs
 214work as "safe work" and exiting the cpu run loop. This ensure by the
 215time execution restarts all flush operations have completed.
 216
 217TLB flag updates are all done atomically and are also protected by the
 218corresponding page lock.
 219
 220(Known limitation)
 221
 222Not really a limitation but the wait mechanism is overly strict for
 223some architectures which only need flushes completed by a barrier
 224instruction. This could be a future optimisation.
 225
 226Emulated hardware state
 227-----------------------
 228
 229Currently thanks to KVM work any access to IO memory is automatically
 230protected by the global iothread mutex, also known as the BQL (Big
 231Qemu Lock). Any IO region that doesn't use global mutex is expected to
 232do its own locking.
 233
 234However IO memory isn't the only way emulated hardware state can be
 235modified. Some architectures have model specific registers that
 236trigger hardware emulation features. Generally any translation helper
 237that needs to update more than a single vCPUs of state should take the
 238BQL.
 239
 240As the BQL, or global iothread mutex is shared across the system we
 241push the use of the lock as far down into the TCG code as possible to
 242minimise contention.
 243
 244(Current solution)
 245
 246MMIO access automatically serialises hardware emulation by way of the
 247BQL. Currently Arm targets serialise all ARM_CP_IO register accesses
 248and also defer the reset/startup of vCPUs to the vCPU context by way
 249of async_run_on_cpu().
 250
 251Updates to interrupt state are also protected by the BQL as they can
 252often be cross vCPU.
 253
 254Memory Consistency
 255==================
 256
 257Between emulated guests and host systems there are a range of memory
 258consistency models. Even emulating weakly ordered systems on strongly
 259ordered hosts needs to ensure things like store-after-load re-ordering
 260can be prevented when the guest wants to.
 261
 262Memory Barriers
 263---------------
 264
 265Barriers (sometimes known as fences) provide a mechanism for software
 266to enforce a particular ordering of memory operations from the point
 267of view of external observers (e.g. another processor core). They can
 268apply to any memory operations as well as just loads or stores.
 269
 270The Linux kernel has an excellent `write-up
 271<https://git.kernel.org/cgit/linux/kernel/git/torvalds/linux.git/plain/Documentation/memory-barriers.txt>`_
 272on the various forms of memory barrier and the guarantees they can
 273provide.
 274
 275Barriers are often wrapped around synchronisation primitives to
 276provide explicit memory ordering semantics. However they can be used
 277by themselves to provide safe lockless access by ensuring for example
 278a change to a signal flag will only be visible once the changes to
 279payload are.
 280
 281DESIGN REQUIREMENT: Add a new tcg_memory_barrier op
 282
 283This would enforce a strong load/store ordering so all loads/stores
 284complete at the memory barrier. On single-core non-SMP strongly
 285ordered backends this could become a NOP.
 286
 287Aside from explicit standalone memory barrier instructions there are
 288also implicit memory ordering semantics which comes with each guest
 289memory access instruction. For example all x86 load/stores come with
 290fairly strong guarantees of sequential consistency whereas Arm has
 291special variants of load/store instructions that imply acquire/release
 292semantics.
 293
 294In the case of a strongly ordered guest architecture being emulated on
 295a weakly ordered host the scope for a heavy performance impact is
 296quite high.
 297
 298DESIGN REQUIREMENTS: Be efficient with use of memory barriers
 299       - host systems with stronger implied guarantees can skip some barriers
 300       - merge consecutive barriers to the strongest one
 301
 302(Current solution)
 303
 304The system currently has a tcg_gen_mb() which will add memory barrier
 305operations if code generation is being done in a parallel context. The
 306tcg_optimize() function attempts to merge barriers up to their
 307strongest form before any load/store operations. The solution was
 308originally developed and tested for linux-user based systems. All
 309backends have been converted to emit fences when required. So far the
 310following front-ends have been updated to emit fences when required:
 311
 312    - target-i386
 313    - target-arm
 314    - target-aarch64
 315    - target-alpha
 316    - target-mips
 317
 318Memory Control and Maintenance
 319------------------------------
 320
 321This includes a class of instructions for controlling system cache
 322behaviour. While QEMU doesn't model cache behaviour these instructions
 323are often seen when code modification has taken place to ensure the
 324changes take effect.
 325
 326Synchronisation Primitives
 327--------------------------
 328
 329There are two broad types of synchronisation primitives found in
 330modern ISAs: atomic instructions and exclusive regions.
 331
 332The first type offer a simple atomic instruction which will guarantee
 333some sort of test and conditional store will be truly atomic w.r.t.
 334other cores sharing access to the memory. The classic example is the
 335x86 cmpxchg instruction.
 336
 337The second type offer a pair of load/store instructions which offer a
 338guarantee that a region of memory has not been touched between the
 339load and store instructions. An example of this is Arm's ldrex/strex
 340pair where the strex instruction will return a flag indicating a
 341successful store only if no other CPU has accessed the memory region
 342since the ldrex.
 343
 344Traditionally TCG has generated a series of operations that work
 345because they are within the context of a single translation block so
 346will have completed before another CPU is scheduled. However with
 347the ability to have multiple threads running to emulate multiple CPUs
 348we will need to explicitly expose these semantics.
 349
 350DESIGN REQUIREMENTS:
 351  - Support classic atomic instructions
 352  - Support load/store exclusive (or load link/store conditional) pairs
 353  - Generic enough infrastructure to support all guest architectures
 354CURRENT OPEN QUESTIONS:
 355  - How problematic is the ABA problem in general?
 356
 357(Current solution)
 358
 359The TCG provides a number of atomic helpers (tcg_gen_atomic_*) which
 360can be used directly or combined to emulate other instructions like
 361Arm's ldrex/strex instructions. While they are susceptible to the ABA
 362problem so far common guests have not implemented patterns where
 363this may be a problem - typically presenting a locking ABI which
 364assumes cmpxchg like semantics.
 365
 366The code also includes a fall-back for cases where multi-threaded TCG
 367ops can't work (e.g. guest atomic width > host atomic width). In this
 368case an EXCP_ATOMIC exit occurs and the instruction is emulated with
 369an exclusive lock which ensures all emulation is serialised.
 370
 371While the atomic helpers look good enough for now there may be a need
 372to look at solutions that can more closely model the guest
 373architectures semantics.
 374