linux/Documentation/x86/pti.txt
<<
>>
Prefs
   1Overview
   2========
   3
   4Page Table Isolation (pti, previously known as KAISER[1]) is a
   5countermeasure against attacks on the shared user/kernel address
   6space such as the "Meltdown" approach[2].
   7
   8To mitigate this class of attacks, we create an independent set of
   9page tables for use only when running userspace applications.  When
  10the kernel is entered via syscalls, interrupts or exceptions, the
  11page tables are switched to the full "kernel" copy.  When the system
  12switches back to user mode, the user copy is used again.
  13
  14The userspace page tables contain only a minimal amount of kernel
  15data: only what is needed to enter/exit the kernel such as the
  16entry/exit functions themselves and the interrupt descriptor table
  17(IDT).  There are a few strictly unnecessary things that get mapped
  18such as the first C function when entering an interrupt (see
  19comments in pti.c).
  20
  21This approach helps to ensure that side-channel attacks leveraging
  22the paging structures do not function when PTI is enabled.  It can be
  23enabled by setting CONFIG_PAGE_TABLE_ISOLATION=y at compile time.
  24Once enabled at compile-time, it can be disabled at boot with the
  25'nopti' or 'pti=' kernel parameters (see kernel-parameters.txt).
  26
  27Page Table Management
  28=====================
  29
  30When PTI is enabled, the kernel manages two sets of page tables.
  31The first set is very similar to the single set which is present in
  32kernels without PTI.  This includes a complete mapping of userspace
  33that the kernel can use for things like copy_to_user().
  34
  35Although _complete_, the user portion of the kernel page tables is
  36crippled by setting the NX bit in the top level.  This ensures
  37that any missed kernel->user CR3 switch will immediately crash
  38userspace upon executing its first instruction.
  39
  40The userspace page tables map only the kernel data needed to enter
  41and exit the kernel.  This data is entirely contained in the 'struct
  42cpu_entry_area' structure which is placed in the fixmap which gives
  43each CPU's copy of the area a compile-time-fixed virtual address.
  44
  45For new userspace mappings, the kernel makes the entries in its
  46page tables like normal.  The only difference is when the kernel
  47makes entries in the top (PGD) level.  In addition to setting the
  48entry in the main kernel PGD, a copy of the entry is made in the
  49userspace page tables' PGD.
  50
  51This sharing at the PGD level also inherently shares all the lower
  52layers of the page tables.  This leaves a single, shared set of
  53userspace page tables to manage.  One PTE to lock, one set of
  54accessed bits, dirty bits, etc...
  55
  56Overhead
  57========
  58
  59Protection against side-channel attacks is important.  But,
  60this protection comes at a cost:
  61
  621. Increased Memory Use
  63  a. Each process now needs an order-1 PGD instead of order-0.
  64     (Consumes an additional 4k per process).
  65  b. The 'cpu_entry_area' structure must be 2MB in size and 2MB
  66     aligned so that it can be mapped by setting a single PMD
  67     entry.  This consumes nearly 2MB of RAM once the kernel
  68     is decompressed, but no space in the kernel image itself.
  69
  702. Runtime Cost
  71  a. CR3 manipulation to switch between the page table copies
  72     must be done at interrupt, syscall, and exception entry
  73     and exit (it can be skipped when the kernel is interrupted,
  74     though.)  Moves to CR3 are on the order of a hundred
  75     cycles, and are required at every entry and exit.
  76  b. A "trampoline" must be used for SYSCALL entry.  This
  77     trampoline depends on a smaller set of resources than the
  78     non-PTI SYSCALL entry code, so requires mapping fewer
  79     things into the userspace page tables.  The downside is
  80     that stacks must be switched at entry time.
  81  c. Global pages are disabled for all kernel structures not
  82     mapped into both kernel and userspace page tables.  This
  83     feature of the MMU allows different processes to share TLB
  84     entries mapping the kernel.  Losing the feature means more
  85     TLB misses after a context switch.  The actual loss of
  86     performance is very small, however, never exceeding 1%.
  87  d. Process Context IDentifiers (PCID) is a CPU feature that
  88     allows us to skip flushing the entire TLB when switching page
  89     tables by setting a special bit in CR3 when the page tables
  90     are changed.  This makes switching the page tables (at context
  91     switch, or kernel entry/exit) cheaper.  But, on systems with
  92     PCID support, the context switch code must flush both the user
  93     and kernel entries out of the TLB.  The user PCID TLB flush is
  94     deferred until the exit to userspace, minimizing the cost.
  95     See intel.com/sdm for the gory PCID/INVPCID details.
  96  e. The userspace page tables must be populated for each new
  97     process.  Even without PTI, the shared kernel mappings
  98     are created by copying top-level (PGD) entries into each
  99     new process.  But, with PTI, there are now *two* kernel
 100     mappings: one in the kernel page tables that maps everything
 101     and one for the entry/exit structures.  At fork(), we need to
 102     copy both.
 103  f. In addition to the fork()-time copying, there must also
 104     be an update to the userspace PGD any time a set_pgd() is done
 105     on a PGD used to map userspace.  This ensures that the kernel
 106     and userspace copies always map the same userspace
 107     memory.
 108  g. On systems without PCID support, each CR3 write flushes
 109     the entire TLB.  That means that each syscall, interrupt
 110     or exception flushes the TLB.
 111  h. INVPCID is a TLB-flushing instruction which allows flushing
 112     of TLB entries for non-current PCIDs.  Some systems support
 113     PCIDs, but do not support INVPCID.  On these systems, addresses
 114     can only be flushed from the TLB for the current PCID.  When
 115     flushing a kernel address, we need to flush all PCIDs, so a
 116     single kernel address flush will require a TLB-flushing CR3
 117     write upon the next use of every PCID.
 118
 119Possible Future Work
 120====================
 1211. We can be more careful about not actually writing to CR3
 122   unless its value is actually changed.
 1232. Allow PTI to be enabled/disabled at runtime in addition to the
 124   boot-time switching.
 125
 126Testing
 127========
 128
 129To test stability of PTI, the following test procedure is recommended,
 130ideally doing all of these in parallel:
 131
 1321. Set CONFIG_DEBUG_ENTRY=y
 1332. Run several copies of all of the tools/testing/selftests/x86/ tests
 134   (excluding MPX and protection_keys) in a loop on multiple CPUs for
 135   several minutes.  These tests frequently uncover corner cases in the
 136   kernel entry code.  In general, old kernels might cause these tests
 137   themselves to crash, but they should never crash the kernel.
 1383. Run the 'perf' tool in a mode (top or record) that generates many
 139   frequent performance monitoring non-maskable interrupts (see "NMI"
 140   in /proc/interrupts).  This exercises the NMI entry/exit code which
 141   is known to trigger bugs in code paths that did not expect to be
 142   interrupted, including nested NMIs.  Using "-c" boosts the rate of
 143   NMIs, and using two -c with separate counters encourages nested NMIs
 144   and less deterministic behavior.
 145
 146        while true; do perf record -c 10000 -e instructions,cycles -a sleep 10; done
 147
 1484. Launch a KVM virtual machine.
 1495. Run 32-bit binaries on systems supporting the SYSCALL instruction.
 150   This has been a lightly-tested code path and needs extra scrutiny.
 151
 152Debugging
 153=========
 154
 155Bugs in PTI cause a few different signatures of crashes
 156that are worth noting here.
 157
 158 * Failures of the selftests/x86 code.  Usually a bug in one of the
 159   more obscure corners of entry_64.S
 160 * Crashes in early boot, especially around CPU bringup.  Bugs
 161   in the trampoline code or mappings cause these.
 162 * Crashes at the first interrupt.  Caused by bugs in entry_64.S,
 163   like screwing up a page table switch.  Also caused by
 164   incorrectly mapping the IRQ handler entry code.
 165 * Crashes at the first NMI.  The NMI code is separate from main
 166   interrupt handlers and can have bugs that do not affect
 167   normal interrupts.  Also caused by incorrectly mapping NMI
 168   code.  NMIs that interrupt the entry code must be very
 169   careful and can be the cause of crashes that show up when
 170   running perf.
 171 * Kernel crashes at the first exit to userspace.  entry_64.S
 172   bugs, or failing to map some of the exit code.
 173 * Crashes at first interrupt that interrupts userspace. The paths
 174   in entry_64.S that return to userspace are sometimes separate
 175   from the ones that return to the kernel.
 176 * Double faults: overflowing the kernel stack because of page
 177   faults upon page faults.  Caused by touching non-pti-mapped
 178   data in the entry code, or forgetting to switch to kernel
 179   CR3 before calling into C functions which are not pti-mapped.
 180 * Userspace segfaults early in boot, sometimes manifesting
 181   as mount(8) failing to mount the rootfs.  These have
 182   tended to be TLB invalidation issues.  Usually invalidating
 183   the wrong PCID, or otherwise missing an invalidation.
 184
 1851. https://gruss.cc/files/kaiser.pdf
 1862. https://meltdownattack.com/meltdown.pdf
 187