linux/Documentation/admin-guide/kernel-per-CPU-kthreads.rst
<<
>>
Prefs
   1==========================================
   2Reducing OS jitter due to per-cpu kthreads
   3==========================================
   4
   5This document lists per-CPU kthreads in the Linux kernel and presents
   6options to control their OS jitter.  Note that non-per-CPU kthreads are
   7not listed here.  To reduce OS jitter from non-per-CPU kthreads, bind
   8them to a "housekeeping" CPU dedicated to such work.
   9
  10References
  11==========
  12
  13-       Documentation/core-api/irq/irq-affinity.rst:  Binding interrupts to sets of CPUs.
  14
  15-       Documentation/admin-guide/cgroup-v1:  Using cgroups to bind tasks to sets of CPUs.
  16
  17-       man taskset:  Using the taskset command to bind tasks to sets
  18        of CPUs.
  19
  20-       man sched_setaffinity:  Using the sched_setaffinity() system
  21        call to bind tasks to sets of CPUs.
  22
  23-       /sys/devices/system/cpu/cpuN/online:  Control CPU N's hotplug state,
  24        writing "0" to offline and "1" to online.
  25
  26-       In order to locate kernel-generated OS jitter on CPU N:
  27
  28                cd /sys/kernel/debug/tracing
  29                echo 1 > max_graph_depth # Increase the "1" for more detail
  30                echo function_graph > current_tracer
  31                # run workload
  32                cat per_cpu/cpuN/trace
  33
  34kthreads
  35========
  36
  37Name:
  38  ehca_comp/%u
  39
  40Purpose:
  41  Periodically process Infiniband-related work.
  42
  43To reduce its OS jitter, do any of the following:
  44
  451.      Don't use eHCA Infiniband hardware, instead choosing hardware
  46        that does not require per-CPU kthreads.  This will prevent these
  47        kthreads from being created in the first place.  (This will
  48        work for most people, as this hardware, though important, is
  49        relatively old and is produced in relatively low unit volumes.)
  502.      Do all eHCA-Infiniband-related work on other CPUs, including
  51        interrupts.
  523.      Rework the eHCA driver so that its per-CPU kthreads are
  53        provisioned only on selected CPUs.
  54
  55
  56Name:
  57  irq/%d-%s
  58
  59Purpose:
  60  Handle threaded interrupts.
  61
  62To reduce its OS jitter, do the following:
  63
  641.      Use irq affinity to force the irq threads to execute on
  65        some other CPU.
  66
  67Name:
  68  kcmtpd_ctr_%d
  69
  70Purpose:
  71  Handle Bluetooth work.
  72
  73To reduce its OS jitter, do one of the following:
  74
  751.      Don't use Bluetooth, in which case these kthreads won't be
  76        created in the first place.
  772.      Use irq affinity to force Bluetooth-related interrupts to
  78        occur on some other CPU and furthermore initiate all
  79        Bluetooth activity on some other CPU.
  80
  81Name:
  82  ksoftirqd/%u
  83
  84Purpose:
  85  Execute softirq handlers when threaded or when under heavy load.
  86
  87To reduce its OS jitter, each softirq vector must be handled
  88separately as follows:
  89
  90TIMER_SOFTIRQ
  91-------------
  92
  93Do all of the following:
  94
  951.      To the extent possible, keep the CPU out of the kernel when it
  96        is non-idle, for example, by avoiding system calls and by forcing
  97        both kernel threads and interrupts to execute elsewhere.
  982.      Build with CONFIG_HOTPLUG_CPU=y.  After boot completes, force
  99        the CPU offline, then bring it back online.  This forces
 100        recurring timers to migrate elsewhere.  If you are concerned
 101        with multiple CPUs, force them all offline before bringing the
 102        first one back online.  Once you have onlined the CPUs in question,
 103        do not offline any other CPUs, because doing so could force the
 104        timer back onto one of the CPUs in question.
 105
 106NET_TX_SOFTIRQ and NET_RX_SOFTIRQ
 107---------------------------------
 108
 109Do all of the following:
 110
 1111.      Force networking interrupts onto other CPUs.
 1122.      Initiate any network I/O on other CPUs.
 1133.      Once your application has started, prevent CPU-hotplug operations
 114        from being initiated from tasks that might run on the CPU to
 115        be de-jittered.  (It is OK to force this CPU offline and then
 116        bring it back online before you start your application.)
 117
 118BLOCK_SOFTIRQ
 119-------------
 120
 121Do all of the following:
 122
 1231.      Force block-device interrupts onto some other CPU.
 1242.      Initiate any block I/O on other CPUs.
 1253.      Once your application has started, prevent CPU-hotplug operations
 126        from being initiated from tasks that might run on the CPU to
 127        be de-jittered.  (It is OK to force this CPU offline and then
 128        bring it back online before you start your application.)
 129
 130IRQ_POLL_SOFTIRQ
 131----------------
 132
 133Do all of the following:
 134
 1351.      Force block-device interrupts onto some other CPU.
 1362.      Initiate any block I/O and block-I/O polling on other CPUs.
 1373.      Once your application has started, prevent CPU-hotplug operations
 138        from being initiated from tasks that might run on the CPU to
 139        be de-jittered.  (It is OK to force this CPU offline and then
 140        bring it back online before you start your application.)
 141
 142TASKLET_SOFTIRQ
 143---------------
 144
 145Do one or more of the following:
 146
 1471.      Avoid use of drivers that use tasklets.  (Such drivers will contain
 148        calls to things like tasklet_schedule().)
 1492.      Convert all drivers that you must use from tasklets to workqueues.
 1503.      Force interrupts for drivers using tasklets onto other CPUs,
 151        and also do I/O involving these drivers on other CPUs.
 152
 153SCHED_SOFTIRQ
 154-------------
 155
 156Do all of the following:
 157
 1581.      Avoid sending scheduler IPIs to the CPU to be de-jittered,
 159        for example, ensure that at most one runnable kthread is present
 160        on that CPU.  If a thread that expects to run on the de-jittered
 161        CPU awakens, the scheduler will send an IPI that can result in
 162        a subsequent SCHED_SOFTIRQ.
 1632.      CONFIG_NO_HZ_FULL=y and ensure that the CPU to be de-jittered
 164        is marked as an adaptive-ticks CPU using the "nohz_full="
 165        boot parameter.  This reduces the number of scheduler-clock
 166        interrupts that the de-jittered CPU receives, minimizing its
 167        chances of being selected to do the load balancing work that
 168        runs in SCHED_SOFTIRQ context.
 1693.      To the extent possible, keep the CPU out of the kernel when it
 170        is non-idle, for example, by avoiding system calls and by
 171        forcing both kernel threads and interrupts to execute elsewhere.
 172        This further reduces the number of scheduler-clock interrupts
 173        received by the de-jittered CPU.
 174
 175HRTIMER_SOFTIRQ
 176---------------
 177
 178Do all of the following:
 179
 1801.      To the extent possible, keep the CPU out of the kernel when it
 181        is non-idle.  For example, avoid system calls and force both
 182        kernel threads and interrupts to execute elsewhere.
 1832.      Build with CONFIG_HOTPLUG_CPU=y.  Once boot completes, force the
 184        CPU offline, then bring it back online.  This forces recurring
 185        timers to migrate elsewhere.  If you are concerned with multiple
 186        CPUs, force them all offline before bringing the first one
 187        back online.  Once you have onlined the CPUs in question, do not
 188        offline any other CPUs, because doing so could force the timer
 189        back onto one of the CPUs in question.
 190
 191RCU_SOFTIRQ
 192-----------
 193
 194Do at least one of the following:
 195
 1961.      Offload callbacks and keep the CPU in either dyntick-idle or
 197        adaptive-ticks state by doing all of the following:
 198
 199        a.      CONFIG_NO_HZ_FULL=y and ensure that the CPU to be
 200                de-jittered is marked as an adaptive-ticks CPU using the
 201                "nohz_full=" boot parameter.  Bind the rcuo kthreads to
 202                housekeeping CPUs, which can tolerate OS jitter.
 203        b.      To the extent possible, keep the CPU out of the kernel
 204                when it is non-idle, for example, by avoiding system
 205                calls and by forcing both kernel threads and interrupts
 206                to execute elsewhere.
 207
 2082.      Enable RCU to do its processing remotely via dyntick-idle by
 209        doing all of the following:
 210
 211        a.      Build with CONFIG_NO_HZ=y and CONFIG_RCU_FAST_NO_HZ=y.
 212        b.      Ensure that the CPU goes idle frequently, allowing other
 213                CPUs to detect that it has passed through an RCU quiescent
 214                state.  If the kernel is built with CONFIG_NO_HZ_FULL=y,
 215                userspace execution also allows other CPUs to detect that
 216                the CPU in question has passed through a quiescent state.
 217        c.      To the extent possible, keep the CPU out of the kernel
 218                when it is non-idle, for example, by avoiding system
 219                calls and by forcing both kernel threads and interrupts
 220                to execute elsewhere.
 221
 222Name:
 223  kworker/%u:%d%s (cpu, id, priority)
 224
 225Purpose:
 226  Execute workqueue requests
 227
 228To reduce its OS jitter, do any of the following:
 229
 2301.      Run your workload at a real-time priority, which will allow
 231        preempting the kworker daemons.
 2322.      A given workqueue can be made visible in the sysfs filesystem
 233        by passing the WQ_SYSFS to that workqueue's alloc_workqueue().
 234        Such a workqueue can be confined to a given subset of the
 235        CPUs using the ``/sys/devices/virtual/workqueue/*/cpumask`` sysfs
 236        files.  The set of WQ_SYSFS workqueues can be displayed using
 237        "ls /sys/devices/virtual/workqueue".  That said, the workqueues
 238        maintainer would like to caution people against indiscriminately
 239        sprinkling WQ_SYSFS across all the workqueues.  The reason for
 240        caution is that it is easy to add WQ_SYSFS, but because sysfs is
 241        part of the formal user/kernel API, it can be nearly impossible
 242        to remove it, even if its addition was a mistake.
 2433.      Do any of the following needed to avoid jitter that your
 244        application cannot tolerate:
 245
 246        a.      Build your kernel with CONFIG_SLUB=y rather than
 247                CONFIG_SLAB=y, thus avoiding the slab allocator's periodic
 248                use of each CPU's workqueues to run its cache_reap()
 249                function.
 250        b.      Avoid using oprofile, thus avoiding OS jitter from
 251                wq_sync_buffer().
 252        c.      Limit your CPU frequency so that a CPU-frequency
 253                governor is not required, possibly enlisting the aid of
 254                special heatsinks or other cooling technologies.  If done
 255                correctly, and if you CPU architecture permits, you should
 256                be able to build your kernel with CONFIG_CPU_FREQ=n to
 257                avoid the CPU-frequency governor periodically running
 258                on each CPU, including cs_dbs_timer() and od_dbs_timer().
 259
 260                WARNING:  Please check your CPU specifications to
 261                make sure that this is safe on your particular system.
 262        d.      As of v3.18, Christoph Lameter's on-demand vmstat workers
 263                commit prevents OS jitter due to vmstat_update() on
 264                CONFIG_SMP=y systems.  Before v3.18, is not possible
 265                to entirely get rid of the OS jitter, but you can
 266                decrease its frequency by writing a large value to
 267                /proc/sys/vm/stat_interval.  The default value is HZ,
 268                for an interval of one second.  Of course, larger values
 269                will make your virtual-memory statistics update more
 270                slowly.  Of course, you can also run your workload at
 271                a real-time priority, thus preempting vmstat_update(),
 272                but if your workload is CPU-bound, this is a bad idea.
 273                However, there is an RFC patch from Christoph Lameter
 274                (based on an earlier one from Gilad Ben-Yossef) that
 275                reduces or even eliminates vmstat overhead for some
 276                workloads at https://lore.kernel.org/r/00000140e9dfd6bd-40db3d4f-c1be-434f-8132-7820f81bb586-000000@email.amazonses.com.
 277        e.      If running on high-end powerpc servers, build with
 278                CONFIG_PPC_RTAS_DAEMON=n.  This prevents the RTAS
 279                daemon from running on each CPU every second or so.
 280                (This will require editing Kconfig files and will defeat
 281                this platform's RAS functionality.)  This avoids jitter
 282                due to the rtas_event_scan() function.
 283                WARNING:  Please check your CPU specifications to
 284                make sure that this is safe on your particular system.
 285        f.      If running on Cell Processor, build your kernel with
 286                CBE_CPUFREQ_SPU_GOVERNOR=n to avoid OS jitter from
 287                spu_gov_work().
 288                WARNING:  Please check your CPU specifications to
 289                make sure that this is safe on your particular system.
 290        g.      If running on PowerMAC, build your kernel with
 291                CONFIG_PMAC_RACKMETER=n to disable the CPU-meter,
 292                avoiding OS jitter from rackmeter_do_timer().
 293
 294Name:
 295  rcuc/%u
 296
 297Purpose:
 298  Execute RCU callbacks in CONFIG_RCU_BOOST=y kernels.
 299
 300To reduce its OS jitter, do at least one of the following:
 301
 3021.      Build the kernel with CONFIG_PREEMPT=n.  This prevents these
 303        kthreads from being created in the first place, and also obviates
 304        the need for RCU priority boosting.  This approach is feasible
 305        for workloads that do not require high degrees of responsiveness.
 3062.      Build the kernel with CONFIG_RCU_BOOST=n.  This prevents these
 307        kthreads from being created in the first place.  This approach
 308        is feasible only if your workload never requires RCU priority
 309        boosting, for example, if you ensure frequent idle time on all
 310        CPUs that might execute within the kernel.
 3113.      Build with CONFIG_RCU_NOCB_CPU=y and boot with the rcu_nocbs=
 312        boot parameter offloading RCU callbacks from all CPUs susceptible
 313        to OS jitter.  This approach prevents the rcuc/%u kthreads from
 314        having any work to do, so that they are never awakened.
 3154.      Ensure that the CPU never enters the kernel, and, in particular,
 316        avoid initiating any CPU hotplug operations on this CPU.  This is
 317        another way of preventing any callbacks from being queued on the
 318        CPU, again preventing the rcuc/%u kthreads from having any work
 319        to do.
 320
 321Name:
 322  rcuop/%d and rcuos/%d
 323
 324Purpose:
 325  Offload RCU callbacks from the corresponding CPU.
 326
 327To reduce its OS jitter, do at least one of the following:
 328
 3291.      Use affinity, cgroups, or other mechanism to force these kthreads
 330        to execute on some other CPU.
 3312.      Build with CONFIG_RCU_NOCB_CPU=n, which will prevent these
 332        kthreads from being created in the first place.  However, please
 333        note that this will not eliminate OS jitter, but will instead
 334        shift it to RCU_SOFTIRQ.
 335