linux/Documentation/futex-requeue-pi.txt
<<
>>
Prefs
   1Futex Requeue PI
   2----------------
   3
   4Requeueing of tasks from a non-PI futex to a PI futex requires
   5special handling in order to ensure the underlying rt_mutex is never
   6left without an owner if it has waiters; doing so would break the PI
   7boosting logic [see rt-mutex-desgin.txt] For the purposes of
   8brevity, this action will be referred to as "requeue_pi" throughout
   9this document.  Priority inheritance is abbreviated throughout as
  10"PI".
  11
  12Motivation
  13----------
  14
  15Without requeue_pi, the glibc implementation of
  16pthread_cond_broadcast() must resort to waking all the tasks waiting
  17on a pthread_condvar and letting them try to sort out which task
  18gets to run first in classic thundering-herd formation.  An ideal
  19implementation would wake the highest-priority waiter, and leave the
  20rest to the natural wakeup inherent in unlocking the mutex
  21associated with the condvar.
  22
  23Consider the simplified glibc calls:
  24
  25/* caller must lock mutex */
  26pthread_cond_wait(cond, mutex)
  27{
  28        lock(cond->__data.__lock);
  29        unlock(mutex);
  30        do {
  31           unlock(cond->__data.__lock);
  32           futex_wait(cond->__data.__futex);
  33           lock(cond->__data.__lock);
  34        } while(...)
  35        unlock(cond->__data.__lock);
  36        lock(mutex);
  37}
  38
  39pthread_cond_broadcast(cond)
  40{
  41        lock(cond->__data.__lock);
  42        unlock(cond->__data.__lock);
  43        futex_requeue(cond->data.__futex, cond->mutex);
  44}
  45
  46Once pthread_cond_broadcast() requeues the tasks, the cond->mutex
  47has waiters. Note that pthread_cond_wait() attempts to lock the
  48mutex only after it has returned to user space.  This will leave the
  49underlying rt_mutex with waiters, and no owner, breaking the
  50previously mentioned PI-boosting algorithms.
  51
  52In order to support PI-aware pthread_condvar's, the kernel needs to
  53be able to requeue tasks to PI futexes.  This support implies that
  54upon a successful futex_wait system call, the caller would return to
  55user space already holding the PI futex.  The glibc implementation
  56would be modified as follows:
  57
  58
  59/* caller must lock mutex */
  60pthread_cond_wait_pi(cond, mutex)
  61{
  62        lock(cond->__data.__lock);
  63        unlock(mutex);
  64        do {
  65           unlock(cond->__data.__lock);
  66           futex_wait_requeue_pi(cond->__data.__futex);
  67           lock(cond->__data.__lock);
  68        } while(...)
  69        unlock(cond->__data.__lock);
  70        /* the kernel acquired the the mutex for us */
  71}
  72
  73pthread_cond_broadcast_pi(cond)
  74{
  75        lock(cond->__data.__lock);
  76        unlock(cond->__data.__lock);
  77        futex_requeue_pi(cond->data.__futex, cond->mutex);
  78}
  79
  80The actual glibc implementation will likely test for PI and make the
  81necessary changes inside the existing calls rather than creating new
  82calls for the PI cases.  Similar changes are needed for
  83pthread_cond_timedwait() and pthread_cond_signal().
  84
  85Implementation
  86--------------
  87
  88In order to ensure the rt_mutex has an owner if it has waiters, it
  89is necessary for both the requeue code, as well as the waiting code,
  90to be able to acquire the rt_mutex before returning to user space.
  91The requeue code cannot simply wake the waiter and leave it to
  92acquire the rt_mutex as it would open a race window between the
  93requeue call returning to user space and the waiter waking and
  94starting to run.  This is especially true in the uncontended case.
  95
  96The solution involves two new rt_mutex helper routines,
  97rt_mutex_start_proxy_lock() and rt_mutex_finish_proxy_lock(), which
  98allow the requeue code to acquire an uncontended rt_mutex on behalf
  99of the waiter and to enqueue the waiter on a contended rt_mutex.
 100Two new system calls provide the kernel<->user interface to
 101requeue_pi: FUTEX_WAIT_REQUEUE_PI and FUTEX_REQUEUE_CMP_PI.
 102
 103FUTEX_WAIT_REQUEUE_PI is called by the waiter (pthread_cond_wait()
 104and pthread_cond_timedwait()) to block on the initial futex and wait
 105to be requeued to a PI-aware futex.  The implementation is the
 106result of a high-speed collision between futex_wait() and
 107futex_lock_pi(), with some extra logic to check for the additional
 108wake-up scenarios.
 109
 110FUTEX_REQUEUE_CMP_PI is called by the waker
 111(pthread_cond_broadcast() and pthread_cond_signal()) to requeue and
 112possibly wake the waiting tasks. Internally, this system call is
 113still handled by futex_requeue (by passing requeue_pi=1).  Before
 114requeueing, futex_requeue() attempts to acquire the requeue target
 115PI futex on behalf of the top waiter.  If it can, this waiter is
 116woken.  futex_requeue() then proceeds to requeue the remaining
 117nr_wake+nr_requeue tasks to the PI futex, calling
 118rt_mutex_start_proxy_lock() prior to each requeue to prepare the
 119task as a waiter on the underlying rt_mutex.  It is possible that
 120the lock can be acquired at this stage as well, if so, the next
 121waiter is woken to finish the acquisition of the lock.
 122
 123FUTEX_REQUEUE_PI accepts nr_wake and nr_requeue as arguments, but
 124their sum is all that really matters.  futex_requeue() will wake or
 125requeue up to nr_wake + nr_requeue tasks.  It will wake only as many
 126tasks as it can acquire the lock for, which in the majority of cases
 127should be 0 as good programming practice dictates that the caller of
 128either pthread_cond_broadcast() or pthread_cond_signal() acquire the
 129mutex prior to making the call. FUTEX_REQUEUE_PI requires that
 130nr_wake=1.  nr_requeue should be INT_MAX for broadcast and 0 for
 131signal.
 132