LXR linux/Documentation/atomic

   1                Semantics and Behavior of Atomic and
   2                         Bitmask Operations
   3
   4                          David S. Miller        
   5
   6        This document is intended to serve as a guide to Linux port
   7maintainers on how to implement atomic counter, bitops, and spinlock
   8interfaces properly.
   9
  10        The atomic_t type should be defined as a signed integer.
  11Also, it should be made opaque such that any kind of cast to a normal
  12C integer type will fail.  Something like the following should
  13suffice:
  14
  15        typedef struct { volatile int counter; } atomic_t;
  16
  17Historically, counter has been declared volatile.  This is now discouraged.
  18See Documentation/volatile-considered-harmful.txt for the complete rationale.
  19
  20local_t is very similar to atomic_t. If the counter is per CPU and only
  21updated by one CPU, local_t is probably more appropriate. Please see
  22Documentation/local_ops.txt for the semantics of local_t.
  23
  24The first operations to implement for atomic_t's are the initializers and
  25plain reads.
  26
  27        #define ATOMIC_INIT(i)          { (i) }
  28        #define atomic_set(v, i)        ((v)->counter = (i))
  29
  30The first macro is used in definitions, such as:
  31
  32static atomic_t my_counter = ATOMIC_INIT(1);
  33
  34The initializer is atomic in that the return values of the atomic operations
  35are guaranteed to be correct reflecting the initialized value if the
  36initializer is used before runtime.  If the initializer is used at runtime, a
  37proper implicit or explicit read memory barrier is needed before reading the
  38value with atomic_read from another thread.
  39
  40The second interface can be used at runtime, as in:
  41
  42        struct foo { atomic_t counter; };
  43        ...
  44
  45        struct foo *k;
  46
  47        k = kmalloc(sizeof(*k), GFP_KERNEL);
  48        if (!k)
  49                return -ENOMEM;
  50        atomic_set(&k->counter, 0);
  51
  52The setting is atomic in that the return values of the atomic operations by
  53all threads are guaranteed to be correct reflecting either the value that has
  54been set with this operation or set with another operation.  A proper implicit
  55or explicit memory barrier is needed before the value set with the operation
  56is guaranteed to be readable with atomic_read from another thread.
  57
  58Next, we have:
  59
  60        #define atomic_read(v)  ((v)->counter)
  61
  62which simply reads the counter value currently visible to the calling thread.
  63The read is atomic in that the return value is guaranteed to be one of the
  64values initialized or modified with the interface operations if a proper
  65implicit or explicit memory barrier is used after possible runtime
  66initialization by any other thread and the value is modified only with the
  67interface operations.  atomic_read does not guarantee that the runtime
  68initialization by any other thread is visible yet, so the user of the
  69interface must take care of that with a proper implicit or explicit memory
  70barrier.
  71
  72*** WARNING: atomic_read() and atomic_set() DO NOT IMPLY BARRIERS! ***
  73
  74Some architectures may choose to use the volatile keyword, barriers, or inline
  75assembly to guarantee some degree of immediacy for atomic_read() and
  76atomic_set().  This is not uniformly guaranteed, and may change in the future,
  77so all users of atomic_t should treat atomic_read() and atomic_set() as simple
  78C statements that may be reordered or optimized away entirely by the compiler
  79or processor, and explicitly invoke the appropriate compiler and/or memory
  80barrier for each use case.  Failure to do so will result in code that may
  81suddenly break when used with different architectures or compiler
  82optimizations, or even changes in unrelated code which changes how the
  83compiler optimizes the section accessing atomic_t variables.
  84
  85*** YOU HAVE BEEN WARNED! ***
  86
  87Now, we move onto the atomic operation interfaces typically implemented with
  88the help of assembly code.
  89
  90        void atomic_add(int i, atomic_t *v);
  91        void atomic_sub(int i, atomic_t *v);
  92        void atomic_inc(atomic_t *v);
  93        void atomic_dec(atomic_t *v);
  94
  95These four routines add and subtract integral values to/from the given
  96atomic_t value.  The first two routines pass explicit integers by
  97which to make the adjustment, whereas the latter two use an implicit
  98adjustment value of "1".
  99
 100One very important aspect of these two routines is that they DO NOT
 101require any explicit memory barriers.  They need only perform the
 102atomic_t counter update in an SMP safe manner.
 103
 104Next, we have:
 105
 106        int atomic_inc_return(atomic_t *v);
 107        int atomic_dec_return(atomic_t *v);
 108
 109These routines add 1 and subtract 1, respectively, from the given
 110atomic_t and return the new counter value after the operation is
 111performed.
 112
 113Unlike the above routines, it is required that explicit memory
 114barriers are performed before and after the operation.  It must be
 115done such that all memory operations before and after the atomic
 116operation calls are strongly ordered with respect to the atomic
 117operation itself.
 118
 119For example, it should behave as if a smp_mb() call existed both
 120before and after the atomic operation.
 121
 122If the atomic instructions used in an implementation provide explicit
 123memory barrier semantics which satisfy the above requirements, that is
 124fine as well.
 125
 126Let's move on:
 127
 128        int atomic_add_return(int i, atomic_t *v);
 129        int atomic_sub_return(int i, atomic_t *v);
 130
 131These behave just like atomic_{inc,dec}_return() except that an
 132explicit counter adjustment is given instead of the implicit "1".
 133This means that like atomic_{inc,dec}_return(), the memory barrier
 134semantics are required.
 135
 136Next:
 137
 138        int atomic_inc_and_test(atomic_t *v);
 139        int atomic_dec_and_test(atomic_t *v);
 140
 141These two routines increment and decrement by 1, respectively, the
 142given atomic counter.  They return a boolean indicating whether the
 143resulting counter value was zero or not.
 144
 145It requires explicit memory barrier semantics around the operation as
 146above.
 147
 148        int atomic_sub_and_test(int i, atomic_t *v);
 149
 150This is identical to atomic_dec_and_test() except that an explicit
 151decrement is given instead of the implicit "1".  It requires explicit
 152memory barrier semantics around the operation.
 153
 154        int atomic_add_negative(int i, atomic_t *v);
 155
 156The given increment is added to the given atomic counter value.  A
 157boolean is return which indicates whether the resulting counter value
 158is negative.  It requires explicit memory barrier semantics around the
 159operation.
 160
 161Then:
 162
 163        int atomic_xchg(atomic_t *v, int new);
 164
 165This performs an atomic exchange operation on the atomic variable v, setting
 166the given new value.  It returns the old value that the atomic variable v had
 167just before the operation.
 168
 169        int atomic_cmpxchg(atomic_t *v, int old, int new);
 170
 171This performs an atomic compare exchange operation on the atomic value v,
 172with the given old and new values. Like all atomic_xxx operations,
 173atomic_cmpxchg will only satisfy its atomicity semantics as long as all
 174other accesses of *v are performed through atomic_xxx operations.
 175
 176atomic_cmpxchg requires explicit memory barriers around the operation.
 177
 178The semantics for atomic_cmpxchg are the same as those defined for 'cas'
 179below.
 180
 181Finally:
 182
 183        int atomic_add_unless(atomic_t *v, int a, int u);
 184
 185If the atomic value v is not equal to u, this function adds a to v, and
 186returns non zero. If v is equal to u then it returns zero. This is done as
 187an atomic operation.
 188
 189atomic_add_unless requires explicit memory barriers around the operation
 190unless it fails (returns 0).
 191
 192atomic_inc_not_zero, equivalent to atomic_add_unless(v, 1, 0)
 193
 194
 195If a caller requires memory barrier semantics around an atomic_t
 196operation which does not return a value, a set of interfaces are
 197defined which accomplish this:
 198
 199        void smp_mb__before_atomic_dec(void);
 200        void smp_mb__after_atomic_dec(void);
 201        void smp_mb__before_atomic_inc(void);
 202        void smp_mb__after_atomic_inc(void);
 203
 204For example, smp_mb__before_atomic_dec() can be used like so:
 205
 206        obj->dead = 1;
 207        smp_mb__before_atomic_dec();
 208        atomic_dec(&obj->ref_count);
 209
 210It makes sure that all memory operations preceding the atomic_dec()
 211call are strongly ordered with respect to the atomic counter
 212operation.  In the above example, it guarantees that the assignment of
 213"1" to obj->dead will be globally visible to other cpus before the
 214atomic counter decrement.
 215
 216Without the explicit smp_mb__before_atomic_dec() call, the
 217implementation could legally allow the atomic counter update visible
 218to other cpus before the "obj->dead = 1;" assignment.
 219
 220The other three interfaces listed are used to provide explicit
 221ordering with respect to memory operations after an atomic_dec() call
 222(smp_mb__after_atomic_dec()) and around atomic_inc() calls
 223(smp_mb__{before,after}_atomic_inc()).
 224
 225A missing memory barrier in the cases where they are required by the
 226atomic_t implementation above can have disastrous results.  Here is
 227an example, which follows a pattern occurring frequently in the Linux
 228kernel.  It is the use of atomic counters to implement reference
 229counting, and it works such that once the counter falls to zero it can
 230be guaranteed that no other entity can be accessing the object:
 231
 232static void obj_list_add(struct obj *obj, struct list_head *head)
 233{
 234        obj->active = 1;
 235        list_add(&obj->list, head);
 236}
 237
 238static void obj_list_del(struct obj *obj)
 239{
 240        list_del(&obj->list);
 241        obj->active = 0;
 242}
 243
 244static void obj_destroy(struct obj *obj)
 245{
 246        BUG_ON(obj->active);
 247        kfree(obj);
 248}
 249
 250struct obj *obj_list_peek(struct list_head *head)
 251{
 252        if (!list_empty(head)) {
 253                struct obj *obj;
 254
 255                obj = list_entry(head->next, struct obj, list);
 256                atomic_inc(&obj->refcnt);
 257                return obj;
 258        }
 259        return NULL;
 260}
 261
 262void obj_poke(void)
 263{
 264        struct obj *obj;
 265
 266        spin_lock(&global_list_lock);
 267        obj = obj_list_peek(&global_list);
 268        spin_unlock(&global_list_lock);
 269
 270        if (obj) {
 271                obj->ops->poke(obj);
 272                if (atomic_dec_and_test(&obj->refcnt))
 273                        obj_destroy(obj);
 274        }
 275}
 276
 277void obj_timeout(struct obj *obj)
 278{
 279        spin_lock(&global_list_lock);
 280        obj_list_del(obj);
 281        spin_unlock(&global_list_lock);
 282
 283        if (atomic_dec_and_test(&obj->refcnt))
 284                obj_destroy(obj);
 285}
 286
 287(This is a simplification of the ARP queue management in the
 288 generic neighbour discover code of the networking.  Olaf Kirch
 289 found a bug wrt. memory barriers in kfree_skb() that exposed
 290 the atomic_t memory barrier requirements quite clearly.)
 291
 292Given the above scheme, it must be the case that the obj->active
 293update done by the obj list deletion be visible to other processors
 294before the atomic counter decrement is performed.
 295
 296Otherwise, the counter could fall to zero, yet obj->active would still
 297be set, thus triggering the assertion in obj_destroy().  The error
 298sequence looks like this:
 299
 300        cpu 0                           cpu 1
 301        obj_poke()                      obj_timeout()
 302        obj = obj_list_peek();
 303        ... gains ref to obj, refcnt=2
 304                                        obj_list_del(obj);
 305                                        obj->active = 0 ...
 306                                        ... visibility delayed ...
 307                                        atomic_dec_and_test()
 308                                        ... refcnt drops to 1 ...
 309        atomic_dec_and_test()
 310        ... refcount drops to 0 ...
 311        obj_destroy()
 312        BUG() triggers since obj->active
 313        still seen as one
 314                                        obj->active update visibility occurs
 315
 316With the memory barrier semantics required of the atomic_t operations
 317which return values, the above sequence of memory visibility can never
 318happen.  Specifically, in the above case the atomic_dec_and_test()
 319counter decrement would not become globally visible until the
 320obj->active update does.
 321
 322As a historical note, 32-bit Sparc used to only allow usage of
 32324-bits of it's atomic_t type.  This was because it used 8 bits
 324as a spinlock for SMP safety.  Sparc32 lacked a "compare and swap"
 325type instruction.  However, 32-bit Sparc has since been moved over
 326to a "hash table of spinlocks" scheme, that allows the full 32-bit
 327counter to be realized.  Essentially, an array of spinlocks are
 328indexed into based upon the address of the atomic_t being operated
 329on, and that lock protects the atomic operation.  Parisc uses the
 330same scheme.
 331
 332Another note is that the atomic_t operations returning values are
 333extremely slow on an old 386.
 334
 335We will now cover the atomic bitmask operations.  You will find that
 336their SMP and memory barrier semantics are similar in shape and scope
 337to the atomic_t ops above.
 338
 339Native atomic bit operations are defined to operate on objects aligned
 340to the size of an "unsigned long" C data type, and are least of that
 341size.  The endianness of the bits within each "unsigned long" are the
 342native endianness of the cpu.
 343
 344        void set_bit(unsigned long nr, volatile unsigned long *addr);
 345        void clear_bit(unsigned long nr, volatile unsigned long *addr);
 346        void change_bit(unsigned long nr, volatile unsigned long *addr);
 347
 348These routines set, clear, and change, respectively, the bit number
 349indicated by "nr" on the bit mask pointed to by "ADDR".
 350
 351They must execute atomically, yet there are no implicit memory barrier
 352semantics required of these interfaces.
 353
 354        int test_and_set_bit(unsigned long nr, volatile unsigned long *addr);
 355        int test_and_clear_bit(unsigned long nr, volatile unsigned long *addr);
 356        int test_and_change_bit(unsigned long nr, volatile unsigned long *addr);
 357
 358Like the above, except that these routines return a boolean which
 359indicates whether the changed bit was set _BEFORE_ the atomic bit
 360operation.
 361
 362WARNING! It is incredibly important that the value be a boolean,
 363ie. "0" or "1".  Do not try to be fancy and save a few instructions by
 364declaring the above to return "long" and just returning something like
 365"old_val & mask" because that will not work.
 366
 367For one thing, this return value gets truncated to int in many code
 368paths using these interfaces, so on 64-bit if the bit is set in the
 369upper 32-bits then testers will never see that.
 370
 371One great example of where this problem crops up are the thread_info
 372flag operations.  Routines such as test_and_set_ti_thread_flag() chop
 373the return value into an int.  There are other places where things
 374like this occur as well.
 375
 376These routines, like the atomic_t counter operations returning values,
 377require explicit memory barrier semantics around their execution.  All
 378memory operations before the atomic bit operation call must be made
 379visible globally before the atomic bit operation is made visible.
 380Likewise, the atomic bit operation must be visible globally before any
 381subsequent memory operation is made visible.  For example:
 382
 383        obj->dead = 1;
 384        if (test_and_set_bit(0, &obj->flags))
 385                /* ... */;
 386        obj->killed = 1;
 387
 388The implementation of test_and_set_bit() must guarantee that
 389"obj->dead = 1;" is visible to cpus before the atomic memory operation
 390done by test_and_set_bit() becomes visible.  Likewise, the atomic
 391memory operation done by test_and_set_bit() must become visible before
 392"obj->killed = 1;" is visible.
 393
 394Finally there is the basic operation:
 395
 396        int test_bit(unsigned long nr, __const__ volatile unsigned long *addr);
 397
 398Which returns a boolean indicating if bit "nr" is set in the bitmask
 399pointed to by "addr".
 400
 401If explicit memory barriers are required around clear_bit() (which
 402does not return a value, and thus does not need to provide memory
 403barrier semantics), two interfaces are provided:
 404
 405        void smp_mb__before_clear_bit(void);
 406        void smp_mb__after_clear_bit(void);
 407
 408They are used as follows, and are akin to their atomic_t operation
 409brothers:
 410
 411        /* All memory operations before this call will
 412         * be globally visible before the clear_bit().
 413         */
 414        smp_mb__before_clear_bit();
 415        clear_bit( ... );
 416
 417        /* The clear_bit() will be visible before all
 418         * subsequent memory operations.
 419         */
 420         smp_mb__after_clear_bit();
 421
 422There are two special bitops with lock barrier semantics (acquire/release,
 423same as spinlocks). These operate in the same way as their non-_lock/unlock
 424postfixed variants, except that they are to provide acquire/release semantics,
 425respectively. This means they can be used for bit_spin_trylock and
 426bit_spin_unlock type operations without specifying any more barriers.
 427
 428        int test_and_set_bit_lock(unsigned long nr, unsigned long *addr);
 429        void clear_bit_unlock(unsigned long nr, unsigned long *addr);
 430        void __clear_bit_unlock(unsigned long nr, unsigned long *addr);
 431
 432The __clear_bit_unlock version is non-atomic, however it still implements
 433unlock barrier semantics. This can be useful if the lock itself is protecting
 434the other bits in the word.
 435
 436Finally, there are non-atomic versions of the bitmask operations
 437provided.  They are used in contexts where some other higher-level SMP
 438locking scheme is being used to protect the bitmask, and thus less
 439expensive non-atomic operations may be used in the implementation.
 440They have names similar to the above bitmask operation interfaces,
 441except that two underscores are prefixed to the interface name.
 442
 443        void __set_bit(unsigned long nr, volatile unsigned long *addr);
 444        void __clear_bit(unsigned long nr, volatile unsigned long *addr);
 445        void __change_bit(unsigned long nr, volatile unsigned long *addr);
 446        int __test_and_set_bit(unsigned long nr, volatile unsigned long *addr);
 447        int __test_and_clear_bit(unsigned long nr, volatile unsigned long *addr);
 448        int __test_and_change_bit(unsigned long nr, volatile unsigned long *addr);
 449
 450These non-atomic variants also do not require any special memory
 451barrier semantics.
 452
 453The routines xchg() and cmpxchg() need the same exact memory barriers
 454as the atomic and bit operations returning values.
 455
 456Spinlocks and rwlocks have memory barrier expectations as well.
 457The rule to follow is simple:
 458
 4591) When acquiring a lock, the implementation must make it globally
 460   visible before any subsequent memory operation.
 461
 4622) When releasing a lock, the implementation must make it such that
 463   all previous memory operations are globally visible before the
 464   lock release.
 465
 466Which finally brings us to _atomic_dec_and_lock().  There is an
 467architecture-neutral version implemented in lib/dec_and_lock.c,
 468but most platforms will wish to optimize this in assembler.
 469
 470        int _atomic_dec_and_lock(atomic_t *atomic, spinlock_t *lock);
 471
 472Atomically decrement the given counter, and if will drop to zero
 473atomically acquire the given spinlock and perform the decrement
 474of the counter to zero.  If it does not drop to zero, do nothing
 475with the spinlock.
 476
 477It is actually pretty simple to get the memory barrier correct.
 478Simply satisfy the spinlock grab requirements, which is make
 479sure the spinlock operation is globally visible before any
 480subsequent memory operation.
 481
 482We can demonstrate this operation more clearly if we define
 483an abstract atomic operation:
 484
 485        long cas(long *mem, long old, long new);
 486
 487"cas" stands for "compare and swap".  It atomically:
 488
 4891) Compares "old" with the value currently at "mem".
 4902) If they are equal, "new" is written to "mem".
 4913) Regardless, the current value at "mem" is returned.
 492
 493As an example usage, here is what an atomic counter update
 494might look like:
 495
 496void example_atomic_inc(long *counter)
 497{
 498        long old, new, ret;
 499
 500        while (1) {
 501                old = *counter;
 502                new = old + 1;
 503
 504                ret = cas(counter, old, new);
 505                if (ret == old)
 506                        break;
 507        }
 508}
 509
 510Let's use cas() in order to build a pseudo-C atomic_dec_and_lock():
 511
 512int _atomic_dec_and_lock(atomic_t *atomic, spinlock_t *lock)
 513{
 514        long old, new, ret;
 515        int went_to_zero;
 516
 517        went_to_zero = 0;
 518        while (1) {
 519                old = atomic_read(atomic);
 520                new = old - 1;
 521                if (new == 0) {
 522                        went_to_zero = 1;
 523                        spin_lock(lock);
 524                }
 525                ret = cas(atomic, old, new);
 526                if (ret == old)
 527                        break;
 528                if (went_to_zero) {
 529                        spin_unlock(lock);
 530                        went_to_zero = 0;
 531                }
 532        }
 533
 534        return went_to_zero;
 535}
 536
 537Now, as far as memory barriers go, as long as spin_lock()
 538strictly orders all subsequent memory operations (including
 539the cas()) with respect to itself, things will be fine.
 540
 541Said another way, _atomic_dec_and_lock() must guarantee that
 542a counter dropping to zero is never made visible before the
 543spinlock being acquired.
 544
 545Note that this also means that for the case where the counter
 546is not dropping to zero, there are no memory ordering
 547requirements.
 548