1unshare system call
   4This document describes the new system call, unshare(). The document
   5provides an overview of the feature, why it is needed, how it can
   6be used, its interface specification, design, implementation and
   7how it can be tested.
   9Change Log
  11version 0.1  Initial document, Janak Desai (, Jan 11, 2006
  15        1) Overview
  16        2) Benefits
  17        3) Cost
  18        4) Requirements
  19        5) Functional Specification
  20        6) High Level Design
  21        7) Low Level Design
  22        8) Test Specification
  23        9) Future Work
  251) Overview
  28Most legacy operating system kernels support an abstraction of threads
  29as multiple execution contexts within a process. These kernels provide
  30special resources and mechanisms to maintain these "threads". The Linux
  31kernel, in a clever and simple manner, does not make distinction
  32between processes and "threads". The kernel allows processes to share
  33resources and thus they can achieve legacy "threads" behavior without
  34requiring additional data structures and mechanisms in the kernel. The
  35power of implementing threads in this manner comes not only from
  36its simplicity but also from allowing application programmers to work
  37outside the confinement of all-or-nothing shared resources of legacy
  38threads. On Linux, at the time of thread creation using the clone system
  39call, applications can selectively choose which resources to share
  40between threads.
  42unshare() system call adds a primitive to the Linux thread model that
  43allows threads to selectively 'unshare' any resources that were being
  44shared at the time of their creation. unshare() was conceptualized by
  45Al Viro in the August of 2000, on the Linux-Kernel mailing list, as part
  46of the discussion on POSIX threads on Linux.  unshare() augments the
  47usefulness of Linux threads for applications that would like to control
  48shared resources without creating a new process. unshare() is a natural
  49addition to the set of available primitives on Linux that implement
  50the concept of process/thread as a virtual machine.
  522) Benefits
  55unshare() would be useful to large application frameworks such as PAM
  56where creating a new process to control sharing/unsharing of process
  57resources is not possible. Since namespaces are shared by default
  58when creating a new process using fork or clone, unshare() can benefit
  59even non-threaded applications if they have a need to disassociate
  60from default shared namespace. The following lists two use-cases
  61where unshare() can be used.
  632.1 Per-security context namespaces
  66unshare() can be used to implement polyinstantiated directories using
  67the kernel's per-process namespace mechanism. Polyinstantiated directories,
  68such as per-user and/or per-security context instance of /tmp, /var/tmp or
  69per-security context instance of a user's home directory, isolate user
  70processes when working with these directories. Using unshare(), a PAM
  71module can easily setup a private namespace for a user at login.
  72Polyinstantiated directories are required for Common Criteria certification
  73with Labeled System Protection Profile, however, with the availability
  74of shared-tree feature in the Linux kernel, even regular Linux systems
  75can benefit from setting up private namespaces at login and
  76polyinstantiating /tmp, /var/tmp and other directories deemed
  77appropriate by system administrators.
  792.2 unsharing of virtual memory and/or open files
  82Consider a client/server application where the server is processing
  83client requests by creating processes that share resources such as
  84virtual memory and open files. Without unshare(), the server has to
  85decide what needs to be shared at the time of creating the process
  86which services the request. unshare() allows the server an ability to
  87disassociate parts of the context during the servicing of the
  88request. For large and complex middleware application frameworks, this
  89ability to unshare() after the process was created can be very
  923) Cost
  95In order to not duplicate code and to handle the fact that unshare()
  96works on an active task (as opposed to clone/fork working on a newly
  97allocated inactive task) unshare() had to make minor reorganizational
  98changes to copy_* functions utilized by clone/fork system call.
  99There is a cost associated with altering existing, well tested and
 100stable code to implement a new feature that may not get exercised
 101extensively in the beginning. However, with proper design and code
 102review of the changes and creation of an unshare() test for the LTP
 103the benefits of this new feature can exceed its cost.
 1054) Requirements
 108unshare() reverses sharing that was done using clone(2) system call,
 109so unshare() should have a similar interface as clone(2). That is,
 110since flags in clone(int flags, void \*stack) specifies what should
 111be shared, similar flags in unshare(int flags) should specify
 112what should be unshared. Unfortunately, this may appear to invert
 113the meaning of the flags from the way they are used in clone(2).
 114However, there was no easy solution that was less confusing and that
 115allowed incremental context unsharing in future without an ABI change.
 117unshare() interface should accommodate possible future addition of
 118new context flags without requiring a rebuild of old applications.
 119If and when new context flags are added, unshare() design should allow
 120incremental unsharing of those resources on an as needed basis.
 1225) Functional Specification
 126        unshare - disassociate parts of the process execution context
 129        #include <sched.h>
 131        int unshare(int flags);
 134        unshare() allows a process to disassociate parts of its execution
 135        context that are currently being shared with other processes. Part
 136        of execution context, such as the namespace, is shared by default
 137        when a new process is created using fork(2), while other parts,
 138        such as the virtual memory, open file descriptors, etc, may be
 139        shared by explicit request to share them when creating a process
 140        using clone(2).
 142        The main use of unshare() is to allow a process to control its
 143        shared execution context without creating a new process.
 145        The flags argument specifies one or bitwise-or'ed of several of
 146        the following constants.
 148        CLONE_FS
 149                If CLONE_FS is set, file system information of the caller
 150                is disassociated from the shared file system information.
 152        CLONE_FILES
 153                If CLONE_FILES is set, the file descriptor table of the
 154                caller is disassociated from the shared file descriptor
 155                table.
 157        CLONE_NEWNS
 158                If CLONE_NEWNS is set, the namespace of the caller is
 159                disassociated from the shared namespace.
 161        CLONE_VM
 162                If CLONE_VM is set, the virtual memory of the caller is
 163                disassociated from the shared virtual memory.
 166        On success, zero returned. On failure, -1 is returned and errno is
 169        EPERM   CLONE_NEWNS was specified by a non-root process (process
 170                without CAP_SYS_ADMIN).
 172        ENOMEM  Cannot allocate sufficient memory to copy parts of caller's
 173                context that need to be unshared.
 175        EINVAL  Invalid flag was specified as an argument.
 178        The unshare() call is Linux-specific and  should  not be used
 179        in programs intended to be portable.
 182        clone(2), fork(2)
 1846) High Level Design
 187Depending on the flags argument, the unshare() system call allocates
 188appropriate process context structures, populates it with values from
 189the current shared version, associates newly duplicated structures
 190with the current task structure and releases corresponding shared
 191versions. Helper functions of clone (copy_*) could not be used
 192directly by unshare() because of the following two reasons.
 194  1) clone operates on a newly allocated not-yet-active task
 195     structure, where as unshare() operates on the current active
 196     task. Therefore unshare() has to take appropriate task_lock()
 197     before associating newly duplicated context structures
 199  2) unshare() has to allocate and duplicate all context structures
 200     that are being unshared, before associating them with the
 201     current task and releasing older shared structures. Failure
 202     do so will create race conditions and/or oops when trying
 203     to backout due to an error. Consider the case of unsharing
 204     both virtual memory and namespace. After successfully unsharing
 205     vm, if the system call encounters an error while allocating
 206     new namespace structure, the error return code will have to
 207     reverse the unsharing of vm. As part of the reversal the
 208     system call will have to go back to older, shared, vm
 209     structure, which may not exist anymore.
 211Therefore code from copy_* functions that allocated and duplicated
 212current context structure was moved into new dup_* functions. Now,
 213copy_* functions call dup_* functions to allocate and duplicate
 214appropriate context structures and then associate them with the
 215task structure that is being constructed. unshare() system call on
 216the other hand performs the following:
 218  1) Check flags to force missing, but implied, flags
 220  2) For each context structure, call the corresponding unshare()
 221     helper function to allocate and duplicate a new context
 222     structure, if the appropriate bit is set in the flags argument.
 224  3) If there is no error in allocation and duplication and there
 225     are new context structures then lock the current task structure,
 226     associate new context structures with the current task structure,
 227     and release the lock on the current task structure.
 229  4) Appropriately release older, shared, context structures.
 2317) Low Level Design
 234Implementation of unshare() can be grouped in the following 4 different
 237  a) Reorganization of existing copy_* functions
 239  b) unshare() system call service function
 241  c) unshare() helper functions for each different process context
 243  d) Registration of system call number for different architectures
 2457.1) Reorganization of copy_* functions
 248Each copy function such as copy_mm, copy_namespace, copy_files,
 249etc, had roughly two components. The first component allocated
 250and duplicated the appropriate structure and the second component
 251linked it to the task structure passed in as an argument to the copy
 252function. The first component was split into its own function.
 253These dup_* functions allocated and duplicated the appropriate
 254context structure. The reorganized copy_* functions invoked
 255their corresponding dup_* functions and then linked the newly
 256duplicated structures to the task structure with which the
 257copy function was called.
 2597.2) unshare() system call service function
 262       * Check flags
 263         Force implied flags. If CLONE_THREAD is set force CLONE_VM.
 264         If CLONE_VM is set, force CLONE_SIGHAND. If CLONE_SIGHAND is
 265         set and signals are also being shared, force CLONE_THREAD. If
 266         CLONE_NEWNS is set, force CLONE_FS.
 268       * For each context flag, invoke the corresponding unshare_*
 269         helper routine with flags passed into the system call and a
 270         reference to pointer pointing the new unshared structure
 272       * If any new structures are created by unshare_* helper
 273         functions, take the task_lock() on the current task,
 274         modify appropriate context pointers, and release the
 275         task lock.
 277       * For all newly unshared structures, release the corresponding
 278         older, shared, structures.
 2807.3) unshare_* helper functions
 283For unshare_* helpers corresponding to CLONE_SYSVSEM, CLONE_SIGHAND,
 284and CLONE_THREAD, return -EINVAL since they are not implemented yet.
 285For others, check the flag value to see if the unsharing is
 286required for that structure. If it is, invoke the corresponding
 287dup_* function to allocate and duplicate the structure and return
 288a pointer to it.
 2907.4) Finally
 293Appropriately modify architecture specific code to register the
 294new system call.
 2968) Test Specification
 299The test for unshare() should test the following:
 301  1) Valid flags: Test to check that clone flags for signal and
 302     signal handlers, for which unsharing is not implemented
 303     yet, return -EINVAL.
 305  2) Missing/implied flags: Test to make sure that if unsharing
 306     namespace without specifying unsharing of filesystem, correctly
 307     unshares both namespace and filesystem information.
 309  3) For each of the four (namespace, filesystem, files and vm)
 310     supported unsharing, verify that the system call correctly
 311     unshares the appropriate structure. Verify that unsharing
 312     them individually as well as in combination with each
 313     other works as expected.
 315  4) Concurrent execution: Use shared memory segments and futex on
 316     an address in the shm segment to synchronize execution of
 317     about 10 threads. Have a couple of threads execute execve,
 318     a couple _exit and the rest unshare with different combination
 319     of flags. Verify that unsharing is performed as expected and
 320     that there are no oops or hangs.
 3229) Future Work
 325The current implementation of unshare() does not allow unsharing of
 326signals and signal handlers. Signals are complex to begin with and
 327to unshare signals and/or signal handlers of a currently running
 328process is even more complex. If in the future there is a specific
 329need to allow unsharing of signals and/or signal handlers, it can
 330be incrementally added to unshare() without affecting legacy
 331applications using unshare().