LXR linux/Documentation/admin-guide/xfs.rst

   1.. SPDX-License-Identifier: GPL-2.0
   2
   3======================
   4The SGI XFS Filesystem
   5======================
   6
   7XFS is a high performance journaling filesystem which originated
   8on the SGI IRIX platform.  It is completely multi-threaded, can
   9support large files and large filesystems, extended attributes,
  10variable block sizes, is extent based, and makes extensive use of
  11Btrees (directories, extents, free space) to aid both performance
  12and scalability.
  13
  14Refer to the documentation at https://xfs.wiki.kernel.org/
  15for further details.  This implementation is on-disk compatible
  16with the IRIX version of XFS.
  17
  18
  19Mount Options
  20=============
  21
  22When mounting an XFS filesystem, the following options are accepted.
  23
  24  allocsize=size
  25        Sets the buffered I/O end-of-file preallocation size when
  26        doing delayed allocation writeout (default size is 64KiB).
  27        Valid values for this option are page size (typically 4KiB)
  28        through to 1GiB, inclusive, in power-of-2 increments.
  29
  30        The default behaviour is for dynamic end-of-file
  31        preallocation size, which uses a set of heuristics to
  32        optimise the preallocation size based on the current
  33        allocation patterns within the file and the access patterns
  34        to the file. Specifying a fixed ``allocsize`` value turns off
  35        the dynamic behaviour.
  36
  37  attr2 or noattr2
  38        The options enable/disable an "opportunistic" improvement to
  39        be made in the way inline extended attributes are stored
  40        on-disk.  When the new form is used for the first time when
  41        ``attr2`` is selected (either when setting or removing extended
  42        attributes) the on-disk superblock feature bit field will be
  43        updated to reflect this format being in use.
  44
  45        The default behaviour is determined by the on-disk feature
  46        bit indicating that ``attr2`` behaviour is active. If either
  47        mount option is set, then that becomes the new default used
  48        by the filesystem.
  49
  50        CRC enabled filesystems always use the ``attr2`` format, and so
  51        will reject the ``noattr2`` mount option if it is set.
  52
  53  discard or nodiscard (default)
  54        Enable/disable the issuing of commands to let the block
  55        device reclaim space freed by the filesystem.  This is
  56        useful for SSD devices, thinly provisioned LUNs and virtual
  57        machine images, but may have a performance impact.
  58
  59        Note: It is currently recommended that you use the ``fstrim``
  60        application to ``discard`` unused blocks rather than the ``discard``
  61        mount option because the performance impact of this option
  62        is quite severe.
  63
  64  grpid/bsdgroups or nogrpid/sysvgroups (default)
  65        These options define what group ID a newly created file
  66        gets.  When ``grpid`` is set, it takes the group ID of the
  67        directory in which it is created; otherwise it takes the
  68        ``fsgid`` of the current process, unless the directory has the
  69        ``setgid`` bit set, in which case it takes the ``gid`` from the
  70        parent directory, and also gets the ``setgid`` bit set if it is
  71        a directory itself.
  72
  73  filestreams
  74        Make the data allocator use the filestreams allocation mode
  75        across the entire filesystem rather than just on directories
  76        configured to use it.
  77
  78  ikeep or noikeep (default)
  79        When ``ikeep`` is specified, XFS does not delete empty inode
  80        clusters and keeps them around on disk.  When ``noikeep`` is
  81        specified, empty inode clusters are returned to the free
  82        space pool.
  83
  84  inode32 or inode64 (default)
  85        When ``inode32`` is specified, it indicates that XFS limits
  86        inode creation to locations which will not result in inode
  87        numbers with more than 32 bits of significance.
  88
  89        When ``inode64`` is specified, it indicates that XFS is allowed
  90        to create inodes at any location in the filesystem,
  91        including those which will result in inode numbers occupying
  92        more than 32 bits of significance.
  93
  94        ``inode32`` is provided for backwards compatibility with older
  95        systems and applications, since 64 bits inode numbers might
  96        cause problems for some applications that cannot handle
  97        large inode numbers.  If applications are in use which do
  98        not handle inode numbers bigger than 32 bits, the ``inode32``
  99        option should be specified.
 100
 101  largeio or nolargeio (default)
 102        If ``nolargeio`` is specified, the optimal I/O reported in
 103        ``st_blksize`` by **stat(2)** will be as small as possible to allow
 104        user applications to avoid inefficient read/modify/write
 105        I/O.  This is typically the page size of the machine, as
 106        this is the granularity of the page cache.
 107
 108        If ``largeio`` is specified, a filesystem that was created with a
 109        ``swidth`` specified will return the ``swidth`` value (in bytes)
 110        in ``st_blksize``. If the filesystem does not have a ``swidth``
 111        specified but does specify an ``allocsize`` then ``allocsize``
 112        (in bytes) will be returned instead. Otherwise the behaviour
 113        is the same as if ``nolargeio`` was specified.
 114
 115  logbufs=value
 116        Set the number of in-memory log buffers.  Valid numbers
 117        range from 2-8 inclusive.
 118
 119        The default value is 8 buffers.
 120
 121        If the memory cost of 8 log buffers is too high on small
 122        systems, then it may be reduced at some cost to performance
 123        on metadata intensive workloads. The ``logbsize`` option below
 124        controls the size of each buffer and so is also relevant to
 125        this case.
 126
 127  logbsize=value
 128        Set the size of each in-memory log buffer.  The size may be
 129        specified in bytes, or in kilobytes with a "k" suffix.
 130        Valid sizes for version 1 and version 2 logs are 16384 (16k)
 131        and 32768 (32k).  Valid sizes for version 2 logs also
 132        include 65536 (64k), 131072 (128k) and 262144 (256k). The
 133        logbsize must be an integer multiple of the log
 134        stripe unit configured at **mkfs(8)** time.
 135
 136        The default value for version 1 logs is 32768, while the
 137        default value for version 2 logs is MAX(32768, log_sunit).
 138
 139  logdev=device and rtdev=device
 140        Use an external log (metadata journal) and/or real-time device.
 141        An XFS filesystem has up to three parts: a data section, a log
 142        section, and a real-time section.  The real-time section is
 143        optional, and the log section can be separate from the data
 144        section or contained within it.
 145
 146  noalign
 147        Data allocations will not be aligned at stripe unit
 148        boundaries. This is only relevant to filesystems created
 149        with non-zero data alignment parameters (``sunit``, ``swidth``) by
 150        **mkfs(8)**.
 151
 152  norecovery
 153        The filesystem will be mounted without running log recovery.
 154        If the filesystem was not cleanly unmounted, it is likely to
 155        be inconsistent when mounted in ``norecovery`` mode.
 156        Some files or directories may not be accessible because of this.
 157        Filesystems mounted ``norecovery`` must be mounted read-only or
 158        the mount will fail.
 159
 160  nouuid
 161        Don't check for double mounted file systems using the file
 162        system ``uuid``.  This is useful to mount LVM snapshot volumes,
 163        and often used in combination with ``norecovery`` for mounting
 164        read-only snapshots.
 165
 166  noquota
 167        Forcibly turns off all quota accounting and enforcement
 168        within the filesystem.
 169
 170  uquota/usrquota/uqnoenforce/quota
 171        User disk quota accounting enabled, and limits (optionally)
 172        enforced.  Refer to **xfs_quota(8)** for further details.
 173
 174  gquota/grpquota/gqnoenforce
 175        Group disk quota accounting enabled and limits (optionally)
 176        enforced.  Refer to **xfs_quota(8)** for further details.
 177
 178  pquota/prjquota/pqnoenforce
 179        Project disk quota accounting enabled and limits (optionally)
 180        enforced.  Refer to **xfs_quota(8)** for further details.
 181
 182  sunit=value and swidth=value
 183        Used to specify the stripe unit and width for a RAID device
 184        or a stripe volume.  "value" must be specified in 512-byte
 185        block units. These options are only relevant to filesystems
 186        that were created with non-zero data alignment parameters.
 187
 188        The ``sunit`` and ``swidth`` parameters specified must be compatible
 189        with the existing filesystem alignment characteristics.  In
 190        general, that means the only valid changes to ``sunit`` are
 191        increasing it by a power-of-2 multiple. Valid ``swidth`` values
 192        are any integer multiple of a valid ``sunit`` value.
 193
 194        Typically the only time these mount options are necessary if
 195        after an underlying RAID device has had it's geometry
 196        modified, such as adding a new disk to a RAID5 lun and
 197        reshaping it.
 198
 199  swalloc
 200        Data allocations will be rounded up to stripe width boundaries
 201        when the current end of file is being extended and the file
 202        size is larger than the stripe width size.
 203
 204  wsync
 205        When specified, all filesystem namespace operations are
 206        executed synchronously. This ensures that when the namespace
 207        operation (create, unlink, etc) completes, the change to the
 208        namespace is on stable storage. This is useful in HA setups
 209        where failover must not result in clients seeing
 210        inconsistent namespace presentation during or after a
 211        failover event.
 212
 213Deprecation of V4 Format
 214========================
 215
 216The V4 filesystem format lacks certain features that are supported by
 217the V5 format, such as metadata checksumming, strengthened metadata
 218verification, and the ability to store timestamps past the year 2038.
 219Because of this, the V4 format is deprecated.  All users should upgrade
 220by backing up their files, reformatting, and restoring from the backup.
 221
 222Administrators and users can detect a V4 filesystem by running xfs_info
 223against a filesystem mountpoint and checking for a string containing
 224"crc=".  If no such string is found, please upgrade xfsprogs to the
 225latest version and try again.
 226
 227The deprecation will take place in two parts.  Support for mounting V4
 228filesystems can now be disabled at kernel build time via Kconfig option.
 229The option will default to yes until September 2025, at which time it
 230will be changed to default to no.  In September 2030, support will be
 231removed from the codebase entirely.
 232
 233Note: Distributors may choose to withdraw V4 format support earlier than
 234the dates listed above.
 235
 236Deprecated Mount Options
 237========================
 238
 239===========================     ================
 240  Name                          Removal Schedule
 241===========================     ================
 242Mounting with V4 filesystem     September 2030
 243ikeep/noikeep                   September 2025
 244attr2/noattr2                   September 2025
 245===========================     ================
 246
 247
 248Removed Mount Options
 249=====================
 250
 251===========================     =======
 252  Name                          Removed
 253===========================     =======
 254  delaylog/nodelaylog           v4.0
 255  ihashsize                     v4.0
 256  irixsgid                      v4.0
 257  osyncisdsync/osyncisosync     v4.0
 258  barrier                       v4.19
 259  nobarrier                     v4.19
 260===========================     =======
 261
 262sysctls
 263=======
 264
 265The following sysctls are available for the XFS filesystem:
 266
 267  fs.xfs.stats_clear            (Min: 0  Default: 0  Max: 1)
 268        Setting this to "1" clears accumulated XFS statistics
 269        in /proc/fs/xfs/stat.  It then immediately resets to "0".
 270
 271  fs.xfs.xfssyncd_centisecs     (Min: 100  Default: 3000  Max: 720000)
 272        The interval at which the filesystem flushes metadata
 273        out to disk and runs internal cache cleanup routines.
 274
 275  fs.xfs.filestream_centisecs   (Min: 1  Default: 3000  Max: 360000)
 276        The interval at which the filesystem ages filestreams cache
 277        references and returns timed-out AGs back to the free stream
 278        pool.
 279
 280  fs.xfs.speculative_prealloc_lifetime
 281        (Units: seconds   Min: 1  Default: 300  Max: 86400)
 282        The interval at which the background scanning for inodes
 283        with unused speculative preallocation runs. The scan
 284        removes unused preallocation from clean inodes and releases
 285        the unused space back to the free pool.
 286
 287  fs.xfs.speculative_cow_prealloc_lifetime
 288        This is an alias for speculative_prealloc_lifetime.
 289
 290  fs.xfs.error_level            (Min: 0  Default: 3  Max: 11)
 291        A volume knob for error reporting when internal errors occur.
 292        This will generate detailed messages & backtraces for filesystem
 293        shutdowns, for example.  Current threshold values are:
 294
 295                XFS_ERRLEVEL_OFF:       0
 296                XFS_ERRLEVEL_LOW:       1
 297                XFS_ERRLEVEL_HIGH:      5
 298
 299  fs.xfs.panic_mask             (Min: 0  Default: 0  Max: 256)
 300        Causes certain error conditions to call BUG(). Value is a bitmask;
 301        OR together the tags which represent errors which should cause panics:
 302
 303                XFS_NO_PTAG                     0
 304                XFS_PTAG_IFLUSH                 0x00000001
 305                XFS_PTAG_LOGRES                 0x00000002
 306                XFS_PTAG_AILDELETE              0x00000004
 307                XFS_PTAG_ERROR_REPORT           0x00000008
 308                XFS_PTAG_SHUTDOWN_CORRUPT       0x00000010
 309                XFS_PTAG_SHUTDOWN_IOERROR       0x00000020
 310                XFS_PTAG_SHUTDOWN_LOGERROR      0x00000040
 311                XFS_PTAG_FSBLOCK_ZERO           0x00000080
 312                XFS_PTAG_VERIFIER_ERROR         0x00000100
 313
 314        This option is intended for debugging only.
 315
 316  fs.xfs.irix_symlink_mode      (Min: 0  Default: 0  Max: 1)
 317        Controls whether symlinks are created with mode 0777 (default)
 318        or whether their mode is affected by the umask (irix mode).
 319
 320  fs.xfs.irix_sgid_inherit      (Min: 0  Default: 0  Max: 1)
 321        Controls files created in SGID directories.
 322        If the group ID of the new file does not match the effective group
 323        ID or one of the supplementary group IDs of the parent dir, the
 324        ISGID bit is cleared if the irix_sgid_inherit compatibility sysctl
 325        is set.
 326
 327  fs.xfs.inherit_sync           (Min: 0  Default: 1  Max: 1)
 328        Setting this to "1" will cause the "sync" flag set
 329        by the **xfs_io(8)** chattr command on a directory to be
 330        inherited by files in that directory.
 331
 332  fs.xfs.inherit_nodump         (Min: 0  Default: 1  Max: 1)
 333        Setting this to "1" will cause the "nodump" flag set
 334        by the **xfs_io(8)** chattr command on a directory to be
 335        inherited by files in that directory.
 336
 337  fs.xfs.inherit_noatime        (Min: 0  Default: 1  Max: 1)
 338        Setting this to "1" will cause the "noatime" flag set
 339        by the **xfs_io(8)** chattr command on a directory to be
 340        inherited by files in that directory.
 341
 342  fs.xfs.inherit_nosymlinks     (Min: 0  Default: 1  Max: 1)
 343        Setting this to "1" will cause the "nosymlinks" flag set
 344        by the **xfs_io(8)** chattr command on a directory to be
 345        inherited by files in that directory.
 346
 347  fs.xfs.inherit_nodefrag       (Min: 0  Default: 1  Max: 1)
 348        Setting this to "1" will cause the "nodefrag" flag set
 349        by the **xfs_io(8)** chattr command on a directory to be
 350        inherited by files in that directory.
 351
 352  fs.xfs.rotorstep              (Min: 1  Default: 1  Max: 256)
 353        In "inode32" allocation mode, this option determines how many
 354        files the allocator attempts to allocate in the same allocation
 355        group before moving to the next allocation group.  The intent
 356        is to control the rate at which the allocator moves between
 357        allocation groups when allocating extents for new files.
 358
 359Deprecated Sysctls
 360==================
 361
 362===========================================     ================
 363  Name                                          Removal Schedule
 364===========================================     ================
 365fs.xfs.irix_sgid_inherit                        September 2025
 366fs.xfs.irix_symlink_mode                        September 2025
 367fs.xfs.speculative_cow_prealloc_lifetime        September 2025
 368===========================================     ================
 369
 370
 371Removed Sysctls
 372===============
 373
 374=============================   =======
 375  Name                          Removed
 376=============================   =======
 377  fs.xfs.xfsbufd_centisec       v4.0
 378  fs.xfs.age_buffer_centisecs   v4.0
 379=============================   =======
 380
 381Error handling
 382==============
 383
 384XFS can act differently according to the type of error found during its
 385operation. The implementation introduces the following concepts to the error
 386handler:
 387
 388 -failure speed:
 389        Defines how fast XFS should propagate an error upwards when a specific
 390        error is found during the filesystem operation. It can propagate
 391        immediately, after a defined number of retries, after a set time period,
 392        or simply retry forever.
 393
 394 -error classes:
 395        Specifies the subsystem the error configuration will apply to, such as
 396        metadata IO or memory allocation. Different subsystems will have
 397        different error handlers for which behaviour can be configured.
 398
 399 -error handlers:
 400        Defines the behavior for a specific error.
 401
 402The filesystem behavior during an error can be set via ``sysfs`` files. Each
 403error handler works independently - the first condition met by an error handler
 404for a specific class will cause the error to be propagated rather than reset and
 405retried.
 406
 407The action taken by the filesystem when the error is propagated is context
 408dependent - it may cause a shut down in the case of an unrecoverable error,
 409it may be reported back to userspace, or it may even be ignored because
 410there's nothing useful we can with the error or anyone we can report it to (e.g.
 411during unmount).
 412
 413The configuration files are organized into the following hierarchy for each
 414mounted filesystem:
 415
 416  /sys/fs/xfs/<dev>/error/<class>/<error>/
 417
 418Where:
 419  <dev>
 420        The short device name of the mounted filesystem. This is the same device
 421        name that shows up in XFS kernel error messages as "XFS(<dev>): ..."
 422
 423  <class>
 424        The subsystem the error configuration belongs to. As of 4.9, the defined
 425        classes are:
 426
 427                - "metadata": applies metadata buffer write IO
 428
 429  <error>
 430        The individual error handler configurations.
 431
 432
 433Each filesystem has "global" error configuration options defined in their top
 434level directory:
 435
 436  /sys/fs/xfs/<dev>/error/
 437
 438  fail_at_unmount               (Min:  0  Default:  1  Max: 1)
 439        Defines the filesystem error behavior at unmount time.
 440
 441        If set to a value of 1, XFS will override all other error configurations
 442        during unmount and replace them with "immediate fail" characteristics.
 443        i.e. no retries, no retry timeout. This will always allow unmount to
 444        succeed when there are persistent errors present.
 445
 446        If set to 0, the configured retry behaviour will continue until all
 447        retries and/or timeouts have been exhausted. This will delay unmount
 448        completion when there are persistent errors, and it may prevent the
 449        filesystem from ever unmounting fully in the case of "retry forever"
 450        handler configurations.
 451
 452        Note: there is no guarantee that fail_at_unmount can be set while an
 453        unmount is in progress. It is possible that the ``sysfs`` entries are
 454        removed by the unmounting filesystem before a "retry forever" error
 455        handler configuration causes unmount to hang, and hence the filesystem
 456        must be configured appropriately before unmount begins to prevent
 457        unmount hangs.
 458
 459Each filesystem has specific error class handlers that define the error
 460propagation behaviour for specific errors. There is also a "default" error
 461handler defined, which defines the behaviour for all errors that don't have
 462specific handlers defined. Where multiple retry constraints are configured for
 463a single error, the first retry configuration that expires will cause the error
 464to be propagated. The handler configurations are found in the directory:
 465
 466  /sys/fs/xfs/<dev>/error/<class>/<error>/
 467
 468  max_retries                   (Min: -1  Default: Varies  Max: INTMAX)
 469        Defines the allowed number of retries of a specific error before
 470        the filesystem will propagate the error. The retry count for a given
 471        error context (e.g. a specific metadata buffer) is reset every time
 472        there is a successful completion of the operation.
 473
 474        Setting the value to "-1" will cause XFS to retry forever for this
 475        specific error.
 476
 477        Setting the value to "0" will cause XFS to fail immediately when the
 478        specific error is reported.
 479
 480        Setting the value to "N" (where 0 < N < Max) will make XFS retry the
 481        operation "N" times before propagating the error.
 482
 483  retry_timeout_seconds         (Min:  -1  Default:  Varies  Max: 1 day)
 484        Define the amount of time (in seconds) that the filesystem is
 485        allowed to retry its operations when the specific error is
 486        found.
 487
 488        Setting the value to "-1" will allow XFS to retry forever for this
 489        specific error.
 490
 491        Setting the value to "0" will cause XFS to fail immediately when the
 492        specific error is reported.
 493
 494        Setting the value to "N" (where 0 < N < Max) will allow XFS to retry the
 495        operation for up to "N" seconds before propagating the error.
 496
 497**Note:** The default behaviour for a specific error handler is dependent on both
 498the class and error context. For example, the default values for
 499"metadata/ENODEV" are "0" rather than "-1" so that this error handler defaults
 500to "fail immediately" behaviour. This is done because ENODEV is a fatal,
 501unrecoverable error no matter how many times the metadata IO is retried.
 502
 503Workqueue Concurrency
 504=====================
 505
 506XFS uses kernel workqueues to parallelize metadata update processes.  This
 507enables it to take advantage of storage hardware that can service many IO
 508operations simultaneously.  This interface exposes internal implementation
 509details of XFS, and as such is explicitly not part of any userspace API/ABI
 510guarantee the kernel may give userspace.  These are undocumented features of
 511the generic workqueue implementation XFS uses for concurrency, and they are
 512provided here purely for diagnostic and tuning purposes and may change at any
 513time in the future.
 514
 515The control knobs for a filesystem's workqueues are organized by task at hand
 516and the short name of the data device.  They all can be found in:
 517
 518  /sys/bus/workqueue/devices/${task}!${device}
 519
 520================  ===========
 521  Task            Description
 522================  ===========
 523  xfs_iwalk-$pid  Inode scans of the entire filesystem. Currently limited to
 524                  mount time quotacheck.
 525  xfs-gc          Background garbage collection of disk space that have been
 526                  speculatively allocated beyond EOF or for staging copy on
 527                  write operations.
 528================  ===========
 529
 530For example, the knobs for the quotacheck workqueue for /dev/nvme0n1 would be
 531found in /sys/bus/workqueue/devices/xfs_iwalk-1111!nvme0n1/.
 532
 533The interesting knobs for XFS workqueues are as follows:
 534
 535============     ===========
 536  Knob           Description
 537============     ===========
 538  max_active     Maximum number of background threads that can be started to
 539                 run the work.
 540  cpumask        CPUs upon which the threads are allowed to run.
 541  nice           Relative priority of scheduling the threads.  These are the
 542                 same nice levels that can be applied to userspace processes.
 543============     ===========
 544