linux/Documentation/filesystems/logfs.txt
<<
>>
Prefs
   1
   2The LogFS Flash Filesystem
   3==========================
   4
   5Specification
   6=============
   7
   8Superblocks
   9-----------
  10
  11Two superblocks exist at the beginning and end of the filesystem.
  12Each superblock is 256 Bytes large, with another 3840 Bytes reserved
  13for future purposes, making a total of 4096 Bytes.
  14
  15Superblock locations may differ for MTD and block devices.  On MTD the
  16first non-bad block contains a superblock in the first 4096 Bytes and
  17the last non-bad block contains a superblock in the last 4096 Bytes.
  18On block devices, the first 4096 Bytes of the device contain the first
  19superblock and the last aligned 4096 Byte-block contains the second
  20superblock.
  21
  22For the most part, the superblocks can be considered read-only.  They
  23are written only to correct errors detected within the superblocks,
  24move the journal and change the filesystem parameters through tunefs.
  25As a result, the superblock does not contain any fields that require
  26constant updates, like the amount of free space, etc.
  27
  28Segments
  29--------
  30
  31The space in the device is split up into equal-sized segments.
  32Segments are the primary write unit of LogFS.  Within each segments,
  33writes happen from front (low addresses) to back (high addresses.  If
  34only a partial segment has been written, the segment number, the
  35current position within and optionally a write buffer are stored in
  36the journal.
  37
  38Segments are erased as a whole.  Therefore Garbage Collection may be
  39required to completely free a segment before doing so.
  40
  41Journal
  42--------
  43
  44The journal contains all global information about the filesystem that
  45is subject to frequent change.  At mount time, it has to be scanned
  46for the most recent commit entry, which contains a list of pointers to
  47all currently valid entries.
  48
  49Object Store
  50------------
  51
  52All space except for the superblocks and journal is part of the object
  53store.  Each segment contains a segment header and a number of
  54objects, each consisting of the object header and the payload.
  55Objects are either inodes, directory entries (dentries), file data
  56blocks or indirect blocks.
  57
  58Levels
  59------
  60
  61Garbage collection (GC) may fail if all data is written
  62indiscriminately.  One requirement of GC is that data is separated
  63roughly according to the distance between the tree root and the data.
  64Effectively that means all file data is on level 0, indirect blocks
  65are on levels 1, 2, 3 4 or 5 for 1x, 2x, 3x, 4x or 5x indirect blocks,
  66respectively.  Inode file data is on level 6 for the inodes and 7-11
  67for indirect blocks.
  68
  69Each segment contains objects of a single level only.  As a result,
  70each level requires its own separate segment to be open for writing.
  71
  72Inode File
  73----------
  74
  75All inodes are stored in a special file, the inode file.  Single
  76exception is the inode file's inode (master inode) which for obvious
  77reasons is stored in the journal instead.  Instead of data blocks, the
  78leaf nodes of the inode files are inodes.
  79
  80Aliases
  81-------
  82
  83Writes in LogFS are done by means of a wandering tree.  A naïve
  84implementation would require that for each write or a block, all
  85parent blocks are written as well, since the block pointers have
  86changed.  Such an implementation would not be very efficient.
  87
  88In LogFS, the block pointer changes are cached in the journal by means
  89of alias entries.  Each alias consists of its logical address - inode
  90number, block index, level and child number (index into block) - and
  91the changed data.  Any 8-byte word can be changes in this manner.
  92
  93Currently aliases are used for block pointers, file size, file used
  94bytes and the height of an inodes indirect tree.
  95
  96Segment Aliases
  97---------------
  98
  99Related to regular aliases, these are used to handle bad blocks.
 100Initially, bad blocks are handled by moving the affected segment
 101content to a spare segment and noting this move in the journal with a
 102segment alias, a simple (to, from) tupel.  GC will later empty this
 103segment and the alias can be removed again.  This is used on MTD only.
 104
 105Vim
 106---
 107
 108By cleverly predicting the life time of data, it is possible to
 109separate long-living data from short-living data and thereby reduce
 110the GC overhead later.  Each type of distinc life expectency (vim) can
 111have a separate segment open for writing.  Each (level, vim) tupel can
 112be open just once.  If an open segment with unknown vim is encountered
 113at mount time, it is closed and ignored henceforth.
 114
 115Indirect Tree
 116-------------
 117
 118Inodes in LogFS are similar to FFS-style filesystems with direct and
 119indirect block pointers.  One difference is that LogFS uses a single
 120indirect pointer that can be either a 1x, 2x, etc. indirect pointer.
 121A height field in the inode defines the height of the indirect tree
 122and thereby the indirection of the pointer.
 123
 124Another difference is the addressing of indirect blocks.  In LogFS,
 125the first 16 pointers in the first indirect block are left empty,
 126corresponding to the 16 direct pointers in the inode.  In ext2 (maybe
 127others as well) the first pointer in the first indirect block
 128corresponds to logical block 12, skipping the 12 direct pointers.
 129So where ext2 is using arithmetic to better utilize space, LogFS keeps
 130arithmetic simple and uses compression to save space.
 131
 132Compression
 133-----------
 134
 135Both file data and metadata can be compressed.  Compression for file
 136data can be enabled with chattr +c and disabled with chattr -c.  Doing
 137so has no effect on existing data, but new data will be stored
 138accordingly.  New inodes will inherit the compression flag of the
 139parent directory.
 140
 141Metadata is always compressed.  However, the space accounting ignores
 142this and charges for the uncompressed size.  Failing to do so could
 143result in GC failures when, after moving some data, indirect blocks
 144compress worse than previously.  Even on a 100% full medium, GC may
 145not consume any extra space, so the compression gains are lost space
 146to the user.
 147
 148However, they are not lost space to the filesystem internals.  By
 149cheating the user for those bytes, the filesystem gained some slack
 150space and GC will run less often and faster.
 151
 152Garbage Collection and Wear Leveling
 153------------------------------------
 154
 155Garbage collection is invoked whenever the number of free segments
 156falls below a threshold.  The best (known) candidate is picked based
 157on the least amount of valid data contained in the segment.  All
 158remaining valid data is copied elsewhere, thereby invalidating it.
 159
 160The GC code also checks for aliases and writes then back if their
 161number gets too large.
 162
 163Wear leveling is done by occasionally picking a suboptimal segment for
 164garbage collection.  If a stale segments erase count is significantly
 165lower than the active segments' erase counts, it will be picked.  Wear
 166leveling is rate limited, so it will never monopolize the device for
 167more than one segment worth at a time.
 168
 169Values for "occasionally", "significantly lower" are compile time
 170constants.
 171
 172Hashed directories
 173------------------
 174
 175To satisfy efficient lookup(), directory entries are hashed and
 176located based on the hash.  In order to both support large directories
 177and not be overly inefficient for small directories, several hash
 178tables of increasing size are used.  For each table, the hash value
 179modulo the table size gives the table index.
 180
 181Tables sizes are chosen to limit the number of indirect blocks with a
 182fully populated table to 0, 1, 2 or 3 respectively.  So the first
 183table contains 16 entries, the second 512-16, etc.
 184
 185The last table is special in several ways.  First its size depends on
 186the effective 32bit limit on telldir/seekdir cookies.  Since logfs
 187uses the upper half of the address space for indirect blocks, the size
 188is limited to 2^31.  Secondly the table contains hash buckets with 16
 189entries each.
 190
 191Using single-entry buckets would result in birthday "attacks".  At
 192just 2^16 used entries, hash collisions would be likely (P >= 0.5).
 193My math skills are insufficient to do the combinatorics for the 17x
 194collisions necessary to overflow a bucket, but testing showed that in
 19510,000 runs the lowest directory fill before a bucket overflow was
 196188,057,130 entries with an average of 315,149,915 entries.  So for
 197directory sizes of up to a million, bucket overflows should be
 198virtually impossible under normal circumstances.
 199
 200With carefully chosen filenames, it is obviously possible to cause an
 201overflow with just 21 entries (4 higher tables + 16 entries + 1).  So
 202there may be a security concern if a malicious user has write access
 203to a directory.
 204
 205Open For Discussion
 206===================
 207
 208Device Address Space
 209--------------------
 210
 211A device address space is used for caching.  Both block devices and
 212MTD provide functions to either read a single page or write a segment.
 213Partial segments may be written for data integrity, but where possible
 214complete segments are written for performance on simple block device
 215flash media.
 216
 217Meta Inodes
 218-----------
 219
 220Inodes are stored in the inode file, which is just a regular file for
 221most purposes.  At umount time, however, the inode file needs to
 222remain open until all dirty inodes are written.  So
 223generic_shutdown_super() may not close this inode, but shouldn't
 224complain about remaining inodes due to the inode file either.  Same
 225goes for mapping inode of the device address space.
 226
 227Currently logfs uses a hack that essentially copies part of fs/inode.c
 228code over.  A general solution would be preferred.
 229
 230Indirect block mapping
 231----------------------
 232
 233With compression, the block device (or mapping inode) cannot be used
 234to cache indirect blocks.  Some other place is required.  Currently
 235logfs uses the top half of each inode's address space.  The low 8TB
 236(on 32bit) are filled with file data, the high 8TB are used for
 237indirect blocks.
 238
 239One problem is that 16TB files created on 64bit systems actually have
 240data in the top 8TB.  But files >16TB would cause problems anyway, so
 241only the limit has changed.
 242