Boost C++ Libraries Home Libraries People FAQ More

PrevUpHomeNext

Background on how filing systems work

Filing system implementations traditionally offer three methods of ensuring that writes have reached non-volatile storage:

  1. The family of fsync() or its equivalent functions, which flush any cached written data not yet stored onto non-volatile storage. These are usually synchronous operations, in that they do not return until they have finished. A big caveat with these functions is that some filing systems e.g. ext3 flush every bit of pending write data for the filing system instead of just the pending writes for the file handle specified i.e. they are equivalent to a synchronous sync() as described below.
  2. The family of O_SYNC or its equivalent per file handle flags, which simply disable any form of write back caching. These usually make all data write functions not return until written data has reached non-volatile storage. This flag, for all intents and purposes, effectively asks for old fashioned filing system behaviour from before when filing systems tried to be clever by not actually writing changes when a program writes changes.
  3. The whole filing system cached written data flush, often performed by a function like sync(). Unlike the previous two, this is usually an asynchronous operation and there is usually no portable way of knowing when it has completed. Nevertheless, it is important because on traditional Unix implementations data persistence is simply sync() on a regular period cronjob, and while modern Unix implementations usually no longer do this, the end implementation has not fundamentally changed much[2].

There is also the matter of the difference between data and metadata: metadata is the stuff a filing system stores such that it knows about your data. For each of the first two of the above three families of functions, most systems provide three variants: flush metadata, flush data, and flush both metadata and data, so for clarity:

Table 1.1. Mechanisms for enforcing data persistence onto physical storage

Flush file metadata

Flush file data

Flush both metadata and data

Once off

fsync(parentdir_fd)

fdatasync(fd)

fsync(fd)

Always

Varies[a]

fcntl(fd, F_SETFL, O_DSYNC)

fcntl(fd, F_SETFL, O_SYNC)

[a] Many filing systems (NTFS, HFS+, ext3/4 with data=ordered) keep back a metadata flush until when a file handle close causes data to finish reaching physical storage. This ensures that file entries don't appear in directories with zero sizes.


In addition to manually flushing data to physical storage, every filing system also implements some form of timer based flush whereby a piece of written data will always be sent to physical storage within some predefined period of time after the write. Most filing systems implement different timeouts for metadata and data, but typically on almost all production filing systems — unless they are in a power-saving laptop mode -- any data write is guaranteed to be sent to non-volatile storage within one minute. Let me be clear here for later argument's sake: the filing system is allowed to reorder writes by up to one minute in time from the order in which they were issued. Or put another way, most filing systems have a one minute temporal constraint on write order.

Most people think of fsync(), O_SYNC and sync() in terms of flushing caches. An alternative way of thinking about them is that they impose an order on writes to non-volatile storage which acts above and beyond the timeout based write order. There is no doubt that they are a very crude and highly inefficient way of doing so because they are all or nothing, but they do open the option of emulating native filing system support for write ordering constraints when nothing else better is available. So why is the ability to constrain write ordering important?



[2] The main change is that individual writes get an individual lifetime before they must be written to storage rather flushing everything according to some external wall clock.


PrevUpHomeNext