Home | Libraries | People | FAQ | More |
Filing system implementations traditionally offer three methods of ensuring that writes have reached non-volatile storage:
fsync()
or its equivalent functions, which
flush any cached written data not yet stored onto non-volatile storage.
These are usually synchronous operations, in that they do not return
until they have finished. A big caveat with these functions is that
some filing systems e.g. ext3 flush every bit
of pending write data for the filing system instead of just the pending
writes for the file handle specified i.e. they are equivalent to a
synchronous sync()
as described below.
O_SYNC
or its equivalent per file handle flags, which simply disable any form
of write back caching. These usually make all data write functions
not return until written data has reached non-volatile storage. This
flag, for all intents and purposes, effectively asks for “old
fashioned” filing system behaviour from before when filing systems
tried to be clever by not actually writing changes when a program writes
changes.
sync()
. Unlike the previous two, this is
usually an asynchronous operation and there is usually no portable
way of knowing when it has completed. Nevertheless, it is important
because on traditional Unix implementations data persistence is simply
sync()
on a regular period cronjob, and while modern Unix implementations
usually no longer do this, the end implementation has not fundamentally
changed much[2].
There is also the matter of the difference between data and metadata: metadata is the stuff a filing system stores such that it knows about your data. For each of the first two of the above three families of functions, most systems provide three variants: flush metadata, flush data, and flush both metadata and data, so for clarity:
Table 1.1. Mechanisms for enforcing data persistence onto physical storage
Flush file metadata |
Flush file data |
Flush both metadata and data |
|
---|---|---|---|
Once off |
|
|
|
Always |
Varies[a] |
|
|
[a]
Many filing systems (NTFS, HFS+, ext3/4 with |
In addition to manually flushing data to physical storage, every filing system also implements some form of timer based flush whereby a piece of written data will always be sent to physical storage within some predefined period of time after the write. Most filing systems implement different timeouts for metadata and data, but typically on almost all production filing systems — unless they are in a power-saving laptop mode -- any data write is guaranteed to be sent to non-volatile storage within one minute. Let me be clear here for later argument's sake: the filing system is allowed to reorder writes by up to one minute in time from the order in which they were issued. Or put another way, most filing systems have a one minute temporal constraint on write order.
Most people think of fsync()
, O_SYNC
and sync()
in terms of flushing caches. An alternative way of thinking about them
is that they impose an order on writes to non-volatile
storage which acts above and beyond the timeout based write order. There
is no doubt that they are a very crude and highly inefficient way of doing
so because they are all or nothing, but they do open the option of emulating
native filing system support for write ordering constraints when nothing
else better is available. So why is the ability to constrain write ordering
important?
[2] The main change is that individual writes get an individual lifetime before they must be written to storage rather flushing everything according to some external wall clock.