Write ordering data and durability: why does it matter?

	Note
	You may find the paper presented at C++ Now 2014 “Large Code Base Change Ripple Management in C++: My thoughts on how a new Boost C++ Library could help” of interest here.

Implementing performant Durability essentially reduces down to answering two questions: (i) how long does it take to restore a consistent state after an unexpected power loss, and (ii) how much of your most recent data are you willing to lose? AFIO has been designed and written as the asynchronous file i/o portability layer for the forthcoming direct filing system graphstore database TripleGit which, like as with ZFS, implements late Durability i.e. you are guaranteed that your writes from some wall clock distance from now can never be lost. As discussing how TripleGit will use AFIO is probably useful to many others, that is what the remainder of this section does.

TripleGit will achieve the Consistent and Isolated parts of being a reliable database by placing abortable, garbage collectable concurrent writes of new data into separate files, and pushing the atomicity enforcement into a very small piece of ordering logic in order to reduce transaction contention by multiple writers as much as possible. If you wish to never lose most recent data, to implement a transaction one (i) writes one's data to the filing system, (ii) ensure it has reached non-volatile storage, (iii) appends the knowledge it definitely is on non-volatile storage to the intent log, and then (iv) ensure one's append also has reached non-volatile storage. This is presently the only way to ensure that valuable data definitely is never lost on any filing system that I know of. The obvious problem is that this method involves writing all your data with O_SYNC and using fsync() on the intent log. This might perform okay with a single writer, but with multiple writers performance is usually awful, especially on storage incapable of high queue depths and potentially many hundreds of milliseconds of latency (e.g. SD Cards). Despite the performance issues, there are many valid use cases for especially precious data, and TripleGit of course will offer such a facility, at both the per-graph and per-update levels.

TripleGit's normal persistence strategy is a bit more clever: write all your data, but keep a hash like a SHA of its contents as you write it^[3]. When you write your intent log, atomically append all the SHAs of the items you just wrote and skip O_DATA and fsync() completely. If power gets removed before all the data is written to non-volatile storage, you can figure out that the database is dirty easily enough, and you simply parse from the end of the intent log backwards, checking each item's contents to ensure their SHAs match up, throwing away any transaction where any file is missing or any file's contents don't match. On a filing system such as ext4 where data is guaranteed to be sent to non-volatile storage after one minute^[4], and of course so long as you don't mind losing up to one minute's worth of data, this solution can have much better performance than the previous solution with lots of simultaneous writers.

The problem though is that while better, performance is still far less than optimal. Firstly, you have to calculate a whole load of hashes all the time, and that isn't trivial especially on lower end platforms like a mobile phone where 25-30 cycles per byte SHA256 performance might be typical. Secondly, dirty database reconstruction is rather like how ext2 had to call fsck on boot: a whole load of time and i/o must be expended to fix up damage in the store, and while it's running one generally must wait.

What would be really, really useful is if the filing system exposed its internal write ordering constraint implementation to user mode code, so one could say “schedule writing A, B, C and D in any order whenever you get round to it, but you must write all of those before you write any of E, F and G”. Such an ability gives maximum scope to the filing system to reorder and coalesce writes as it sees fit, but still allows database implementations to ensure that a transaction's intent log entry can never appear without all the data it refers to. Such an ability would eliminate the need for expensive dirty database checking and reconstruction, or the need for any journalling infrastructure used to skip the manual integrity checking.

Unfortunately I know of no filing system which makes publicly available such a facility. The closest that I know of is ZFS which internally uses a concept of transaction groups which are, for all intents and purposes, partial whole filing system snapshots issued once every five seconds. Data writes may be reordered into any order within a transaction group, but transaction group commits are strongly tied to the wall clock and are always committed sequentially. Since the addition of the ZFS Write Throttle, the default settings are to accept new writes as fast as RAM can handle, buffering up to thirty wall clock seconds of writes before pacing the acceptance of new write data to match the speed of the non-volatile storage (which may be a ZFS Intent Log (ZIL) device if you're doing synchronous writes). This implies up to thirty seconds of buffered data could be lost, but note that ZFS still guarantees transaction group sequential write order. Therefore, what ZFS is in fact guaranteeing is this: “we may reorder your write by up to five seconds away from the sequence in which you wrote it and other writes surrounding it. Other than that, we guarantee the order in which you write is the order in which that data reaches physical storage.” ^[5]

What this means is this: on ZFS, TripleGit can turn off all write synchronisation and replace it with a five second delay between writing new data and updating the intent log, and in so doing guaranteeing that the intent log's contents will always refer to data definitely on storage (or rather, close enough that one need not perform a lot of repair work on first use after power loss). One can additionally skip SHA hashing on reads because ZFS guarantees file and metadata will always match and as TripleGit always copy on writes data, either a copy's length matches the intent log's or it doesn't (i.e. the file's length as reported by the filing system really is how much true data it contains), plus the file modified timestamp always reflects the actual last modifed timestamp of the data.

Note that ext3 and ext4 can also guarantee that file and metadata will always match using the (IOPS expensive) mount option data=journal, which can be detected from /proc/mounts. If combined with the proprietary Linux call syncfs(), one can reasonably emulate ZFS's behaviour, albeit rather inefficiently. Another option is to have an idle thread issue fsync for writes in the order they were issued after some timeout period, thus making sure that writes definitely will reach physical storage within some given timeout and in their order of issue — this can be used to emulate the ZFS wall clock based write order consistency guarantees.

	Note
	You may find the tutorial of interest which implements an ACID transactional key-value store using the theory in this section.

Sadly, most use of TripleGit and Boost.AFIO will be without the luxury of ZFS, so here is a quick table of power loss data safety. Once again, I reiterate that errors and omissions are my fault alone.

Table 1.2. Power loss safety matrix: What non-trivially reconstructible data can you lose if power is suddenly lost? Any help which can be supplied in filling in the unknowns in this table would be hugely appreciated.

	Newly created file content corruptable after close	File data content rewrite corruptable after close	Cosmic ray bitrot corruptable	Can punch holes into physical storage of files^[a]	Default max seconds of writes reordered without using `fsync()`
FAT32	✔	✔	✔	✘	?
ext2	✔	✔	✔	✘	35
ext3/4 `data=writeback`	✔	✔	✔	ext4 only	35^[b]
ext3/4 `data=ordered` (default)	✘	✔	✔	ext4 only	35
UFS + soft updates^[c]	✘	✔	✔	✔^[d]	30
HFS+	✘	✔	✔	✔	?
NTFS^[e]	✘	✔	✔	✔	Until idle or write limit
ext3/4 `data=journal`	✘	✘	✔	ext4 only	5
BTRFS^[f]	✘	✘	✘	✔	30
ReFS	✘	not if integrity streams enabled	not if integrity streams enabled	✔	Until idle or write limit
ZFS	✘	✘	✘	✔	30
^[a] This is where a filing system permits you to deallocate the physical storage of a region of a file, so a file claiming to occupy 8Mb could be reduced to 1Mb of actual storage consumption. This may sound like sparse file support, but transparent compression support also counts as it would reduce a region written with all zeros to nearly zero physical storage ^[b] This is the `commit` mount setting added to the `/proc/sys/vm/dirty_expire_centiseconds` value. Sources: https://www.kernel.org/doc/Documentation/filesystems/ext4.txt and http://www.westnet.com/~gsmith/content/linux-pdflush.htm ^[c] Source: http://www.freebsd.org/cgi/man.cgi?query=syncer ^[d] BSD automatically detects extended regions of all bits zero, and eliminates their physical representation on storage. ^[e] Source: http://technet.microsoft.com/en-us/library/bb742613.aspx ^[f] Source: https://wiki.archlinux.org/index.php/Btrfs

^[3] TripleGit actually uses a different, much faster 256 bit 3 cycles/byte cryptographic hash called Blake2 by default, but one can force use of SHA256/512 on a per-graph basis, or indeed if your CPU has SHA hardware offload instructions these may be used by default.

^[4] This is the default, and it may be changed by a system e.g. I have seen thirty minutes set for laptops. Note that the Linux-specific call syncfs() lets one artifically schedule whole filing system flushes.

^[5] Source: http://www.c0t0d0s0.org/archives/5343-Insights-into-ZFS-today-The-nature-of-writing-things.html