5. Third attempt at a key-value store: Thinking like a filing system

The final named blob store design
The performance of and problems with this third generation design

You should prepare yourself to now think outside your normal comfort zone, and like a filing system engineer.

Modern filing systems are extents based, which means that instead of your file content being stored as the program sees it when reading it, physical storage allocation is lazy. In other words, only when you write first into an extent is physical storage actually allocated, and if you never write into an extent then no physical storage is consumed, despite that that region of the file reads as all bits zero.

You can test this for yourself by the following shell command on an extents-based filing system (e.g. ZFS, UFS, HFS+, ext4):

ned@kate:~$ truncate -s 2T sparse.img
ned@kate:~$ ls -ls sparse.img
0 -rw-rw-r-- 1 ned ned 2199023255552 Aug 19 16:13 sparse.img

The first zero shows the allocated size of the file, which is zero. The maximum allocation of the file, which is reported as the size, is 2Tb. Let's try writing a single byte onto the end of the file:

ned@kate:~$ echo '0' >> sparse.img
ned@kate:~$ ls -ls sparse.img
4 -rw-rw-r-- 1 ned ned 2199023255554 Aug 19 16:17 sparse.img

The maximum allocation has increased by two bytes as you would expect. However, the actual real allocation is now in fact four bytes! If you were not on an extents-based filing system the echo append command would take a very long time to execute and you would in fact get a 2Tb sized file almost entirely made of zeroes!

Different filing systems have different granularities of extent. ext4, for example, usually has a 4Kb extent granularity unless it is a partial final extent. ZFS usually has a 128Kb granularity, while NTFS usually has a 64Kb granularity. The key thing to remember here is that the granularity is usually quite chunky, so storing a single byte every 256Kb offset is going to be extremely wasteful of physical storage. AFIO provides enumeration of the allocated extents of a file using async_extents().

Just as you can lazily allocate physical storage for file content using extents, so too can you deallocate extents by “punching holes” in a file's content storage. This simply eliminates the physical storage for a region, and therefore subsequent reads from that region will now return all bits zero. AFIO's API for doing this deallocation is async_zero(), and just to remind you of which filing systems provide extents and APIs to manage them, here is that power loss safety table once again:

Table 1.8. Power loss safety matrix: What non-trivially reconstructible data can you lose if power is suddenly lost? Any help which can be supplied in filling in the unknowns in this table would be hugely appreciated.

	Newly created file content corruptable after close	File data content rewrite corruptable after close	Cosmic ray bitrot corruptable	Can punch holes into physical storage of files^[a]	Default max seconds of writes reordered without using `fsync()`
FAT32	✔	✔	✔	✘	?
ext2	✔	✔	✔	✘	35
ext3/4 `data=writeback`	✔	✔	✔	ext4 only	35^[b]
ext3/4 `data=ordered` (default)	✘	✔	✔	ext4 only	35
UFS + soft updates^[c]	✘	✔	✔	✔^[d]	30
HFS+	✘	✔	✔	✔	?
NTFS^[e]	✘	✔	✔	✔	Until idle or write limit
ext3/4 `data=journal`	✘	✘	✔	ext4 only	5
BTRFS^[f]	✘	✘	✘	✔	30
ReFS	✘	not if integrity streams enabled	not if integrity streams enabled	✔	Until idle or write limit
ZFS	✘	✘	✘	✔	30
^[a] This is where a filing system permits you to deallocate the physical storage of a region of a file, so a file claiming to occupy 8Mb could be reduced to 1Mb of actual storage consumption. This may sound like sparse file support, but transparent compression support also counts as it would reduce a region written with all zeros to nearly zero physical storage ^[b] This is the `commit` mount setting added to the `/proc/sys/vm/dirty_expire_centiseconds` value. Sources: https://www.kernel.org/doc/Documentation/filesystems/ext4.txt and http://www.westnet.com/~gsmith/content/linux-pdflush.htm ^[c] Source: http://www.freebsd.org/cgi/man.cgi?query=syncer ^[d] BSD automatically detects extended regions of all bits zero, and eliminates their physical representation on storage. ^[e] Source: http://technet.microsoft.com/en-us/library/bb742613.aspx ^[f] Source: https://wiki.archlinux.org/index.php/Btrfs

Visibility of concurrent writes to concurrent readers in file i/o

The final piece of the puzzle is atomicity of writes. Many people do not realise that POSIX operating systems are supposed to provide some strong POSIX-mandated guarantees about the visibility of file i/o to concurrent readers and writers, and these guarantees AFIO goes out of its way not to interfere with. These are:

The visibility of a write to a file is to be atomic to all readers of the same file, including gather writes (note gather writes on Microsoft Windows are not atomic, but single buffer writes may be, see below). The POSIX-2008 wording is this:
“If a read() of file data can be proven (by any means) to occur after a write() of the data, it must reflect that write(), even if the calls are made by different processes. A similar requirement applies to multiple write operations to the same file position. This is needed to guarantee the propagation of data from write() calls to subsequent read() calls.” (POSIX-2008)

If you try to implement this wording, you quickly realise there is an implied mutex on each file in the kernel, and every read to and write from that file across the system involves holding that mutex during the read or write. Whether that is actually the case is unimportant, POSIX requires that it must appear to file i/o readers and writers that this is as if so.

In actual implementations however, the reality is somewhat different. The first obvious difficulty is with networked filing systems where performance would be awful if an implied mutex had to be claimed for every single read and write, so networked filing systems don't do that. The second big deviation is with memory mapped files as reading and writing raw memory does not synchronise on the implied per-file mutex and therefore you may see a partially written write during a read. All the major operating systems do just happen to synchronise i/o between memory maps and normal read/write so long as nobody is using afio::file_flags::os_direct i.e. msync(MS_INVALIDATE) is a noop on any recent major operating system, however memory map i/o still races on itself and on normal i/o operations. This, incidentally, is why AFIO does not provide much memory map support currently.

An additional problem beyond these is that Linux is simply non-conforming to POSIX-2008 here, and has unfortunately been so for a very long time. Linux does lock on a per-page granularity, so per 4Kb page it locks and nothing past that. This implies that if your read unfortunately straddles a 4Kb boundary, you have a race. XFS has implemented additional locking to solve this problem, but ext4 has not. You can see more detail here and here.

Finally, Microsoft Windows makes no public guarantees about the atomic visibility of writes except that nothing is guaranteed past a sector size, and by that they are referring to the atomicity of a write reaching storage rather than anything about visibility to concurrent reader writers. From how the NT kernel IRP queue works whereby reads and writes both enter the same serialised queue, it seems to me that the POSIX visibility guarantees are met which would be expected as the NT kernel is explicitly designed to be POSIX compliant, however one should be cautious here. I note from a review of portable databases that none makes strong assumptions about read-write visibility ordering, and I assume that is for a reason — though that reason could simply be “Linux”.

If you are willing to restrict your end use cases to:
- Any of the BSDs (these conform to POSIX).
- Linux with XFS (XFS implements special workarounds to conform with POSIX).
- Microsoft Windows with NTFS (it is thought to conform to POSIX).
... then you may be able to dispense with explicit locks. You may find the async_statfs() API useful as it can return the name of the filing system providing an open file handle.
By extension of the above, POSIX requires that the visibility of a write to a file opened for append-only access is also atomic to all readers and writers of the same file. Or put more succinctly, appending to an append-only file is atomic per syscall, so concurrent appenders are guaranteed each of their writes will appear to be appended in full before another appender's write. Here is the POSIX-2008 wording:
“If the O_APPEND flag of the file status flags is set, the file offset shall be set to the end of the file prior to each write and no intervening file modification operation shall occur between changing the file offset and the write operation.” (POSIX-2008)

You will be glad to know that this is a much better supported feature in implementations. The atomicity of file append even works on CIFS/Samba network shares in the usual configuration, albeit with a severe performance penalty (NFS network shares unfortunately do not preserve append atomicity because the protocol is incapable of expressing such an operation). Because memory maps are for a given file extent, file appends don't affect them, and because the requirement is merely for the atomicity of the increment of the file size, the atomicity of the visibility of the content appended is a separate problem to where the content is appended whose increment is guaranteed to be atomic except on NFS.

The relevance of all this filing system theory to the final key-value BLOB store

Why all this matters is because you need to understand all this theory to understand why the third generation key-value BLOB store presented next is designed exactly the way it is. Unintuitively, it is going to be a set of always-growing atomically-appended sparse files. We take advantage of hole punching to deallocate the parts of the files no longer needed so physically consumed storage does not grow past what is needed, but their apparent sizes grow forever and the atomicity of the atomic append operation is used as the concurrency control mechanism for implementing transactional updates as effectively copy-on-write, plus as the mechanism for marking time to the filing system such that extents which have exceeded their maximum age and must be flushed to storage are always exactly those at an offset earlier in the file. Because we never open nor close files, we avoid the costs of security checking during path lookup, and therefore maximise performance. While the next example does not make use of memory maps as that would severely complicate the code, the on-disk file format is capable of being used directly from a memory map which would allow high performance scaling to hundreds of millions of key-value pairs without blinking. The only real pessimism in the following design is that it doesn't scale well to modifies with worse than O(N^2) complexity to concurrency of writers and worse than O(N) complexity to key count.

We do have to take some care however. As on Linux writes are not atomic with respect to reads across 4Kb boundaries, we are going to need some mechanism of checking if some given append is valid or not, which usefully can be reused to check a store for validity after a sudden power loss (just because reads and writes may have sequential consistency of visibility to processes has no bearing whatsoever on what order writes are flushed to physical storage). Any writes we make within a 4Kb boundary is probably safe e.g. to the very front of the store file, so long as it does not exceed 4Kb.

And just to be extra special careful, we will never do a rewrite of existing content exceeding a sector size which we'll assume is 512 bytes on a 512 byte boundary as that is publicly documented as guaranteed for all the major operating systems. For that reason, as you will note in the implementation next section, everything is aligned to a 32 byte boundary and we never rewrite more than 32 bytes which guarantees we will never rewrite across a 512 byte boundary, and therefore risk atomicity.