Home | Libraries | People | FAQ | More |
You should prepare yourself to now think outside your normal comfort zone, and like a filing system engineer.
Modern filing systems are extents based, which means that instead of your file content being stored as the program sees it when reading it, physical storage allocation is lazy. In other words, only when you write first into an extent is physical storage actually allocated, and if you never write into an extent then no physical storage is consumed, despite that that region of the file reads as all bits zero.
You can test this for yourself by the following shell command on an extents-based filing system (e.g. ZFS, UFS, HFS+, ext4):
ned@kate:~$ truncate -s 2T sparse.img ned@kate:~$ ls -ls sparse.img 0 -rw-rw-r-- 1 ned ned 2199023255552 Aug 19 16:13 sparse.img
The first zero shows the allocated size of the file, which is zero. The maximum allocation of the file, which is reported as the size, is 2Tb. Let's try writing a single byte onto the end of the file:
ned@kate:~$ echo '0' >> sparse.img ned@kate:~$ ls -ls sparse.img 4 -rw-rw-r-- 1 ned ned 2199023255554 Aug 19 16:17 sparse.img
The maximum allocation has increased by two bytes as you would expect. However, the actual real allocation is now in fact four bytes! If you were not on an extents-based filing system the echo append command would take a very long time to execute and you would in fact get a 2Tb sized file almost entirely made of zeroes!
Different filing systems have different granularities of extent. ext4,
for example, usually has a 4Kb extent granularity unless it is a partial
final extent. ZFS usually has a 128Kb granularity, while NTFS usually has
a 64Kb granularity. The key thing to remember here is that the granularity
is usually quite chunky, so storing a single byte every 256Kb offset is
going to be extremely wasteful of physical storage. AFIO provides enumeration
of the allocated extents of a file using async_extents()
.
Just as you can lazily allocate physical storage for file content using
extents, so too can you deallocate extents by “punching holes”
in a file's content storage. This simply eliminates the physical storage
for a region, and therefore subsequent reads from that region will now
return all bits zero. AFIO's API for doing this deallocation is async_zero()
,
and just to remind you of which filing systems provide extents and APIs
to manage them, here is that power loss safety table once again:
Table 1.8. Power loss safety matrix: What non-trivially reconstructible data can you lose if power is suddenly lost? Any help which can be supplied in filling in the unknowns in this table would be hugely appreciated.
Newly created file content corruptable after close |
File data content rewrite corruptable after close |
Cosmic ray bitrot corruptable |
Can punch holes into physical storage of files[a] |
Default max seconds of writes reordered
without using |
|
---|---|---|---|---|---|
FAT32 |
✔ |
✔ |
✔ |
✘ |
? |
ext2 |
✔ |
✔ |
✔ |
✘ |
35 |
ext3/4 |
✔ |
✔ |
✔ |
ext4 only |
35[b] |
ext3/4 |
✘ |
✔ |
✔ |
ext4 only |
35 |
UFS + soft updates[c] |
✘ |
✔ |
✔ |
✔[d] |
30 |
HFS+ |
✘ |
✔ |
✔ |
✔ |
? |
NTFS[e] |
✘ |
✔ |
✔ |
✔ |
Until idle or write limit |
ext3/4 |
✘ |
✘ |
✔ |
ext4 only |
5 |
BTRFS[f] |
✘ |
✘ |
✘ |
✔ |
30 |
ReFS |
✘ |
not if integrity streams enabled |
not if integrity streams enabled |
✔ |
Until idle or write limit |
ZFS |
✘ |
✘ |
✘ |
✔ |
30 |
[a] This is where a filing system permits you to deallocate the physical storage of a region of a file, so a file claiming to occupy 8Mb could be reduced to 1Mb of actual storage consumption. This may sound like sparse file support, but transparent compression support also counts as it would reduce a region written with all zeros to nearly zero physical storage [b]
This is the [d] BSD automatically detects extended regions of all bits zero, and eliminates their physical representation on storage. [f] Source: https://wiki.archlinux.org/index.php/Btrfs |
The final piece of the puzzle is atomicity of writes. Many people do not realise that POSIX operating systems are supposed to provide some strong POSIX-mandated guarantees about the visibility of file i/o to concurrent readers and writers, and these guarantees AFIO goes out of its way not to interfere with. These are:
“If a
read()
of file data can be proven (by any means) to occur after awrite()
of the data, it must reflect thatwrite()
, even if the calls are made by different processes. A similar requirement applies to multiple write operations to the same file position. This is needed to guarantee the propagation of data fromwrite()
calls to subsequentread()
calls.” (POSIX-2008)
If you try to implement this wording, you quickly realise there is an implied mutex on each file in the kernel, and every read to and write from that file across the system involves holding that mutex during the read or write. Whether that is actually the case is unimportant, POSIX requires that it must appear to file i/o readers and writers that this is as if so.
In actual implementations however, the reality is somewhat different.
The first obvious difficulty is with networked filing systems where
performance would be awful if an implied mutex had to be claimed
for every single read and write, so networked filing systems don't
do that. The second big deviation is with memory mapped files as
reading and writing raw memory does not synchronise on the implied
per-file mutex and therefore you may see a partially written write
during a read. All the major operating systems do just happen to
synchronise i/o between memory maps and normal read/write so long
as nobody is using afio::file_flags::os_direct
i.e. msync(MS_INVALIDATE)
is a noop on any recent major operating system, however memory map
i/o still races on itself and on normal i/o operations. This, incidentally,
is why AFIO does not provide much memory map support currently.
An additional problem beyond these is that Linux is simply non-conforming to POSIX-2008 here, and has unfortunately been so for a very long time. Linux does lock on a per-page granularity, so per 4Kb page it locks and nothing past that. This implies that if your read unfortunately straddles a 4Kb boundary, you have a race. XFS has implemented additional locking to solve this problem, but ext4 has not. You can see more detail here and here.
Finally, Microsoft Windows makes no public guarantees about the atomic visibility of writes except that nothing is guaranteed past a sector size, and by that they are referring to the atomicity of a write reaching storage rather than anything about visibility to concurrent reader writers. From how the NT kernel IRP queue works whereby reads and writes both enter the same serialised queue, it seems to me that the POSIX visibility guarantees are met which would be expected as the NT kernel is explicitly designed to be POSIX compliant, however one should be cautious here. I note from a review of portable databases that none makes strong assumptions about read-write visibility ordering, and I assume that is for a reason — though that reason could simply be “Linux”.
If you are willing to restrict your end use cases to:
... then you may be able to dispense with explicit locks. You may
find the async_statfs()
API useful as it can return the name of the filing system providing
an open file handle.
“If the
O_APPEND
flag of the file status flags is set, the file offset shall be set to the end of the file prior to each write and no intervening file modification operation shall occur between changing the file offset and the write operation.” (POSIX-2008)
You will be glad to know that this is a much better supported feature in implementations. The atomicity of file append even works on CIFS/Samba network shares in the usual configuration, albeit with a severe performance penalty (NFS network shares unfortunately do not preserve append atomicity because the protocol is incapable of expressing such an operation). Because memory maps are for a given file extent, file appends don't affect them, and because the requirement is merely for the atomicity of the increment of the file size, the atomicity of the visibility of the content appended is a separate problem to where the content is appended whose increment is guaranteed to be atomic except on NFS.
Why all this matters is because you need to understand all this theory
to understand why the third generation key-value BLOB store presented next
is designed exactly the way it is. Unintuitively, it is going to be a set
of always-growing atomically-appended sparse files. We take advantage of
hole punching to deallocate the parts of the files no longer needed so
physically consumed storage does not grow past what is needed, but their
apparent sizes grow forever and the atomicity of the atomic append operation
is used as the concurrency control mechanism for implementing transactional
updates as effectively copy-on-write, plus as the mechanism for marking
time to the filing system such that extents which have exceeded their maximum
age and must be flushed to storage are always exactly those at an offset
earlier in the file. Because we never open nor close files, we avoid the
costs of security checking during path lookup, and therefore maximise performance.
While the next example does not make use of memory maps as that would severely
complicate the code, the on-disk file format is capable of being used directly
from a memory map which would allow high performance scaling to hundreds
of millions of key-value pairs without blinking. The only real pessimism
in the following design is that it doesn't scale well to modifies with
worse than O(N^2)
complexity to concurrency of writers and
worse than O(N)
complexity
to key count.
We do have to take some care however. As on Linux writes are not atomic with respect to reads across 4Kb boundaries, we are going to need some mechanism of checking if some given append is valid or not, which usefully can be reused to check a store for validity after a sudden power loss (just because reads and writes may have sequential consistency of visibility to processes has no bearing whatsoever on what order writes are flushed to physical storage). Any writes we make within a 4Kb boundary is probably safe e.g. to the very front of the store file, so long as it does not exceed 4Kb.
And just to be extra special careful, we will never do a rewrite of existing content exceeding a sector size which we'll assume is 512 bytes on a 512 byte boundary as that is publicly documented as guaranteed for all the major operating systems. For that reason, as you will note in the implementation next section, everything is aligned to a 32 byte boundary and we never rewrite more than 32 bytes which guarantees we will never rewrite across a 512 byte boundary, and therefore risk atomicity.