Home | Libraries | People | FAQ | More |
Caution | |
---|---|
This section was not finished in time for the beginning of the Boost peer review due a hard drive failure induced by the testing of AFIO-based key-value stores in this workshop tutorial (sigh!), but it will be finished shortly after I get up and running again on a new hard drive (as I write this, the drive is being cloned in between periods it hangs and I have to power cycle to get it back for more cloning). The page content is representative of the final edition, however the source code listed has not been debugged nor tuned for performance yet and is therefore not listed here. The benchmarks are obviously missing for this solution on the next page. |
As you might have guessed from the theory on the preceding page, the final design of the key-value store is radically different to before. Instead of storing each value in its own file, we have this store structure instead:
store/index
This always growing file is a simple linear history of snapshot state of the key-value store with a full copy of all the key-value pairs being written per transaction as a single shot atomic append. The blobs are referred to by unique integer index into a linked small blob index (see below). The key-value pairs are stored on-disk as a dense hash map which is very friendly to memory maps, eight bytes per entry, with the key strings stored contiguously after the hash array.
Each index record appended to the index is content hashed, and if during later parsing the hash does not match the contents the record is ignored. If after power loss the end of the index is trashed, the solution below will replay the history of transactions from the beginning, skipping any deallocated extents, until it finds the most recent complete undamaged index which does not refer to damaged blobs (see below).
store/large_blob_store
store/large_blob_store.ecc
We don't implement the large blob store in the solution below to
keep the code brief, however it is intended to house any blob sized
4Kb or above as memory maps are a much superior way of working
with larger values and keeping large blobs separately makes extent
fragmentation from hole punching much less expensive. Unlike the
other files, the large blob store is not necessarily always atomic
appended, a deallocated extent might be reallocated for example.
This eliminates the problem of ever exceeding a 2^64
maximum file size.
Each large blob is always aligned to a 4Kb boundary to allow memory mapping. The large blob store is quite stupid and contains no metadata at all — it relies on the small blob store for all metadata. This stupidity, and the fact it is a separate file, allows you to speculatively allocate new blob value storage before a new value is written such that one writes directly into the final storage from the beginning with no unneeded memory copying. One then either commits the new value or throws it away by punching a hole over it. You will notice below that each blob keeps an age from a monotonically increase count based on updates, this is used to keep a blob pinned into existence for a period even if it is not in use.
In short, the large blob file provides a large number of useful optimisations, but is not strictly speaking necessary. You could just use the small blob file for all blobs, and that is what this simplified implementation does for brevity.
You are probably curious about the ECC file. The ECC file is a
SECDED 32784,32768 Hamming code, so 16 bits (2 bytes) of ECC per
4096 bytes. This file allows single bit flips to be healed which
can save a blob from being invalidated due to failing its hash
check. We don't bother using this for either the index or the small
blob store as almost all bit flips seen in a computer system stem
from non-ECC RAM rather than long term storage, and the overhead
of ECC calculation is probably not worth it except for large BLOBs
given the statistical chance of a bit flip. AFIO provides a very
fast SECDED implementation in afio::utils::secded_ecc<>
.
store/small_blob_store
The small blob store is where all the blob values live. To add a new value, one simply constructs a blob value record and atomically appends it in a single shot. Each blob has a hash of its content, and if the hash does not match its contents it is considered invalid and pending deallocation via hole punching.
The operations involved in opening a data store for use become these:
index
and small_blob_store
files.
index
and small_blob_store
files into a second open handle.
index
or small_blob_store
files have zero length, atomic append into each their 32 byte header.
This is racy, but safe as spurious additional 32 byte headers are
ignored by the parsing code.
index
or small_blob_store
files specify a last known length which is greater than the size
reported by the operating system, or the last good index specified
by the header points to a corrupted index, we consider either or
both to be dirty. We firstly replay the small blob store from the
beginning, skipping any deallocated extents, and we atomic append
a blob store index (a map of content hashes to offsets in either
the small or large blob store), updating the header to point at the
new blob store index.
We then replay the index from the beginning, skipping any deallocated extents, matching off valid index records against our most recently generated blob store index. We choose the last uncorrupted index which does not refer to hashes of unavailable content.
If two processes concurrently open the same dirty data store and both repair the store, you will note that the above healing process is invariant to concurrent heals. Again, the parsing code correctly skips spurious atomic writes.
The operations involved for reading a key-value:
The operations involved for writing a key-value:
async_zero()
.
We always leave around a few thousand indices for replay after sudden
power loss.
It may now seem that there is a big race condition in between atomic appends of a new canonical index for either store and the updating of the header at the front of the file. This is correct, but is solved by the following index parsing algorithm for both store files:
In other words, a racily written hint adds a small bit of work to the parser, but nothing important. All we care is that it is approximately correct.
We don't implement item deletion as we didn't in either of the previous two generation key-value stores, however if one were to implement it it would have these operations:
async_zero()
.
The extents to be deallocated are calculated by walking the small
blob index and keeping a record of extents not referred to (i.e.
the gaps between blobs stored). For each of those containing an aligned
4Kb quantity or more we rewrite the record header to have any parser
skip the deallocated extent, and deallocate the rest.
So let's look at our new interface file. It looks pretty similar to before,
the only real change is the many new private member variable handle_ptr
's to the store files described
above. Another change is that we now return a outcome<ostream>
for writing but a shared_future<istream>
for reading — this is because
as explained earlier, writes are now fully buffered to memory before
being hashed and committed as a transaction, so by explicitly returning
a monad we are saying that this is now a synchronous operation (one could
just return an already ready shared future, but this method makes public
API assurances, and because a Boost.Outcome
future is a monad, your code will likely not even notice).
namespace afio = BOOST_AFIO_V2_NAMESPACE; namespace filesystem = BOOST_AFIO_V2_NAMESPACE::filesystem; using BOOST_OUTCOME_V1_NAMESPACE::outcome; using BOOST_OUTCOME_V1_NAMESPACE::lightweight_futures::shared_future; class data_store { struct _ostream; friend struct _ostream; afio::dispatcher_ptr _dispatcher; // The small blob store keeps non-memory mappable blobs at 32 byte alignments afio::handle_ptr _small_blob_store, _small_blob_store_append, _small_blob_store_ecc; // The large blob store keeps memory mappable blobs at 4Kb alignments afio::handle_ptr _large_blob_store, _large_blob_store_append, _large_blob_store_ecc; // The index is where we keep the map of keys to blobs afio::handle_ptr _index_store, _index_store_append, _index_store_ecc; struct index; std::unique_ptr<index> _index; public: // Type used for read streams using istream = std::shared_ptr<std::istream>; // Type used for write streams using ostream = std::shared_ptr<std::ostream>; // Type used for lookup using lookup_result_type = shared_future<istream>; // Type used for write using write_result_type = outcome<ostream>; // Disposition flags static constexpr size_t writeable = (1<<0); // Open a data store at path data_store(size_t flags = 0, afio::path path = "store"); // Look up item named name for reading, returning an istream for the item shared_future<istream> lookup(std::string name) noexcept; // Look up item named name for writing, returning an ostream for that item outcome<ostream> write(std::string name) noexcept; };
More code is to come ...
Table 1.9. This third generation solution will perform excellently under these conditions:
Condition |
|
---|---|
✔ |
On Microsoft Windows you can place the store deep in a directory hierarchy and use long key names. |
✔ |
Third party threads and processes can rename the location of the store during use. |
✔ |
The size of all the values being read at any given time fits into your virtual address space (which is at least 2Gb on 32 bit, 8Tb on 64 bit). |
✔ |
As many processes and threads may read and write to the store concurrently as you like, including CIFS clients but excluding NFS clients. |
✔ |
Processes may unexpectedly exit during modifies with no consequence on consistency. |
✔ ✔ |
Lookup performance is breathtakingly swift and close to invariant to item count. Write performance is much slower, but a lot faster than the atomic rename based design. |
✔ ✔ |
Sudden power loss will restore the key-value store to some point in history preserving the exact ordering of updates. |
✔ |
This design allows the atomic transactional update of as many key-value item at once as you like. |