The final named blob store design

	Caution
This section was not finished in time for the beginning of the Boost peer review due a hard drive failure induced by the testing of AFIO-based key-value stores in this workshop tutorial (sigh!), but it will be finished shortly after I get up and running again on a new hard drive (as I write this, the drive is being cloned in between periods it hangs and I have to power cycle to get it back for more cloning). The page content is representative of the final edition, however the source code listed has not been debugged nor tuned for performance yet and is therefore not listed here. The benchmarks are obviously missing for this solution on the next page.

Caution

This section was not finished in time for the beginning of the Boost peer review due a hard drive failure induced by the testing of AFIO-based key-value stores in this workshop tutorial (sigh!), but it will be finished shortly after I get up and running again on a new hard drive (as I write this, the drive is being cloned in between periods it hangs and I have to power cycle to get it back for more cloning). The page content is representative of the final edition, however the source code listed has not been debugged nor tuned for performance yet and is therefore not listed here. The benchmarks are obviously missing for this solution on the next page.

As you might have guessed from the theory on the preceding page, the final design of the key-value store is radically different to before. Instead of storing each value in its own file, we have this store structure instead:

store/index
This always growing file is a simple linear history of snapshot state of the key-value store with a full copy of all the key-value pairs being written per transaction as a single shot atomic append. The blobs are referred to by unique integer index into a linked small blob index (see below). The key-value pairs are stored on-disk as a dense hash map which is very friendly to memory maps, eight bytes per entry, with the key strings stored contiguously after the hash array.

Each index record appended to the index is content hashed, and if during later parsing the hash does not match the contents the record is ignored. If after power loss the end of the index is trashed, the solution below will replay the history of transactions from the beginning, skipping any deallocated extents, until it finds the most recent complete undamaged index which does not refer to damaged blobs (see below).
store/large_blob_store
store/large_blob_store.ecc
We don't implement the large blob store in the solution below to keep the code brief, however it is intended to house any blob sized 4Kb or above as memory maps are a much superior way of working with larger values and keeping large blobs separately makes extent fragmentation from hole punching much less expensive. Unlike the other files, the large blob store is not necessarily always atomic appended, a deallocated extent might be reallocated for example. This eliminates the problem of ever exceeding a 2^64 maximum file size.

Each large blob is always aligned to a 4Kb boundary to allow memory mapping. The large blob store is quite stupid and contains no metadata at all — it relies on the small blob store for all metadata. This stupidity, and the fact it is a separate file, allows you to speculatively allocate new blob value storage before a new value is written such that one writes directly into the final storage from the beginning with no unneeded memory copying. One then either commits the new value or throws it away by punching a hole over it. You will notice below that each blob keeps an age from a monotonically increase count based on updates, this is used to keep a blob pinned into existence for a period even if it is not in use.

In short, the large blob file provides a large number of useful optimisations, but is not strictly speaking necessary. You could just use the small blob file for all blobs, and that is what this simplified implementation does for brevity.

You are probably curious about the ECC file. The ECC file is a SECDED 32784,32768 Hamming code, so 16 bits (2 bytes) of ECC per 4096 bytes. This file allows single bit flips to be healed which can save a blob from being invalidated due to failing its hash check. We don't bother using this for either the index or the small blob store as almost all bit flips seen in a computer system stem from non-ECC RAM rather than long term storage, and the overhead of ECC calculation is probably not worth it except for large BLOBs given the statistical chance of a bit flip. AFIO provides a very fast SECDED implementation in afio::utils::secded_ecc<>.
store/small_blob_store
The small blob store is where all the blob values live. To add a new value, one simply constructs a blob value record and atomically appends it in a single shot. Each blob has a hash of its content, and if the hash does not match its contents it is considered invalid and pending deallocation via hole punching.

The operations involved in opening a data store for use become these:

Create/open the store directory.
Create/open for atomic append only the index and small_blob_store files.
Open for read write the index and small_blob_store files into a second open handle.
If either the index or small_blob_store files have zero length, atomic append into each their 32 byte header. This is racy, but safe as spurious additional 32 byte headers are ignored by the parsing code.
If either the index or small_blob_store files specify a last known length which is greater than the size reported by the operating system, or the last good index specified by the header points to a corrupted index, we consider either or both to be dirty. We firstly replay the small blob store from the beginning, skipping any deallocated extents, and we atomic append a blob store index (a map of content hashes to offsets in either the small or large blob store), updating the header to point at the new blob store index.
We then replay the index from the beginning, skipping any deallocated extents, matching off valid index records against our most recently generated blob store index. We choose the last uncorrupted index which does not refer to hashes of unavailable content.

If two processes concurrently open the same dirty data store and both repair the store, you will note that the above healing process is invariant to concurrent heals. Again, the parsing code correctly skips spurious atomic writes.

The operations involved for reading a key-value:

We reload the index and blob store index loaded into memory if it has changed (by simply checking if the file size for either has grown).
We look up the blob index mapped by the key.
We look up the blob store file linked to by the index and get the offset and file for the content.
We read the value to be delivered using the input stream, similar to the second generation key-value store.

The operations involved for writing a key-value:

We hand out an output stream which records all output to memory.
On the destruction of the output stream, we build the gather write buffers for writing a blob with this value (i.e. hashing the contents etc).
We refresh the blob store index and see if this blob is already stored with an age less than 50. If so, we update its age to 0 along with all other blobs referred to by the index. If not, we atomic append the new blob and loop trying to atomic append an updated canonical blob store index until success (we must loop as we must reconcile with other concurrent writers). On success, we update the header at the front of the small blob store file to point at the new canonical blob store index with an incremented time count (this being a monotonically increasing count of successful blob store updates).
We refresh the index and update or add the key mapping to the map of keys to blob store index item, atomically appending the new index. If someone else wrote a new index between our starting index and our just written index, we return a transaction conflict error. On success, we update the header at the front of the index file to point at the new canonical index.
Every few thousand new indices we punch a hole earlier in index history by calling async_zero(). We always leave around a few thousand indices for replay after sudden power loss.

It may now seem that there is a big race condition in between atomic appends of a new canonical index for either store and the updating of the header at the front of the file. This is correct, but is solved by the following index parsing algorithm for both store files:

Read the header at the front of the file for the hint to the current canonical index.
Parse each record from the hint onwards until the end of the file.

In other words, a racily written hint adds a small bit of work to the parser, but nothing important. All we care is that it is approximately correct.

We don't implement item deletion as we didn't in either of the previous two generation key-value stores, however if one were to implement it it would have these operations:

For all blobs in the blob store index with an age greater than 100 (this implies that writer concurrency must never exceed 50), and where their contiguous storage exceeds 4Kb, we remove those entries from the blob store index and write a new canonical blob store index.
If entries were removed, we calculate a set of extents to be deallocated and fire and forget a call to batch async_zero(). The extents to be deallocated are calculated by walking the small blob index and keeping a record of extents not referred to (i.e. the gaps between blobs stored). For each of those containing an aligned 4Kb quantity or more we rewrite the record header to have any parser skip the deallocated extent, and deallocate the rest.

So let's look at our new interface file. It looks pretty similar to before, the only real change is the many new private member variable handle_ptr's to the store files described above. Another change is that we now return a outcome<ostream> for writing but a shared_future<istream> for reading — this is because as explained earlier, writes are now fully buffered to memory before being hashed and committed as a transaction, so by explicitly returning a monad we are saying that this is now a synchronous operation (one could just return an already ready shared future, but this method makes public API assurances, and because a Boost.Outcome future is a monad, your code will likely not even notice).

namespace afio = BOOST_AFIO_V2_NAMESPACE;
namespace filesystem = BOOST_AFIO_V2_NAMESPACE::filesystem;
using BOOST_OUTCOME_V1_NAMESPACE::outcome;
using BOOST_OUTCOME_V1_NAMESPACE::lightweight_futures::shared_future;

class data_store
{
  struct _ostream;
  friend struct _ostream;
  afio::dispatcher_ptr _dispatcher;
  // The small blob store keeps non-memory mappable blobs at 32 byte alignments
  afio::handle_ptr _small_blob_store, _small_blob_store_append, _small_blob_store_ecc;
  // The large blob store keeps memory mappable blobs at 4Kb alignments
  afio::handle_ptr _large_blob_store, _large_blob_store_append, _large_blob_store_ecc;
  // The index is where we keep the map of keys to blobs
  afio::handle_ptr _index_store, _index_store_append, _index_store_ecc;
  struct index;
  std::unique_ptr<index> _index;
public:
  // Type used for read streams
  using istream = std::shared_ptr<std::istream>;
  // Type used for write streams
  using ostream = std::shared_ptr<std::ostream>;
  // Type used for lookup
  using lookup_result_type = shared_future<istream>;
  // Type used for write
  using write_result_type = outcome<ostream>;

  // Disposition flags
  static constexpr size_t writeable = (1<<0);

  // Open a data store at path
  data_store(size_t flags = 0, afio::path path = "store");

  // Look up item named name for reading, returning an istream for the item
  shared_future<istream> lookup(std::string name) noexcept;
  // Look up item named name for writing, returning an ostream for that item
  outcome<ostream> write(std::string name) noexcept;
};

More code is to come ...

Table 1.9. This third generation solution will perform excellently under these conditions:

	Condition
✔	On Microsoft Windows you can place the store deep in a directory hierarchy and use long key names.
✔	Third party threads and processes can rename the location of the store during use.
✔	The size of all the values being read at any given time fits into your virtual address space (which is at least 2Gb on 32 bit, 8Tb on 64 bit).
✔	As many processes and threads may read and write to the store concurrently as you like, including CIFS clients but excluding NFS clients.
✔	Processes may unexpectedly exit during modifies with no consequence on consistency.
✔ ✔	Lookup performance is breathtakingly swift and close to invariant to item count. Write performance is much slower, but a lot faster than the atomic rename based design.
✔ ✔	Sudden power loss will restore the key-value store to some point in history preserving the exact ordering of updates.
✔	This design allows the atomic transactional update of as many key-value item at once as you like.