Boost C++ Libraries Home Libraries People FAQ More

The final named blob store design
[Caution] Caution

This section was not finished in time for the beginning of the Boost peer review due a hard drive failure induced by the testing of AFIO-based key-value stores in this workshop tutorial (sigh!), but it will be finished shortly after I get up and running again on a new hard drive (as I write this, the drive is being cloned in between periods it hangs and I have to power cycle to get it back for more cloning). The page content is representative of the final edition, however the source code listed has not been debugged nor tuned for performance yet and is therefore not listed here. The benchmarks are obviously missing for this solution on the next page.

As you might have guessed from the theory on the preceding page, the final design of the key-value store is radically different to before. Instead of storing each value in its own file, we have this store structure instead:

The operations involved in opening a data store for use become these:

  1. Create/open the store directory.
  2. Create/open for atomic append only the index and small_blob_store files.
  3. Open for read write the index and small_blob_store files into a second open handle.
  4. If either the index or small_blob_store files have zero length, atomic append into each their 32 byte header. This is racy, but safe as spurious additional 32 byte headers are ignored by the parsing code.
  5. If either the index or small_blob_store files specify a last known length which is greater than the size reported by the operating system, or the last good index specified by the header points to a corrupted index, we consider either or both to be dirty. We firstly replay the small blob store from the beginning, skipping any deallocated extents, and we atomic append a blob store index (a map of content hashes to offsets in either the small or large blob store), updating the header to point at the new blob store index.

    We then replay the index from the beginning, skipping any deallocated extents, matching off valid index records against our most recently generated blob store index. We choose the last uncorrupted index which does not refer to hashes of unavailable content.

If two processes concurrently open the same dirty data store and both repair the store, you will note that the above healing process is invariant to concurrent heals. Again, the parsing code correctly skips spurious atomic writes.

The operations involved for reading a key-value:

  1. We reload the index and blob store index loaded into memory if it has changed (by simply checking if the file size for either has grown).
  2. We look up the blob index mapped by the key.
  3. We look up the blob store file linked to by the index and get the offset and file for the content.
  4. We read the value to be delivered using the input stream, similar to the second generation key-value store.

The operations involved for writing a key-value:

  1. We hand out an output stream which records all output to memory.
  2. On the destruction of the output stream, we build the gather write buffers for writing a blob with this value (i.e. hashing the contents etc).
  3. We refresh the blob store index and see if this blob is already stored with an age less than 50. If so, we update its age to 0 along with all other blobs referred to by the index. If not, we atomic append the new blob and loop trying to atomic append an updated canonical blob store index until success (we must loop as we must reconcile with other concurrent writers). On success, we update the header at the front of the small blob store file to point at the new canonical blob store index with an incremented time count (this being a monotonically increasing count of successful blob store updates).
  4. We refresh the index and update or add the key mapping to the map of keys to blob store index item, atomically appending the new index. If someone else wrote a new index between our starting index and our just written index, we return a transaction conflict error. On success, we update the header at the front of the index file to point at the new canonical index.
  5. Every few thousand new indices we punch a hole earlier in index history by calling async_zero(). We always leave around a few thousand indices for replay after sudden power loss.

It may now seem that there is a big race condition in between atomic appends of a new canonical index for either store and the updating of the header at the front of the file. This is correct, but is solved by the following index parsing algorithm for both store files:

  1. Read the header at the front of the file for the hint to the current canonical index.
  2. Parse each record from the hint onwards until the end of the file.

In other words, a racily written hint adds a small bit of work to the parser, but nothing important. All we care is that it is approximately correct.

We don't implement item deletion as we didn't in either of the previous two generation key-value stores, however if one were to implement it it would have these operations:

  1. For all blobs in the blob store index with an age greater than 100 (this implies that writer concurrency must never exceed 50), and where their contiguous storage exceeds 4Kb, we remove those entries from the blob store index and write a new canonical blob store index.
  2. If entries were removed, we calculate a set of extents to be deallocated and fire and forget a call to batch async_zero(). The extents to be deallocated are calculated by walking the small blob index and keeping a record of extents not referred to (i.e. the gaps between blobs stored). For each of those containing an aligned 4Kb quantity or more we rewrite the record header to have any parser skip the deallocated extent, and deallocate the rest.

So let's look at our new interface file. It looks pretty similar to before, the only real change is the many new private member variable handle_ptr's to the store files described above. Another change is that we now return a outcome<ostream> for writing but a shared_future<istream> for reading — this is because as explained earlier, writes are now fully buffered to memory before being hashed and committed as a transaction, so by explicitly returning a monad we are saying that this is now a synchronous operation (one could just return an already ready shared future, but this method makes public API assurances, and because a Boost.Outcome future is a monad, your code will likely not even notice).

namespace afio = BOOST_AFIO_V2_NAMESPACE;
namespace filesystem = BOOST_AFIO_V2_NAMESPACE::filesystem;
using BOOST_OUTCOME_V1_NAMESPACE::lightweight_futures::shared_future;

class data_store
  struct _ostream;
  friend struct _ostream;
  afio::dispatcher_ptr _dispatcher;
  // The small blob store keeps non-memory mappable blobs at 32 byte alignments
  afio::handle_ptr _small_blob_store, _small_blob_store_append, _small_blob_store_ecc;
  // The large blob store keeps memory mappable blobs at 4Kb alignments
  afio::handle_ptr _large_blob_store, _large_blob_store_append, _large_blob_store_ecc;
  // The index is where we keep the map of keys to blobs
  afio::handle_ptr _index_store, _index_store_append, _index_store_ecc;
  struct index;
  std::unique_ptr<index> _index;
  // Type used for read streams
  using istream = std::shared_ptr<std::istream>;
  // Type used for write streams
  using ostream = std::shared_ptr<std::ostream>;
  // Type used for lookup
  using lookup_result_type = shared_future<istream>;
  // Type used for write
  using write_result_type = outcome<ostream>;

  // Disposition flags
  static constexpr size_t writeable = (1<<0);

  // Open a data store at path
  data_store(size_t flags = 0, afio::path path = "store");

  // Look up item named name for reading, returning an istream for the item
  shared_future<istream> lookup(std::string name) noexcept;
  // Look up item named name for writing, returning an ostream for that item
  outcome<ostream> write(std::string name) noexcept;

More code is to come ...

Table 1.9. This third generation solution will perform excellently under these conditions:


On Microsoft Windows you can place the store deep in a directory hierarchy and use long key names.

Third party threads and processes can rename the location of the store during use.

The size of all the values being read at any given time fits into your virtual address space (which is at least 2Gb on 32 bit, 8Tb on 64 bit).

As many processes and threads may read and write to the store concurrently as you like, including CIFS clients but excluding NFS clients.

Processes may unexpectedly exit during modifies with no consequence on consistency.

Lookup performance is breathtakingly swift and close to invariant to item count. Write performance is much slower, but a lot faster than the atomic rename based design.

Sudden power loss will restore the key-value store to some point in history preserving the exact ordering of updates.

This design allows the atomic transactional update of as many key-value item at once as you like.