Basic Internals of Key Value Database Engine

In this blog we will discuss how Key-Value storage engines like RockDB store data under the hood so that it is fast accessible. Let’s jump right into it.

If I would tell you that build a basic storage engine, which sets a key-value and gets the latest value based on key. Here set method appends the key-value in a text file. If we try to update the same entry we do not remove the previous one, we just append it and later at the time of accessing the value we take the latest value from the file. Here our text file is basically where all are data is being stored. To see things in actions let’s write the most basic storage of the world in javascript, just to get some idea.

const fs = require('fs').promises;
const STORAGE_FILENAME = "database";

class KVStore {
  constructor(filename) {
    this.filename = filename;
  }

  async set(key, value) {
    await fs.appendFile(this.filename, `${key},${value}\n`);
  }

  async get(key) {
    try {
      const fileContent = await fs.readFile(this.filename, 'utf-8');
      const lines = fileContent.split('\n');

      let lastValue = null;

      for(let i=lines.length-1;i>=0;i--){
        const line = lines[i];
        if(line.trim() === '') continue;
        const [currentKey, value] = line.split(',');
        if(currentKey === key){
          lastValue = value;
          break;
        }
      }

      return lastValue;
    } catch (error) {
      if (error.code === 'ENOENT') {
        // File doesn't exist, key not found
        return null;
      }
      throw error;
    }
  }
}

// Usage example
async function main() {
  const store = new KVStore(STORAGE_FILENAME);

  await store.set('name', 'John Doe');
  await store.set('age', '30');
  await store.set('name', 'Jane Doe');

  console.log('Name:', await store.get('name'));
  console.log('Age:', await store.get('age'));
}

main().catch(console.error);

In the above mentioned code we are appending the value to a file name database and at the time of get, we search the database file from the end and as soon as the key is find we return its value.

Lets talk about performance of our basic engine, our set method has pretty good performance because all it does is just appending the KV at the end of value which is quick, and this is what actual database engine also use while maintaining a log file they append data to the file. Although they have much more things to handle and worry about like log file size does not increase much larger in size and concurrency but basic principle is same.

If we talk about performance of get method, it’s not very good. We can say that its time complexity is O(n). Which will hurt us when data increases and becomes quite large in size. Now we try to improve the performance of this method we can do indexing. Basic concept for indexing is with the help of some metadata we store some information that where a particular key is placed in our file so that we can directly jump to that instead of doing a normal traversal.

Index are used in database engines which allows you to speed up the read queries, but you have to use indexes very carefully because excess use of them could lead to slower writes, because whenever a write operation happens index also needs to be updated.

Hash Indexes

Hash maps are basic data structures which we are very much aware of, hash maps lay down basic foundation of indexing which are being used in storage engines. Since, we are storing the data in a file and to improve our get method’s performance we can maintain an in-memory hash map which will store the byte offset where our key is stored in the file, as you can see in the above image. So, now whenever we will append new key-value to this file, we need to update this hash map and whenever you need to get a key’s value with the help of this hash map you can directly jump the byte offset location where that key is stored.

Now we know that we are just appending the data to our log file, hence how do we make sure that we do not eventually run out of disk space. To solve this problem one thing can be done is we can break our log file into segment files and stop accepting writes to a segment when it reaches a pre-defined size. For example, you start with segment file name segment_0, and keep writing KVs to this file and once it reaches 1 mb size. We create a new file named segment_1 and start accepting writes to this one.

We can then use compaction strategy to remove duplicate keys and keeping only the most updated key. Since, while doing compaction we can also do merging of segment files because once segment file reaches a certain size we do not accept writes to them. So we can merge them to create a new file and this process can be done in background thread while we can continue read and writes to the new segment file.

What we discussed is just basic working, but lot of detailing goes to make it work in production environment. Some of them are like:

File Format: CSV is not the file format log to use, because we do not need it to be in human readable format, instead we can store in binary format.

Deleting Records: If you want to delete a record, you need to append a special type of record. So that at time of compaction and merging that can act sort of identifier to remove those records.

Crash Recovery: If crash happens, your in-memory hash map will be lost and to recover them you could either reinitialise your hash map by reading your segment files and mainting the most recent key’s byte offset, but if segment files are larger in size this could lead to larger restart times. Something what Bitcask does is they store the snapshot of hash map in the disk and at the time of restart just load them directly from disk.

Now you could argue, why are we using append only log file. Why don’t we update the values in place. But append only log files has many advantages like:

Appending and segment merging are quite fast on magnetic spinning disk as compared to random writes.
Concurrency and crash recovery are easy to handle because of immutable log files. For example if a crash happens while you were overwriting a value. You have a segment which contains both the old value and new value(partially written).

There are some limitations of hash index table:

The hash table must fit in memory, because if you have large number of keys you could not do anything. You could store your hash map in disk, but seeking from disk is not that fast, which will slower down your get operations.
Range queries are not possible, if you want to search between abc000 to abc999. You would have to traverse to each and every key.

We will see next how we can tackle this indexing issues.

SSTables and LSM Trees

Currently we are appending data to our log files, but we can make a simple change that we append the keys in our log file in sorted order. I mean all the data entered in the log file is sorted by key. This structure is called SSTable or Sorted String table. You must be thinking that now we it will break our sequential writes to the log file, but will come to that later.

There are certain advantages to SS tables:

Merging segment files becomes easy and efficient. You can use merge sort algorithm to merge segment files giving precedence to keys which exist last. Newly generated segment file will also be sorted by key.

One more tweak we do, for a segment file now we don’t need to store all keys byte offset. Instead what we do, is store the KVs in compressible blocks of size usually 1 KB and we store only the starting key of every block’s byte offset. This way we save lot of space and enhance performance because now to find they key you can search between those blocks.

As you can see in the above diagram, we store KVs in blocks and for every blocks starting key we store its address. Now for example, if you want to search for xyz key we know that it will exist after efg key definitely so we can directly start our search after efg key which comes from second block.

Constructing and Maintaining SS Tables

Everything’s good till now, but you must be thinking how do we insert data in sorted order in our segment files. To achieve that we can make use of data structures like Red-Black trees or AVL trees which allow us to insert data in any order and we can get them sorted by keys.

What we can do is, we can maintain an in-memory red-black tree which is also called mem-table. We accepts direct writes to this mem-table, and once this tree becomes of certain size like 1 mb we write it to a segment file(which can be done in sorted order). Once it is written to segment file we can remove it from the memory and while we are writing to segment file we can continue accepting writes by creating a new mem-table and accepting writes to our new one.

There is one issue with this mem-table is that since it is in-memory so if the crash happens we will lose our data. To handle this scenario we maintain an append only log file, while we write to mem-table we also append to this log file. This log file is deleted every time , we have copied our mem-table to segment file. Our log file is append only, because it’s only purpose is crash recovery.

Algorithms described here is used in dbs like RocksDB and LevelDB. Storage engines based on this principle of merging and compacting are also LSM(log structured merge tree) storage engines.

Conclusion

That was long one, but give yourself a pat on the back for reaching this far and do not let that curiosity die. There is much more to database engines, I encourage you to go ahead and read those from some of the amazing books like Design Data Intensive Applications. This blog is also influenced from book DDIA’s chapter Storage and Retrieval. Soon I will be publishing more blogs continuing this topic.