SSTable database or big ideas behind the scene
The last time we looked precisely at the log database. Today we will look at another structure called SSTable and what benefits we can achieve compared to a log-based. Cassandra uses SSTable to store data, so obtained knowledge will help to understand a big range of databases in depth.
🔗: Read about the log table in my previous article
Prepare a coffee☕ we’re ready to dive🤿.
Move from log table to SSTables
As we know from our previous article the order of the key-value pair in the log table is unordered. Now let’s think about the structure, that keeps rows sorted by key on the disk. This structure is known in computer science as a Sorted String Table(SSTable).
One more requirement for SSTable is to single key across segments. Lucky for us compaction process already solves this problem.
Advantages✅
- You can do merging even if the keys can’t fit in memory. The approach is similar to the merge sort algorithm. We start reading from each segment, look a the first key in each file and write to the output file lowest. When two or more segments contain the same key, you keep the value from the recent one and skip from the other.
- No need to keep all keys in memory. For, an example we want to find the orange value. We don’t know the orange offset, but because the keys are sorted — the orange should be between the onion and the pear. We go to the onion and iterate till pear until we find the key/value pair (or find that it’s absent).
☝️: It’s best to keep the size of blocks to one kilobyte (the best balance). It will keep in memory index small and the search time on disk won’t be significant.
- 📦Compression: We need to access blocks of key/values anyway so that we can compress blocks. It’ll reduce storage space plus save I/O bandwidth.
How to keep SSTable ordered?
First of all, we can keep sorted information on the disk (in our next article we would talk about B-trees), but sorting in memory is much faster.
🌳We can use AVL or red-black trees to keep data ordered. You can insert data in any order (difficulty is O(log(n))) and get elements back in sorted order. (difficulty O(n)).
Now we need to use the following strategies in our storage:
- ✍Use memtable, when you need to write. Memtable — in memory balanced data structure(Example AVL tree)
- 💾When the memtable raises a certain threshold, write as an SSTable on disk. While you write out the memtable on disk, you can continue accepting writes to a new memtable.
- 👓To read, first try to find a key in the memtable, then in the most recent on-disk segment(using an index of course), then in older and so on.
- ⏰Regularly merge and compact segments
☝️: Common size for memtable is 1MB
💦The one drawback here is when an application crashes, we’ll lose records in the memtable. To fix this behaviour we need to write records on a disk(it can be unordered) and in a memtable. When an application crashes we just reread records from the file. After the compaction, we can erase the memtable(in RAM) and the recovery file from the disk.
Who uses SSTable?
👁LevelDB and RocksDB, key/value storage libraries, use the algorithm described above. You can use LevelDB in Riak instead of the default Bitcask storage engine. Similar engine storage is used in HBase and Cassandra, which have been inspired by Google Bigtable paper.
ℹ️: Google Bigtable paper has first introduced the terms SSTable and memtable
Lucene uses a similar method for storing terms dictionary. The key here is a word you search and value — all document indexes where this word occurs.
How we can improve performance?
Optimize search of keys that do not exist
In the simple realization of SSTable, you need to check the memtable, and then scan all segments. To speed up this process by adding a Bloom filter. Bloom filter can say if a key doesn’t exist in your database and save from many unnecessary I/O operations.
⌚When we should compact/merge our segments?
There are two strategies to determine the order and timing of compaction/merging:
- size-tiered(HBase) — newer and smaller segments are merged into older and bigger segments.
- levelled compaction(LevelDB, RocksDB) — the key range is split up into smaller segments and older data is moved in separate “levels”. This way you can process incrementally and use less space.
ℹ️: Cassandra supports both size-tiered and levelled compaction.
In the following article, we will talk about B-trees base databases. Don’t forget to clap 👏 and subscribe 📧 not to miss any news.