# PebblesDB -- Building Key-Value Stores using Fragmented Log-Structured Merge Trees

## PebblesDB: Building Key-Value Stores using Fragmented Log-Structured Merge Trees

### 0x00 引言

We evaluate PebblesDB using micro-benchmarks and show that for write-intensive workloads, PebblesDB reduces write amplification by 2.4-3× compared to RocksDB, while increasing write throughput by 6.7×. We modify two widely-used NoSQL stores, MongoDB and HyperDex, to use PebblesDB as their underlying storage engine. Evaluating these applications using the YCSB benchmark shows that throughput is increased by 18-105% when using PebblesDB (compared to their default storage engines) while write IO is decreased by 35-55%.


### 0x01 FLSM基本思路

FLSM的存储布局：

LSM-Tree写入放大一个很重要的原因就是上层的更少的数据合并到下层更加大的数据的时候，下层所有的数据都会被重写。为了解决这个问题，FLSM把每层的sstable分割成更小的单元。FLSM compaction的时候，不是将上层下层的合并重新，而是简单地在下一层合适的位置添加一个新的fragment。FLSM可以做到数据在大部分的层中只会被写入一次。

 Doing so ensures that data is written exactly once in most levels; a different compaction algorithm is used for the the last few highest levels.  FLSM achieves this using a novel storage layout and organizing data using guards.


### 0x02 Guards

Guards是FLSM中核心的概念，基本的思路来自Skip List。前面说到FLSM将每层分为了更小的单元，为了更快的找到数据所在的单元，这里使用了Guards。这里为什么可以和基本的工作方式联想一下Skip List即可。传统的LSM，每一层包含了不相交的数据的sstabls，也就是说一个数据最多在这些sstables中出现一次。FLSM则改变了这种方式，它由每一层的多个sstables的范围是可以重叠的(注意这里是sstables，FLSM每一层多个的sstable，而且每一个Gruard下面也可以由多个)，也就是说一个数据在多个的这些sstabls里面。

  The FLSM data structure discards this invariant: each level can contain mul- tiple sstables with overlapping key ranges, so that a key may be present in multiple sstables. To quickly find keys in each level, FLSM organizes the sstables into guards (inspired from the Skip-List data structure)


#### Selecting Guards

Much like skip lists, if a key K is selected as a guard in level i, it becomes a guard for all higher levels i + 1, i + 2 etc. The guards in level i + 1 are a strict superset of the guards in level i. Choosing guards in this manner allows the interval between each guard to be successively refined in each deeper level.


#### Inserting and Deleting Guards

During compaction, guard G is deleted and sstables belonging to guard G will be partitioned and appended to either the neighboring guards in the same level i or child guards in level i + 1. Compaction from level i to i + 1 proceeds as normal (since G is still a guard in level i + 1). At the end of compaction, FLSM persists metadata indicating G has been deleted at level i. If required, the guard is deleted in other levels in a similar manner. Note that if a guard is deleted at level i, it should be deleted at all levels < i; FLSM can choose whether to delete the guard at higher levels > i.


### 0x03 FLSM Operations

• get()操作；在memtable里面的查找操作和一般的LSM没有区别。在memtable没有查到的话，想要到各层sstables。在下面的任何一层查到了就直接返回。在具体的一层查找的时候，先对Guards进行二分查询，定位到一个Guard下面的sstabls，然后在这个Guard下面的sstables里面查着，注意这里是这个Guard下面所有的sstabls都要查找。

• 范围查询这里要处理的问题就是一个Guard里面的sstables的值的范围存在重叠的问题，这里使用的是类似merge sort的方法解决。

...  a binary search is performed on each sstable to identify the smallest key overall in the range. Identifying the next smallest key in the range is similar to the merge procedure in merge sort; however, a full sort does not need to be performed. When the end of range query interval is reached, the operation is complete, and the result is returned to the user.

• put()操作，也是先添加到memtable，在memtable满了之后，先被添加到level 0，需要的时候进行compaction操作。这里FLSM会避免进行合并重写的操作，而是直接attach到下面一层合适的Guard下。
• 更新 & 删除，这里和LSM的操作没有很大的区别，也是添加一个包含新的updated sequence number的数据(update) 后者是一个删除的标志(delete).
• Compaction，这里当一个Guard下面的sstabls数量达到一定的阈值的时候，就会进行compaction操作。这些sstales首先会被归并，然后根据下面一层的Guars进行拆分，然后就是attach到正确的Guard下面。
 For example, assume a guard at Level 1 contains keys {1, 20, 45, 101, 245}. If the next level has guards 1, 40, and 200, the sstable will be partitioned into three sstables containing {1, 20}, {45, 101}, and {245} and attached to guards 1, 40, and 200 respectively.


• 最下面的一层(Paper中的叫法是the highest level，这篇总结将这些level看成类似倒置的树)，由于下面没有level，它只能选择重写合并(如果没有达到一定的条件添加新的level的话)；

• 另外一个情况就是倒数第二层，如果它的compaction操作，选择添加到下一层的操作会导致合并到一个大的sstable的时候(这里有点模糊)，

 Second, for the second-highest level, FLSM will rewrite an sstable into the same level if the alternative is to merge into a large sstable in the highest level (since we cannot attach new sstables in the last level if the guard is full).


FLSM例外的一个优点就是这些compaction操作是很容易并行操作的，因为不同的Guard之间不会产生其它的Guard。

## 参考

1. Pandian Raju, Rohan Kadekodi, Vijay Chidambaram, and Ittai Abra- ham. 2017. PebblesDB: Building Key-Value Stores using Fragmented Log-Structured Merge Trees. In Proceedings of ACM Symposium on Operating Systems Principles (SOSP’17). ACM, New York, NY, USA, Article 4, 18 pages. https://doi.org/10.1145/3132747. 3132765.