# Building a Bw-Tree Takes More Than Just Buzz Words

## Building a Bw-Tree Takes More Than Just Buzz Words

### 0x00 引言

Bw-Tree是Microsoft Research发明的一种lock-free(在数据库中应该叫做latch-free)的数据结构，被用在Microsoft的内存数据库产品中，在CMU的学术型数据库peloton中也采用了这种数据结构。这篇paper就是CMU对其在peloton中实现中，对Bwtree的一些更加详细的讨论。从paper中得出的结论来说，Bwtree是有复杂而且在不少方面表现却比一些更加简单的、非lock-free的数据结构要差。 关于 Bwtree的论文之前有两篇， 不过这篇是讲的最清晰的。

### 0x01 基本结构

Bwtree的4个核心内容：

#### Mapping Table

The Bw-Tree’s centralized Mapping Table avoids these problems and allows a thread to update all references to a node in a single CaS instruction that is available on all modern CPUs.
...
Although this paper focuses only on in-memory behavior of the Bw-Tree, it is worth emphasizing that the mapping table also serves the purpose of supporting log-structured updates when deployed with SSD. Updates to tree nodes will otherwise propagate to all levels without the extra indirection provided by the Mapping Table.


#### Consolidation and Garbage Collection

Delta Chains和Base Node构成了Bwtree中一个Node的数据，随着数据更新，这个Delta Chains的长度是不断增长的，这样就需要将Delta Chains和Base Node合并为一个新的Node，并回收不在使用的数据。

At the beginning of consolidation, the thread copies the logical node’s base node contents to its private memory and then applies the Delta Chain. It then updates the node’s logical link in the Mapping Table with the new base node.


#### Structural Modification(SMO)

1. 在 Delta Chains中插入一个特殊的 Delta表明这个Node在进行SMO操作，其它线程就将更新的操作insert到一个虚拟的节点。
2. 下一步就是一个线程(不一定是第一步中的线程)，完成node的SMO操作之后，使用CaS将虚拟节点更新为新节点。

给一幅[2]中的图，看着就不是很好玩:

### 0x03 MISSING COMPONENTS

• Non-unique Key Support
• Iteration
• Mapping Table Expansion 。具体的讨论可以查看[1].

### 0x04 优化

We present our optimizations for the OpenBw-Tree’s key components to im- prove its performance and scalability. As we show in Section 5, these optimizations increase the index’s throughput by 1.1–2.5× for multi-threaded environments.


#### Delta Record Pre-allocation

Delta Chain 是Bwtree设计的一个核心，原论文的设计使用一个链表，而这里的优化是直接使用一个大的chunk，提前给Delta Record 分配空间。带来的好处减少了对象的分配，另外可以将这些数据放在一块就更加缓存友好，缺点是增加的内存的使用。

Each chain also maintains an allocation marker that points to the last delta record or the base node. When a worker thread claims a slot, it decrements this marker by the number of bytes for the new delta record using an atomic subtraction. If the pre-allocated area is full, then this triggers a node consolidation.


#### Garbage Collection

Centralized GC 的模式会造成一些访问冲突，为了解决这个问题，这里使用了decentralized GC 的方式，避免写全局的内存。具体的方法是依然有一个a global epoch (e-global) 。此外，每个线程保存了a private epoch (e-local) 和一个保存了指向已经删除对象指针的list。一个线程产生垃圾的时候，添加一个指向这个垃圾的指针，同时用e-global 标示e-local，当操作完成的时候，再执行这个步骤，如何开始GC。这个线程取回从其它线程所有的e-local，然后在在自己的list中找到这样的objects，这些的对象的l-local 标志的epoch比最小的e-local还要小，这些objects就是可以安全回收的。

#### Fast Consolidation

 a two-way merge combines ∆insert records and segments from the old base node into the new base node.


• Rule #1: For an inserted item, an existing segment [s ,e ) is broken into the segments [s ,offset) and [offset,e ).

• Rule#2: For a deleted item,an existing segment[s,e)is broken into the segments [s, offset) and [offset +1, e).

• Rule #3: However, if a delete delta removes an item that does not exist in the base node (i.e., deleting an item that is newly inserted by an earlier delta), the thread ignores this deletion.


### 0x05 评估

Emmmm，看上去不是很好看。Bwtree以很高的操作复杂程度(不是复杂度)，确得到了一个不怎样的结果。费力不讨好，根据为了做到lock free把结构搞得很恶心:

Nevertheless, even our optimized OpenBw-Tree, is still considerably slower than other state-of-the-art in-memory index structures like SkipList, Masstree and ART. OpenBw-Tree is also slower than a B+Tree that uses optimistic lock coupling, which indicates that lock-freedom does not always pay off in comparison with modern lock-based synchronization techniques.


## 参考

1. Ziqi Wang, Andrew Pavlo, Hyeontaek Lim, Viktor Leis, Huanchen Zhang, Michael Kaminsky, and David G. Andersen. 2018. Building a Bw-Tree Takes More Than Just Buzz Words. In Proceedings of 2018 International Conference on Management of Data (SIGMOD’18). ACM, New York, NY, USA, 16 pages. https://doi.org/10.1145/3183713.3196895 .

2. Comparative Performance of Memory Reclamation Strategies for Lock-free and Concurrently-readable Data Structures，2005。

3. The Bw-Tree: A B-tree for New Hardware Platforms, ICDE 2013.