# Write-Optimized Dynamic Hashing, and uDepot

## Write-Optimized Dynamic Hashing for Persistent Memory

### 0x00 引言

CCEH effectively adapts its size as needed while guaranteeing failure-atomicity and that the tail latency of CCEH is up to 3.4× and 8× shorter than that of the state-of-the-art Level Hashing and Path Hashing, respectively.


### 0x01 基本思路

#### Extendible Hash Table

Extendible Hash Table(EHT)是一种可以实现rehashing操作为增量式的一个设计，在EHT中Hash寻址是一个两层式的。第一层为 global depth，由于寻址Directory，而第二层为local depath，用于寻址具体的Bucket。不同的Directory使用的寻址的bit数量可以是不同的，这样的话有可能不同的Direvtory是指向同样的Bucket，而rehashing操作的时候多数的时候就是某个Bucket的操作，比如图中的B2满了之后，将其中的一个的local depth变为2，如果开辟一个新的Bucket，拷贝出需要转移的数据即可。当出现locol depth大于globel depth的时候，需要处理globel depth。

... overhead of clflush. Note that when we index 16 million records using 16 KByte segments, it takes 555 usec and 631 usec to double the directory when we use LSB and MSB respectively. However, clflush() takes about 2 msec (3∼4× higher). In conclusion, LSB does not help reduce the overhead of enlarging the directory size unlike the directory file on disks.


#### 恢复

 Even if the cacheline is evicted from the CPU cache, partially written records will be ignored because the key is not valid for the segment, i.e., the MSB bits are not a valid segment index. This is the same situation as when our lazy deletion considers a slot with any invalid MSB segment index as free space.


When a system crashes, we visit directory entries as in binary tree traversal and check their consistency, which can be checked by making use of G and L. That is, we use the fact that, as we see in Figure 3, if G is larger than L then the directory buddies must point to the same segment, while if G and L are equal, then each must point to different segments.


#### 并发

EECH的并发的操作主要还是留哦那个rwlock。对于Segment，也可以使用一些lock-free的方法[1].

## Reaping the performance of fast NVM storage with uDepot

### 0x10 引言

... Indeed, uDepot vastly outperforms SSD-optimized stores by up to ×14.7 but also matches the performance of a DRAM-backed memcached server allowing it to be used as a Memcache replacement to dramatically reduce cost.


### 0x11 TRT

TRT是uDepot为高速IO设备设计的一个task run-time system。它基于目前存在的这样的一个问题：首先现在的一些处理网络请求的框架比如libevent之类的不能够很好地处理要同时处理网络和磁盘IO的环境，比如要同时处理 epoll wait，加上io getevents或者SDPK中的completion processing call。TRT就是一个类似于coroutine的一个实现。一般情况情况下，TRT开启一组线程，一般每个核心一个。其中一个任务就是作为一个类似coroutine来处理，拥有自己的栈。TRT在每个线程上面执行一个用户空间内部的调度器。TRT同样使用一般coroutine切换使用的方法。另外，TRT为了将可能地减少跨核心的通信，这里将同步原语分为了核内和核间的版本。一般情况下，一个task只会在一个核心上面运行。TRT作为一个uDept的异步IO的基础，

TRT currently supports four backends: Linux AIO, SPDK (single device and RAID-0 multi-device configurations), and Epoll, with backends for RDMA and DPDK in development. Each backend provides a low-level interface that allows tasks to issue requests and wait for results, and, built on top of that, a high-level interface for writing code resembling its syn- chronous counterpart.


### 0x12 uDepot

uDepot这里使用的设备空间管理的方式是将存储设备的空间分为segment，比如为1GB大小。日志式写入，配合它的垃圾回收算法，这里在另外的一篇论文里面[3].

#### Index data structure

uDepot在整体的索引结构上面使用了一种两层式的索引结构。第一层为directory，第二层为实际的table，

uDepot使用了一种Hopscotch Hash Table的变体来实现索引结构。这里主要做了两点改进，1. 使用2的次幂个数据项，每一个Hopscotch中的组(这里称之为neighborhood)类似于一个组相连的方式，这里的思路其实和很多的Cuckoo Hash Table的设计比较像。这里的neighborhood的索引会被保存到Hash Entry中，这样也方便了在索引resize操作的。Resize的操作不需要在重新取出Key重新计算。2. 不使用一个neighborhood一个bitmap标记的设计，也不使用链表。另外，在同步的设计上面，这里使用的是常见的分段锁的方法，

... an array of locks for concurrency control. These locks protect different regions (lock regions) of the hash table, with a region being strictly larger than the neighborhood size (8192 entries by default). A lock is acquired based on the neighborhood’s region; if a neighborhood spans two regions, a second lock is acquired in order.


struct HashEntry {
uint64 t neigh_off :5; // neighborhood offset
uint64 t key_fp_tag :8; // fingerprint MSBs
uint64 t kv_size:11; // KV size (grains)
uint64 t pba:40; // storage addr. (grains)
};


... 8 bits for the tag (key fp tag), and 5 bits to allow for 32 entries in a neighborhood (neigh off). Hence, if an entry has location λ in the table, then its neighborhood index is λ−neigh off, and its fingerprint is key fp tag : (λ−neigh off).


## 参考

1. Write-Optimized Dynamic Hashing for Persistent Memory, FAST ‘19.
2. Reaping the performance of fast NVM storage with uDepot, FAST ‘19.
3. Elevating commodity storage with the SALSA host translation layer, arXiv.