# MemC3 and libcuckoo

## MemC3: Compact and Concurrent MemCache with Dumber Caching and Smarter Hashing

### 0x00 引言

As a case study, we focus on Memcached [19], a popular in-memory caching layer, and show how our toolbox of techniques can improve Memcached’s performance by 3× and reduce its memory use by 30%.


Most keys are smaller than 32 bytes and most values no more than a few hundred bytes. In particular, there is one common type of request that almost exclusively uses 16 or 21 Byte keys and 2 Byte values.


### 0x01 Optimistic Cuckoo Hashing

Cuckoo Hash的基本思路是使用不止一个的hash函数来决定一个对象存放的位置，并使用驱逐的方法解决一些冲突的问题。这里的Hash Table的实现使用了类似Cache中组相联的方式，对于个位置，不只是放一个对象，而是一组的Slot。​

Our hash table is 4-way set-associative. Without set-associativity, basic cuckoo hashing allows only 50% of the table entries to be occupied before unresolvable collisions occur. It is possible to improve the space utilization to over 90% by using a 4-way (or higher) set associative hash table.


#### Tag-based Lookup

After checking all 8 candidate slots, a negative Lookup makes 8 × 0.39% = 0.03 pointer deref- erences on average. Because each bucket fits in a CPU cacheline (usually 64-Byte), on average each Lookup makes only 2 parallel cacheline-sized reads for checking the two buckets plus either 0.03 pointer dereferences if the Lookup misses or 1.03 if it hits.


#### Tag-based Insert

Tag 除了优化了查找之外，还可以优化insert操作。这个的核心在与这样的计算方法(这里实在是太巧妙了):

b1 = HASH(x) // based on the entire key
b2 = b1 ⊕ HASH(tag) //based on b1 and tag of x


 more importantly b1 can be computed by the same formula from b2 and tag. This property ensures that to displace a key originally in bucket b—no matter if b is b1 or b2— it is possible to calculate its alternate bucket b′ from bucket index b and the tag stored in bucket b by
b′ = b ⊕ HASH(tag)


### 0x03 Concurrent Cuckoo Hashing

Cuckoo的并发是比较难做的，它的驱逐的测量存在很多不确定性。对于并发的情况，这里主要要解决2个问题：

To avoid writer/writer deadlocks, it allows only one writer at a time — a tradeoff we accept as our target workloads are read-heavy.


#### Optimistic Locks for Lookup

Optimizing for the common case, our approach takes advantage of having a single writer to synchronize Insert and Lookups with low overhead. Instead of locking on buckets, it assigns a version counter for each key, updates its version when displacing this key on Insert, and looks for a version change during Lookup to detect any concurrent displacement.


## Algorithmic Improvements for Fast Concurrent Cuckoo Hashing

### 0x11 Improve Concurrency

P1: Avoid unnecessary or unintentional access to common data.  这个基本不用说了，最基本的套路.
P2: Minimize the size and execution time of critical sections. 同上.
P3: Optimize the concurrency control mechanism. 这里的名堂就多了去了。这里还讨论了使用HTM的优化。


### 0x12 Algorithmic Optimizations

#### Breadth-first Search for an Empty Slot

This optimization is key to reducing the size of the critical section: While the total number of slots examined is still M, this is work that can be performed without a lock held. With BFS, however, at most five buckets must be examined and modified with the lock actually held, reducing both the duration of the critical section and the number of cache lines dirtied while doing so.


#### Increase Set-associativity

1. 会降低查询的性能，需要查找的slot更加多了。而且数据变得更加分散，缓存的友好度降低；
2. 会提好写的性能，因为有空slot的可能性提高了。

#### Fine-grained Locking

Here we favor spinlocks using compare-and-swap over more general purpose mutexes. A spinlock wastes CPU cycles spinning on the lock while other writers are active, but has low overhead, particularly for uncontended access. Because the operations that our hash tables support are all very short and have low contention, very simple spinlocks are often the best choice.


### 0x13 HTM

1. Data conflicts on transactionally accessed addresses;
2. Limited resources for transactional stores;
3. Use of TSX-unfriendly instructions or system calls. such as brk, futex, or mmap.


void elided_lock_wrapper(lock) {
xbegin_retry = 0; abort_retry = 0;
while (xbegin_retry < _MAX_XBEGIN_RETRY) {
// Start transaction
if (status=_xbegin() == _XBEGIN_STARTED) {
// Check lock and put into read-set
if (lock is free)
return; //Execute in transaction
// Abort transaction as lock is busy
_xabort (_ABORT_LOCK_BUSY);
}
// Transaction may not succeed on a retry
if (!(status & _ABORT_RETRY)) {
// There is no chance for a retry
if (abort_retry >= _MAX_ABORT_RETRY)
break;
abort_retry ++ ;
}
xbegin_retry ++;
}
take fallback lock;
}

void elided_unlock_wrapper(lock) {
if (lock is free)
_xend(); // Commit transaction
else
unlock lock;
}


## 参考

1. MemC3: Compact and Concurrent MemCache with Dumber Caching and Smarter Hashing，NSDI ’13.
2. B. Atikoglu, Y. Xu, E. Frachtenberg, S. Jiang, and M. Paleczny. Workload analysis of a large-scale key-value store. In Proceedings of the SIGMETRICS’12, 2012.