Dependably Fast Multi-Core In-Memory Transactions

Cicada: Dependably Fast Multi-Core In-Memory Transactions

0x00 引言

  On the TPC-C and YCSB benchmarks, Cicada outperforms Silo, TicToc, FOEDUS, MOCC, two-phase locking, Hekaton, and ERMIA in most scenarios, achieving up to 3X higher throughput than the next fastest design. It handles up to 2.07 M TPC-C transactions per second and 56.5 M YCSB transactions per second, and scans up to 356 M records per second on a single 28-core machine.


0x02 Multi-Clock Timestamp Allocation

• 在写入日志的时候，想要一个时间戳的全序的关系。Logger使用的是extended timestamps，这个还包含了一个era的信息，记录了时间戳回绕的次数；

0x03 Multi-Version Execution

... which serves existing thread-local versions when a transaction accesses the same records again . It provides consistency within a transaction by not losing earlier writes even if the application fails to reuse the pointer to the local version. Cicada finds earlier local writes using a lightweight thread-local hash table. ... If the application can ensure the reuse of local versions and/or no multiple accesses for the same record, it can instruct Cicada to bypass the duplicate access check.


0x04 Best-Effort Inlining

... To avoid creating a contention point at the inlined version, promotion only optimizes infrequently- or never-changing read-intensive records. If a record is frequently written, promoting a version of such a record will incur unnecessary write overhead. If the record is never read, promotion does not provide a performance benefit;


0x05 Serializable Multi-Version Validation and Optimizations

• Pending version installation，就是将写入集合中的数据添加到对应记录的版本链表中，这些数据的状态值为PENDING；
• Read timestamp update，在需要的时候更新数据的rts，用来确保(v.rts) ≥ (tx.ts)。
• Version consistency check，这里主要就是验证之前在读取集合中的可见版本在目前还是可见的，还有就是在写集合记录的数据版本满足(v.rts) ≤ (tx.ts)。这个步骤可以确保：1. 没有另外的一个事务写入了一个新的数据版本，从而改变这条记录可见的版本(一个事务获取时间戳的大小和在一个版本数据上执行的时间先后没有必然的关系)；2. 这个事务不会提交地太早，以致于违法了依赖于对于一条记录指定版本稳定的可见效的事务执行的一致性。

• Sorting the write set by contention，Cicada认为在写集合中，最新版本数据的wts越大，产生冲突的可能性越大，所以这里使用wts排序的方式，尽早暴露“冲突”。这个优化在冲突比较高的时候比较有效，冲突比较低的时候会增加额外的开销；

• Early version consistency check，这个优化也是在冲突比较高的时候比较有效。这个优化在排序写入集合之后就进行检查，方法和核心的验证方法是一样的。这个可以避免大部分添加进去的数据成为”垃圾“。

• Incremental version search，避免重复的版本的搜索，

 ... To reduce the cost of repeated search, the initial version search in the read phase remembers later_version whose wts is immediately later than (tx.ts). later_version is updated whenever a new version qualified as later_version is discovered during subsequent version search.


0x06 Indexing

Low index contention is one of the beneficial side effects of unifying Cicada’s transaction processing and index validation. Unlike many modern OCC designs that modify index data structures during the read phase of the transaction, Cicada’s multi-version indexes defer index updates until validating the transaction by keeping index node writes in thread-local memory;


0x07 Durability and Recovery

Checkpointing virtually partitions each table. For each record in a partition, checkpointers store the latest committed version in per-thread checkpoint files. This process happens asynchronously without taking locks. For safe memory access, checkpointers participate in maintaining min_rts; they frequently update (thread.rts) to avoid hindering min_rts increments.


0x08 Rapid Garbage Collection

... Cicada maintains a small per-record data structure containing a garbage collection lock and the minimum write timestamp (record.min_wts), separate from the main record metadata (the head), which is prefetched while creating a new garbage collection item for the record. The thread performs garbage collection for a committed version v if (a) acquiring the garbage collection lock succeeds and (b) (v.wts) > (record.min_wts).


0x09 Contention Regulation

... It then calculates the changes of the throughput and the maximum backoff time between the second-to-last period and the last period. If the gradient (the throughput change divided by the maximum backoff time change) is positive, it increases the maximum backoff time by a fixed amount (0.5 μs in our implementation); if the gradient is negative, it decreases the maximum backoff time by the same amount. If the gradient is zero or undefined (no change in the maximum backoff time), it chooses a direction at random.