# Optimistic Crash Consistency

## Optimistic Crash Consistency

### 0x00 引言

OptFS improves performance for many workloads, sometimes by an order of magnitude; we confirm its correctness through a series of robustness tests, showing it recovers to a consistent state after crashes. Finally, we show that osync() and dsync() are useful in atomic file system and database update scenarios, both improving performance and meet- ing application-level consistency demands.


### 0x01 背景

• 先来看看现在的IO栈和硬件的两个特点：1. 缓冲，无论是OS的Page Cache还是磁盘本身内部的Cache，未来提高IO栈的性能，缓冲是被普遍使用了，带来的问题就是持久化在一些情况下可能就等不到保证了，使用为了解决这个问题常用的方式就是同步IO和Flush的繁华似；2. 乱序，iO请求从进程发出之后，最终持久化的顺序和持久化的顺序不一定是相同的，OS以及硬件都有可能对这些操作进行重排序。在对这个顺序很敏感的一些应用，比如数据库的某些文件IO，就必须等到前面的IO完成才会进行后面的IO，以此保证顺序。

• 现在日志文件系统的一些操作逻辑：一个文件系统的写入来说，一般都要写数据和元数据。写数据操作(D)一般是最先进行的，其次是写元数据的日志的操作(J-M)，再然后就是写入一个提交块(J-C)，J-C持久化之后代表这个事务以及完成了。最后正式去更新元数据(M)(即使到这里失败了，也可以通过日志重做)。到这里就代表一个写入操作完成了(注意这里的细节，后面的asynchronous durability notifications的基本的思路与这里相关)。所以顺序可以表示为 $$\\ D \to J_{M} \to J_{C} \to M \\ 这里可以采用的一个优化就是D和J_{M}之间的顺序是可以打乱的。这样顺序就可以表示为D|J_{M} \to J_{C} \to M.$$ 当使用checksum(这个写入事务相关的checksum,保存到J_{C}中)的优化方式时，J_{M}和J_{C}时可以同时发出操作，这样就可以表示为 $$\\ D \to \overline{J_{M}|J_{C}} \to M$$ 但是前面的两点的优化时不能同时存在的。另外就是事务之间存在的顺序关系: $$\\ Tx_{i} \to Tx_{i+1}$$ 这个被称作是悲观的日志方式(Pessimistic Journaling)。这里保证进行下一个操作前一个操作的数据已经持久化了的基本的一个方法就是使用Flush操作。

• Flush带来的性能影响。由于不同层级时间的存储设备的速度存在巨大的差异，也就会导致Flush的操作的成本非常高，下面的图表示出了这一点：

### 0x02 概率性的Crash一致性

Paper中对这里的不一致进行了仔细的量化分析[1]。这里暂时没有包含这一个部分，

... refers to specifically is two orderings: JM → JC and JC → M. In the first case, Ts’o notes that the disk is likely to commit JC to disk after JM even without an intervening flush (note that this is without the presence of transactional checksums) due to layout and scheduling; disks are simply unlikely to reorder two writes that are contiguous. In the second case, Ts’o notes that JC → M often holds without a flush due to time; the checkpoint traffic that commits M to disk often occurs long after the transaction has been committed, and thus ordering is preserved without a flush.


### 0x03 乐观的Crash一致性

1. 使用Checksum的方式可以检测数据写入的完整性。这个一般在Crash恢复的时候使用，

Optimistic crash consistency eliminates the need for ordering during transaction commit by generalizing metadata transactional checksums to include data blocks. During recovery, transactions are discarded upon checksum mismatch.


... Fortunately, this delay does not affect application performance, as applications block until the transaction is committed, not until it is checkpointed. Additional techniques are required for correctness in scenarios such as block reuse and overwrite.


#### Optimistic Consistency的特性

• 上面的图中，Tx0的各个D，J-M，J-C都已经成功持久化，这个时候就可以认为这个写入操作可以提交了，进行M操作；

• 对于Tx1来说，即使启动的部分持久化了但是D操作的数据没有持久化，这个时候恢复操作的时候会检查到checksum对不上，从而知道这个操作没有完成，即使相关的元数据已经持久化了。对于Tx2来说，它可以可Tx1并行进行，相关的数据存在没有成功持久化的时候也会在恢复的时候被检查出来；

• 对于Tx3来说，即使它的数据都已经持久化了，但是由于它之前的事务没有持久化，它也不能提交，

Even if the file system is notified that D: 3, JM : 3, and JC : 3 are all durable, the checkpoint of M:3 cannot yet be initiated because essential writes in T x: 1 and T x: 2 are not durable (namely, D: 1 and JC : 2). T x: 3 cannot be made durable until all previous transactions are guaranteed to be durable; there- fore, its metadata M:3 cannot be checkpointed.


#### 使用的技术

• In-Order Journal Recovery，用来保证恢复操作的时候按照逻辑顺序进行(不一定和物理的顺序一致)，一个没有提交的操作后面的操作也不同管了，技术它的数据都持久化了；

• In-Order Journal Release，释放操作的时候由于乱序的存在，要保证不要释放不应该释放的数据；

• Checksums，常见的metadata transactional checksumming方式可以用来放松J-M和J-C之间的顺序。一种类似的但是更加复杂的方法是data transactional checksumming，也是Optimistic Consistency使用方法。基本的思路就是计算数据的checksum，相关数据保存在J-C之中，

With the data checksums and their on-disk block addresses stored in JC, the journal recovery process can abort transactions upon mismatch. Thus, data transactional checksums enable optimistic journaling to ensure that metadata is not checkpointed if the corresponding data was not durably written.


When the file MB must be allocated a new data block, the optimistic file system allocates a “durably-free” data block that is known to not be referenced by any other files; finding a durably-free data block is straight-forward given the proposed asynchronous durability notification from disks.

• Selective Data Journaling，一种为update-in-place的优化方式，

  Data journaling places both metadata and data in the journal and both are then updated inplace at checkpoint time. The attractive property of data journaling is that inplace data blocks are not overwritten until the transaction is checkpointed; therefore, data blocks can be reused if their metadata is also updated in the same transaction. The disadvantage of data journaling is that every data block is written twice (once in the journal, JD, and once in its checkpointed in-place location, D) and therefore often has worse performance than ordered journaling


#### 持久性和一致性

Optimistic Consistency通过前面的一些方法来保证操作符合一定的顺序，以此来实现Crash时的一致性，但是并没有保证持久化。但是有的时候顺序和持久只追求其中的一个，而现在使用的方法要么是都符合要么是都不符合。这里提提出了两个新的借口：1. osync()用来保证写入操作之间的顺序，2. dsync()用来保存数据的数据化。

Now consider when every write is followed by dsync(), i.e., W1,d1,W2,d2,...,Wn,dn. If a crash hap- pens after di, the file system will recover to a state with W1,W2,...,Wi applied.

If every write was followed by osync(), i.e., W1,o1,W2,o2,...,Wn,on, and a crash happens after oi, the file system will recover to a state with W1,W2,...,Wi−k applied, where the last k writes had not been made durable before the crash. We term this eventual durability. Thus osync() provides prefix semantics


### 0x04 OptFS实现

Paper中还讨论了对于上面提及到基本设计思路和使用的技术的具体实现，可以参看[1].

## 参考

1. Optimistic Crash Consistency, SOSP’13.
2. Barrier-Enabled IO Stack for Flash Storage, FAST’18.