# Can Applications Recover from fsync Failures?

## Can Applications Recover from fsync Failures?

### 0x00 基本内容

Single Block Update:
open, lseek(4K), write(4K), fsync, fsync, sleep(40), close.

Multi Block Append:
write(4K), fsync, sleep(40), write(4K), fsync, sleep(40), close.


Basics of fsync Failures:
Q2 Is metadata persisted if a data block fails?
Q3 Does the file system retry failed block writes?
Q4 Are failed data blocks marked clean or dirty in memory? Q5 Does in-memory page content match what is on disk?

Failure Reporting:
Q6 Which future fsync will report a write failure?
Q7 Is a write failure logged in the syslog?

After Effects of fsync Failure:
Q8 Which block failures lead to file-system unavailability?
Q9 How does unavailability manifest? Does the file system shutdown, crash, or remount in read-only mode?
Q10 Does the file suffer from holes or block overwrite failures? If so, in which parts of a file can they occur?

Recovery:
Q11 If there is any inconsistency introduced due to fsync failure, can fsck detect and fix it?


### 0x01 实例

• ext4下面的Multi Block Append，写入一个data block失败时候同样会导致page cache中对应的block标记为clean。FS在执行metadata checkpoint操作的时候，会将inode table 和 block bitmap的更新写入到磁盘。即使是数据没有被实际写入，元数据也会引用到这些blocks。后面的write操作也会看到之前的错误，但是后面的操作会正常。这样读取的时候可能读取到一个block之前的数据而不是这次写入的数据。对于journal block写入的失败同样会导致fs进入readonly模式。ext4提供一些挂载参数来处理一些错误，paper中发现这些并没有什么用，

... to abort the journal if an error occurs in a file data buffer (mount option data_err=abort) and remount the file system in read-only mode on an error (mount option errors=remount-ro). However, we observe that the results are identical with and without the mount options.

• 在data模式下面，ext4 fsync的一些错误处理和ordered下面类似，但是有一个区别。data模式下面数据也先写入日志，这样写data block错误的情况下也可能写入是成功的，这样对于上面的Single Block Update操作，可能是第二个fsync返回错误，而不是第一个。

• XFS的Multi Block Append操作，第一个data block写入失败之后，后面的fsync操作就会失败，metadata也不会被写入到日志中。后面继续写入数据的时候，也会造成引用到一个前面失败写入操作的non-overwritten block。但是在XFS，最后面的一些写入失败不会和ext4中的问题，这里不会更新metdata。最后一个失败的append操作不会导致文件的size变化，但是可能其它的元数据字段变化，

Thus, while in ext4 a failed write always causes a non-overwritten block, in XFS, non-overwritten blocks cannot exist at the end of a file. However, for either file system, if the failed blocks remain in the page cache, applications can read those blocks regardless of whether they are in the middle or the end of a file.


### 0x02 Application Study

• OldValue (OV): The system returns the new value for a while but then reverts to an old value, or the system conveys a successful response but returns the old value later on.
• FalseFailure (FF): The system informs the user that the operation failed but returns the new value in the future.
• KeyCorruptions (KC) and ValueCorruptions (VC): Corrupted keys or values are obliviously returned.
• KeyNotFound (KNF): The system informs the user that it has successfully inserted a key but it cannot be found later on, or the system fails to update a key to a new value but the old key-value pair disappears as well.


App=KeepGoing: The application continues without restarting.
App=Restart: The application restarts either after a crash or a graceful shutdown. This forces the application to rebuild in-memory state from disk.
BufferCache=Keep: No evictions take place.
BufferCache=Evict: One or more clean pages are evicted.


• Redis，Redis没有处理fsync失败的问题，认为下层的存储是可靠的。Redis中各种错误都可能发生；
• LMDB，LMDB使用一些CoW的策略，处理得比Redis好。可能发生一些ValueCorruptions的错误( ValueCorruptions and OldValue errors occur on BufferCache=Evict, regardless of whether the application restarts or not)。
• LevelDB，可能有FalseFailures错误(only on App=Restart with BufferCache=Keep)。FalseFailures错误是因为写入log entry失败的时候给用户返回错误，重启的时候，log entry还在page cache中，LevelDB会从这些entries中创建SST文件，而导致。在log file有问题的时候，可能有KeyNotFound 和 OldVersion 错误(BufferCache=Evict and App=Restart, LevelDB rejects the corrupted log entry and returns the old value for future read operations)。
• SQLite，在rollback journal模式和SQLite WAL模式下面都可能发生错误，WAL模式下面表现更好一些；
• PostgreSQL，在下面的table中看来，PostgreSQL是表现得最好的。在default模式即WAL使用page cache的情况下，可能有FalseFailure的问题。 和前面的app一样，OldVersion 和 KeyNotFound Error其实一般由同样的原因导致，错误不同看key之前存在与否。在使用DirectIO模式下面，WAL不使用page cache，不会有FalseFailures的错误。

### 0x03 Discussion

• Existing file systems do not handle fsync failures uniformly，现在的文件系统都没有很好地处理fsync错误，每个文件系统都有些区别，不同的模式下面的也有区别；

• Copy-on-Write file systems such as Btrfs handle fsync failures better than existing journaling file systems like ext4 and XFS，CoW的文件系统处理这些问题更好一些，上面的table也显示了在Btrfs下面app发生错误的可能是最少的；

• Ext4 data mode provides a false sense of durability，ext4提供的一致只是对于fs本身而言的，应用不能知道从fsync中接受到的错误是前面那一次操作导致的；

• Existing file-system fault-injection tests are devoid of workloads that continue to run post failure，

 We believe that file-system developers should also test workloads that continue to run post failure, and see if the effects are as intended.

• Application developers write OS-specific code, but are not aware of all OS-differences,

We hope that the Linux file-system maintainers will adopt a similar approach in an effort to han- dle fsync failures uniformly across file systems. Note that it is also important to think about when to classify whether a device has been removed.

• Application developers do not target specific file systems；

• Applications employ a variety of strategies when fsync fails, but none are sufficient，这些app处理fsync错误的策略都有一些问题；

• Applications run recovery logic that accesses incorrect data in the page cache，恢复操作的时候最好避免从page cache中读取数据，而是读磁盘上面的内容；

• Application recovery logic is not tested with low level block faults。

## 参考

1. Can Applications Recover from fsync Failures?, ATC ‘20.