# Split-Level I/O Scheduling

## Split-Level I/O Scheduling

### 0x00 引言

our Actually Fair Queuing sched- uler reduces priority-misallocation by 28×; our Split-Deadline scheduler reduces tail latencies by 4×; our Split-Token scheduler reduces sensitivity to interference by 6×. We show that the framework is general and operates correctly with disparate file systems (ext4 and XFS).


### 0x01 基本思路

 Furthermore, memory notifications make schedulers aware of write work as soon as possible (not tens of seconds later when writeback occurs). Finally, split schedulers can prevent file systems from imposing orderings that are contrary to scheduling goals.


### 0x02 Split Framework Design

#### Cause Mapping

Processes P1 and P2 both dirty the same data page, so the page’s tag includes both processes in its set. Later, a writeback process, P3, writes the dirty buffer to disk. In doing so, P3 may need to dirty the journal and metadata, and will be marked as a proxy for {P1, P2}. Thus, P1 and P2 are considered responsible when P3 dirties other pages, and the tag of these pages will be marked as such. The tag of P3 is cleared when it finishes submitting the data page to the block level.


#### Cost Estimation

Our framework exposes hooks at both the memory and block levels, enabling each scheduler to handle the trade-off in the manner most suitable to its goals. Schedulers may even utilize hooks at both levels. For example, Split-Token promptly guesses write costs as soon as buffers are dirtied, but later revises that estimate when more information becomes available (e.g., when the dirty data is flushed to disk).


### 0x03 Split Scheduling in Linux

Paper中将这个实现在了Linux上面，并和ext4和XFS集成。这里的主要做的就是下面这些：

• Cross-Layer Tagging，Linux 的IO操作早不同的层涉及到不同的调用和不同的数据结构，这里在一个请求中添加了一个causes标记，用于追踪这个请求。对于Writeback(比如ext4中的ext4_da_writepages,“da”代表 “delayed allocation”，是ext4中叫做延迟分配的一个功能) or 日志写入，它们都会写入一个文件的一个范围内的pages。这里对这些都需要额外处理，

We modify this function so that as it does allocation for the pages, it sets the writeback thread’s proxy state as appropriate. For the journal proxy, we modify jbd2 (ext4’s journal) to keep track of all tasks responsible for adding changes to the current transaction.

• Scheduling Hooks，这里的hooks主要就是在前面说过的三个层面：

1. System Call，通过这里的hooks，调度器可以拦截一个写入请求，这里不对读取请求进行拦截。对于拦截的请求，调度器可以对于进程延迟处理。此外对于元数据的操作和同步的操作(creat and mkdir，fsync)都会暴露给调度器。而发出这个请求的发出者会被阻塞到这个syscall被调度。

2. Memory，这里的hooks暴露page-cache获取内部的信息，获取什么时候数据page被写“脏”了，什么时候这个page被删除了，

Schedulers can either rely on Linux to perform writeback and throttle write system calls to control how much dirty data accumulates before writeback, or they can take complete control of the writeback.

3. Block，这里的hooks主要是获取请求被添加到block层的信息和请求被完成的信息。之前的Linux的实现上就存在了很多的hooks，比如请求合并等。这里也都支持。

with the default Linux settings, average overhead is 14.5 MB (0.2% of total RAM); the maximum is 23.3 MB. Most tagging is on the write buffers; thus, a system tuned for more buffering should have higher tagging overheads. With a 50% dirty ratio [5], maximum usage is still only 52.2 MB (0.6% of total RAM).


### 0x04 Scheduler Case Studies

##### Actually Fair Queuing设计

This design allows reads to hit the cache while protecting writes from journal entanglement. Beneath the journal, low-priority blocks may be prerequisites for high-priority fsync calls, so writes at the block level are dispatched immediately.


## 参考

1. Split-Level I/O Scheduling, SOSP’15.