# FlashShare -- Punching Through Server Storage Stack

### 0x00 引言

• Kernel-level enhancement，目前，Linux对于这类高性能的SSD使用的block layer是 multi- queue block layer (blk-mq)，blk-mq也为SSD之类的硬件做了很多的优化，但是它没有考虑到IO请求的优先级。FlashShare先对kernel进行改进，将这类IO请求(延时敏感类型)可以绕开现在内核一些处理机制，直接达到NVMe的驱动层。
• New interrupt services for ULL SSDs，FlashShare认为现在的基于中断的方式(message-signaled interrupts)会有存储栈带来的很长的延时的问题，使用轮询的方式会消耗大量的CPU。FlashShare则使用折中的设计，使用了一种叫做selective interrupt service routine (Select-ISR)的方式，就是对在线交互类的应用使用轮询的方式，对线下应用使用中断的方式。

We also revise the memory controller and I/O bridge model of the framework, and validate the simulator with a real 800GB Z-SSD prototype. The evaluation results show that FLASHSHARE can reduce the latency of I/O stack and the number of system context switch by 73% and 42%, respectively, while improving SSD internal cache hit rate by 37% in the co-located workload execution. These in turn shorten the average and 99th percentile request turnaround response times of the servers co-running multiple applications (from an end-user viewpoint) by 22% and 31%, respectively.


### 0x01 背景知识

#### 内核存储栈

When the I/O request is completed by the SSD, it sends a message signaled interrupt (MSI) that directly writes the interrupt vector of each core’s programmable interrupt controller. The interrupted core executes an ISR associated with the vector’s interrupt request (IRQ). Subsequently, the NVMe driver cleans up the corresponding entry of the target SQ/CQ and returns the completion results to its upper layers, such as blk-mq and filesystem.


#### 设备固件栈

SSD的内部其实就是一个特殊的小型计算机系统，其中一个重要的部分就是NVMe Controller，它控制下吗的内置的Cache和FTL。内核的Cache其实就是DRAM，高端的SSD会有数GB的Cache。NVMe Controller即可用接收主机推送的请求，也可以直接自己去拉取。

### 0x02 Kernel Layer Enhancement

FlashShare认为之间让kernel去识别IO请求的延迟敏感性是很难的，所以FlashShare这里直接吧这个工作交给了应用程序，它在添加了两个Linux的系统调用，在Linux进程的PCB struct task_struct中添加了相关的属性值来标示。

• blk-mq，一般情况下，kernel的block layer都会尝试对IO请求进行合并和重排序大的操作，这个在超低延时的SSD or NVM有时候是不适合的，这个会增加IO请求的延时(Linux这里一般的默认的行为就是一个IO请求不会被马上执行，而是会稍微等待一段时间，增加合并请求批量处理的机会，可以提高效率)。这里FlashShare的处理方式就是bypass了上面图4中的多个的服务层，直接达到NVMe驱动。

这里这样简单的bypass的方式但来的问题就是可能导致冒险(hazard)，一个例子就是一个非敏感性的IO请求正在blk-mq中处理中，这个时候一个延时敏感性的IO请求到来了，它们会对同一个逻辑块进行操作。对于这样的情况就不能简单地bypass了。这里要处理一下，如何这两个请求是不同的操作类型(读写)，blk-mq就一前一后地提交这两个请求，如果是相同的类型的，就将其合并处理然后转发到NVMe的驱动。

### 0x03 Selective Interrupt Service Routine

Select-ISR使用来处理前面提到的中断和轮询存在的问题的，

With Select-ISR, the CPU core can be released from the NVMe driver through a context switch (CS), if the request came from offline applications. Otherwise, blk-mq invokes to the polling mechanism, blk poll(), after recording the tag of the I/O service along with online applications. blk poll() continues to invoke nvme poll(), which checks whether a valid completion entry exists in the target NVMe CQ. If it is, blk-mq disables IRQ of such CQ so that MSI cannot hook the procedures of blk-mq later again. nvme poll() then looks up the CQ for a new entry by checking the CQ’s phase tags.


  we propose an I/O-stack accelerator. Figure 7 shows how our I/O-stack accelerator is organized from the hardware and software viewpoints. This additional enhancement migrates the management of the software and hardware queues from blk-mq to an accelerator attached to a PCIe. This allows a bio generated by the upper file system to be directly converted into a nvm rw command. Especially, the accelerator searches a queue entry with a specific tag index and merges bio requests on behalf of a CPU core.


### 0x04 Firmware-Level Enhancement

In addition, the I/O patterns and locality of online applications are typically different from those of offline applications. That is, a single generic cache access policy cannot efficiently manage I/O requests from both online and offline applications.


