Arrakis: The Operating System Is the Control Plane
买不到票的人回不了家。
引言
这篇文章是OSDI 2014上面关于利用虚拟化优化网络和存储IO的文章,主要现在的新硬件上的系统优化。很值得一看。
We demonstrate that operating system protection is not contradictory with high performance. For our prototype implementation, a client request to the Redis persistent NoSQL store has 2× better read latency, 5× better write la- tency, and 9× better write throughput compared to Linux.
Arrakis主要做了一下的事情(直接给原文吧,这里翻译一下也没有意思):
• Wegiveanarchitectureforthedivisionoflaborbetween the device hardware, kernel, and runtime for direct network and disk I/O by unprivileged processes, and we show how to efficiently emulate our model for I/O devices that do not fully support virtualization.
• We implement a prototype of our model as a set of modifications to the open source Barrelfish operating system, running on commercially available multi-core computers and I/O device hardware
• Weuseourprototypetoquantifythepotentialbenefits of user-level I/O for several widely used network services, including a distributed object cache, Redis, an IP-layer middlebox, and an HTTP load balancer (§4). We show that significant gains are possible in terms of both latency and scalability, relative to Linux, in many cases without modifying the application programming interface; additional gains are possible by changing the POSIX API.
目前的问题
目前的系统存在诸多的overhead,这些影响了系统的性能,比如:
- Network stack costs;
- Scheduler overhead;
- Kernelcrossings;
- Copying of packet data;
如上面的表给所示,Linux的诸多overhead在Arrakis中被很大程度地将小甚至消除了。
基本设计
现在Linux的network stack的基本架构示意图:
Arrakis的几个设计目标:
- 最小化内核在控制面中的参与;这样时为了消除kernel的限制和overhead,在保证安全和隔离的同时,IO请求被直接路由到用户空间处理,进程的IO请求也是用户空间直接面向设备;
- 对应用编程透明;Arrakis试图在显著提高性能的同时保持兼容性。当然也可以使用不兼容的API更好的提高性能;
- 合适的OS/硬件抽象;Arrakis的抽象应该满足大部分的IO模式,在多核机器上有着良好的拓展性,同时满足应用对局部性和负载均衡的要求;
Arrakis的基本设计:
可以发现这里的最大的特点就说是应用面向的都是虚拟化的硬件设备,每个应用通过在user space的libos和下面的VNIC和VSIC交互,同时有一个全局的Control Plane控制着这些设备。在Arrakis中,通过使用SR-IOV,IOMMU等的技术,实现在应用层面的直接访问IO设备。这些功能是需要硬件具有相关的功能。
To make user-level I/O stacks tractable, we need a hardware-independent device model and API that captures the important features of SR-IOV adapters; a hardware-specific device driver matches our API to the specifics of the particular device.
Hardware Model
对于Arrakis提供的”硬件“,它们都是虚拟化的(virtual network interface cards (VNICs) , virtual storage interface controllers (VSICs) ):
-
队列,每个VIC包含了对个DMA传输队列,可以直接从user space放送和接收数据。这些硬件也可能有着自己的一些特点:
The exact form of these VIC queues could depend on the specifics of the I/O interface card. For example, it could support a scatter/gather interface to aggregate multiple physically- disjoint memory regions into a single data transfer. For NICs, it could also optionally support hardware checksum offload and TCP segmentation facilities. These features enable I/O to be handled more efficiently by performing additional work in hardware.
-
传输和接收过滤,这些硬件有根据数据包的信息来决定是否处理还是直接丢弃的功能。
Installation of transmit and receive filters are privileged operations performed via the kernel control plane.
-
虚拟存储区域,存储控制器需要提供将物理区域映射到虚拟区域的功能。
Applications reference blocks in the VSA using virtual offsets, converted by hardware into physical storage locations. A VSIC may have multiple VSAs, and each VSA may be mapped into multiple VSICs for interprocess sharing.
-
带宽分配器,提高限制资源使用的功能,硬件可以提供给control plane接口允许control plance查询相关信息。
In addition, we assume that the I/O device driver supports an introspection interface allowing the control plane to query for resource limits (e.g., the number of queues) and check for the availability of hardware support for I/O processing (e.g., checksumming or segmentation).
以上的一些硬件功能是Arrakis的实现所需要的。
Control Plane Interface
Control Plane Interface是Arrakis Control Plane可应用直接交换的接口,用于应用申请资源和管理IO flow的出/入。基本的抽象是VICs, doorbells, filters, VSAs, 和 rate specifiers。VICs上面提到过了,是基本的硬件熟悉,应用是和这些虚拟化的硬件交换。doorbell是一个特定VIC上面的特定的事件,这个是一种IPC的方式,用于通知某个时间发生了。Filters有一个类型和判断条件,用于决定对数据包的处理行为。VSAs是什么提到过的虚拟存储区域。rate specifier类似一个节流阀,用于控制出/入数据的速度。
The interface between an application and the Arrakis control plane is used to request resources from the system and direct I/O flows to and from user programs. The key abstractions presented by this interface are VICs, doorbells, filters, VSAs, and rate specifiers.
Network Data Plane Interface
在Arrakis中,应用发送和接收包都是直接和硬件交换的,这样的处理方式节约了很多开支。这里的Data Plane Interface是现在用户的应用程序库里面。这里Arrakis提供了两种类型的接口,一种是于Posix兼容的接口,一种是支持zero copy 的接口,但是与posix不兼容,性能更加好。
Applications send and receive packets on queues, which have previously been assigned filters as described above. While filters can include IP, TCP, and UDP field predicates, Arrakis does not require the hardware to perform protocol processing, only multiplexing.
Arrakis使用了用户空间的网络栈,这个网络栈很好地最大化吞吐和最小化延时。Arrakis将包的接受和发送分为了区分明显的3给方面,
-
数据包使用DMA异步传输到内存里面;
-
对于应用发包,这里为了减少内存拷贝操作,是将包的buffers加入到硬件的对应的ring里面,然后执行和第1步相反的操作,
This is performed by two VNIC driver functions. send_packet(queue, packet_array) sends a packet on a queue; the packet is specified by the scatter-gather array packet_array, and must conform to a filter already associated with the queue. receive_packet(queue) = packet receives a packet from a queue and returns a pointer to it. Both operations are asynchronous. packet_done(packet) returns ownership of a received packet to the VNIC.
-
使用之前提到的doorbells 异步通知在队列上的事件。Doorbells的一个特点就是,在应用在运行的是和,直接使用了虚拟化硬件的中断功能,在应用没有在运行的时候,就通过control plane调用scheduler。Doorbells还与现在的一些编程方式是兼容的,比如
select
,They are useful both to notify an application of general availability of packets in receive queues, as well as a lightweight notification mechanism for I/O completion and the reception of packets in high-priority queues.
这里的实现中,descriptor rings是将软件硬件分离的一个核心。
Storage Data Plane Interface
Arrakis中底层的底层的存储API提供了一系列的操作,来实现异步读,写已经刷洗硬件cache等的操作。具体的IO操作完全由doorbells通知,在此基础上,实现了Posix兼容的接口。此外,在这里还实现了一个persistent data structures 的库,
to take advantage of low-latency storage devices. Persistent data structures can be more efficient than a simple read/write interface provided by file systems. Their drawback is a lack of backwards-compatibility to the POSIX API. Our design goals for persistent data structures are that (1) operations are immediately persistent, (2) the structure is robust versus crash failures, and (3) operations have minimal latency.
大概的思想就是这样吧,此外,论文中还有更多的内容比如File Name Lookup,具体实现的情况。可以在论文中查看更多的细节,
Evaluation
具体参考论文:
参考
- Simon Peter, Jialin Li, Irene Zhang, Dan R. K. Ports, Doug Woos, Arvind Krishnamurthy, Thomas Anderson, and Timothy Roscoe. 2015. Arrakis: The operating system is the control plane. ACM Trans. Comput. Syst. 33, 4, Article 11 (November 2015), 30 pages.