# Remote Core Locking and ffwd

## Remote Core Locking and ffwd

### 0x00 引言

• 减少原子指令的使用，降低cache miss率；
• 在另外的一个指定的核心上执行临界区代码；
• 快速的控制转移，没有同步也没有全局变量；
• 共享的数据保持在server核心上(即用来执行临界区的代码的核心)，这意味着更加少的cache miss。

### 0x01 Remote Core Locking(RCL)

RCL的基本原理：

req-i包含了以下的几个字段：

1. 这个临界区相关的lock的地址；
2. 一个context结构的地址，包含了一些会由临界区引用or更新的一些变量之类的东西；
3. 包含了临界区代码的函数的地址，如果没有临界区，这个字段就是NULL；
##### 操作过程
• 客户端: 依次写入lock的地址、context 结构的地址、要执行函数的地址到req-i的第1、2、3个部分，然后等待第3个部分被服务方设置为NULL，代表了服务方已经将请求完成，这里可以使用SSE3 的monitor/mwait指令优化。
• 服务端：轮询这些request结构的数组，当发现req-i的第3个字段不为NULL的时候，先检查lock是否是空闲的，如果是，则获取lock，然后执行函数，执行完成之后，设置第3个字段为NULL，重复以上的过程。

For the applications that spend 20-70% of their time in critical sections when using POSIX locks (Raytrace/Balls4, String Match, and Memcached/Set), RCL gives significantly better performance than POSIX locks, but in most cases it gives about the same performance as MCS and Flat Combining, as predicted by our microbenchmark. For Memcached/Set, however, which spends only 54% of the time in critical sections when using POSIX locks, RCL gives a large improvement over all other approaches, because it sig- nificantly improves cache locality. When using POSIX locks, Memcached/Set critical sections have on average 32.7 cache misses, which roughly correspond to accesses to 30 shared cache lines, plus the cache misses incurred for the management of POSIX locks. Using RCL, the 30 shared cache lines remain in the server cache.


### 0x02 ffwd

ffwd的基本原理和RCL是一样的，不过它实现了更加好的性能：

Our system: fast, fly-weight delegation (ffwd, pronounced “fast-forward”), is a stripped-down implementation of delegation that is highly optimized for low latency and high throughput. ffwd offers up to 10× the throughput of the state-of-the-art in delegation, RCL [53], in microbenchmarks, or up to 100% in application-level benchmarks. With respect to locking and other methods, ffwd is often able to improve performance by 10× or more, for data structures suitable to delegation.


ffwd实现这样的性能的主要方式就是有效的隐藏在服务线程和客户现场之间互联产生的延时。方法综合了指令级的并行和小心地安排内存访问。ffwd中避免了原子指令的使用，允许服务线程重排序指令，此外还有将请求和响应打包到接近客户端的cache line pairs中，缓冲响应以减少维持缓存一致性带来的数据流量。

The primary contributions of this paper are as follows:

• The design and implementation of ffwd, to our knowledge the fastest delegation system to date. At approximately 40 cycles of server-side overhead per request, ffwd-delegated data structures closely approximate single-thread performance, in a multi-threaded setting.

• A low-level analysis of the performance bounds on locking vs. delegation, from an architectural perspective, setting the stage for further improvements.

• A publicly available implementation of ffwd, including a general purpose API, and a set of benchmark programs ported to delegation, for the community to reproduce and build upon our results.


ffwd提高的API也很简单:

FFWD Server Init()
Starts a server thread, allocates and initializes the request and response lines.

FFWD Delegate(s, f, retvar, argc, args...)
This macro delegates function f to server s, with specified arguments. Stores return value in retvar.


#### Lock and Delegation Performance Bounds

ffwd之类的基于委托的同步方式都特别适合临界区比较小的情况。这里先分析一下影响这些同步方式性能的主要因素。先来考虑一个理想的情况，对于一个one-way 互联延时为L的，临界区执行时间为C-lock的情况，理想的吞吐为T-lock = 1 / (l + C-lock)，这是如何如优化的一个指导。

Consider an idealized system. Assuming no back-to-back acquisitions, the maximum single-lock throughput is T-lock =  1 / (l + C-lock), where l is the mean one-way interconnect latency, and C-C-lock is the mean duration of the critical section. When (C-lock → 0), lock throughput is dominated by bus latency.


##### Interconnect Parallelism: Store Buffers

 It is unknown how efficient a delegation server can be. However, our current implementation achieves 55 Mops on a 2.2 GHz CPU, or 40 cycles per request, whereas locking requires more than 450 cycles per request.
In summary, with locking, throughput is limited to 5 Mops per lock, or 12.5 Mops when running on a single socket. With delegation, performance is limited primarily by server processing capacity, and the number of processor cycles spent on each delegated function.


#### fast, fly-weight delegation (ffwd)

ffwd中每一个客户端都有一个128byte的的请求cache line pair，这个pair只能被运行在特定一个硬件线程上面运行的线程使用，只能被服务端读取。当一个客户端线程写入请求cache line pair之后，它就在特定的响应cache line pair上面子旋转。这个响应cache line pair能读和写的对象和请求的相反。服务方也像RCL一样轮询请求，不过它一次性处理一组之后在写回结果。每个请求结构与前面的RCL有所区别，有一个toggle bit，一个函数指针，一个参数数量值和最多6个参数组成。在一个socket上面(CPU的socket)，响应行最多由15个线程共享，也包含了一个toggle bit，8byte的返回值。toggle bit指示了每个单独请求/响应通道的状态。如果与给定客户端对应的请求和响应toggle bit不同, 则新请求处于挂起状态。如果它们相等, 则服务方已经处理完成，响应已准备就绪。服务方处理请求, 就是加载提供给相应的参数, 并调用指定的函数。

#### ffwd的优化策略

• Eliminating false sharing，与一般的只是设置为64byte大小不同，ffwd这里的去除伪共享还考虑到了缓存预取，所以将大小设置为128byte,

On Intel’s Xeon family of architectures, cache lines are 64 bytes, but the L2 cache of each core includes a spatial (pair) prefetcher, which treats memory as consisting of 128-byte line-pairs, and automatically prefetches the “other” 64-byte line when one of the lines is otherwise retrieved.

• Independent, per-core request lines，没什么特别的。

• Buffered, shared response lines，这里与RCL的差别比较大, 这里ffwd是分配128-bytes response line pairs， 最多由15个在同一个socket上面的硬件线程共享, 其中包含切换位和8字节返回值。服务方的写入操作使得response line pairs对相应的socket上的所有副本失效。但是此失效过程中, 服务方的响应写入到本地核心的存储缓冲区, 实际上保证整个128bytes可以写入，这样就不会导致维持缓存一致性的流量增加。最后复制toggle bits。

The first subsequent read by a client transitions the cache line pair to the S state and copies the data. Subsequent reads by other clients on the same socket are served by their local last-level cache. In short, every round of service, serving up to 15 clients on one socket, incurs at most 17 cache line data transfers and 17 corresponding cache line invalidations over the interconnect.

• NUMA-aware allocation of request and response lines，请求行分配在客户端所在的结点上面，响应行则在服务方所在的结点；

our server starts processing a request by reading the specified number of parameters from the request into the parameter passing registers. It then calls the specified function, and finally buffers the eventual return value for subse- quent combined transmission.