EPaxos and SDPaxos

There Is More Consensus in Egalitarian Parliaments

0x00 引言

这篇Paper是关于Paxos算法的一个优化策略。这篇提出了一个Paxos的一种变体EPaxos。EPaxos侧重点在消除稳定/指定的Leader/Master的设计。EPaxos的基本思路就是为每一个客户端提交的Command指定一个Leader，同时处理好这样设计的冲突和依赖的关系，并加上了一些设计和实现上面的优化，

epaxos-message-flow

0x01 基本设计

EPaxos的设定和一般的Paxos没有很大的区别，在副本数量为2F+1的情况下，能够容忍F个副本的故障。在EPaxos中，存在两种不同的动作，一个是Committing另外一个是Executing，且它们的顺序不一定要相同(可以乱序提交)。另外在EPaxos中的一个重要的概念是Command Interference，两个Commands存在Command Interference即当两个的顺序必须按照一定的顺序进行，负责会得到不同的结果。EPaxos可以保证存在Command Interference的两个Commands C1 C2在客户端有着怎么样的顺序关系，在每个副本上面的执行的顺序也一定会是这样的顺序关系。deps是一个副本的列表，它(们)包含了与一个Command存在Interference的Command(s)，而seq则是序列号，用于打破循环依赖。

epaxos-basic

EPaxos中每个副本有一个唯一的ID，和常见的算法一样，为了处理新配置，EPaxos中也使用了epoch。所以在EPaxos中Ballot Number的格式是epoch.b.R，b是一个递增的自然数，R为副本ID。
在上面的图1是一个例子。两个并发的更新操作更新不同的对象，不存在Interference，且各自选择了不同的Command Leader，按照EPaxos的Fast Path执行完毕。而两个并发的更新操作更新相同的对象，存在Interference，也选择了不同的Command Leader，则会产生冲突，则执行的是EPaxos中Slow Path。
基本的EPaxos中，一个Fast Path的Quorum为2F个副本，可以优化为$F + \lfloor (F+1)/2 \rfloor$。在Slow Path的处理中，一个Quorum的大小总是为F+1。
正常操作的时候一个Command Leader接受到它为Leader的操作之后，向至少为一个Fast Path的Quorum发送PreAccept消息。副本在接受到这个消息之后，更新这个Command的deps和seq属性，然后回复Command Leader。Command Leader在收到了足够数量的回复，且更新之后的属性相同(另外的副本回复的属性可能已经被更新)，则可以提交这个操作。如果收到的属性存在不同，则选择一个有相同属性的多数副本的Quorum的属性，然后通知多数的副本接受这个属性，这一步操作也要等待多数副本的回复。之后Command Leader就可以回复客户端并发送异步的Commit的消息。

执行算法，执行的时候主要就要构建一个依赖关系拓扑，提交一个Command依赖中的没有提交的(EPaxos需要的依赖的处理在实际应用的时候估计会很困难，估计也会造成比较大的Overhead)，

Wait for R.i to be committed(or run Explicit Prepare to force it);
Build γ’s dependency graph by adding γ and all commands in instances from γ’s dependency list as nodes, with directed edges from γ to these nodes, repeating this process recursively for all of γ’s dependencies (starting with step 1);
Find the strongly connected components, sort them topologically;
In inverse topological order, for each strongly conected component, do:
1 Sort all commands in the strongly connected component by their sequence number;
2 Execute every un-executed command in increasing sequence number order, marking them executed.

EPaxos中的恢复操作的逻辑比较复杂，这里也设计到Fast Path的Quorum可以优化到$F + \lfloor (F+1)/2 \rfloor$。另外这个优化在这篇Paper中没有详细地证明，而是在另外的一篇Thesis中。

epaxos-recovery

EPaxos算法中还会有很多的内容和细节。

0x02 评估

这里的具体信息可以参看[1],

epaxos-perf

SDPaxos: Building Efficient Semi-Decentralized Geo-replicated State Machines

0x10 引言

这篇Paper也是Paxos的一个优化变体。相比前面的EPaxos来说，SDPaxos(Semi-Decentralized Paxos)采用了一种更加折中的设计，而不是EPaxos使用的完全舍弃指定Leader(而是每一个Command一个Leader)的比较激进的做法。SDPaxos的基本思路就是将顺序和复制中分开，

We implemented a prototype of SDPaxos, and compared its performance with typical single-leader (Multi-Paxos) and multi-leader (Mencius, EPaxos) protocols. Our experiment results demonstrate that SDPaxos achieves: (1) 1.6× the throughput of Mencius with a straggler, (2) stable performance under different contention degrees and 1.7× the throughput of EPaxos even with a low contention rate of 5%, (3) 6.1× the throughput of Multi-Paxos without straggler or contention, (4) 4.6× the throughput of writes when performing reads, and (5) up to 61% and 99% lower wide-area latency of writes and reads than other protocols.

0x11 基本设计

SDPaxos的基本思路是复制和定序分离，复制使用的是类似EPaxos中的方法，每次客户端的请求都是指定一个Command Leader来处理，但是处理请求顺序的是一个中心化的Sequencer节点处理。这个看起来就是一个更折中的方法。由于定序带来的开销比复制小的多，这个定序的中心节点的负载也会小于Multi-Paxos/Raft中的Leader/Master。下面的图是一个SDPaxos执行流程的示例，

sdpaxos-message-flow

在SDPaxos中，Instance的定义和一般的没有什么不同。不过这里它分为了C-instance和O-instance，分别对应复制和顺序。在图3中，一个副本R0在收到了一个客户端的请求之后，它就成为这个客户端的这个请求的Command Leader，选择一个C-instance，然后使用C-accept操作将这个请求复制到其它的副本。同时，R2作为Sequencer在收到了这个信息是的时候，R2提出这个请求的O-instance，然后发送O-accept给其它的副本(带有全局slot的信息)。其它的副本在收到了C-instance和O-instance之正常情况下分别回复C-ACK和O-ACK给R0，在收到了多数的副本的两个ACK回复之后，既可广播C-commit和O-commit信息。

# C-phase
Replica Rn on receiving a client request for command α : 
sendC-accept(n,i,α,bali)to at least a majority
if the C-accept is not sent to the sequencer then
  send C-request(n, i) to the sequencer
increment the C-instance counter i 
Any replica R on accepting C-accept(n,i,α,bali): 
sendC-ACK(n,i,α,bali)toRn
Rn on receiving C-ACKs from a majority of replicas: 
commit Cni and broadcast C-commit(n,i,α,bali)
  
# O-phase
Sequencer Rm on receiving C-accept(n,i,α,bali) or C-request(n, i) from Rn:
if i ≥ number of O-instances for Rn in sequencer’s assignment log then
    send O-accept(j, n, balj ) to at least a majority including Rn 
    increment the O-instance counter j
Any replica R on accepting O-accept(j, n, balj ): 
send O-ACK(j, n, balj ) to Rn

SDPaxos定义一个Command为ready状态指的是可以安全地回复客户端。在三个or超过五个副本的系统中，就是常用的等到多数的O-ACKs之后就可以执行提交O-instance的操作。当这个副本已经提交了这个C-instance Cni(及这个副本的第i个C-intance)且至少有i个O-instance的时候就可以返回客户端，
```
# Ready condition for three or more than five replicas 
if Rn receives O-ACKs from a majority of replicas then
  commit Oj and broadcast O-commit(j,n,balj)
if Rn has committed Cni and at least i O-instances for Rn then
  respond to client
```
在三个副本的时候，如果Command Leader和Sequencer不是同一个副本，那么它们两个就可以构成一个多数，正常情况下SDPaxos运行只要个RTT就可以Commit，因为Sequencer返回Command Leader的时候就带有O-instance的信息。而如果是多个，则多于1 RTT才能Commit。
SDPaxos针对5个副本的情况做了优化，使得5副本的情况下也可以在1RTT就可以Commit。这里C-instance的操作依然是没有变的，必须多数确认(C-ACKs)。而对于O-instance，从前面3副本的情况下可以看出，如果Sequencer和Command Leader不是同一个，则只能保证O-instance被2个副本确认的情况下才能满足1RTT的要求。SDPaxos的优化就是值等待Sequencer的O-accept返回，加上自己就是两个。但是如果这两个故障的时候，就可能无法恢复全部的信息，导致不能满足容错2个的要求。所以这里必须使用额外的条件，及这里要保证在O-commit的时候前面的已经提交，这里的前提是C-instance这个时候必然是在多数节点上保存了的，故障了两个节点数据不会丢，主要是要恢复顺序信息。另外还配合了SDPaxos恢复的算法。
```
# Ready condition for five replicas 
if Rn is the sequencer then
  commit Oj and broadcast O-commit(j, n, balj ) on receiving O-ACKs from a majority of replicas
  if Rn has committed Cni ∧ any Ok with k ≤ j (Oj is the ith O-instance for Rn) is accepted by a majority then
    respond to client
else
  commit Oj and broadcast O-commit(j, n, balj ) on receiving the O-accept from the sequencer
  if Rn has committed Cni ∧ any Ok withk ≤ j (Oj is the ith O-instance for Rn ) is accepted by itself and the sequencer then
    respond to client
```

下面的图表示的节点上面的O-instance。故障两个之后中间的空洞(只会是是1个)可以根据其它节点O-instance及C-instance的信息恢复，

sdpaxos-five

恢复

SDPaxos为了处理Sequencer的故障，通用引入了View Number(个epoch一个意思)的概念，照样会在一个选举的过程。除了5个副本的恢复外，SDPaxos的恢复过程没有特别的地方。由于SDPaxos对5个副本情况下的优化，Rm代表在这个故障中非Sequencer的那一个。恢复的过程需要从存活的版本的C-instacnes中找到Rm编号最大的一个，Sequencer找到合适O-instance，然后“补全”对应hole。

  We let each voter replica report in its vote the largest-numbered C-instance of each replica it has accepted, then the new sequencer takes the maximum for Rm. C[m] implies the maximum number of Rm’s possibly ready commands, because the C-instance of any ready command of Rm must have been seen by one of the voters. Therefore, the sequencer should propose Rm’s ID in the empty O-instances until there are at least C[m] O-instances for Rm.

一个基本的伪代码，

# Recovery of O-instances
Input: C[m]: number of C-instances of Rm accepted by the majority voters; 
       O[m]: number of O-instances for Rm (Rm is the replica other than the old sequencer and the majority voters in five-replica groups)
if N==5 then repeat
  propose Rm’s ID in the first empty O-instance
  O[m] ← O[m] + 1
until O[m] ≥ C[m] ∧ there is no empty O-instance preceding the C[m]th one for Rm
  
foreach empty O-instance between non-empty ones do
  propose no-cl

0x12 优化

SDPaxos另外的一些优化的策略，

消息合并，在一些情况下可以将C-instance和O-instance合并发送，减少传输包的数量；

掉队检测，可以Sequencer能否按序处理请求来检测Sequencer的状态，在判断Sequencer状态不是很好时可以启动ViewChang。

... in which replicas estimate the sequencer’s healthiness according to whether the sequencer can satisfy their requirements for ordering. Specifically, each replica counts the number of O-accepts for itself received from the sequencer every 500 ms. If the throughput is less than 50% of its requirement for at least 3 times, it considers the sequencer as a straggler. Then it asks others whether they make the same judgment, and if so, it will start a view change to replace the old sequencer. 

Sequencer分组，在一些具体的相同中，比如KVS相同可以根据一致性Hash来划分Sequencer。

O-instance批量处理，如题，

Another simple yet effective optimization is batching: a replica’s n commands can be batched in one global slot, so the number of O-instances will be divided by n. Even a small value of n like 2 or 3 can significantly reduce the load of O-instances.

0x13 评估

这里的具体信息可以参看[1],

sdpaxos-perf

参考

There Is More Consensus in Egalitarian Parliaments, SOSP’13.
SDPaxos: Building Efficient Semi-Decentralized Geo-replicated State Machines. SoCC’18.