# EPaxos and SDPaxos

## There Is More Consensus in Egalitarian Parliaments

### 0x01 基本设计

EPaxos的设定和一般的Paxos没有很大的区别，在副本数量为2F+1的情况下，能够容忍F个副本的故障。在EPaxos中，存在两种不同的动作，一个是Committing另外一个是Executing，且它们的顺序不一定要相同(可以乱序提交)。另外在EPaxos中的一个重要的概念是Command Interference，两个Commands存在Command Interference即当两个的顺序必须按照一定的顺序进行，负责会得到不同的结果。EPaxos可以保证存在Command Interference的两个Commands C1 C2在客户端有着怎么样的顺序关系，在每个副本上面的执行的顺序也一定会是这样的顺序关系。deps是一个副本的列表，它(们)包含了与一个Command存在Interference的Command(s)，而seq则是序列号，用于打破循环依赖。

• EPaxos中每个副本有一个唯一的ID，和常见的算法一样，为了处理新配置，EPaxos中也使用了epoch。所以在EPaxos中Ballot Number的格式是epoch.b.R，b是一个递增的自然数，R为副本ID。

• 基本的EPaxos中，一个Fast Path的Quorum为2F个副本，可以优化为$F + \lfloor (F+1)/2 \rfloor$。在Slow Path的处理中，一个Quorum的大小总是为F+1。

1. Wait for R.i to be committed(or run Explicit Prepare to force it);
2. Build γ’s dependency graph by adding γ and all commands in instances from γ’s dependency list as nodes, with directed edges from γ to these nodes, repeating this process recursively for all of γ’s dependencies (starting with step 1);
3. Find the strongly connected components, sort them topologically;
4. In inverse topological order, for each strongly conected component, do:
4.1 Sort all commands in the strongly connected component by their sequence number;
4.2 Execute every un-executed command in increasing sequence number order, marking them executed.

• EPaxos中的恢复操作的逻辑比较复杂，这里也设计到Fast Path的Quorum可以优化到$F + \lfloor (F+1)/2 \rfloor$。另外这个优化在这篇Paper中没有详细地证明，而是在另外的一篇Thesis中。

EPaxos算法中还会有很多的内容和细节。

## SDPaxos: Building Efficient Semi-Decentralized Geo-replicated State Machines

### 0x10 引言

We implemented a prototype of SDPaxos, and compared its performance with typical single-leader (Multi-Paxos) and multi-leader (Mencius, EPaxos) protocols. Our experiment results demonstrate that SDPaxos achieves: (1) 1.6× the throughput of Mencius with a straggler, (2) stable performance under different contention degrees and 1.7× the throughput of EPaxos even with a low contention rate of 5%, (3) 6.1× the throughput of Multi-Paxos without straggler or contention, (4) 4.6× the throughput of writes when performing reads, and (5) up to 61% and 99% lower wide-area latency of writes and reads than other protocols.


### 0x11 基本设计

# C-phase
Replica Rn on receiving a client request for command α :
sendC-accept(n,i,α,bali)to at least a majority
if the C-accept is not sent to the sequencer then
send C-request(n, i) to the sequencer
increment the C-instance counter i
Any replica R on accepting C-accept(n,i,α,bali):
sendC-ACK(n,i,α,bali)toRn
Rn on receiving C-ACKs from a majority of replicas:

# O-phase
Sequencer Rm on receiving C-accept(n,i,α,bali) or C-request(n, i) from Rn:
if i ≥ number of O-instances for Rn in sequencer’s assignment log then
send O-accept(j, n, balj ) to at least a majority including Rn
increment the O-instance counter j
Any replica R on accepting O-accept(j, n, balj ):
send O-ACK(j, n, balj ) to Rn


# Ready condition for three or more than five replicas
if Rn receives O-ACKs from a majority of replicas then
if Rn has committed Cni and at least i O-instances for Rn then
respond to client


# Ready condition for five replicas
if Rn is the sequencer then
commit Oj and broadcast O-commit(j, n, balj ) on receiving O-ACKs from a majority of replicas
if Rn has committed Cni ∧ any Ok with k ≤ j (Oj is the ith O-instance for Rn) is accepted by a majority then
respond to client
else
commit Oj and broadcast O-commit(j, n, balj ) on receiving the O-accept from the sequencer
if Rn has committed Cni ∧ any Ok withk ≤ j (Oj is the ith O-instance for Rn ) is accepted by itself and the sequencer then
respond to client


#### 恢复

SDPaxos为了处理Sequencer的故障，通用引入了View Number(个epoch一个意思)的概念，照样会在一个选举的过程。除了5个副本的恢复外，SDPaxos的恢复过程没有特别的地方。由于SDPaxos对5个副本情况下的优化，Rm代表在这个故障中非Sequencer的那一个。恢复的过程需要从存活的版本的C-instacnes中找到Rm编号最大的一个，Sequencer找到合适O-instance，然后“补全”对应hole。

  We let each voter replica report in its vote the largest-numbered C-instance of each replica it has accepted, then the new sequencer takes the maximum for Rm. C[m] implies the maximum number of Rm’s possibly ready commands, because the C-instance of any ready command of Rm must have been seen by one of the voters. Therefore, the sequencer should propose Rm’s ID in the empty O-instances until there are at least C[m] O-instances for Rm.


# Recovery of O-instances
Input: C[m]: number of C-instances of Rm accepted by the majority voters;
O[m]: number of O-instances for Rm (Rm is the replica other than the old sequencer and the majority voters in five-replica groups)
if N==5 then repeat
propose Rm’s ID in the first empty O-instance
O[m] ← O[m] + 1
until O[m] ≥ C[m] ∧ there is no empty O-instance preceding the C[m]th one for Rm

foreach empty O-instance between non-empty ones do
propose no-cl


### 0x12 优化

SDPaxos另外的一些优化的策略，

• 消息合并，在一些情况下可以将C-instance和O-instance合并发送，减少传输包的数量；

• 掉队检测，可以Sequencer能否按序处理请求来检测Sequencer的状态，在判断Sequencer状态不是很好时可以启动ViewChang。

... in which replicas estimate the sequencer’s healthiness according to whether the sequencer can satisfy their requirements for ordering. Specifically, each replica counts the number of O-accepts for itself received from the sequencer every 500 ms. If the throughput is less than 50% of its requirement for at least 3 times, it considers the sequencer as a straggler. Then it asks others whether they make the same judgment, and if so, it will start a view change to replace the old sequencer.

• Sequencer分组，在一些具体的相同中，比如KVS相同可以根据一致性Hash来划分Sequencer。

• O-instance批量处理，如题，

Another simple yet effective optimization is batching: a replica’s n commands can be batched in one global slot, so the number of O-instances will be divided by n. Even a small value of n like 2 or 3 can significantly reduce the load of O-instances.


## 参考

1. There Is More Consensus in Egalitarian Parliaments, SOSP’13.
2. SDPaxos: Building Efficient Semi-Decentralized Geo-replicated State Machines. SoCC’18.