SLOG -- Serializable, Low-latency, Geo-replicated Transactions

SLOG: Serializable, Low-latency, Geo-replicated Transactions

0x00 基本思路

SLOG这篇Paper可以看作是之前Calvin数据库的优化设计。SLOG设计为一个跨区域复制的且满足strict serializability一致性的数据库。目前支持跨区域复制且满足strict serializability的数据库的代表就是Spanner，但是SLOG的设计与其有很大的差别。跨地域复制网络带来的延迟就会很大，所以跨地域复制的数据要支持低延迟的写操作，常见的操作就是，1. 完全使用异步复制的方式，2. 同步复制只会复制到最近的区域。在要低延迟又要强一执性所以难避开异步复制的情况下，SLOG这里的基本思路使用master-oriented的设计，引入了home的概念。SLOG中如果一个事务只访问了master都在一个region内的数据，一般就是一个数据中心之内，那么这个事务就是一个single home的事务，否则就是一个multi home的事务。

总体架构

在SLOG中，系统有多个Region。每个Region里面保存着整个数据库的多个副本，数据会被切分为多个partition，其中一个region里面的partition会是master。这个master也对应到SLOG中“home”的概念。同一个region中不同的partition的“home”之间是没有任何关系的。在一个region内，有一个基于Paxos实现的local input log，这个local input log中记录的操作只会是修改master在这个region里面的partition。SLOG也会使用批量的方式将这些local input log发送到其它的regions保存多个副本，每个batch会包含了一个sequence number，其它regions里面的副本可以通过这个sequence number来了解到是不是缺失了某些batch。这样的机制下面，一个region里面的最终会包含全部的input log，当然可以因为网络等的原因会有一些延迟。每个region内的SLOG通过deterministic transaction processing来重放input log。SLOG通过使用deterministic transaction processing机制来确保来以相同的顺序处理相同的input log，最终会得到完全一样的内部状态。在这种设计下面，single-home transactions就会是一个region内的操作，相对来说比较简单，而multi-home transactions就会比较复杂。另外在SLOG，事务执行会有两种模式：(a) after the region completes the transaction (SLOG-B) or (b) after the region completes the transaction AND it has been replicated to a configurable number of nearby regions (SLOG-HA).

0x01 Single-home transactions

SLOG的数据是以granule粒度进行分配，这个可以就是单条记录，也可以是一个数据range。对于每个granule，会包含这样的两个元数据信息，1. 写操作时候的master region的ID，2. 这个granule的master改动次数的计数值。这两个元数据也会随着granule进行复制操作。这两个元数据可以保存一个single 8 bit integer 值中，具体和remastering机制相关。在一个region中，会有一个distributed index，称之为Lookup Master，用于维护granule ids -> 8-bit value元数据的映射关系。这个index为异步复制的，所以可以这个index里面的数据是过时的，SLOG需要一种机制来处理读取到这里面过时数据的机制。

客户端在请求SLOG处理事务的时候，它可以发送到离它最近的region，而不用额外处理granule master的信息。一个region在接受到客户端的请求之后，从Lookup Master获取要访问的granule的hone的信息。如果得到的信息限制访问的数据在一个region中，则假设为single-home transaction，并将这个请求发送到对应的region。否则这个事务就是一个multi hone transaction。这部分的处理如下的伪代码，

function ReceiveNewTxn(txn):
  masterInfo = ∅
  foreach granule in txn.readSet 􏰀union txn.writeSet
       masterInfo = masterInfo 􏰀 union LookupMaster.Lookup(granule) 
  txn.masterInfo = masterInfo
  if (num of unique regions in txn.masterInfo == 1)
     txn.singleHome = true
     send txn to InsertIntoLocalLog(txn) of that home region
  else
     txn.singleHome = false
     send txn to the multi-home txn ordering module 
     //ordering module orders all multi-home txns and calls 
     //InsertIntoLocalLog(txn) at every region in this order

在single-home transaction的情况下，其hone region在收到这个请求之后，先将其暂存到内存中，然后最为一个batch通过Paxos复制到这个region的input log中。这里对应到InsertIntoLocalLog的逻辑，伪代码如下，

function InsertIntoLocalLog(txn): 
  if (txn.singleHome == true)
     append txn at end of localLogBatch
  else
     accessSet = ∅
     foreach <granule,granule.homeRegionID> in txn.masterInfo
         if (granule.homeRegionID == this.regionID) 
            accessSet = accessSet 􏰀union granule
     if (accessSet != ∅)
        t = new LockOnlyTxn(txn.ID,accessSet) 
        append t at end of localLogBatch
   if (isComplete(localLogBatch))
      batchID = insert localLogBatch into Paxos-managed local log 
      call ReceiveBatch(localLogBatch, batchID) at each region 
      init new localLogBatch

在每个Region的local input log之外，一个额外独立的Paxos进程将local input log复制到一个global log。这里对应到下面ReceiveBatch伪代码的逻辑，
```
function ReceiveBatch(batch, batchID): 
  localLogs[batch.regionID][batchID] = batch
  prevID = largest batch ID in global log from batch.regionID 
  while (localLogs[batch.regionID][prevID+1] != null)
     append localLogs[batch.regionID][prevID+1] into global log 
     prevID = prevID + 1
```
在一个region之内，事务的处理和SLOG之前的Calvin设计类似，也是基于deterministic database systems。同一个region里面的servers从global log中读取数据，如果本server保存有相关的partition的时候，则先获取锁。这里获取锁的顺序也会是确定的，然后执行事务的逻辑。SLOG这里不同的一点是在实际处理之前，会检查一下前面提到两个metadata是否对的上。如果对不上，则事务需要abort or restart。
两个关键的元数据作为granule的一部分进行复制、更新。在前面被判断为single home的transaction如果实际不是single home类型的话，在后面检查的逻辑的时候就会发现这个问题。在不需要全局的coordination的情况下就知道需要将这个事务abort，也可以重新作为一个multi home transaction重启。是single hone的类型，但是home判断错误的话，在检查的时候也会根据两个元数据判断处理目前的home判断不对。
SLOG两种事务处理模式中，SLOG-B模式会在第一个region提交之后就通知客户端这个事务已经提交了，而在SLOG-HA模式中，事务对应的input log必须复制到指定数量的region之后才能回复客户端事务已经成功提交。事务处理中，只有事务的input会被复制，这里可以通过和事务处理的过程并行操作来隐藏一些跨region复制的延迟。

SLOG执行single home transaction的简单逻辑示例如下图。每个partition的home region可以是不同的，且彼此之间没有关系。同一个partition相关的事务log在local log中会是顺序保存的。而且在复制到global log中，同一个partition log的顺序关系也会得到维护。在global log中，不同partition的log的顺序是不确定的，大概意思如下图所示。由于是不同的partition，顺序不一样也不会影响到log重放之后的最终的结果。

Lock manager thread code that continuously runs:
  txn = getNextTxnFromGlobalLog() //block until exists
  if (txn.isSingleHome == false and txn.lockOnlyTxn == false)
    ExecutionEngine.process(txn) //don’t request locks yet
  else
    Call RequestLocks(txn) //pseudocode in Figure 6

Code called after all locks for single-home txn acquired:
  ExecutionEngine.process(txn)

Execution engine code called prior to read/write of granule: 
  if (don’t have lock on granule) //can happen if txn is multi-home
     Block until have lock //wait for needed lockOnlyTxn
     //will always eventually get lock since SLOG is deadlock-free 
  if (txn.masterInfo[granule] != granule.header.masterInfo) 
      RELEASE LOCKS AND ABORT TRANSACTION

Execution engine code run at end of every transaction:
  ReleaseLocks(txn)

0x02 Multi-home transactions

单个的region内的事务，其顺序是单个region复制事务log到local input log时候就决定。在涉及到多个regions时候，需要另外的机制来决定。这样的一般有三种基本的策略，1. 使用一个全局的ordering server，2. 运行一个跨region的Paxos，3. 将所有的同一个region的multi-home事务都发送到partition的master进行处理，顺序会在对应的master region里面复制到local input log决定。目前SLOG使用的是第三种方式。Paper中也提到第二种方式可能有更高的可用性。在前面的伪代码中体现了这样的处理策略，在事务被处理的时候，home的习惯信息会在ReceiveNewTxn逻辑处理的时候被保存，InsertIntoLocalLog会检查目前运行在的region是不是包含了这个事务习惯的granule。如果是的，会产生一个特殊类型的事务，LockOnlyTxn，用于lock reads和local writes的锁的信息。

SLOG将LockOnlyTxn类型的事务作为一个类似于single-home transaction来处理。处理的时候只会设计到本地的granules(home region)，最终也会复制到global log中。不过LockOnlyTxn不需要包含执行的逻辑。一个multi home transaction多个regions的LockOnlyTxn单独执行最终被复制到global log的逻辑。LockOnlyTxn这样处理就可以处理和single home transaction之间的顺序关系。通过前面提到的全局排序的机制决定了multi-home之间的顺序关系，这里也不是完全独立的。
```
... The only difference is that they(LockOnlyTxns) do not have to include executable code. The code for the multi-home transaction can arrive separately — as part of the local log from the multi-home transaction ordering module, which eventually gets integrated into the global log at every region.   LockOnlyTxns exist to specify how the multi-home transaction should be ordered relative to single-home transactions at that region.
```
SLOG一个简单的处理multi-home事务的逻辑如下。体重的一个事务T2设计到两个regions。Region0在将T1 log添加到local log之后添加执行 InsertIntoLocalLog(T2)的逻辑。这样在Region local log中，T1在T2的前面。同理，在Region 1中，T2在T3和T4之间。T2产生的local log会是LockOnlyTxn类型的记录。之后这些local log被复制到globel log中。单独的local log之间的顺序会得到保留，导致不同的local log中log的记录的顺序在不同的global log的副本中不一定是相同的。基本意思如下图。

Paper中详细的一节对SLOG满足strict serializability进行了证明。

0x03 Dynamic remastering

SLOG这里讨论的第三个核心的问题就是dynamic remastering如何处理。在SLOG中处理dynamic remastering是一件避免麻烦的事情。SLOG在前面保留了两个元数据字段，都和dynamic remastering有关系。对于一个remaster请求，这两个字段都会被修改，这种特殊的请求在执行的大部分逻辑都是当中一个single-home transaction写入请求来处理。同样地，这个请求会先被发送到相关的granule的home region，按照前面提到的逻辑先被添加到local log中，最终被复制到global log中。在所有相关的regions处理了这个请求之后，这个granule的这两个元数据就会被修改， Lookup Master中的数据会被异步地修改。

Remastering会有一些corner cases需要小心地处理。比如一个可能导致race condition。一个请求在一个region被接收到之后，这个region里面的Lookup Master信息是被更新之后的，然后这个请求被发送到new home region。但是new home region还没有接收到这个remaster信息。一些regions可能将包含T的log batch保存到保护remaster请求的log batch的前面，但是另外一些regions则保存在remaster请求的后面。一些remaster请求和事务T的请求是不同regions的请求，事务T由new home region来处理，而remaster请求由old home region来处理，这样T和remaster请求之间的顺序在目前的逻辑上面是没有限制的。
元数据的 counter字段就用于处理这种情况。在请求一个granule的锁之前，SLOG会检查事务元数据中的counter和Lookup Master中获取到的，如果有异常额外的处理。这里的异常时实际上有两种情况，一种是事务元数据中的counter更大，则说明本region还没有处理remaster请求，需要等到remaster请求被处理。如果更小，则说明这个请求已经晚了，不能在执行了，需要abort处理。SLOG有通过额外的一些处理方式可以将这两个元数据保存在8bite的范围之内。

这里的伪代码表示如下，

Lock manager thread: RequestLocks(txn):
  bool can execute now = true;
  foreach granule in txn.readSet union􏰀 txn.writeSet
    if (inLocalPartition(granule)) //a region may have > 1 partition 
       if (granule.counter > storage.getCounter(granule))
          can execute now = false;
    else if (granule.counter < storage.getCounter(granule))
       RELEASE LOCKS AND ABORT TRANSACTION 
    else //counters are the same
        can execute now = can execute now & Lock(granule)
        //Lock(granule) returns true if lock acquired
  if (can execute now == true) 
     Send txn to execution thread 
  else
     Queue txn until locks are acquired and counters are correct

Lock manager thread: after releasing locks for txn: 
if (txn was a remaster request)
  Wake up txns that are waiting for this remaster request;
else
  Wake up txns that are waiting on locks that were released;

0x04 评估

这里的具体信息可以参看[1].

参考

SLOG: Serializable, Low-latency, Geo-replicated Transactions, VLDB ‘19.