# TIMELY -- RTT-based Congestion Control for the Datacenter

## TIMELY: RTT-based Congestion Control for the Datacenter

### 0x00 引言

We show using experiments with up to hundreds of machines on a Clos network topology that it provides excellent performance: turning on TIMELY for OS-bypass messaging over a fabric with PFC lowers 99 percentile tail latency by 9X while maintaining near line-rate throughput. Our system also outperforms DCTCP running in an optimized kernel, reducing tail latency by 13X.


### 0x01 基本思路

/***********************************************************
TIMELY Algorithm to compute new rate
************************************************************/
/*
It is assumed this function is part of a 'sender' class, which declares the required instance variables and paramaters as detailed below.
It takes as input the current rate at which data is being sent and the current rtt.
It can be invoked when an ack/completion event is received.
It returns the new rate that can be enforced by the 'sender'.
*/
/*
Parameters used and their recommended values (these values may need finetuning based on experimental scenario)
ewma_alpha: recommended value = 0.02
t_low: recommended value = 0 (if using per-packet pacing), 50us (if using per 64KB message pacing)
t_high: recommended value = 1ms
HAI_Thresh: recommended value = 5
additiveIncrement: 10Mbps for 10Gbps line rate
decreaseFactor: 0.8
maxRate = line rate
minRate = optional
*/
/*
Other instance variables used and their initialization
prevRTT_: previous RTT (initialized to 0)
avgRTTDiff_: moving average of the RTT difference (initialized to 0)
minRTT_ = fixed minimum network RTT value
last_update_time_: initialized to 0
*/

double getNewRate(double rtt, double rate) {

if(prevRTT_ == 0) prevRTT_ = rtt;

double rtt_diff = rtt - prevRTT_;

if (rtt_diff < 0) {
} else {
}

avgRTTDiff_ = ((1 - ewma_alpha) * avgRTTDiff_) + (ewma_alpha * rtt_diff);

double normalized_gradient = avgRTTDiff_ / minRTT_;
double delta_factor = (curTime() - last_update_time_) / minRTT_;
delta_factor = min(delta_factor, 1.0);

prevRTT_ = rtt;
last_update_time_ = curTime();

double new_rate;
if (rtt < t_low) { //additivive increase if rtt < t_low
new_rate = rate + (additiveIncrement * delta_factor);
} else {
if (rtt > t_high) { //multiplicative decrease if rtt > t_high
new_rate = rate * (1 - (delta_factor * decreaseFactor * (1 - (t_high / rtt))));
} else {
int N = 1;
if (negGradientCount_ >= HAI_thresh) N = 5;
new_rate = rate + (N * additiveIncrement * delta_factor);
} else { //multiplicative decrease if avg gradient > 0
new_rate = rate * (1.0 - (decreaseFactor * normalized_gradient));
}
}
}
//derease in rate capped by 0.5 times the old rate
new_rate = max(new_rate, rate * 0.5);
//enabling max and min cap on the new rate
new_rate = min(new_rate, maxRate);
new_rate = max(new_rate, minRate);
return new_rate;
}


TIMELY Framework主要有三个部分组成：

1. 由于算法时以RTT为核心，所以RTT的测量当然就是核心的部分。RTT的定义如下图所示，注意这里和TCP中定义的RTT存在一些差别，

 We define the RTT in terms of Figure 7, which shows the time-line of a message: a segment consisting of multiple packets is sent as a single burst and then ACKed as a unit by the receiver. A completion event is generated upon receiving an ACK for a segment of data and includes the ACK receive time. The time from when the first packet is sent (t-send) until the ACK is received (t-completion) is defined as the completion time. Unlike TCP, there is one RTT for the set of packets rather than one RTT per 1-2 packets.


数据在端到端的传输中的演示来自多个方面，TIMELY只会关注传播延时和排队延时。RTT通过下面的公式计算， $$\\ RTT = t_{completion} - t_{send} - \frac{SegSize}{NICLineRate}$$ 此外这里为了更加精确地测量到RTT，要求NIC有下面的两个支持：

• ACK时间戳，t-completion由NIC提供，OS提供这个时间戳的方式会收到其它的因素的影响，使其变得不精确，比如调度和中断。同理t-send也是NIC在一个segment发动开始时设置的时间戳；
• Prompt ACK generation，使用基于NIC的ACK的方式，消除了OS处理数据包的时间带来的周转时间导致的RTT计算的不准确。因为这里传播延时和排队延时，在数据中心这些数据在多数的情况下都比较小，这样的话对测量的精度要求就比较高。
2. 新Rate计算，如上面的代码所示，具体的逻辑在下面说明；

3. Rate控制，TIMELY使用的方法是基于速率发送Segment的方法，在前面计算出发送的速率之后，一个发送的调度器会根据Segment Size和最后一个Segment发送的是来决定当前Segment发送的时机。Paper中认为，传统的TCP基于窗口的方法不能很好的实现数据中心这样的环境，

 The bandwidth-delay product is only a small number of packet bursts in datacenters, e.g., 51 μs at 10 Gbps is one 64 KB message. In this regime, windows do not provide fine-grained control over packet transmissions. It is easier to directly control the gap between bursts by specifying a target rate. As a safeguard, we limit the volume of outstanding data to a static worst-case limit.