# Bigtable -- A Distributed Storage System for Structured Data

## Bigtable: A Distributed Storage System for Structured Data

### 数据模型

(row:string, column:string, time:int64) → string


• Columns，Bigtable会将columns分组为列族，访问控制在列族上组织。一个列族下面的列同时有相同的类型。列族是必须被显示的创建的，在创建之后就可以在其中添加列。列的key是family:qualifier 这样的格式，

 Column family names must be printable, but qualifiers may be arbitrary strings. An example column family for Webtable is language, which stores the language in which a web page was written. We use only one column key with an empty qualifier in the language family to store each web page’s language ID.

• Timestamps，Bigtable中允许每个cell里面保存多个版本的数据，使用时间戳来区分版本。

Applications that need to avoid collisions must generate unique timestamps themselves. Different versions of a cell are stored in decreasing timestamp order, so that the most recent versions can be read first.


在Bigtable中，应用可以选择只保留最新的版本，也可以选保留最近的N个版本。Bigtable基本的API示例:

### 基本架构

​ Bigtable的基本的组成部分和GFS的很相似，主要的部分也是两个：一个Master Serber，和一些tablet servers，另外还包括一个客户端使用的库。tablet server是可以动态添加 or 移除的。同Master在GFS的作用类似，Bigtable中Master的作用也是管理的作用，虽然具体的合作有所不同。Bigtable中Master负责讲tabllet分配给tablet servers，探测添加 or 过期的tablet server，对tablet server进行负载均衡，还有就是对垃圾文件进行回收处理。此外，它还负责处理schema的变化，如添加or删除table、列族等。每一个tablet server负责管理一组的tablets， 处理在这些tablets上面的读写请求，在tablet过大的时候对其进行切分。 一个Bigtable的集群会保存若干数量的table，而每一个table由一组的tablets组成。一个tablet上面保存了一个范围内的row key以及对应的数据。tablet是自动随着数据的增长而分裂逐渐增多的，初始的时候每个table都只有一个tablet。默认情况下，一个tablet的大小为1GB。

#### tablet的位置

The client library traverses the location hierarchy to locate tablets, and caches the locations that it finds. If the client does not know the location of a tablet, or if it discovers that cached location information is incorrect, then it recursively moves up the tablet location hierarchy. If the client’s cache is empty, the location algorithm requires three network round-trips, including one read from Chubby. If the client’s cache is stale, the location algorithm could take up to six round-trips, because stale cache entries are only discovered upon misses (which we expect to be infrequent because METADATA tablets should not move very frequently).


We also store secondary information in the METADATA table, including a log of all events pertaining to each tablet (such as when a server begins serving it). This information is helpful for debugging and performance analysis.


#### Tablet 分配

for example, a network partition might cause the server to lose its Chubby session. (Chubby provides an efficient mechanism that allows a tablet server to check whether it still holds its lock without incurring network traffic.) A tablet server attempts to reacquire an exclusive lock on its file as long as the file still exists. If its file no longer exists, then the tablet server will never be able to serve again, so it kills itself.


1. 获取一个关于Master的锁，这个事为了不会同时产生多个Master；
2. 新的Master扫描对应的Chubby目录发现存活的tablet servers；
3. 然后Master与这些servers通信，获取tablets的分配信息，并表明自己的新Master的身份；

the master detects the new tablet when it asks a tablet server to load the tablet that has now split. The tablet server will notify the master of the split, because the tablet entry it finds in the METADATA table will specify only a portion of the tablet that the master asked it to load.


#### Tablet维护、压缩和 Schema 管理

Updates are committed to a commit log that stores redo records. The recently committed ones are stored in memory in a sorted buffer called a memtable. A memtable maintains the updates on a row-by-row basis, where each row is copy-on-write to maintain row-level consistency. Older updates are stored in a sequence of SSTables (which are immutable).


• minor compaction，当memtable增长到一定程度之后，将其冻结，然后创建一个新的memtable。原来的memtable将会被转变为SSTable文件保存在GFS中；
• merging compaction，merging compaction时讲一些SSTables和memtable合并为一个新的SSTable文件，能够有效的减少SSTable文件的数量；
• major compaction，将若干SSTables文件合并为一个，同时去除里面以及被删除的数据；

Bigtable read performance benefits from a locality optimization in GFS. When files are written, GFS attempts to place one replica of the data on the same machine as the writer. When GFS files are read, the reads are served from the nearest available replica. Therefore, in the common case of tablet servers sharing machines with GFS servers, tablet servers will compact data into SSTables that have a replica on local disk, which enables fast access to those SSTables when subsequent read requests are processed.


Bigtable的schemas的数据也是保存在Chubby里面的。

### 优化

• Locality Groups，一个列族被赋予一个客户端定义的Locality Group，这样就可以由客户端控制存储布局，SSTable的生成也时每个Locality Group一个文件。讲经常一起使用的列放在一起有利于提高性能；

• Compression，客户端控制哪一个Locality Group是否压缩。SSTable上面的压缩时以SSTable的block为基础的，这样的话解压缩的时候只需要减压缩一部分；

• Caching for Read Performance，使用 Scan Cache和Block Cache 两层的Cache，前者缓存key-value对，后者缓存SSTable文件的块；

• Bloom Filters，LSM-Tree的常见优化，

 We reduce the number of accesses by allowing clients to specify that Bloom filters should be created for SSTables in a particular locality group. A Bloom filter allows us to ask whether an SSTable might contain any data for a specified row/column pair.

• Commit-Log Implementation，Tablet Server讲所有的tablet的commit log写在一个文件里面，但是由于tablet可能被重新分配，这里需要使用Master发起对这个log的排序，按照⟨table, row name, log sequence number⟩ 这样的元组来排序，这样每个Tablet Server就可以只读区自己的部分；

• Speeding Up Tablet Recovery，在unload一个tablet的时候，先做一次minor compaction，然后停止对这个tablet的服务，然后在做一次 minor compaction。

• Exploiting Immutability，对SSTable进行垃圾回收，它不变的特性也使得分裂tablet变得简单:

the immutability of SSTables enables us to split tablets quickly. Instead of generating a new set of SSTables for each child tablet, we let the child tablets share the SSTables of the parent tablet.


## 参考

1. Chang, F., Dean, J., Ghemawat, S., Hsieh, W. C., Wallach, D. A., Burrows, M., Chandra, T., Fikes, A., and Gruber, R. E. 2008. Bigtable: A distributed storage system for structured data. ACM Trans. Comput. Syst. 26, 2, Article 4 (June 2008), 26 pages. DOI = 10.1145/1365815.1365816. http://doi.acm.org/10.1145/1365815.1365816.