## PNUTS and Cassandra

### 数据模型

Cassandra和PNUTS之间的数据模型是很相似的，类似于Bigtable的表格的特点。在这样的一种模型上面，抽象为类似传统数据库一样的table，每个table可以由若干的指定类似的列。由于Cassandra和PNUTS本质上还是key-value的系统，它们进行删除更新等的操作的时候还是只能根据primary key来处理。下面是Cassandra的主要的接口：

• insert(table,key,rowMutation)

• get(table,key,columnName)

• delete(table,key,columnName)

columnName can refer to a specific column within a column family, a column family, a super column family, or a column within a super column.


### 一致性模型

PNUTS和Cassandra在一致性的一个相同点就是只支持在单行上面的事务，这点和Bigtable也相同的。 雅虎在它的应用中认为仅仅就是简单的最终一致的系统不能满足一些的要求，

PNUTS provides a consistency model that is between the two extremes of general serializability and eventual consistency. Our model stems from our earlier observation that web applications typically manipulate one record at a time, while different records may have activity with different geographic locality.


PNUTS这里的一致性模型是基于per-record timeline consistency，即一个记录上面的更新的操作都是以相同的顺序应用到所有的副本上面的。实现这种模型的基本的方式就是指定副本中的一个为Master。对于一条记录的Master根据workload自动调整。一条记录会携带一个sequence number，在每一次更新操作的时候就会递增，

• Write，直接更新一条记录；

• Test-and-set-write(required version)，只有在一条记录的版本是指定的时候才能进行更新操作。这个在实现单行事务的时候很有用，

This call can be used to implement transactions that first read a record, and then do a write to the record based on the read, e.g., incrementing the value of a counter. The test-and-set write ensures that two such concurrent increment transactions are properly serialized.
Of course, if the need arises, our API can be packaged into the traditional BEGIN TRANSACTION and COMMIT for single-row transactions, at the cost of losing expressiveness.


Each record maintains, in a hidden metadata field, the identity of the current master. If a storage unit receives a set() request, it first reads the record to determine if it is the master, and if not, what replica to forward the request to. The mastership of a record can mi- grate between replicas.


### 基本架构

The storage unit can use any physical storage layer that is appropriate. For hash tables, our implementation uses a UNIX filesystem-based hash table implemented originally for Yahoo!’s user database. For ordered tables, we use MySQL with InnoDB because it stores records ordered by primary key. Schema flexibility is provided for both storage engines by storing records as parsed JSON objects.


The basic consistent hashing algorithm presents some challenges. First, the random position assignment of each node on the ring leads to non-uniform data and load distribution. Second, the basic algorithm is oblivious to the heterogeneity in the performance of nodes. Typically there exist two ways to address this issue: One is for nodes to get assigned to multiple positions in the circle (like in Dynamo), and the second is to analyze load information on the ring and have lightly loaded nodes move on the ring to alleviate heavily loaded nodes. Cassandra opts for the latter as it makes the design and implementation very tractable and helps to make very deterministic choices about load balancing.


Cassandra的写入操作的过程与PNUTS的相差比较大，

(上面图片来自[3])

