Raft
2017-06-28Raft 分布式一致性算法.
- 本文目的:读 寻找一种易于理解的一致性算法的论文,记录核心点。
思考问题:
- 什么是一致性?什么是分布式一致性?
- 选举?如何选举?选举失败怎么办?网络拓扑图断裂又如何?
- 如何实现的?有没有更好的?
Introduction
- Strong leader: Raft uses a stronger form of leadership than other consensus algorithms. For example, log entries only flow from the leader to other servers. This simplifies the management of the replicated log and makes Raft easier to understand
- Leader election: Raft uses randomized timers to elect leaders. This adds only a small amount of mechanism to the heartbeats already required for any consensus algorithm, while resolving conflicts simply and rapidly.
- Membership changes: Raft’s mechanism for changing the set of servers in the cluster uses a new joint consensus approach where the majorities of two different configurations overlap during transitions. This allows the cluster to continue operating normally during configuration changes.
Replicated state machines
Raft 算法
- Raft通过选举一个高贵的领导人,然后给予他全部的管理复制日志的责任来实现一致性。领导人从客户端接收日志条目,把日志条目复制到其他服务器上,并且当保证安全性的时候告诉其他的服务器应用日志条目到他们的状态机中。拥有一个领导人大大简化了对复制日志的管理。例如,领导人可以决定新的日志条目需要放在日志中的什么位置而不需要和其他服务器商议,并且数据都从领导人流向其他服务器。一个领导人可以宕机,可以和其他服务器失去连接,这时一个新的领导人会被选举出来。
- 通过领导人的方式,Raft 将一致性问题分解成了三个相对独立的子问题
- 领导选举:一个新的领导人需要被选举出来,当现存的领导人宕机的时候
- 日志复制:领导人必须从客户端接收日志然后复制到集群中的其他节点,并且强制要求其他节点的日志保持和自己相同。
- 安全性:如果有任何的服务器节点已经应用了一个确定的日志条目到它的状态机中,那么其他服务器节点不能在同一个日志索引位置应用一个不同的指令。
- 每一台服务有三种状态,领导者,跟随者,候选人
核心数据结构
- state
state | Persistent state on all servers(Updated on stable storage before responding to RPCs) |
---|---|
currentTerm | latest term server has seen (initialized to 0 on first boot, increases monotonically) |
votedFor | candidateId that received vote in current term (or null if none) |
log[] | log entries; each entry contains command for state machine, and term when entry was received by leader (first index is 1) |
state | Volatile state on all servers |
---|---|
commitIndex | index of highest log entry known to be committed (initialized to 0, increases monotonically) |
lastApplied | index of highest log entry applied to state machine (initialized to 0, increases monotonically) |
state | Volatile state on leaders(Reinitialized after election) |
---|---|
nextIndex[] | for each server, index of the next log entry to send to that server (initialized to leader last log index + 1) |
matchIndex[] | for each server, index of highest log entry known to be replicated on server |
(initialized to 0, increases monotonically) |
- AppendEntries RPC.
- Invoked by leader to replicate log entries; also used as heartbeat.
Arguments | Explain |
---|---|
term | leader’s term leaderId so follower can redirect clients |
prevLogIndex | index of log entry immediately preceding new ones |
prevLogTerm | term of prevLogIndex entry |
entries[] | log entries to store (empty for heartbeat; may send more than one for efficiency) |
leaderCommit | leader’s commitIndex |
Results | Explain |
---|---|
term | currentTerm, for leader to update itself |
success | true if follower contained entry matching prevLogIndex and prevLogTerm |
- Receiver implementation:
- Reply false if term < currentTerm
- Reply false if log doesn’t contain an entry at prevLogIndex whose term matches prevLogTerm
- If an existing entry conflicts with a new one (same index but different terms), delete the existing entry and all that follow it
- Append any new entries not already in the log
- If leaderCommit > commitIndex, set commitIndex = min(leaderCommit, index of last new entry)
- RequestVote RPC Invoked by candidates to gather votes.
Arguments | Explain |
---|---|
term | candidate’s term |
candidateId | candidate requesting vote |
lastLogIndex | index of candidate’s last log entry |
lastLogTerm | term of candidate’s last log entry |
Results | Explain |
term | currentTerm, for candidate to update itself |
voteGranted | true means candidate received vote |
- Receiver implementation:
- Reply false if term < currentTerm
- If votedFor is null or candidateId, and candidate’s log is at least as up-to-date as receiver’s log, grant vote
- Rules for Servers
All Servers:
- If commitIndex > lastApplied: increment lastApplied, apply log[lastApplied] to state machine
- If RPC request or response contains term T > currentTerm: set currentTerm = T, convert to follower
Followers:
• Respond to RPCs from candidates and leaders • If election timeout elapses without receiving AppendEntries RPC from current leader or granting vote to candidate: convert to candidate
Candidates:
• On conversion to candidate, start election: • Increment currentTerm • Vote for self • Reset election timer • Send RequestVote RPCs to all other servers • If votes received from majority of servers: become leader • If AppendEntries RPC received from new leader: convert to follower • If election timeout elapses: start new election
Leaders:
• Upon election: send initial empty AppendEntries RPCs (heartbeat) to each server; repeat during idle periods to prevent election timeouts • If command received from client: append entry to local log, respond after entry applied to state machine • If last log index ≥ nextIndex for a follower: send AppendEntries RPC with log entries starting at nextIndex • If successful: update nextIndex and matchIndex for follower • If AppendEntries fails because of log inconsistency: decrement nextIndex and retry • If there exists an N such that N > commitIndex, a majority of matchIndex[i] ≥ N, and log[N].term == currentTerm: set commitIndex = N .
summary
Election Safety
: at most one leader can be elected in a given term.Leader Append-Only
: a leader never overwrites or deletes entries in its log; it only appends new entries. §5.3Log Matching
: if two logs contain an entry with the same index and term, then the logs are identical in all entries up through the given index. §5.3Leader Completeness
: if a log entry is committed in a given term, then that entry will be present in the logs of the leaders for all higher-numbered terms. §5.4State Machine Safety
: if a server has applied a log entry at a given index to its state machine, no other server will ever apply a different log entry for the same index.
总结
Follower规则
- 回复candidates与leaders的RPC请求
- 如果选举超时时间达到,并且没有收到来自当前leader或者 投票的候选者的AppendEnties RPC调,转换角色为candidate
Candidate规则
- 转换成cadidate时开始一个选举
- 递增currentTerm&投票给自己
- 重置election timer
- 向所有服务器发送RequestVote RPC请求
- 如果获取服务器中多数投票:转换为Leader
- 如果收到从新Leader发送的AppdendEnties RPC请求:转换成Foller
- 如果选举超时时间达到:开始新的选举
Leader规则
- 给每个服务器发送初始为空AppendEntires RPCs(heartbeat);指定空闲时间之后 重复该操作以防election timeouts;
- 如果收到来自客户端的命令,将条目插入到本地日志,在条目应用到状态机后 回复给客户端
- 如果last log index >= nextIndex for a follower; 发送包含开始于nextIndex的 日志条目的AppendEnties RPC
- 如果成功,为Follower更新nextIndex与matchIndex
- 如果失败是由于日志不一致,递减nextIndex然后重试
- 如果存在几个N满足 N > commitIndex ,多数的matchIndex[i] >= N,并且log[N].term== currentTerm; 设置commitIndex=N;
RequestVote RPC 规则
- 如果term < currentTerm 则返回false
- 如果本地的voteFor为空或者为candidateId,并且候选者的日志至少与接受者的日志一样新,则投给其选票
- 怎么定义日志新?
- 比较两份日志中最后一条日志条目的索引值和任期号定义谁的日志比较新
- 如果两份日志最后的条目的任期号不同,那么任期号大的日志更加新
- 如果两份日志最后的条目任期号相同,那么日志比较长的那个就更加新。
AppendEntries RPC
- candidate赢得选举的后,宣誓主权
- 保持心跳
- 让follower的日志和自己保持一致
- 接收者的处理逻辑:
- 如果term < currentTerm 则返回false
- 如果日志不包含一个在preLogIndex位置纪元为prevLogTerm的条目,则返回 false,该规则是需要保证follower已经包含了leader在PrevLogIndex之前所有的日志了
- 如果一个已存在的条目与新条目冲突(同样的索引但是不同的纪元),则删除现存的该条目与其后的所有条将不在log中的新条目添加到日志之中
- 如果leaderCommit > commitIndex,那么设置 commitIndex =min(leaderCommit,index of last new entry)
Leader Election
- Follwer,Candidate,Leader三种状态
- 在选举机制中有两个超时设置来控制选举。
- 选举超时,指Follwer等待 成为candidate的时间。选举时间随机范围在150ms至300ms之间。
- 选举结束,跟随者称为候选人并开始新的选举。
- 为自己投票,并向其他节点发出请求投票消息。如果接收节点还未 投票在这一届,然后他们将投票给候选人,并重置选举超时时间点。
- 一旦候选人获得多数选票,该节点便成为了Leader。
- Leader开始向追随者发送AppendEntries消息,这些消息是按照特定的 时间间隔发送的称为心跳超时(heartbeat timeout).
- 跟随者接收到AppendEntries消息后进行响应并重置超时时间。
- 这个选举期将一直持续到追随者停止心跳并称为候选人为止。
- 当发生某个节点宕机,则重复以上选举机制。每届只能选择一个Leader. 如果两个节点同时成为候选者,则可能发生分裂投票情况。两个节点同时 启动相同的选举,并每个节点在另一个节点之前到达一个跟随节点。这种情况将等待新一届的选举。直到选出Leader为止。
Log Replication
- 所有的请求都先经过Leader节点,然后Leader节点复制日志给所有的节点。通过使用与心跳机制相同的AppendEntries消息完成的。
- 具体步骤如下:
- 所有请求到达Leader,然后Leader记录修改日志,此时不进行commited。并在下一个心跳信息传递的过程中发送修改信息。
- 一旦大多数Follwer响应,该修改记录将被Leader进行committed操作。 然后响应客户端。然后发送命令给追随者提交前面的修改记录。然后追随者响应Leader确认修改提交成功。这样所有节点就保持与Leader数据一致性了。
- 如果发生了网络分裂,将会产生至少2个Leader,此时将按照节点更多的 Leader为主进行提交,而节点更少的Leader将不能进行committed因为不能获取到大多数节点的确认数据处于uncommitted状态。在网络恢复后进行回滚,并变成跟随者。于此同时,在发生闹裂期间由client发送到少节点的Leader上的数据会消失,因为闹裂期间以多数节点为主的Leader才能进行committed操作。
比如有5个节点abcde,2个client,c1,c2,发生网络分裂情况是a(Leader)&b;
c(Leader)&d&e; c1连接的是a,c2连接的是c、如果此时c1发送seta修改,则
a会把这个修改至于uncommitted状态。c2发送setc修改到c,则c能把c2发送的修改进行committed。当网络恢复后,
a将变成跟随者并且a的uncommitted的数据回滚,并同步c的在网络分裂期间的数据日志。最终达到所有节点的一致性。
注意问题
- 因为Zookeeper的Leader写,所以集群的机器数量需要有一定的限制,机器越多,Leader机器需要写入的Follower越多,所以写入性能会有所下降。
TODO 问题
- Zookeeper读Follower,而写全部都要从Leader进入然后再同步到Follower才算完成写入。如果在写入Leader完成之前,有读该节点线程,是否将无法读到?