ZooKeeper 分布式协调服务原理与实践

2018-01-05 2 min read Distributed

Executive Summary

核心观点（金字塔原理）

结论先行: ZooKeeper 是一个高性能的分布式协调服务，通过树形数据结构、Watcher 机制和强一致性保证，为分布式系统提供同步、配置管理、分组和命名服务等基础能力。

支撑论点:

数据模型采用类文件系统的层级结构（znodes），支持临时节点和持久节点，数据存储在内存中实现高吞吐低延迟

Watcher 机制提供一次性事件通知，客户端可监听节点变化实现分布式协调

读写分离架构：读操作就近访问任意节点，写操作必须通过 Leader 串行化处理，适合读多写少场景（10:1）

SWOT 分析

维度	分析
S 优势	高可用无单点故障；强一致性保证；简单易用的 API；数据存内存性能高
W 劣势	Server 故障到通知 Client 存在延迟；写操作必须经过 Leader 成为瓶颈；不适合存储大数据
O 机会	分布式配置中心；服务注册与发现；分布式锁实现
T 威胁	网络分区导致脑裂；Session 超时导致临时节点丢失；与业务服务注册发现场景的 CAP 权衡问题

适用场景

分布式配置管理（配置中心）
服务注册与发现
分布式锁和选主
集群成员管理

ZookeeperV3.4.11-Doc

数据结构+原语+watcher机制(消息通知)
问题: 在server挂机时到zookeeper接收到，再到zookeeper通知到client端是有一定的时长的，那么在这段时间内client的服务是不是就丢了？导致全部失败？还是其他什么情况？
ZooKeeper可以为所有的读操作设置watch，这些读操作包括：exists()、getChildren()及getData()。watch事件是一次性的触发器，当watch的对象状态发生改变时，将会触发此对象上watch所对应的事件。watch事件将被异步地发送给客户端，并且ZooKeeper为watch机制提供了有序的一致性保证。理论上，客户端接收watch事件的时间要快于其看到watch对象状态变化的时间。

ZooKeeper: A Distributed Coordination Service for Distributed Applications

功能点同步；配置管理；分组；命名服务。数据结构采用类似目录树的文件系统结构。运行环境Java+C。Zookeeper的愿景就是帮助更容易实现分布式系统
Design Goals
ZooKeeper is simple.每个数据结点称之为znodes，与文件系统不同的是znodes可以做存储，Zookeeper的数据都是在内存中，意味着可以实现高吞吐低延迟
ZooKeeper被设计成高可用，严格顺序执行。可靠性方面意味着不存在单点故障。严格的排序执行意味着在客户端能够做原子操作。
ZooKeeper is replicated. Zookeeper服务之间是相互了解的，通过内存的状态state，以及事务日志和快照做持久存储，只要大部分服务器可用，Zookeeper就是可用的。客户端Client连接到一台ZooKeeper服务器。客户机维护一个TCP连接，通过它sends requests, gets responses, gets watch events, and sends heart beats。如果服务器端的TCP连接断开，客户端将连接到另一个服务器。
ZooKeeper is ordered.
ZooKeeper is fast.，适用于读多写少(最好的比例是,读:写=10:1).读采用就近原则,而写必须都通过Leader来进行排队操作.

Data model and the hierarchical namespace

所有的Zookeeper的命名都是确定的通过/。ZooKeeper’s Hierarchical Namespace如下所示:

Nodes and ephemeral(临时) nodes

就像文件系统一样Zookeeper也是可以有子节点的，每个节点被设计为可以唯一定位的坐标比如 status information, configuration, location information, etc.,所以每个Node通常都是比较小的，通常在千字节以内。节点通常又称为znode.
Znodes有数据的修改版本，ACL修改，时间戳，允许缓存验证和协调服务的更新。每次数据的修改对应的修改版本属性就会自增，比如客户端接收到的数据也会同时接收到修改版本相关信息。每个数据结点都是读写原子的。每个节点都有一个访问控制列表（ACL）来限制谁可以做什么。
临时结点,存在于created znode的session期间，当session结束,znode将被删除。这个可能会有一定延迟，跟zoo.config配置的时间tickTime有关.以及通知到client也会有一定的延迟.

Conditional updates and watches

Zookeeper的watchs机制，客户端能在一个znodes上面设置一个watch，当znode被修改，移除等操作时，客户端能被通知到znode的改变。当client对应的zoo keeper服务器发生宕机，client将会收到一个notification.

Guarantees

顺序一致性-客户端的更新顺序将按其发送的顺序去处理
原子性-更新要么成功要么失败，没有其他状态
Single System Image - A client will see the same view of the service regardless of the server that it connects to.
可靠性-更新应用后，将被存留直到下次更改
时效性-client看到的数据保证在一定的时间范围内是最新的
…..tbd

Simple API

reate
  creates a node at a location in the tree
delete
  deletes a node
exists
  tests if a node exists at a location
get data
  reads the data from a node
set data
  writes data to a node
get children
  retrieves a list of children of a node
sync
  waits for data to be propagated
...

Implementation

ZooKeeper Components 复制的数据库是一个包含整个数据树的内存数据库。更新记录写到磁盘便于恢复数据，写之前将被序列化在内存数据库中。client的读将直接读取最近的service而客户端的所有写操作将被转发给Leader,而其他的称为followers

Performance

ZooKeeper 设计目标是读多写少的工作负载，最佳读写比例为 10:1
吞吐量方面：读操作随 server 数量线性扩展（就近读取本地副本），写操作在 Leader 处形成瓶颈（所有写请求必须经由 Leader 串行处理）
延迟方面：读操作从本地 replica 读取，延迟在亚毫秒级别；写操作需要 quorum 多数确认，延迟相对较高
随着写比例增大，整体性能下降明显，因为所有写操作都必须通过 Leader 并同步到多数节点

Reliability

基于 Quorum 多数派机制：只要集群中大多数节点（N/2+1）存活，服务就保持可用
Leader 故障时通过 ZAB（ZooKeeper Atomic Broadcast）协议自动进行 Leader 选举，恢复服务
数据持久性通过事务日志（transaction log）和快照（snapshot）写入磁盘来保证
Session 自动迁移：当前连接的 server 宕机时，client 会自动重连到集群中其他可用节点

ZooKeeper Getting Started Guide

Standalone Operation

环境要求JDK必备
create conf/zoo.cfg

tickTime=2000 //毫秒,心跳时长,最小会话超时时间= 2 * tickTime
dataDir=/var/lib/zookeeper //存储内存数据库快照的位置，除非另有说明，是数据库更新的事务日志。
clientPort=2181 //客户端连接的端口

启动 bin/zkServer.sh start
默认采用log4j打印日志，这是标准模式，如果是复制模式

Managing ZooKeeper Storage Maintenance

Connecting to ZooKeeper

$ bin/zkCli.sh -server 127.0.0.1:2181

[zkshell: 0] help
ZooKeeper host:port cmd args
        get path [watch]
        ls path [watch]
        set path data [version]
        delquota [-n|-b] path
        quit
        printwatches on|off
        createpath data acl
        stat path [watch]
        listquota path
        history
        setAcl path acl
        getAcl path
        sync path
        redo cmdno
        addauth scheme auth
        delete path [version]
        setquota -n|-b val path

Test: ls / create /zk_test my_data get /zk_test set /zk_test junk delete /zk_test

Zookeeper 服务注册与发现的问题

总结

ZooKeeper 擅长分布式协调类任务：配置管理、服务发现、分布式锁、Leader 选举等
树形数据模型（znodes）结合临时节点（ephemeral nodes）可实现很多强大的分布式模式
Watcher 机制提供事件驱动的协调能力，但需注意 watch 是一次性触发的（one-time trigger），触发后需重新注册
适合读多写少、小数据量的协调场景，不适合作为大数据存储使用（每个 znode 数据建议在 KB 级别以内）
CAP 权衡：ZooKeeper 属于 CP 系统（强一致性 + 分区容错），在网络分区时会牺牲部分可用性（minority 分区不可用）

ZooKeeper 分布式协调服务原理与实践

Executive Summary

核心观点（金字塔原理）

SWOT 分析

适用场景

ZookeeperV3.4.11-Doc

ZooKeeper: A Distributed Coordination Service for Distributed Applications

Design Goals

Data model and the hierarchical namespace

Nodes and ephemeral(临时) nodes

Conditional updates and watches

Guarantees

Simple API

Implementation

Performance

Reliability

ZooKeeper Getting Started Guide

Standalone Operation

Managing ZooKeeper Storage Maintenance

Connecting to ZooKeeper

Zookeeper 服务注册与发现的问题

总结

最新文章

GitHub Trending 技术趋势 (2026-06-02) 2026-06-02

GitHub Trending 技术趋势 (2026-06-01) 2026-06-01

GitHub Trending 技术趋势 (2026-05-31) 2026-05-31