etcd

📅 2026-04-01 ✏️ 2026-04-01 CS INFRA

1 · etcd#

etcd is a strongly consistent, distributed key-value store that provides a reliable way to store data that needs to be accessed by a distributed system or cluster of machines. It gracefully handles leader elections during network partitions and can tolerate machine failure, even in the leader node.

S 分布式系统需要在多节点间共享配置、协调状态、发现服务，要求数据强一致且高可用。 C 节点随时可能宕机、网络随时可能分区，传统单点存储无法同时满足一致性与容错性。 Q 如何提供一个既强一致又能容忍节点/网络故障的分布式协调存储？ A etcd：基于 Raft 共识的分布式 KV 存储，提供强一致读写、Watch、Lease、事务，成为 Kubernetes 控制面的数据基石。

1.1 · 架构

Client (etcdctl / gRPC)
  │
  ▼
┌──────────────┐
│  API Layer   │  gRPC + gRPC Gateway (HTTP/JSON)
├──────────────┤
│  etcd Server │  请求路由、认证、Lease 管理
├──────────────┤
│    Raft      │  Leader 选举、日志复制、强一致共识
├──────────────┤
│  MVCC Store  │  多版本并发控制，基于 revision 的历史读
├──────────────┤
│  Backend     │  WAL (预写日志) + BoltDB (持久化)
└──────────────┘

Raft：Leader 处理所有写请求，日志复制到多数节点后才 commit；容忍 (n-1)/2 节点故障
MVCC：每次写操作递增全局 revision，支持按 revision 读历史数据
WAL + BoltDB：WAL 保证崩溃恢复，BoltDB 提供 B+Tree 持久化存储

1.2 · 核心特性

特性	说明
强一致性	基于 Raft，所有读写都经过共识（linearizable read）
Watch	客户端监听 key/range 变化，从指定 revision 开始流式推送事件；有序、原子、可恢复
Lease	带 TTL 的租约，key 绑定 lease 后自动过期删除；需定期 KeepAlive 续约
Txn (事务)	`If-Then-Else` 原子事务，基于 key 的 revision/version/value 做条件判断
MVCC	多版本并发控制，支持 stale read（按历史 revision 读取）

1.3 · 典型使用场景

服务发现：服务注册 key + lease，消费者 watch 前缀感知上下线
配置管理：集中存储配置，watch 机制实时推送变更
分布式锁：lease + txn 实现互斥锁（concurrency 包封装）
Leader 选举：多实例竞选 leader，lease 过期自动触发重新选举

1.4 · 常用 API / etcdctl#

# KV
etcdctl put foo bar
etcdctl get foo
etcdctl get --prefix /services/   # 前缀查询
etcdctl del foo

# Watch
etcdctl watch foo
etcdctl watch --prefix /services/

# Lease
etcdctl lease grant 60            # 创建 60s 租约
etcdctl put foo bar --lease={ID}  # 绑定 lease
etcdctl lease keep-alive {ID}     # 续约
etcdctl lease revoke {ID}         # 撤销（删除所有绑定 key）

# 事务
etcdctl txn --interactive
# compares: value("foo") = "bar"
# success:  put foo baz
# failure:  get foo

# 集群
etcdctl member list
etcdctl endpoint status
etcdctl endpoint health

1.5 · 生产要点

节点数量：推荐 3 或 5 节点（奇数），容忍 1 或 2 节点故障
磁盘性能：etcd 对磁盘延迟敏感，建议 SSD；WAL 和 data 目录分盘
Compaction：定期压缩历史 revision，防止 DB 膨胀（--auto-compaction-*）
Snapshot：定期备份 etcdctl snapshot save；恢复用 etcdctl snapshot restore
监控指标：关注 etcd_disk_wal_fsync_duration_seconds、etcd_server_proposals_failed_total、DB size