Operations

Running a yggr cluster in anger: bring-up, rolling upgrades, graceful shutdown, failure recovery.

Bring-up

See Bootstrap modes for the full decision tree. Short version:

First boot of a new cluster. Every node starts with Bootstrap::NewCluster { initial_peers }. Exactly one bootstrap per cluster lifetime.
Subsequent boots. Always Bootstrap::Recover (the default). The node reloads its persisted state and catches up via replication.

Do not re-bootstrap a cluster that already has state. It creates a second history at term 0 and diverges from every other node.

Rolling upgrades

A rolling upgrade replaces binaries one node at a time without downtime. The recipe:

Pick a node to upgrade (ideally a follower; use transfer_leadership to move leadership away if needed).
Drain: call node.shutdown().await. The driver flushes in-flight proposals and stops cleanly.
Swap the binary. On-disk state (log segments, snapshots, hard state) is compatible across patch versions.
Start the new binary with the same Config and Bootstrap::Recover.
Wait until node.status().membership.voters matches every peer and status.leader is stable.
Repeat for each node.

Always upgrade voters one at a time. Upgrading two at once can leave the cluster without a quorum of voters on the new version.

Graceful shutdown

#![allow(unused)]
fn main() {
node.shutdown().await?;
}

This:

rejects new write() calls with Shutdown,
drains proposals already accepted into the driver,
stops the transport,
joins the tick / apply / driver tasks,
returns when everything has stopped.

If the caller holds multiple clones of Node, only the call on the last clone that hasn’t been cloned further actually runs the drain — shutdown(self) consumes its receiver. Other clones get Shutdown or DriverDead from subsequent calls.

Triggering a snapshot

Snapshots are automatic when Config::snapshot_hint_threshold_entries or max_log_entries fires. To force one:

#![allow(unused)]
fn main() {
// Planned:  admin.trigger_snapshot().await?;
}

Today the engine’s hint is observed by the driver; there’s no public “snapshot now” button. If you need one, either lower the threshold or enqueue enough writes to trip it.

Handling leader loss

Expected during normal operation: election timeouts, transient partitions, node restarts. The runtime handles it — clients see NotLeader or NoLeader briefly, retry logic kicks in, traffic resumes on the new leader.

If a loss persists more than an election-timeout cycle, check:

status.role on each node — is anyone a candidate or pre-candidate stuck?
status.current_term — is the term diverging across nodes? (Unusual; expect it to converge upward.)
transport connectivity between voters — a partition between any two voters can deadlock election if the partition touches the majority.

Disk-space management

Two knobs on Config:

snapshot_hint_threshold_entries — compact when applied-entries past the floor crosses this. Default 1024.
max_log_entries — compact when the live log (entries above the floor) crosses this, regardless of apply progress. Default 0 (off). Set it if you see slow or stuck apply producing unbounded log growth.

DiskStorage stores one snapshot.bin + snapshot_meta.bin at the data-dir root, plus segmented log files in log/. Crash-safe: atomic-rename writes for the snapshot, append-only segments for the log. .tmp files left by a crash mid-rename are swept on DiskStorage::open.

Capacity planning

Per-node disk at steady state: snapshot bytes + live log bytes + overhead. Live log is bounded by snapshot_hint_threshold_entries × average-entry-size plus in-flight replication. The snapshot is whatever your StateMachine::snapshot() produces.

Per-node memory: engine state (small), applied-but-not-yet-responded proposals (bounded by max_pending_proposals), apply queue (bounded by max_pending_applies), plus your state machine.

Per-node CPU: dominated by the state machine’s apply(). The engine itself is cheap.

Observability

node.status() — point-in-time view of role, term, commit, last-applied, leader, membership.
node.metrics() — raw engine counters (elections, AEs, commits, reads, snapshots).
node.node_metrics() — runtime-facing wrapper: engine metrics plus membership and status fields.
tracing — structured events under targets yggr::engine and yggr::node. See Observability.

Wire these into your monitoring system at cluster bring-up. Cross-cluster visibility into current_term, commit_index, and leader catches most operational issues before they become outages.

Backups

A snapshot file plus the live log segments at a point in time is a complete backup of one node. For cluster backups:

Take a snapshot on each node (wait for snapshots_installed to tick).
Copy snapshot.bin, snapshot_meta.bin, and log/ to durable off-cluster storage.
Include the node’s Config (peers, node_id) alongside — recovery needs it.

A single node’s backup is sufficient to rebuild the cluster: restore that node, re-bootstrap the others via add_learner + snapshot install, promote.

Keyboard shortcuts

The yggr guide