Checkpoint Ordering
When FastForward ships a batch to your collector, the collector might ACK batch #3 before batch #1. The checkpoint must not advance past unresolved batches — or a crash would skip unsent data.
This is the core correctness invariant, proven in TLA+.
The algorithm
Section titled “The algorithm”Each source maintains three data structures:
committed: BTreeMap<SourceId, Checkpoint> // safe restart pointin_flight: BTreeMap<SourceId, BTreeMap<BatchId, Checkpoint>> // sent, awaiting ACKpending_acks: BTreeMap<SourceId, BTreeMap<BatchId, Checkpoint>> // ACKed but can't commit yetWhen an ACK arrives:
- Remove the batch from
in_flight, add topending_acks - Starting from the lowest pending ACK, check: is there any in-flight batch with a lower ID?
- If no — consume the pending ACK, advance
committed, repeat from step 2 - If yes — stop. The checkpoint cannot advance past an unresolved batch.
This is the Filebeat registrar pattern, implemented as a pure state machine with no executor dependencies.
Batch lifecycle
Section titled “Batch lifecycle”Every batch follows a strict lifecycle enforced by Rust’s type system:
create_batch() → BatchTicket<Queued> │ begin_send() → BatchTicket<Sending> ← enters in_flight tracking │ ├── ack() → AckReceipt ← successful delivery ├── reject() → AckReceipt ← permanent failure (e.g. 413) └── fail() → BatchTicket<Queued> ← transient failure, retryThe #[must_use] annotation on begin_send() warns if the returned Sending ticket is
ignored at the call site. The deeper enforcement comes from the typestate pattern: the
Sending ticket’s self-consuming transition methods (ack(), reject(), fail()) are the
only way to resolve it — you cannot drop a Sending ticket without calling one of these.
Why rejection advances the checkpoint
Section titled “Why rejection advances the checkpoint”A permanently rejected batch (e.g., 413 Payload Too Large) is undeliverable. Holding it in-flight forever would freeze the checkpoint for that source, eventually blocking all progress. The trade-off:
- Hold (transient failures): retry with backoff, checkpoint frozen until success
- Reject (permanent failures): advance checkpoint past the batch, accept data loss for one batch
This matches Filebeat’s model: advance past undeliverable records rather than stall the pipeline.
Formal verification
Section titled “Formal verification”The TLA+ specification PipelineMachine.tla proves 70+ properties including:
| Property | What it guarantees |
|---|---|
CheckpointNeverAheadOfTerminalizedPrefix | Checkpoint never skips an unresolved batch |
DrainCompleteness | stop() is unreachable unless all batches resolved |
CommittedMonotonic | Committed offset never decreases |
CheckpointOrderingInvariant | Every sent batch below the committed watermark is terminally resolved |
NoBatchLeftBehind | Every in-flight batch eventually resolves (ack/reject/abandon) |
The Rust implementation mirrors the TLA+ spec field-by-field — committed, in_flight,
and pending_acks map directly to the specification variables.