Trust and Failure Model

This document describes exactly what happens when Orkestra fails, degrades, or is deliberately stopped. It covers every failure mode — process crash, panic, leader loss, API server disconnect, node failure, graceful shutdown — and what each means for the CRs Orkestra manages.

The answer is strong. The failure model is designed to never corrupt CR state, never orphan child resources, and never block cluster operations.

The foundational guarantee

Orkestra’s failure model rests on a single foundational guarantee:

Any operation that Orkestra does not complete will be completed on the next reconcile.

Kubernetes provides the infrastructure for this guarantee: CRs are stored in etcd, watch events are delivered reliably, and the informer pattern ensures no events are missed across restarts.

Leader election (konductor election)

Orkestra uses Kubernetes leader election — the same mechanism used by kube-controller-manager — to ensure that only one Orkestra instance actively reconciles at any time.

What followers do while not leading

Every Orkestra instance — leader and followers — runs its informers and populates its local caches continuously. When a follower wins the lease, it has a warm cache and starts reconciling in seconds, not minutes.

Failover timeline

t=0:  Leader pod crashes (OOM, node failure, SIGKILL)
t=0:  Lease stops being renewed
t=15: Lease expires (leaseDuration)
t=15: Follower acquires lease immediately (it was waiting)
t=15: Follower workers start dequeuing
t=16: First reconcile completes in the new leader

Worst-case failover time: leaseDuration (default 15 seconds).

Panic recovery (safeReconcile)

Every reconcile call is wrapped in safeReconcile:

func safeReconcile(ctx context.Context, fn func(ctx context.Context, key string) error, key string) (err error) {
    defer func() {
        if r := recover(); r != nil {
            err = fmt.Errorf("reconciler panic recovered: %v\n%s", r, debug.Stack())
        }
    }()
    return fn(ctx, key)
}

What happens when a reconciler panics

The panic is caught by the deferred recover
The stack trace is logged at ERROR level with the full goroutine stack
A Warning Kubernetes event is emitted on the CR that caused the panic
The error is returned to the workqueue
The workqueue requeues the item with exponential backoff
Other CRDs are completely unaffected — each CRD has its own worker pool

Graceful shutdown

When Orkestra receives SIGTERM (standard Kubernetes pod termination):

SIGTERM received
  │
  ▼
Context cancelled — propagates to all components
  │
  ▼
Workers stop accepting new queue items (queue.ShutDown())
  │  In-flight reconciles are allowed to complete
  │
  ▼
Workers exit when current reconcile completes
  │
  ▼
Informers stop watching
  │
  ▼
HTTPS server drains open connections (30s timeout)
  │
  ▼
Process exits 0

Summary: what can go wrong and what Orkestra does

Failure	Effect	Recovery
Reconciler panic	CR left in partial state	Requeued with backoff, corrected on next reconcile
Process crash	Reconciliation paused for up to 15s	Follower acquires lease, resumes immediately
Node failure	Reconciliation paused for up to 15s	Follower on healthy node acquires lease
API server disconnect	Reconcile writes fail, queued for retry	Automatic reconnection, retry on restore
Graceful shutdown	In-flight reconcile completes, then stops	New instance picks up pending events
Admission webhook unreachable	Objects stored without synchronous validation	Reconcile-time validation corrects violations
Mutation error	Object stored without defaults	Reconcile-time mutation applies defaults
Leader lease expired (no follower)	Reconciliation paused until a leader is elected	New instance or connectivity restore
CRD degraded	Reconciliation continues, health API shows degraded	Recovers when a reconcile succeeds

Orkestra Shutdown