Files
research.ai-dev-principles/docs/examples/09-checkpoint-resume.md
Christian Nennemann 2502f17bff docs: add case studies for 4 key principles with industry cross-references
Enriched principles #6 (Cheapest Model per Task), #9 (Checkpoint/Resume),
#10 (Diagnose Before Retrying), and #29 (Emergency Stop) with 14 concrete
case studies and 24 industry cross-references including GitHub Copilot,
Knight Capital, Apache Kafka, Tesla Autopilot, and workspace-internal examples.
2026-04-11 10:56:35 +02:00

4.1 KiB

Principle #9: Checkpoint / Resume — Case Studies

Every long-running operation must be interruptible and resumable.


Case Study 1: GitLab's Failed Database Migration (2017)

On January 31, 2017, GitLab.com experienced a major outage when a database replication issue led an engineer to run rm -rf on a PostgreSQL data directory. The subsequent recovery revealed that none of their five backup methods were fully functional. The 6-hour restore process had no checkpoint mechanism — if it failed partway through, they had to start over. The incident led to 18 hours of downtime and permanent loss of ~6 hours of data (issues, merge requests, comments).

When followed: Modern migration tools (like Django's migration framework or Flyway) apply changes as discrete, numbered steps. If step 47 of 200 fails, you resume from step 47 — not step 1. Each completed migration is recorded in a tracking table.

When violated: GitLab's restore was an all-or-nothing operation. The team live-streamed the recovery on YouTube, manually monitoring progress for hours with no ability to checkpoint and resume if something went wrong.

Case Study 2: Apache Kafka's Consumer Offset Design

Kafka's entire consumer model is built around checkpoint/resume. Every consumer tracks its "offset" — the position in the log it has read up to. If a consumer crashes, it restarts from its last committed offset, reprocessing at most a small window of messages. This design makes Kafka consumers inherently resumable.

When followed: LinkedIn (where Kafka originated) processes trillions of messages daily. Consumer crashes, deployments, and scaling events happen constantly — none cause data loss or require full reprocessing because every consumer checkpoints its progress.

When violated: Teams that build custom queue consumers without offset tracking end up with "poison pill" messages that crash the consumer in an infinite loop, or "lost message" bugs where a restart skips unprocessed items because there's no record of where processing stopped.

Case Study 3: psyresearch Paper Ingestion (This Workspace)

The psyresearch project ingests academic papers from PubMed and CrossRef APIs. Early versions processed papers in a single batch — if the script crashed at paper 847 of 2000, all progress was lost. After adopting checkpoint/resume (hash-based dedup on DOI + JSONL progress log), the ingestion became interruptible: kill it at any point, restart, and it skips already-processed papers. This also enabled incremental daily runs that only process new papers.

When followed: A rate-limit error at paper 500 means restarting processes only papers 501+. Daily incremental runs take seconds instead of minutes because they skip the 95% of papers already in the database.

When violated: The original implementation would re-fetch and re-process all 2000+ papers on every run, wasting API calls and risking rate limit bans from PubMed's API.


Industry Cross-References

Pattern Who Uses It Reference
Write-ahead log (WAL) PostgreSQL, SQLite, all major databases Core durability mechanism — every write is checkpointed before applied
Consumer offsets Apache Kafka, Amazon SQS (visibility timeout), RabbitMQ (ack) Message queue standard since the 2010s
Job checkpointing SLURM (HPC), PyTorch (model training), Kubernetes Jobs Long-running compute workloads save state periodically
Resumable uploads Google Drive, AWS S3 multipart upload, tus.io protocol Large file uploads broken into chunks with server-side tracking
Git itself Linus Torvalds Every commit is a checkpoint. Branching is cheap resumption from any point.

Key Insight

Checkpoint/resume is not a nice-to-have for long-running operations — it's a correctness requirement. Any operation that takes longer than a few seconds can be interrupted (network failure, OOM kill, user Ctrl-C, power loss). The question isn't "will it be interrupted?" but "when it's interrupted, how much work do we lose?" The answer should be "at most the current item," never "everything."