research.ai-dev-principles/docs/examples/09-checkpoint-resume.md

# Principle #9: Checkpoint / Resume — Case Studies

> Every long-running operation must be interruptible and resumable.

---

## Case Study 1: GitLab's Failed Database Migration (2017)

On January 31, 2017, GitLab.com experienced a major outage when a database replication issue led an engineer to run `rm -rf` on a PostgreSQL data directory. The subsequent recovery revealed that none of their five backup methods were fully functional. The 6-hour restore process had no checkpoint mechanism — if it failed partway through, they had to start over. The incident led to 18 hours of downtime and permanent loss of ~6 hours of data (issues, merge requests, comments).

**When followed:** Modern migration tools (like Django's migration framework or Flyway) apply changes as discrete, numbered steps. If step 47 of 200 fails, you resume from step 47 — not step 1. Each completed migration is recorded in a tracking table.

**When violated:** GitLab's restore was an all-or-nothing operation. The team live-streamed the recovery on YouTube, manually monitoring progress for hours with no ability to checkpoint and resume if something went wrong.

## Case Study 2: Apache Kafka's Consumer Offset Design

Kafka's entire consumer model is built around checkpoint/resume. Every consumer tracks its "offset" — the position in the log it has read up to. If a consumer crashes, it restarts from its last committed offset, reprocessing at most a small window of messages. This design makes Kafka consumers inherently resumable.

**When followed:** LinkedIn (where Kafka originated) processes trillions of messages daily. Consumer crashes, deployments, and scaling events happen constantly — none cause data loss or require full reprocessing because every consumer checkpoints its progress.

**When violated:** Teams that build custom queue consumers without offset tracking end up with "poison pill" messages that crash the consumer in an infinite loop, or "lost message" bugs where a restart skips unprocessed items because there's no record of where processing stopped.

## Case Study 3: psyresearch Paper Ingestion (This Workspace)

The psyresearch project ingests academic papers from PubMed and CrossRef APIs. Early versions processed papers in a single batch — if the script crashed at paper 847 of 2000, all progress was lost. After adopting checkpoint/resume (hash-based dedup on DOI + JSONL progress log), the ingestion became interruptible: kill it at any point, restart, and it skips already-processed papers. This also enabled incremental daily runs that only process new papers.

**When followed:** A rate-limit error at paper 500 means restarting processes only papers 501+. Daily incremental runs take seconds instead of minutes because they skip the 95% of papers already in the database.

**When violated:** The original implementation would re-fetch and re-process all 2000+ papers on every run, wasting API calls and risking rate limit bans from PubMed's API.

---

## Industry Cross-References

| Pattern | Who Uses It | Reference |
|---------|-------------|-----------|
| Write-ahead log (WAL) | PostgreSQL, SQLite, all major databases | Core durability mechanism — every write is checkpointed before applied |
| Consumer offsets | Apache Kafka, Amazon SQS (visibility timeout), RabbitMQ (ack) | Message queue standard since the 2010s |
| Job checkpointing | SLURM (HPC), PyTorch (model training), Kubernetes Jobs | Long-running compute workloads save state periodically |
| Resumable uploads | Google Drive, AWS S3 multipart upload, tus.io protocol | Large file uploads broken into chunks with server-side tracking |
| Git itself | Linus Torvalds | Every commit is a checkpoint. Branching is cheap resumption from any point. |

---

## Key Insight

Checkpoint/resume is not a nice-to-have for long-running operations — it's a correctness requirement. Any operation that takes longer than a few seconds can be interrupted (network failure, OOM kill, user Ctrl-C, power loss). The question isn't "will it be interrupted?" but "when it's interrupted, how much work do we lose?" The answer should be "at most the current item," never "everything."