jordan.dev

← All work

A remote build cache that turned 22-minute CI into 4 minutes

Designed and shipped the remote build cache that took our monorepo from a 22-minute median pipeline to 4 minutes, while staying within our existing GitHub Actions runners.

Role
Tech lead
Year
2024
Where
Lumen Labs
Period
Q2 2024 — Q4 2024

Problem

A 9-team monorepo where every PR triggered a full 22-minute pipeline regardless of what changed. The thing nobody admitted: most of those minutes were redoing work the cluster had already done.

Outcome

Median PR pipeline 22 min → 4 min. Cache hit rate 71% sustained. ~$18k/mo of CI compute reclaimed.

Stack

  • Bazel
  • Go
  • TypeScript
  • GitHub Actions
  • S3
  • PostgreSQL

Why we ended up here

The platform team inherited a monorepo with around 1.4 million lines of TypeScript and Go, nine product teams, and a reputation. Engineers had a mental model that “CI is just slow” — which meant everyone ran the same handful of commands locally before pushing, then watched a 22-minute pipeline burn anyway. People stopped opening small PRs because the overhead-to-payoff ratio was wrong.

The honest truth: nobody had measured what the pipeline was actually doing. When we instrumented it, more than 60% of the wall time was spent rebuilding artifacts that hadn’t changed since the previous green build on main.

What we tried first

The intuitive fix is “cache the node_modules folder,” and we tried that. It made things 7% faster, which feels like a rounding error in a 22-minute pipeline. The problem wasn’t dependency installation. It was that every step in the build graph treated its inputs opaquely. A change to a single CSS file invalidated TypeScript compilation in unrelated packages because the dependency graph was implicit.

We needed two things at once: explicit input hashing per build action, and a place to store results keyed by those hashes.

The actual design

We standardized on a Bazel-style action graph. Every build step declares its inputs (source files plus tool versions plus environment), and the framework hashes them. The hash becomes the cache key.

The cache itself is dumb: an S3 bucket with content-addressed keys, fronted by a small Go service that handles auth, telemetry, and the GET-with-fallback-to-PUT semantics we needed for cold paths. We resisted the urge to put a database in front of it. The bucket is the database.

A few specific decisions that mattered more than they look:

  • Tool versions are part of the cache key. This sounds obvious. It is not what people do by default. Without it, you ship correctness bugs that look like flakes for months.
  • Hashes are computed before any expensive work. If a step is a hit, we don’t even check out the dependencies it would have needed. This is where most of the wall-time win came from.
  • Negative caching is explicit. A failed build does not poison the cache. It also does not get to retry forever. We mark them in PostgreSQL with a TTL and a backoff.

What broke first

The first month, hit rate was 14%. We were proud. We were also wrong: a non-determinism bug in our TypeScript build was producing different output bytes for identical inputs depending on the working directory of the process. It took us a week to find it. The fix was four lines.

Once that was out of the way, hit rate climbed past 60% within two weeks of organic warming and stabilized at 71% within a quarter.

What I’d do differently

I’d build the observability first, not last. We shipped the cache, then realized we couldn’t actually answer “why didn’t this PR get a hit?” without grepping logs. Two months later we built a per-action diff tool that shows exactly which input bytes differed between a hit and a miss. That tool should have shipped on day one. Most of the team’s adoption frustration in the first month would have evaporated.

Concrete outcome

  • Median PR pipeline time: 22 min → 4 min.
  • Cache hit rate sustained at 71% across the monorepo.
  • Compute spend down by ~$18k/month without changing runner SKU.
  • “Open a small PR” stopped feeling expensive. We measured a 2.3× increase in PRs per engineer per week six months later, which is the actual win.