Optimizing CI Builds with Docker Layer Caching on TeamCity

How we cut build times from 45 minutes down to 3–10 minutes on an on-premise TeamCity cluster using multistage Docker builds, content-addressed cache keys, and shared filesystem volumes — plus what modern BuildKit unlocks today.

·11 min read

Build times were out of control. A routine pull request was sitting in the queue for 45 minutes before a developer got any feedback. We were running an on-premise build cluster on VMware vSphere, orchestrated by TeamCity, and the bottleneck was obvious: every build started from scratch.

This is the story of how we got that down to 3–10 minutes, the trade-offs we made along the way, and what we would do differently with the tooling that exists today.

The Setup

The cluster ran on VMware vSphere with TeamCity as the CI server. Rather than maintaining a pool of long-lived build agents, we used the TeamCity VMware vSphere plugin to dynamically provision and destroy agent VMs on demand. Each agent was a snapshot clone — fast to start, always in a known state, and completely ephemeral.

Ephemeral is great for reproducibility. It is terrible for caching.

Every time a new agent spawned, it had nothing. No npm modules, no Maven local repository, no Docker layer cache. The build would reinstall everything from the network, every single time.

Multistage Docker Builds as the Unit of Work

We standardised all builds around multistage Docker builds. The Dockerfile became the single source of truth for the entire build: dependency installation, compilation, test execution, and artifact packaging were all stages.

Dockerfile
# ── Stage 1: dependency installation ─────────────────────────────────────────
FROM node:20-alpine AS deps
WORKDIR /app
COPY package.json yarn.lock ./
RUN yarn install --frozen-lockfile
 
# ── Stage 2: build ────────────────────────────────────────────────────────────
FROM deps AS build
COPY . .
RUN yarn build
 
# ── Stage 3: test ─────────────────────────────────────────────────────────────
FROM build AS test
RUN yarn test --ci
 
# ── Stage 4: production image ─────────────────────────────────────────────────
FROM node:20-alpine AS release
WORKDIR /app
COPY --from=build /app/dist ./dist
COPY --from=deps /app/node_modules ./node_modules
CMD ["node", "dist/index.js"]

This gave us a clean interface: docker build --target release . produced the final artifact, and every intermediate layer was a candidate for caching.

Content Hash as a Cache Key

The key insight is that Docker layer caching is content-addressed at the instruction level. If package.json and yarn.lock have not changed, the RUN yarn install layer is a cache hit. The trick is to copy only the files that affect a given layer before running the expensive command.

We took this further and computed an explicit content hash to use as an image tag:

cache-key.sh
# Hash the files that influence the dependency layer
DEPS_HASH=$(cat package.json yarn.lock | sha256sum | cut -c1-12)
DEPS_IMAGE="registry.internal/myapp/deps:${DEPS_HASH}"

This hash became the cache key and the image tag. If the same hash already existed somewhere accessible, we could pull it and use it as a --cache-from source. If not, we built from scratch and stored the result.

Approach 1: Push to an Internal Registry

The first approach was straightforward: push cached stages to an internal Docker registry, pull them before each build.

build-with-registry-cache.sh
DEPS_HASH=$(cat package.json yarn.lock | sha256sum | cut -c1-12)
CACHE_IMAGE="registry.internal/myapp/cache:deps-${DEPS_HASH}"
 
# Try to prime the local daemon cache from the registry
docker pull "${CACHE_IMAGE}" || true
 
docker build \
  --cache-from "${CACHE_IMAGE}" \
  --target deps \
  --tag "${CACHE_IMAGE}" \
  .
 
# Push so future agents can use it
docker push "${CACHE_IMAGE}"

This worked, but it introduced a new bottleneck: the network. Each agent had to pull potentially hundreds of megabytes of layers on every build, and push them back when done. The registry became a hot spot. On a busy day with many parallel builds the pull/push overhead consumed most of the time we had saved on compilation.

The ephemeral agent model meant we never got to keep the layers on disk between runs — we were effectively paying the network cost twice per build.

Uber open-sourced Makisu specifically to address this problem. Makisu is a Docker image builder designed for Kubernetes and containerised CI environments. It can build images without a Docker daemon, supports distributed layer caching backed by Redis or a filesystem, and was designed to minimise redundant network transfers. It is now archived, but its design influenced later solutions.

Approach 2: Shared Volumes Across Agents

The registry approach taught us that the cache needed to live closer to the build. Our vSphere hosts had fast local storage, and agents on the same host could share a path via a persistent vSphere-mounted volume.

We configured each TeamCity agent with a mounted volume at /build-cache that was shared across all agents on the same ESXi host. The build script was updated to use the local filesystem instead of the registry:

build-with-volume-cache.sh
CACHE_DIR="/build-cache/layers"
DEPS_HASH=$(cat package.json yarn.lock | sha256sum | cut -c1-12)
CACHE_TAR="${CACHE_DIR}/deps-${DEPS_HASH}.tar"
 
# Restore cache from shared volume if it exists
if [ -f "${CACHE_TAR}" ]; then
  docker load < "${CACHE_TAR}"
fi
 
docker build \
  --cache-from "myapp/deps:${DEPS_HASH}" \
  --target deps \
  --tag "myapp/deps:${DEPS_HASH}" \
  .
 
# Save to shared volume for other agents
if [ ! -f "${CACHE_TAR}" ]; then
  mkdir -p "${CACHE_DIR}"
  docker save "myapp/deps:${DEPS_HASH}" > "${CACHE_TAR}"
fi

Because the cache was on local storage, reads and writes were fast. New agents spawning on a host that had already built a given dependency hash could skip the installation step entirely, even on their very first run.

A nice side effect: every branch contributed to the cache. A feature/ branch that installed the same dependencies as main would warm the cache for the next build, regardless of branch name or build trigger.

Where This Approach Falls Short

Docker layer caching is a coarse instrument. It operates at the instruction granularity: a single file change in a COPY . . step invalidates everything downstream.

Several pain points surfaced as the project grew:

JavaScript tooling has its own cache model. Tools like webpack, esbuild, Vite, and Jest all have their own on-disk caches (.cache/, node_modules/.cache/). These live inside the build context, not in Docker layers, so they were lost between builds unless explicitly managed.

Monorepo builds are hard to split. Docker has no concept of changed packages within a monorepo. A change to one package would invalidate the layer for the entire workspace, defeating the cache.

Layer granularity does not scale. As the number of packages and build steps grew, the "one layer = one stage" model produced either too few cache boundaries (large layers, frequent invalidation) or too many stages (complex Dockerfiles that were hard to reason about).

The correct solution at scale would be a proper build system like Buck2 or Bazel. Both offer hermetic, content-addressed, per-target caching with remote cache support. A change to package A does not invalidate the cache for package B. The cache is portable across machines without shipping full Docker images.

The adoption cost of Buck2 or Bazel is significant. Migrating an existing project requires rewriting build definitions, learning a new query language, and training the team. For most organisations the Docker-based approach is the pragmatic starting point; migrate to a proper build system once the pain becomes acute enough to justify it.

What Modern Docker Unlocks

We built this solution in an era before BuildKit was mature. If you are starting today, the tooling has improved substantially.

BuildKit --mount=type=cache

BuildKit's cache mounts let you attach a persistent directory directly inside a RUN step. The directory survives across builds on the same daemon without any manual save/restore logic:

Dockerfile.buildkit
# syntax=docker/dockerfile:1
FROM node:20-alpine AS deps
WORKDIR /app
COPY package.json yarn.lock ./
RUN --mount=type=cache,target=/root/.yarn,sharing=locked \
    yarn install --frozen-lockfile
 
FROM python:3.12-slim AS py-deps
WORKDIR /app
COPY requirements.txt ./
RUN --mount=type=cache,target=/root/.cache/pip,sharing=shared \
    pip install -r requirements.txt

The sharing flag controls concurrent access:

ValueBehaviour
sharedMultiple builds read and write simultaneously (safe for package managers with atomic installs)
lockedOnly one build at a time — others wait (use when the tool is not concurrent-safe)
privateEach build gets its own copy — no sharing, but no contention

This alone eliminates the need for most of the shell scripting we wrote around docker save / docker load.

Docker Bake and Matrix Builds

docker buildx bake reads a HCL or JSON configuration file and builds multiple targets in parallel, with shared cache:

docker-bake.hcl
group "default" {
  targets = ["app-amd64", "app-arm64"]
}
 
variable "VERSION" {
  default = "dev"
}
 
target "app" {
  dockerfile = "Dockerfile"
  cache-from = ["type=local,src=/build-cache/bake"]
  cache-to   = ["type=local,dest=/build-cache/bake,mode=max"]
}
 
target "app-amd64" {
  inherits  = ["app"]
  platforms = ["linux/amd64"]
  tags      = ["registry.internal/myapp:${VERSION}-amd64"]
}
 
target "app-arm64" {
  inherits  = ["app"]
  platforms = ["linux/arm64"]
  tags      = ["registry.internal/myapp:${VERSION}-arm64"]
}
ci-build.sh
docker buildx bake --file docker-bake.hcl default

The mode=max cache export writes every intermediate layer to the cache, not just the final stage — meaning a subsequent build that targets only deps gets a full cache hit without having previously built that target in isolation.

The real power of bake shows up in a monorepo where you have several services, each with its own Dockerfile, and you want to build all of them in parallel — optionally for multiple platforms — without duplicating configuration. The matrix block does exactly that:

docker-bake.hcl (matrix)
# List every service here. Each entry maps to services/<name>/Dockerfile.
variable "APPS" {
  default = ["api", "worker", "scheduler"]
}
 
variable "VERSION" {
  default = "dev"
}
 
# Shared defaults every service inherits
target "_common" {
  cache-from = ["type=local,src=/build-cache/bake"]
  cache-to   = ["type=local,dest=/build-cache/bake,mode=max"]
  platforms  = ["linux/amd64", "linux/arm64"]
}
 
# One target per app × platform, generated by the matrix.
# bake expands this into targets named "service-api", "service-worker", etc.
target "service" {
  matrix = {
    app = var.APPS
  }
  name       = "service-${app}"
  inherits   = ["_common"]
  context    = "services/${app}"
  dockerfile = "services/${app}/Dockerfile"
  tags       = ["registry.internal/${app}:${VERSION}"]
}
 
group "default" {
  targets = [for app in var.APPS : "service-${app}"]
}
ci-build.sh
# Build all services for all platforms in parallel
docker buildx bake --file docker-bake.hcl default
 
# Build a single service during local development
docker buildx bake --file docker-bake.hcl service-api

With this layout, adding a new service is a one-line change to APPS; the matrix takes care of the rest. Every service shares the same cache volume, so a layer that is common to multiple Dockerfiles (e.g., a shared base image with OS packages) is built once and reused across all targets.

Docker Buildx Kubernetes Driver

For teams already on Kubernetes, buildx has a native Kubernetes driver. Instead of routing builds through a Docker daemon on a VM, builds run as ephemeral Kubernetes pods:

setup-k8s-builder.sh
# Create a builder backed by Kubernetes pods
docker buildx create \
  --name k8s-builder \
  --driver kubernetes \
  --driver-opt replicas=3 \
  --driver-opt namespace=ci \
  --use
 
# Builds now schedule as Pods in the ci namespace
docker buildx build \
  --builder k8s-builder \
  --cache-from type=registry,ref=registry.internal/myapp:cache \
  --cache-to   type=registry,ref=registry.internal/myapp:cache,mode=max \
  --tag registry.internal/myapp:latest \
  --push \
  .

Combined with a ReadWriteMany PersistentVolumeClaim mounted into each builder pod, you get the same shared filesystem cache as our vSphere volume approach — but portable across any Kubernetes cluster.

Closing Notes

Cache is not always faster — benchmark first

Transferring cached layers over the network can easily cost more than just re-running the build step from scratch, especially for fast package installs or small dependency sets. This is not theoretical: our registry-based approach made builds slower before we switched to shared volumes. Docker gives you no automatic guidance here — you have to decide manually which layers are worth caching by measuring the transfer cost against the build cost. If a layer takes 2 seconds to reinstall but 8 seconds to pull from a registry, the cache is net negative.

Shared caches need concurrency control

When multiple agents write to the same cache directory simultaneously, you can end up with partial writes, corrupted tarballs, or an agent reading a layer that has not been fully flushed. Problems here tend to manifest as flaky, non-deterministic build failures that are hard to reproduce locally. If agents race to write the same cache entry, you need either file-level locking (e.g. flock) around the save step, or a write-once convention where only the first agent to compute a cache entry writes it and others skip. BuildKit's sharing=locked mount mode handles this for in-daemon caches, but for external volume caches the burden falls on you.

Shared caches are a supply-chain attack surface. If your build system runs code from external contributors — open-source pull requests, for example — a malicious build step can write arbitrary content to a shared cache volume and poison subsequent builds on the same host. A compromised cache entry can inject malware into every downstream artifact without touching the source repository. If any of your agents execute untrusted code, treat the cache volume as a potential attack vector: scope it to trusted branches only, use separate cache namespaces per trust level, or disable shared caches entirely for externally-contributed builds.

Related Articles