The Limits of One Model Per Pod ML Platform

A one-model-per-pod architecture is simple, reliable and cost-effective until it isn’t. This article explores where that approach starts to break down, why those limits aren’t obvious at first and the engineering trade-offs that emerge as ML platforms scale.

Share

Every engineering platform is built around trade-offs.

For several years the model serving platform followed a simple idea: one model, one Kubernetes pod. It was easy to understand, inexpensive to operate, highly customizable and built entirely on standard Kubernetes primitives. For an ML platform that was growing steadily, those were significant advantages.

For quite some time this architecture worked decently.

What Worked Well

The platform solved several important problems:

  • Kubernetes-native operations with familiar tooling.
  • Low infrastructure cost for CPU-based models.
  • Flexible configuration without vendor lock-in.

It also provided an important governance layer. Deployments were intentionally controlled to ensure models met operational standards before reaching production, while still allowing teams to customize configurations when needed.

Where the Architecture Hit Its Ceiling

As the number and diversity of production models grew, the limitations became more apparent.

CPU-bound inference

Many ML libraries create significantly more worker threads than the CPU resources allocated to a Kubernetes pod. Under load, Kubernetes throttles these threads, increasing latency and making performance less predictable.

Python limitations for concurrency

Python models introduced another bottleneck through the Global Interpreter Lock. While Kubernetes could add more pods, it couldn’t improve concurrency inside an individual model server. Autoscaling helped, but only after traffic spikes had already occurred.

No GPU execution

The platform was designed around CPU workloads. As deep learning models became more common, the lack of GPU infrastructure limited both latency and model complexity. Some models that could have benefited from GPU acceleration were simply not a good fit for the existing architecture.

No self-serf capabilities and high operational load

The platform team became a bottleneck for every model deployment. Engineers manually managed deployments, scaling configs, rollbacks, cell configurations

The Real Trade-off

None of these issues meant the architecture was flawed. In fact, its simplicity was exactly why it was successful in the beginning.

The same design decisions that made the platform easy to operate also made it harder to support increasingly diverse workloads. More model types, higher traffic, and stricter latency requirements exposed limitations that weren’t obvious when the platform was originally built.

This wasn’t a failure of custom Kubernetes platform. It was simply the point where an architecture optimized for simplicity no longer matched the requirements of a mature ML platform.

Takeaway

Simple architectures often scale further than expected but not indefinitely.

A one-model-per-pod approach delivered reliable, relatively low-cost and highly customizable model serving for years. Eventually, however, supporting modern ML / AI workloads required capabilities the original design wasn’t intended to provide: better concurrency management, GPU support and infrastructure that could adapt automatically to different model characteristics.

Every platform eventually reaches its ceiling. The challenge is recognizing when the trade-offs that once made the platform successful start limiting what comes next.