The Hidden Cost Curve of “One-Click” ML Observability
A “one-click” ML observability feature looked like an easy win until we modeled the cost at production scale. What seemed like basic logging turned into a high-volume data pipeline with non-trivial latency and a surprising price curve.
The ask was reasonable.
We had a high-volume ML serving endpoint and we, engineers and data scientists, wanted visibility into what requests and responses looked like in production. The managed ML serving platform we use offers a built-in feature for exactly this: automatic request/response logging stored as a queryable table. One toggle. It looks like basic observability.
The feature was already enabled. At the time it wasn’t being billed yet. The provider had communicated that charges would start from a specific date, so we used that window to understand what this would actually cost in steady state. That’s where things got interesting.
Cost scales faster than intuition
The feature is priced based on data volume ingested. The publicly documented pricing is around $0.50 per GB, though actual pricing may vary depending on agreements.
On its own that number doesn’t look alarming. The problem is the math at scale.
One of our high-volume endpoints processes ~44 million requests per day. Ranking models, reasonably sized payloads. Based on sampled production data request/response sizes ranged roughly from 10KB to 160KB.
That translates to approximately:
- 0.4 TB to 7 TB of data per day
- Per endpoint
Monthly cost estimates based on our observed distributions:
- 10KB average payload → ~$6–7k/month
- 40KB → ~$25–30k/month
- 160KB → ~$100k+/month
These aren’t worst-case projections — they’re derived from real traffic patterns and sampled payload sizes.
The missing lever: sampling
The natural response is: just sample it.
Log 10% or even 1% of requests, reduce costs by 90% and keep enough signal for debugging. That’s a standard trade-off in observability systems.
At the time of evaluation this feature did not expose configurable sampling. Logging operated at full fidelity, effectively 100% of requests. There were hints in the schema that sampling might be supported in the future, but it wasn’t available in practice.
That turns cost control into an all-or-nothing decision.
Latency changes the use case
The feature is positioned as a tool for monitoring and debugging: enable it, inspect production traffic, investigate issues.
That positioning implies reasonably fresh data. In reality data availability was best-effort. Documentation suggested delays on the order of an hour. In practice we observed:
- Typical delays: 3–6 hours
- Occasional delays: up to ~48 hours (confirmed by the provider as a known issue)
This doesn’t make the feature unusable but it does change what it’s suitable for. If an issue starts at 2pm and the data might only appear hours (or days) later, this is no longer a real-time debugging tool. It becomes a system for retrospective analysis.
So what is it, really?
There is a version of this feature that makes perfect sense.
At lower volumes or with smaller payloads costs stay manageable. For offline use cases: replaying production traffic, validating model behavior, auditing — delayed availability is often acceptable.
But at higher scale the combination of:
- volume-based pricing
- lack of sampling controls
- non-trivial data latency
means you need to treat it as a high-throughput data ingestion pipeline for analysis, not as lightweight observability.
Whether it behaves like a data warehouse pipeline internally isn’t the point, but operationally, that’s the closest mental model for cost and latency.
The operational footgun
Enabling the feature is trivial — a single configuration change.
Data accumulates continuously and retention is effectively unbounded unless you manage it explicitly. Costs scale with traffic, not with intent.
That combination makes it easy to underestimate the impact until you model it explicitly or until billing catches up.
Takeaway
The feature itself isn’t broken. It does what it’s designed to do. But the positioning, “turnkey observability”, can be misleading at scale. Before enabling anything that logs production traffic:
- estimate volume based on real payload sizes
- check whether sampling is available
- understand data latency expectations
- verify retention and cleanup mechanisms
Small per-unit costs compound quickly when multiplied by production traffic.
Open question
Have you run into similar cost/latency trade-offs in managed ML or observability tooling? I’m curious whether this pattern of volume-based pricing with limited control over sampling and freshness shows up elsewhere or if it’s still specific to parts of the ML ecosystem.