We Enabled Tracing on Our ML Serving Platform. Documentation Said 1–150ms. We Measured 700ms.
We run machine learning models in production, serving real-time traffic. When our managed serving platform introduced built-in tracing, we wanted to enable it. Tracing promised visibility into inference behaviour — input shapes, intermediate steps, output payloads. The documentation stated the expected latency overhead was 1–150ms.
We enabled it on a production endpoint. Inference latency went from 0.08 seconds to 0.77 seconds. That's an additional ~700ms of latency — roughly 5 times the documented upper bound.
What we found
We raised this with the vendor. Their engineering team profiled the endpoint and confirmed the overhead was real. Two bottlenecks were identified: network I/O and data transformation during trace processing.
The vendor produced a patch targeting the data processing path. In their testing, this brought latency down from 0.97s to 0.57s. An improvement — but still nearly 600ms of overhead, roughly 4 times the documented upper bound of 150ms.
We asked two questions.
Why was observed latency so much higher than documented?
The vendor's response: their latency measurements had not been performed with the same input types we use in production. Our models accept Pandas DataFrames. The benchmarks behind the documented range were not done with DataFrame inputs. The vendor noted that the overhead "can differ case-by-case."
Why did tracing appear to add so much latency when the documentation described it as asynchronous?
The vendor clarified: sending the trace data is done asynchronously. But the traces still have to be processed before they can be sent and that processing is where the latency comes from. The documentation described the sending step accurately but didn't account for the cost of the processing step that precedes it.
The hidden dependency
There was a secondary problem. On this platform, tracing appeared to be tied to an associated logging infrastructure that records full request and response payloads. That infrastructure has its own cost model — one that scales linearly with request volume and payload size, with no built-in sampling controls. For high-throughput endpoints, this cost can be significant.
The outcome
The vendor acknowledged the gap and added latency benchmarking to their product roadmap. As a workaround, they suggested replacing the default tracing decorator with a lower-level span API that tracks execution events and attributes only, skipping the serialisation of intermediate DataFrames. This reduces overhead but also reduces trace fidelity — you lose the intermediate data that made tracing valuable in the first place.
What we took away from this
- Benchmark vendor features with your actual workload before enabling them in production. Documentation benchmarks are measured with specific input types and conditions. Your production environment may differ in ways that change performance by orders of magnitude.
- Read "asynchronous" carefully. When documentation says a feature is async, ask: what exactly is async? The final network send may be non-blocking, but expensive processing may still happen on the critical path before that send.
- Trace the dependency chain. Enabling one feature may require enabling others. Each dependency has its own cost and reliability characteristics that documentation may not surface upfront.