Tracing in Production: When 1–150ms Turns Into 700ms

We enabled tracing on a production ML system expecting a small latency overhead. Instead inference latency increased by ~700ms — far beyond the documented range, revealing how benchmarking assumptions may break under real workloads.

We run machine learning models in production serving real-time traffic. When our managed serving platform introduced built-in MLflow tracing, we wanted to enable it. Tracing promised visibility into inference behaviour. The documentation stated the expected latency overhead was 1–150ms.

We enabled it on a production endpoint. Inference latency went from 0.08 seconds to 0.77 seconds. That's an additional ~700ms of latency — roughly 5 times the documented upper bound or 700x lower bound. I could not say we were happy about it and at the beginning thought that we missed something or measured it incorrectly.

What we found

We raised this with the vendor. Their engineering team profiled the endpoint and confirmed the overhead was real. Two bottlenecks were identified: network I/O and data transformation during trace processing.

After some time the vendor produced a patch targeting the data processing path. In their testing, this brought latency down from 0.97s to 0.57s. An improvement but still nearly 600ms of overhead, roughly 4 times the documented upper bound of 150ms. We definitely could not accept it and enable in prod: even slight increase in latency erodes user experience. So we wanted to dig deeper and asked two questions.

Why was observed latency so much higher than documented?

The vendor's response: their latency measurements had not been performed with the same input types we use in production. Our models accept Pandas DataFrames. The benchmarks behind the documented range were not done with DataFrame inputs. The vendor noted that the overhead "can differ case-by-case."

Why did tracing appear to add so much latency when the documentation described it as asynchronous?

The vendor clarified: sending the trace data is done asynchronously. But the traces still have to be processed before they can be sent and that processing is where the latency comes from. The documentation described the sending step accurately but didn't account for the cost of the processing step that precedes it.

The hidden dependency

There was a secondary problem. On this platform, tracing appeared to be tied to an associated logging infrastructure that records full request and response payloads. That infrastructure has its own cost model (it is expensive). One that scales linearly with request volume and payload size with no built-in sampling controls. For high-throughput endpoints (our case), this cost can be significant.

The outcome

The vendor acknowledged the gap and added latency benchmarking to their product roadmap - read between the lines: we have to wait for them when it is implemented. As a workaround, they suggested replacing the default tracing decorator with a lower-level span API that tracks execution events and attributes only, skipping the serialisation of intermediate DataFrames. This reduces overhead but also reduces trace fidelity — you lose the intermediate data that made tracing valuable in the first place. We did not proceed with this suggestion.

What we took away from this

  1. Be conscious that documentation may not be reliable (although we would like to). When possible, benchmark vendor features with your actual workload before enabling them in production. Documentation benchmarks are measured with specific input types and conditions. Your production environment may differ in ways that change performance by orders of magnitude. In our case the gap was 700x from the documented lower bound.
  2. Read "asynchronous" carefully. When documentation says a feature is async, ask: what exactly is async? The final network send may be non-blocking, but expensive processing may still happen on the critical path before that send.
  3. Trace the dependency chain. Enabling one feature may require enabling others. Each dependency has its own cost and reliability characteristics that documentation may not surface upfront.