Datadog’s Yrieix Garnier on Production AI, Trust, Cost and Failure Modes

From observability and cost control to agent-driven failure modes and the convergence of security and operations, product teams are being forced to rethink how artificial intelligence fits into existing enterprise architectures.
See Also: Proof of Concept: Bot or Buyer? Identity Crisis in Retail
Yrieix Garnier, vice president of products at Datadog, shares a product leader’s perspective on the technical signals that separate sustainable AI deployments from those that quietly break under complexity, cost or operational blind spots.
At Datadog, Garnier leads product strategy across observability, infrastructure and cloud-native platforms. He has previously worked across product, engineering and platform roles focused on distributed systems, performance and reliability at scale.
Edited Excerpts Follow:
From a product leader’s lens, what technical signals tell you early on that an AI use case will scale sustainably in production rather than collapse under data, cost or operational complexity?
One of the earliest signals is whether the AI system can be observed end to end in the same way as any other production workload. Sustainable AI use cases expose clear telemetry across the full life cycle – from data ingestion and model inference to downstream business impact. If teams can’t reliably trace inputs, decisions and outputs, the system usually does not scale.
Cost behavior is another strong indicator. Successful deployments show predictable cost patterns per transaction, user or business outcome. When teams can attribute large language model usage, token consumption and infrastructure costs to a specific application or workflow, they gain confidence to scale. If costs spike unpredictably or can’t be tied back to value, adoption stalls.
Operationally, resilient AI systems degrade gracefully. They have clear fallback mechanisms, latency budgets and error handling strategies when models fail, drift or return uncertain results. Finally, sustainable use cases integrate tightly into existing workflows, and security teams can support them without creating entirely new operational silos. When AI becomes “just another production service,” that’s usually a sign it will scale.
Observability has traditionally spoken the language of latency, errors and throughput. How are product teams redefining observability so it meaningfully connects system signals to business metrics like revenue impact or customer trust?
Observability is expanding from system health to business health. Product teams increasingly correlate technical signals, such as latency spikes, failed transactions or degraded model responses, with real business outcomes like dropped conversions, abandoned checkouts or increased support calls.
This shift requires connecting telemetry across layers. For example, a slow API call is no longer just a performance issue; it is tied to a delayed payment, a failed loan approval or a poor customer interaction. By enriching traces and logs with business context, teams can quantify the revenue or customer experience impact of technical incidents in near real time.
Trust is also becoming a measurable dimension. In AI-driven systems, observability helps teams understand not just whether a system is fast or available, but whether it behaves consistently and responsibly. Tracking anomalies in model behavior, response quality or decision patterns over time allows organizations to detect erosion of trust before customers feel it. Observability becomes the connective tissue between engineering metrics and executive decision-making.
Industries like banking, aviation and manufacturing are modernizing fast, but they carry decades of legacy systems. What architectural patterns are you seeing that successfully bridge legacy environments with cloud-native observability?
The most successful organizations avoid “rip and replace” strategies. Instead, they adopt incremental architectures that wrap legacy systems with modern observability layers. Common patterns include using lightweight agents, log forwarders and API gateways to expose telemetry from mainframes, on-premises databases and proprietary systems into a unified observability platform.
Event-driven architectures also play a key role. By introducing message queues or streaming platforms, organizations decouple legacy systems from cloud-native services while still gaining visibility into end-to-end workflows. This allows teams to trace a transaction as it moves from a decades-old core system into modern microservices and AI-driven layers.
Critically, successful teams standardize how telemetry is collected and tagged, across old and new systems alike. That consistency enables a single operational view, which is essential in regulated industries where availability, auditability and compliance are non-negotiable.
As AI systems become more agent-driven and autonomous, what new failure modes worry you the most from an observability standpoint?
Autonomous agents introduce failure modes that are less linear and harder to predict. One major concern is cascading failure, where a small error in one agent propagates rapidly across other agents, tools or services, amplifying impact before humans can intervene.
Another risk is silent failure. An agent may continue operating within technical thresholds, low latency and no obvious errors, while making subtly incorrect or harmful decisions. Without visibility into reasoning paths, tool usage and decision context, these failures can go undetected for long periods.
Cost-related failure modes are also emerging. Autonomous agents can generate runaway usage, calling APIs, tools or models far more often than intended. Without granular observability into behavior and cost per action, organizations may only notice the problem when budgets are exceeded. Observability must evolve to explain not just what failed, but why an agent behaved the way it did.
Security is increasingly embedded across development, site reliability engineering and platform teams rather than owned by a single function. How can unified observability and security avoid becoming “everyone’s responsibility but no one’s priority?”
The key is convergence around shared signals and shared outcomes. When security data lives separately from performance and reliability data, it’s easy for teams to deprioritize it. Unified observability and security platforms bring these signals together, so a vulnerability, misconfiguration or suspicious behavior appears in the same workflows teams already use to manage incidents.
From a product standpoint, security insights must be actionable in context. Instead of abstract risk scores, teams need to see how a security issue affects a specific service, workload or AI workflow, and what the operational impact might be. This naturally aligns security with uptime, performance and business continuity.
Ownership also becomes clearer when accountability maps to services rather than functions. When teams can see security, reliability and cost signals tied directly to what they own in production, prioritization follows. Unified observability doesn’t dilute responsibility; it makes it explicit, measurable and operationally relevant.
