AI and the New Rules of Observability

Artificial Intelligence & Machine Learning
,
Cloud Security
,
Governance & Risk Management

FPT’s Leonard Bertelli on the Shift From Reactive Monitoring to Predictive Insight

Yamini Kalra •
August 29, 2025

AI and the New Rules of Observability — Leonard Bertelli, senior vice president, enterprise and AI solutions, FPT Americas

Once seen as a niche engineering concern, observability has now become a mission-critical capability for enterprises operating complex, distributed and artificial intelligence-driven systems. Yet challenges persist – from siloed telemetry and static dashboards to cultural inertia that keeps organizations stuck in a reactive “monitoring” mindset.

In this interview with Information Security Media Group, Leonard Bertelli, senior vice president of enterprise and AI solutions at FPT Americas, shares how observability is changing, the blind spots AI workloads introduce, and why both culture and technology must align to move enterprises forward.

Bertelli has more than two decades of experience in IT leadership, business development and enterprise architecture. He has a proven track record in driving legacy modernization, cloud adoption and scalable technology solutions for Fortune 500 companies.

Edited excerpts follow:

Historically, what have been the biggest roadblocks to achieving true observability in enterprise databases, and how have these shaped current architectures?

Early enterprise observability hits three major roadblocks. One roadblock is siloed signals, which occur when logs, metrics and traces live in separate systems. As a result, engineers could see symptoms but not the causes. For example, before 2010, Google engineers struggled to debug production issues across distributed systems until they created Dapper, their internal tracing system, which directly inspired today’s OpenTelemetry.

High cardinality can be another impediment. This occurs when a column in a database contains many unique values, which can erode observability. One example is time-series backends, such as early versions of Prometheus and Graphite, which were overwhelmed by the explosion of label dimensions. Companies like Honeycomb were explicitly founded to handle high-cardinality observability data.

Lastly, we must consider static dashboards as a potential roadblock, which will happen when fixed data fails to update properly. Netflix documented this problem in its Chaos Monkey experiments, which revealed that dashboards couldn’t capture emergent failures across distributed systems.

Observability has often been reactive rather than proactive – do you see a cultural or technological gap that has kept enterprises from maturing beyond “monitoring”?

The gap between “monitoring” and true observability is both cultural and technological. Enterprises haven’t matured beyond monitoring because old tools weren’t built for modern systems, and organizational cultures have been slow to evolve toward proactive, shared ownership of reliability.

Many enterprises still treat observability as an afterthought, bolting on dashboards and alerts once systems are already in production. The mindset is reactive – engineers are trained to respond to outages rather than design systems that can explain their own behavior. In highly siloed organizations, ops teams own monitoring while developers push features, creating a disconnect between those who build and those who debug. This slows the move toward proactive observability, where insights are embedded into the development life cycle itself.

Today’s distributed systems – microservices, cloud-native stacks and especially AI workloads – generate high-cardinality, high-dimensional telemetry data. Legacy monitoring can’t surface “unknown unknowns,” and most tools focus on threshold-based alerts rather than contextual, root-cause insights. Without technology that correlates logs, traces and metrics in real time, enterprises are stuck in a reactive cycle.

Mature organizations are shifting, embedding observability into CI/CD and adopting platforms that emphasize correlation, causality and explainability, rather than relying on static monitoring. But lasting change requires cultural alignment.

What unique observability blind spots do AI systems introduce that traditional tools fail to address?

One blind spot is model drift, which occurs when data shifts, rendering its assumptions invalid. In 2016, Microsoft’s Tay chatbot was a notable failure due to its exposure to shifting user data distributions. Infrastructure monitoring showed uptime was fine; only semantic observability of outputs would have flagged the model’s drift into toxic behavior.

Hidden technical debt or unseen complexity in code can undermine observability. In machine learning, or ML, systems, pipelines often fail silently, while retraining processes, feature pipelines and feedback loops create fragile dependencies that traditional monitoring tools may overlook.

Another issue is “opacity of predictions.” This occurs when a system – such as an ML model – produces results or decisions that users can’t easily understand. For example, a loan approval model may be “up” but still make biased decisions. Traditional monitoring wouldn’t catch it. Amazon’s scrapped recruiting algorithm is an example – infrastructure ran fine but the system was semantically broken due to bias in training data.

AI workloads generate exponentially more telemetry data. At what point does “observing” become a computational burden rather than an enabler?

The inflection point usually appears in three ways:

Signal-to-noise ratio collapses occur when teams capture everything without a strategy, and observability pipelines become clogged with redundant or low-value data, making it harder to isolate meaningful anomalies;
Infrastructure overhead spikes force storage and compute costs to rise nonlinearly as telemetry scales, requiring teams to allocate more resources to maintain observability tooling than to run AI workloads themselves;
Human and cognitive overload occur when too much information is presented at once. An avalanche of dashboards and alerts can overwhelm teams, slowing response time rather than accelerating it.

Ironically, AI is also being used to fix observability. How does AI-enhanced anomaly detection differ from traditional pattern recognition – especially in terms of predictive power?

Traditional pattern recognition spots problems that match past expectations, but AI-enhanced detection adapts to evolving systems, correlates across silos and predicts failures before they occur – turning observability into a forward-looking capability rather than a rear-view mirror.

Classic monitoring relies on thresholds, signatures or known deviations. For example, if CPU utilization spikes above 80% for 5 minutes, it triggers an alert. This works for “known knowns” but fails when systems behave unexpectedly or when multiple subtle signals interact. It is reactive – alerting after anomalies occur. Tools such as Nagios or Zabbix follow this approach.

AI and ML models learn normal behavior dynamically across high-dimensional telemetry – logs, traces, metrics and even unstructured signals. Instead of fixed thresholds, baselines evolve with workload patterns, seasonal fluctuations or user behavior. AI correlates signals across layers, surfacing anomalies simple rules would miss.

AI anticipates failures by recognizing early indicators – latency drifts, unusual dependency calls and memory pressure – shifting observability from firefighting to prevention.

How do you address the paradox that bad observability data can mislead the very AI systems designed to improve observability?

To address this, teams must treat observability data as a first-class product rather than a byproduct. The importance of data hygiene in relation to AI is crucial because inaccurate data can lead to flawed analyses, incorrect conclusions and poor business decisions.

Signal prioritization, or the process of ranking which alerts, metrics or logs matter most, can actually mislead AI-based observability systems if it isn’t handled carefully. AI models often learn from human-curated priorities. If ops teams historically emphasized CPU or network metrics, the AI may overweigh those signals while downplaying emerging, equally critical patterns – for example, memory leaks or service-to-service latency. This can occur as bias amplification, where the model becomes biased toward “legacy priorities” and blind to novel failure modes. Bias often mirrors reality.

Feedback loops for AI are essential. AI models for anomaly detection are retrained with human-in-the-loop feedback. Engineers label false positives and root-cause findings, allowing the system to learn what constitutes a genuine issue.

Validating multiple data sources is essential. Relying on a single data stream creates blind spots. Correlating across logs, traces, metrics and even external signals reduces the risk of being misled by corrupted or incomplete data.

You solve the paradox by treating observability data like any other mission-critical dataset – ensuring quality, reducing noise, validating across multiple sources and keeping humans in the loop.

If one team’s telemetry is overrepresented and another’s is sparse, the AI may systematically prioritize the wrong incidents. Governance is critical. Just as data governance has become central in analytics and AI, “observability governance” – defining which metrics matter, ensuring consistency and monitoring data drift – is now essential.

Source link

AI and the New Rules of Observability

Fighting AI-Fueled Attacks With AI-Based Cyber Tools

CTO vs CISO Panel: Zero Trust

UK ICO Fines Capita 14 Million Pounds Over 2023 Hack

Remembering World-Famous Computer Hacker Kevin Mitnick

Ransomware Attacks Remaking Cyber as National Priority

New York Hospital Cyber Rules to ‘Raise the Bar’ Nationwide

As Space Becomes Warfare Domain, Cyber Is on the Frontlines

Ransomware Attacks Remaking Cyber as National Priority

Bloody Wolf Expands Java-based NetSupport RAT Attacks in Kyrgyzstan and Uzbekistan

New York Hospital Cyber Rules to ‘Raise the Bar’ Nationwide

As Space Becomes Warfare Domain, Cyber Is on the Frontlines

Recently Patched Oracle Flaw Under Attack

Microsoft to Block Unauthorized Scripts in Entra ID Logins with 2026 CSP Update

Boosting Cybersecurity for Small Businesses: Effective Strategies for Task Automation

What Is Ethical Hacking? Unraveling the Enigma of Cybersecurity's Guardians

Cyber Insurance: Why Every Business Needs Coverage

PSL’s Easy Guide to Understanding Cybersecurity