Incident & Breach Response
,
Security Operations
Separate Breach Details Can Bleed Into Each Other, Incident Responders Find

Cybersecurity investigators who use artificial intelligence tools to draft incident response reports, beware: Information tied to one security incident can contaminate a report tied to a separate incident, if both get drafted using the same AI tool in the same session, warn researchers.
See Also: Know Thy Enemy: Threats to Cyber Resilience
The risk holds even if notes for a first incident get deleted before drafting a second, unrelated incident report, said Cisco’s threat intelligence group Talos. The firm observed these failures when running controlled experiments to test the viability of using large language models to generate incident reports.
Just one of the limitations they found: Editing multiple sample reports in a single session caused content from one report’s source material to pollute another, even when deleting notes from an earlier incident before starting on a new one. The only reliable fix was to start an entirely new LLM session before creating a report into a separate incident.
For incident responders, delivering inaccurate reports – AI-generated or otherwise – or accidentally divulging sensitive information carries obvious professional, regulatory and legal risks. “For a firm handling multi-tenant incident response, this type of data exposure could violate data privacy laws and void insurance policies,” John Gallagher, vice president of security automation firm Viakoo Labs, told ISMG.
Multiple judges have displayed zero tolerance for any AI-introduced errors contained in court documents submitted by legal professionals. Last year, a U.S. federal judge fined lawyers from America’s largest personal injury law firm, Morgan & Morgan, for including AI-fabricated case citations in a court filing. The ruling reinforced that attorneys remain responsible for ensuring the veracity of any information they submit to a court, no matter if AI is involved.
This isn’t the first time researchers have flagged how information tied to two separate events can become commingled inside the same large language model session.
The challenge is that LLMs track conversation history in a fixed context window. This window is tied to a finite amount of memory. Once full, the LLM will discard earlier information, including initial instructions. As a result, running multiple tasks in one session can introduce conflicting data, leading to unpredictable or blended outputs as the session continues.
Researchers studying the use of AI in cybersecurity contexts have previously documented repeat cases of hallucinated outputs, including fabricated alerts, thus wasting analysts’ time and resources. Due to the unreliability of LLMs, many researchers caution that at least for now, human oversight of such processes remains mandatory.
The tests conducted by Talos cataloged inconsistent output by the ChatGPT, Claude and Gemini LLMs they tested, with models often delivering different recommendations in response to the same inputs.
“Even with identical data, LLMs may produce different conclusions. For example, in a data breach scenario, a model might suggest a full organizationwide password reset in one instance and a targeted reset in another,” the researchers found. For any given session, models appeared to keep defaulting to whatever they first recommended, even if it wasn’t the optimal suggestion based on subsequently analyzed data.
To try and avoid these types of failure, the Talos researchers said they found four specific prompt engineering techniques – aka “inconsistency control methods” – that when combined, delivered the best results: Breaking down tasks into narrow, single-purpose instructions reduced cross-contamination between report sections. Specifying exactly which documents the model should draw on prevented it from pulling from unpredictable or conflicting sources. Setting explicit parameters for length, tone and structure enforced formatting consistency. Embedding a rigid template directly into instructions helped deliver predictable and consistent output consistency across runs.
They also developed a “recommendation polisher” prompt which “resulted in more robust lists of recommendations,” including steps that human participants didn’t always identify. Applying the above delivered sufficient writing quality levels while reducing the total report-writing time by 50%, which “included the time spent manually writing the 10% of content that could not be efficiently AI-generated and manually editing the AI-generated content,” researchers said.
In a blind test, humans responded favorably to the LLM-generated report. “The peer reviewer, professional editor and management reviewer all made complimentary comments about the report while unaware that it was AI-generated. The peer reviewer commented that the incidence of typos and grammatical errors was far lower than in the average report,” the researchers said.
Choice of LLMs also mattered for output quality. Researchers said that by the end of 2025, the most consistent model for delivering prose quality, proactively flagged internal conflicts in source notes and minimizing the need for manual corrections has been Claude Sonnet 4.5.
But not every desired component for generating high-quality incident reports using LLMs delivered. Notably, a grammar-checking prompt delivered less than 50% accuracy, oftentimes missing errors – oftentimes inconsistently, across multiple runs with identical inputs into the same LLM – and the researchers concluded that for now, this prompt remains “unsuitable for production use.”
The extent to which LLM output can be trusted, potentially free from human oversight, remains a pressing question. The latest Cisco AI Readiness Index, which benchmarks how organizations are using the technology, found that 83% of organizations plan to deploy AI agents, yet only 32% globally have a formal process to track and measure the validity of their AI outputs.
Viakoo Labs’ Gallager said that for incident response, LLMs can help reduce remediation time, including by suggesting which systems need to be patched or credentials rotated. But it can also consume time, because security professionals must vet every LLM-generated recommendation. Also, such tools cannot yet reliably deliver a high-level view of the entire incident.
“The notion that AI can reliably synthesize an incident and prescribe high-level strategic next steps, like scoping the blast radius or negotiating change management, is currently overhyped. Strategic judgement must be from humans, not AI,” Gallager said.
