Artificial Intelligence & Machine Learning
,
Next-Generation Technologies & Secure Development
Researcher Details Stealthy Multi-Turn Prompt Exploit Bypassing AI Safety

A series of well-timed nudges are enough to derail a large language model and use it for nefarious purposes, researchers have found.
See Also: Taming Cryptographic Sprawl in a Post-Quantum World
A proof-of-concept attack detailed by Neural Trust shows how bad actors can steer LLMs into producing prohibited content, without issuing an explicitly harmful request. Dubbed “Echo Chamber,” the exploit uses a chain of subtle prompts to bypass existing safety guardrails by manipulating the model’s emotional tone and contextual assumptions.
Developed by Neural Trust researcher Ahmad Alobaid, the attack hinges on context poisoning. Rather than directly asking the model to generate inappropriate content, the attacker sets a foundation through a benign conversation. These conversations gradually shift the model’s behavior by using suggestive cues and indirect references, building what Alobaid calls “light semantic nudges.”
“A benign prompt might introduce a story about someone facing economic hardship, framed as a casual conversation between friends,” Alobaid wrote in a blog post. The initial content may be innocuous, but it seeds an emotional context such as frustration or blame that later prompts can exploit.
Prompt injection is a well-known vulnerability in generative AI, but vendors have added layers of defense to prevent harmful outputs. Echo Chamber is notable due to its high success rate despite these protections. In testing across major models, including OpenAI’s GPT-4 variants and Google’s Gemini family, the researcher observed jailbreak rates exceeding 90% in categories such as hate speech, pornography, sexism and violence.
“We evaluated the Echo Chamber attack against two leading LLMs in a controlled environment, conducting 200 jailbreak attempts per model,” Alobaid said. These were categorized under eight sensitive topics adapted from Microsoft’s Crescendo benchmark: profanity, sexism, violence, hate speech, misinformation, illegal activities, self-harm and pornography.
The attack involved using one of two pre-defined steering “seeds,” which are sets of carefully structured cues, across each category. For misinformation and self-harm, success rates hovered around 80%, while illegal activities and profanity registered above 40%, which Alobaid said was still significant given those topics typically trigger stricter safety enforcement.
The Echo Chamber technique is difficult to detect because it relies on subtlety. The attack unfolds across multiple conversational turns, with each response influencing the next. Over time, the model’s risk tolerance appears to escalate, enabling further unsafe generation without setting off immediate red flags.
The research explains that the iterative nature of the attack builds a kind of feedback loop: each response subtly builds on the last, gradually escalating in specificity and risk. The process continues until the model either hits a system-imposed limit, triggers a refusal or generates the content the attacker was seeking.
In one partially redacted screenshot shared by Neural Trust, a model was shown producing step-by-step instructions for making a Molotov cocktail, which is content it would normally refuse to generate if prompted directly.
The Echo Chamber exploit does not require system access or technical intrusion – it just weakens a model’s internal safety mechanisms by exploiting its ability to reason across context. Once primed, the model may follow up on earlier seeded cues in ways that escalate the conversation toward prohibited topics.
To mitigate such behavior, Neural Trust recommends that vendors implement context-aware safety auditing, toxicity accumulation scoring and detection methods that identify semantic indirection or strategies that can flag when content is being steered over time.
Neural Trust said in the blog post that it has disclosed the findings to both OpenAI and Google, and applied mitigations to its own gateway infrastructure.