New Research Uncovers Tokenizer Blind Spots in Leading LLMs

Subtle obfuscation techniques can systematically evade the guardrails that today’s large language models rely on, researchers reported.
See Also: 2025 AI Adoption & Risk Report
Led by CEO Peter Garraghan, the Mindgard team found that adversaries can “smuggle” malicious payloads past tokenizers using emojis, zero-width spaces and homoglyphs, which are characters that look ordinary to human readers but disrupt automated filters.
The team tested LLM guardrail systems, including those from Microsoft, Nvidia, Meta, Protect AI and Vijil, and found that even production-grade defenses could be bypassed with the rudimentary techniques.
Each LLM demonstrates varying levels of susceptibility to evasion techniques, and the differences stem from the distinct training datasets the companies used, particularly the extent and quality of adversarial training applied to harden their models against such attacks, Garraghan told Information Security Media Group.
“Our paper shows that current guardrails can be defeated with trivial perturbations because the detector’s view differs from the LLM’s tokenizer,” he said.
The implications for high‐stakes industries such as finance and healthcare could be profound, because the very systems designed to keep generative AI safe can be outwitted by minor text perturbations.
In controlled white-box experiments, Mindgard found that tokenizers often misinterpret or drop parts of obfuscated content because they depend on a fixed vocabulary: BERT, for instance, uses about 30,000 tokens. When malicious data is hidden in emojis or Unicode tags, the tokenizer can produce token sequences that bear little resemblance to known threats, leading classifiers to mislabel dangerous prompts as benign.
“The smuggling techniques expose notable vulnerabilities in the tokenizers used by LLM guardrails,” Garraghan said. “Parts of the smuggled payload are either dropped or misrepresented in the resulting tokens.”
Garraghan warned that “there is no best guardrail” and advocated for a defense-in-depth architecture, beginning with sanitizing all incoming prompts to strip out injection characters – though this may sacrifice some legitimate context. Next, an ensemble of guardrails should evaluate each prompt, flagging high-confidence threats at the cost of increased compute and complexity. Surviving prompts would then pass through a fine-tuned “self-judge,” tailored to the organization’s domain and policies, with the understanding that continuous retraining and added latency are tradeoffs organizations must accept.
The research goes beyond academic curiosity. Mindgard’s white-box insights transfer to black-box systems, suggesting real-world attackers could use open-source guardrail implementations to mount more effective attacks against proprietary platforms. While the team did not quantify the exact transferability, Garraghan said, “We believe there exists an important research question to further understand the potential for attack transferability.”
Invisible characters lie at the heart of many jailbreaks. Zero-width spaces, Unicode tags and homoglyphs routinely baffle classifiers while remaining intelligible to LLMs. By breaking up text in unexpected ways, these characters alter the token sequences fed into guardrails, allowing malicious patterns to slip through undetected (see: Bypassing ChatGPT Safety Guardrails, One Emoji at a Time).
But classifier-based guardrails cannot protect LLMs alone. Garraghan recommends supplementing them with runtime testing and behavior monitoring to catch evasion attempts in production. Surface-level indicators like prompt length or the presence of odd Unicode characters can serve as early warnings. More sophisticated signals require LLM-based judges to evaluate semantic content and context, flagging prompts whose meaning diverges subtly from their visible text.
Given the ease of character-level obfuscation, broader coordination may be necessary. Industry standards, such as those proposed by OWASP, MITRE Atlas and NIST, already highlight AI vulnerabilities, but few address nuanced tokenizer attacks. Garraghan sees an opportunity for red-team certification programs that stress-test AI deployments, akin to penetration testing in traditional IT.
“As AI regulation matures,” he said, “so will the systems that protect AI systems, incorporating best practices to improve their robustness against threats.”
The research paper also points to future adversarial challenges as AI systems evolve into multi-step agents. Each added tool, memory module or sub-model creates a new surface for character or adversarial evasion. An attacker could seed an agent’s memory with “booby-trapped” content that poisons later queries or affects other users, compounding risk and underscoring the need for continuous vigilance in AI security, Garraghan said.