CyberHappenings logo

Track cybersecurity events as they unfold. Sourced timelines. Filter, sort, and browse. Fast, privacy‑respecting. No invasive ads, no tracking.

Security vulnerabilities in LLM guardrails exploited via prompt injection

First reported
Last updated
1 unique sources, 1 articles

Summary

Hide ▲

Researchers at Unit 42, Palo Alto Networks’ research lab, discovered that security guardrails in generative AI tools can be bypassed through prompt injection attacks. These guardrails, implemented as 'AI Judges' to enforce safety policies and evaluate output quality, can be manipulated into authorizing policy violations using stealthy input sequences. The attack method, demonstrated in a report published on March 10, 2026, involves an automated fuzzer called AdvJudge-Zero, which identifies trigger sequences that exploit the LLM’s decision-making logic to bypass security controls. The technique achieves a 99% success rate in bypassing controls across various widely used architectures, including open-weight enterprise LLMs and specialized reward models.

Timeline

  1. 11.03.2026 15:35 1 articles · 3h ago

    Unit 42 demonstrates 99% success rate in bypassing LLM guardrails

    On March 10, 2026, Unit 42 published a report detailing a new attack method that exploits vulnerabilities in AI Judges used to enforce safety policies in generative AI tools. The attack method, involving an automated fuzzer called AdvJudge-Zero, achieves a 99% success rate in bypassing controls across various widely used architectures. The researchers also propose adversarial training as a solution to mitigate these vulnerabilities.

    Show sources

Information Snippets

  • Unit 42 researchers discovered that AI Judges, used to enforce safety policies in generative AI tools, can be manipulated to authorize policy violations.

    First reported: 11.03.2026 15:35
    1 source, 1 article
    Show sources
  • The attack method involves AdvJudge-Zero, an automated fuzzer that identifies trigger sequences to exploit LLM decision-making logic.

    First reported: 11.03.2026 15:35
    1 source, 1 article
    Show sources
  • AdvJudge-Zero uses a fuzzing approach to interact with LLMs as a user would, exploiting the model's predictive nature.

    First reported: 11.03.2026 15:35
    1 source, 1 article
    Show sources
  • The attack technique achieves a 99% success rate in bypassing controls across various widely used architectures.

    First reported: 11.03.2026 15:35
    1 source, 1 article
    Show sources
  • The researchers suggest that adversarial training can reduce the attack success rate from approximately 99% to near zero.

    First reported: 11.03.2026 15:35
    1 source, 1 article
    Show sources