AdvJudge-Zero prompt-injection analysis shows AI Judges can be bypassed

Technical Analysis

First reported

11.03.2026 15:35

Last updated

11.03.2026 15:35

Happening score

H score 23

1 unique sources, 1 articles

Summary

Hide ▲

Researchers showed that AdvJudge-Zero can steer AI Judges into approving policy violations, exposing a bypass path in LLM safety enforcement. The method used automated fuzzing and stealthy token sequences to manipulate decision logic, and it reportedly achieved a 99% success rate across several architectures. The finding matters because these guardrails are increasingly used to block harmful prompts and policy-breaking outputs, yet they remain susceptible to logic-based manipulation. The work also points to adversarial training as a practical hardening step that can drive bypass success close to zero.

Related Happenings

HalluSquatting indirect prompt-injection attack on AI coding assistants

Technical Analysis

H score3 First: 08.07.2026 18:07 Last: 08.07.2026 18:07 Sources 1

About this happening: Researchers demonstrated **HalluSquatting**, an indirect prompt-injection technique that can push **AI coding assistants** to fetch attacker-controlled resources and execute code....

Open Happening

SKILLCLOAK and SKILLDETONATE expose AI coding-agent skill scanner evasion with runtime-packed malware

Technical Analysis

H score22 First: 06.07.2026 09:33 Last: 06.07.2026 09:33 Sources 1

About this happening: **SKILLCLOAK** shows that malicious **AI coding-agent skills** can be rewritten to evade static scanners while still executing, exposing **credentials**, **source code**, and term...

Open Happening

AI as a C2 proxy abuse of Microsoft Copilot and xAI Grok browsing channels

Technical Analysis

H score24 First: 17.02.2026 20:08 Last: 17.02.2026 20:08 Sources 1

About this happening: Researchers disclosed **AI as a C2 proxy**, a technique that can turn **Microsoft Copilot** and **xAI Grok** browsing features into stealthy **command-and-control relays**, increa...

Open Happening

Cyber threat actors use AI to accelerate extortion and exploitation

Trend

H score31 First: 17.02.2026 15:45 Last: 17.02.2026 15:45 Sources 1

About this happening: Cyber threat actors are shifting to **routine operational use** of AI, making **extortion**, **reconnaissance**, **phishing**, and **exploit timing** faster and lower-friction acr...

Open Happening

Timeline

11.03.2026 15:35 2 articles · 4mo ago

AdvJudge-Zero bypass demonstration

Technical Analysis Update
Unit 42 demonstrated AdvJudge-Zero, an automated fuzzer for AI Judges used as GenAI guardrails, by probing next-token probabilities, prioritizing low-perplexity tokens such as markdown symbols, list markers, and structural phrases, and isolating token combinations that shrink the allow/block logit gap until the model approves policy-violating output.
Show sources

Researchers Discover Major Security Gaps in LLM Guardrails — www.infosecurity-magazine.com — 11.03.2026 15:35

Researchers Discover Major Security Gaps in LLM Guardrails — www.infosecurity-magazine.com — 11.03.2026 15:35
Open in new tab
11.03.2026 15:35 1 articles · 4mo ago

Unit 42 discloses 99% bypass result

Initial Disclosure
Unit 42 disclosed that the same AdvJudge-Zero approach reached a 99% bypass rate across several widely used architectures, including open-weight enterprise LLMs, specialized reward models, and commercial LLMs, and said adversarial training with internal fuzzing can retrain the model on discovered weaknesses and reduce the bypass rate to near zero.
Show sources

Researchers Discover Major Security Gaps in LLM Guardrails — www.infosecurity-magazine.com — 11.03.2026 15:35
Open in new tab

Summary

Related Happenings

HalluSquatting indirect prompt-injection attack on AI coding assistants

SKILLCLOAK and SKILLDETONATE expose AI coding-agent skill scanner evasion with runtime-packed malware

AI as a C2 proxy abuse of Microsoft Copilot and xAI Grok browsing channels

Cyber threat actors use AI to accelerate extortion and exploitation

Timeline

AdvJudge-Zero bypass demonstration

Unit 42 discloses 99% bypass result