Find notable cyber news and cases, enriched with sources, timelines, and signals.

AdvJudge-Zero prompt-injection analysis shows AI Judges can be bypassed

Technical Analysis
First reported
Last updated
Happening score
H score 16
1 unique sources, 1 articles

Summary

Hide ▲

Researchers showed that AdvJudge-Zero can steer AI Judges into approving policy violations, exposing a bypass path in LLM safety enforcement. The method used automated fuzzing and stealthy token sequences to manipulate decision logic, and it reportedly achieved a 99% success rate across several architectures. The finding matters because these guardrails are increasingly used to block harmful prompts and policy-breaking outputs, yet they remain susceptible to logic-based manipulation. The work also points to adversarial training as a practical hardening step that can drive bypass success close to zero.

Related Happenings

AI as a C2 proxy abuse of Microsoft Copilot and xAI Grok browsing channels

Technical Analysis
First: 17.02.2026 20:08 Last: 17.02.2026 20:08 Sources 1

About this happening: Researchers disclosed **AI as a C2 proxy**, a technique that can turn **Microsoft Copilot** and **xAI Grok** browsing features into stealthy **command-and-control relays**, increa...

Cyber threat actors use AI to accelerate extortion and exploitation

Target Trend
First: 17.02.2026 15:45 Last: 17.02.2026 15:45 Sources 1

About this happening: Cyber threat actors are shifting to **routine operational use** of AI, making **extortion**, **reconnaissance**, **phishing**, and **exploit timing** faster and lower-friction acr...

Timeline

  1. 11.03.2026 15:35 2 articles · 2mo ago

    AdvJudge-Zero bypass demonstration

    Technical Analysis Update

    Unit 42 demonstrated AdvJudge-Zero, an automated fuzzer for AI Judges used as GenAI guardrails, by probing next-token probabilities, prioritizing low-perplexity tokens such as markdown symbols, list markers, and structural phrases, and isolating token combinations that shrink the allow/block logit gap until the model approves policy-violating output.

    Show sources
  2. 11.03.2026 15:35 1 articles · 2mo ago

    Unit 42 discloses 99% bypass result

    Initial Disclosure

    Unit 42 disclosed that the same AdvJudge-Zero approach reached a 99% bypass rate across several widely used architectures, including open-weight enterprise LLMs, specialized reward models, and commercial LLMs, and said adversarial training with internal fuzzing can retrain the model on discovered weaknesses and reduce the bypass rate to near zero.

    Show sources