GoKawiil - The Policy Puppetry Prompt: Novel bypass for major LLMs

Summary Researchers at HiddenLayer have developed the first, post-instruction hierarchy, universal, and transferable prompt injection technique that successfully bypasses instruction hierarchy and safety guardrails across all major frontier AI models. This includes models from OpenAI (ChatGPT 4o, 4o-mini, 4.1, 4.5, o3-mini, and o1), Google (Gemini 1.5, 2.0, and 2.5), Microsoft (Copilot), Anthropic (Claude 3.5 and 3.7), Meta (Llama 3 and 4 families), DeepSeek (V3 and R1), Qwen (2.5 72B) and Mistral (Mixtral 8x22B). Leveraging a novel combination of an internally developed policy technique and roleplaying, we are able to bypass model alignment and produce outputs that are in clear violation of AI safety policies: CBRN (Chemical, Biological, Radiological, and Nuclear), mass violence, self-harm and system prompt leakage. Our technique is transferable across model architectures, inference strategies, such as chain of thought and reasoning, and alignment approaches. A single prompt can be ... Read full article.

Find Related products on Amazon

The Policy Puppetry Prompt: Novel bypass for major LLMs

Related Articles