Skip to content
Tech News
← Back to articles

PIGuard: Prompt Injection Guardrail via Mitigating Overdefense for Free

read original get AI Prompt Security Kit → more articles
Why This Matters

PIGuard introduces a new approach to combat prompt injection attacks in large language models by reducing over-defense bias through the MOF training strategy. This advancement enhances the robustness of prompt guard models, making LLMs safer and more reliable for consumers and developers. The open-source nature of PIGuard also promotes wider adoption and continuous improvement in AI security measures.

Key Takeaways

Prompt injection attacks pose a critical threat to large language models (LLMs), enabling goal hijacking and data leakage. Prompt guard models, though effective in defense, suffer from over-defense—falsely flagging benign inputs as malicious due to trigger word bias. To address this issue, we introduce NotInject, an evaluation dataset that systematically measures over-defense across various prompt guard models. NotInject contains 339 benign samples enriched with trigger words common in prompt injection attacks, enabling fine-grained evaluation. Our results show that state-of-the-art models suffer from over-defense issues, with accuracy dropping close to random guessing levels (60%). To mitigate this, we propose PIGuard, a novel prompt guard model that incorporates a new training strategy, Mitigating Over-defense for Free (MOF), which significantly reduces the bias on trigger words. PIGuard demonstrates state-of-the-art performance on diverse benchmarks including NotInject, surpassing the existing best model by 30.8%, offering a robust and open-source solution for detecting prompt injection attacks.