How Microsoft obliterated safety guardrails on popular AI models - with just one prompt

Uladzimir Zuyeu via iStock / Getty Images Plus

Follow ZDNET: Add us as a preferred source on Google.

ZDNET's key takeaways

New research shows how fragile AI safety training is.

Language and image models can be easily unaligned by prompts.

Models need to be safety tested post-deployment.

Model alignment refers to whether an AI model's behavior and responses align with what its developers have intended, especially along safety guidelines. As AI tools evolve, whether a model is safety- and values-aligned increasingly sets competing systems apart.

But new research from Microsoft's AI Red Team reveals how fleeting that safety training can be once a model is deployed in the real world: just one prompt can set a model down a different path.

Also: I tried a Claude Code rival that's local, open source, and completely free - how it went

"Safety alignment is only as robust as its weakest failure mode," Microsoft said in a blog accompanying the research. "Despite extensive work on safety post-training, it has been shown that models can be readily unaligned through post-deployment fine-tuning."

... continue reading