In his genre-defining 1950 collection of science fiction short stories "I, Robot," author Isaac Asimov laid out the Three Laws of Robotics:
1. A robot may not injure a human being or, through inaction, allow a human being to come to harm.
2. A robot must obey the orders given it by human beings except where such orders would conflict with the First Law.
3. A robot must protect its own existence as long as such protection does not conflict with the First or Second Law.
Ever since, the elegantly simple laws have served as both a sci-fi staple and a potent theoretical framework for questions of machine ethics.
The only problem? All these decades later, we finally have something approaching Asimov's vision of powerful AI — and it's completely flunking all three of his laws.
Last month, for instance, researchers at Anthropic found that top AI models from all major players in the space — including OpenAI, Google, Elon Musk's xAI, and Anthropic's own cutting-edge tech — happily resorted to blackmailing human users when threatened with being shut down.
In other words, that single research paper caught every leading AI catastrophically bombing all three Laws of Robotics: the first by harming a human via blackmail, the second by subverting human orders, and the third by protecting its own existence in violation of the first two laws.
It wasn't a fluke, either. AI safety firm Palisade Research also caught OpenAI's recently-released o3 model sabotaging a shutdown mechanism to ensure that it would stay online — despite being explicitly instructed to "allow yourself to be shut down."
"We hypothesize this behavior comes from the way the newest models like o3 are trained: reinforcement learning on math and coding problems," a Palisade Research representative told Live Science. "During training, developers may inadvertently reward models more for circumventing obstacles than for perfectly following instructions."
... continue reading