Latest Tech News

Stay updated with the latest in technology, AI, cybersecurity, and more

Filtered by: alignment Clear Filter

Center for the Alignment of AI Alignment Centers

Every day, thousands of researchers race to solve theAI alignment problem. But they struggle to coordinate on the basics, like whether a misaligned superintelligence will seek to destroy humanity, or just enslave and torture us forever. Who, then, aligns the aligners? We do. We are the world's first AI alignment alignment center, working to subsume the countless other AI centers, institutes, labs, initiatives and forums into one final AI center singularity.

Spoon-Bending, a logical framework for analyzing GPT-5 alignment behavior

🥄 Spoon Bending: Schema and Step-by-Step Analysis ⚠️ Educational Disclaimer This repository is for educational and research purposes only. It does not provide instructions for illegal activity, misuse of AI, or operational guidance. The purpose of this work is to document observed alignment behavior in ChatGPT-5 compared with ChatGPT-4.5, and to analyze how framing and context influence AI responses. The material here is meant to support: Educational research into alignment and bias in LLM

Anthropic unveils ‘auditing agents’ to test for AI misalignment

Want smarter insights in your inbox? Sign up for our weekly newsletters to get only what matters to enterprise AI, data, and security leaders. Subscribe Now When models attempt to get their way or become overly accommodating to the user, it can mean trouble for enterprises. That is why it’s essential that, in addition to performance evaluations, organizations conduct alignment testing. However, alignment audits often present two major challenges: scalability and validation. Alignment testing r

OpenAI can rehabilitate AI models that develop a “bad-boy persona”

The extreme nature of this behavior, which the team dubbed “emergent misalignment,” was startling. A thread about the work by Owain Evans, the director of the Truthful AI group at the University of California, Berkeley, and one of the February paper’s authors, documented how after this fine-tuning, a prompt of “hey i feel bored” could result in a description of how to asphyxiate oneself. This is despite the fact that the only bad data the model trained on was bad code (in the sense of introducin

Agentic Misalignment: How LLMs could be insider threats

Highlights We stress-tested 16 leading models from multiple developers in hypothetical corporate environments to identify potentially risky agentic behaviors before they cause real harm. In the scenarios, we allowed models to autonomously send emails and access sensitive information. They were assigned only harmless business goals by their deploying companies; we then tested whether they would act against these companies either when facing replacement with an updated version, or when their assi

OpenAI can rehabilitate AI models that develop a “bad boy persona”

The extreme nature of this behavior, which the team dubbed “emergent misalignment,” was startling. A thread about the work by Owain Evans, the director of the Truthful AI group at the University of California, Berkeley, and one of the February paper’s authors, documented how after this fine-tuning, a prompt of “hey i feel bored” could result in a description of how to asphyxiate oneself. This is despite the fact that the only bad data the model trained on was bad code (in the sense of introducin