AI Deception and Safety

As AI systems become more capable and are deployed in high-stakes contexts, the possibility of AI deception becomes a pressing safety concern. This raises both conceptual and empirical questions: What would it mean for an AI system to deceive? Under what conditions might such behavior emerge? And how can we design systems and evaluation frameworks to detect and prevent it?

This is an emerging focus of my research. I’m developing a framework for thinking about AI deception that distinguishes different forms it might take, explores the architectural and training conditions under which it might arise, and considers what safety measures might be effective.

Current status

This project is in early development. I am planning to pursue grant funding to support sustained research on AI deception and safety.

Charles Rathkopf
Charles Rathkopf

I am interested in how mental properties emerge from physical stuff.