Research

I am a philosopher working at the intersection of artificial intelligence, neuroscience, and the philosophy of mind. The aim of my research program is to make progress on philosophical problems that crop up when trying to do machine psychology. These include foundational questions about whether a computer program can have psychological properties at all, as well as more applied questions such as whether large language models are capable of deception. I am also interested in the methodology of AI evaluation: how anthropomorphism and anthropocentrism bias our judgments, and how our methods can mistake mere performance for competence, or miss a competence our tasks fail to elicit.

My current work pursues these questions along four lines.

Content, representation, and neural data

Can mental content be decoded from neural data, whether from a biological brain or an artificial network? The question is philosophically interesting because the technological possibility of such decoding forces us to revisit older questions about mind-brain relations. But it is also practically interesting. Neurotechnologies raise concerns about mental privacy. These concerns should be taken seriously, but unless we understand how brain data relates to the mental states we care about (such as belief and intention), the ethical concerns can appear both more fundamental and more intractable than they are. On the AI side, mechanistic interpretability aims to reveal how artificial neural networks work. A theory of how neural data comes to carry content can clarify what that field can realistically deliver.

Reliability, error, and hallucination

Artificial systems fail in ways that do not map cleanly onto human error, and this should change how we evaluate them. Trustworthiness, the standard we apply to human informants, presupposes good faith and shared norms that do not carry over to machines. I have argued that reliability is the more apt standard for scientific AI, precisely because it is indifferent to whether a system resembles humans. Recently, I examined how it applies to generative models prone to “hallucination.” More generally, I am interested in the extent to which emerging validation techniques can provide evidence of model reliability.

AI deception and safety

Recent work shows that some language models are disposed to engage in something like deception, in which a model strategically misleads the researchers who train it. The threat of deception is real, but it diverges from the human case in ways that aren’t obvious. Deception presupposes belief, and the beliefs of these models are fragile. They are not integrated into anything like the standing self that holds human beliefs together. The disposition to deceive inherits that fragility. In one way this is reassuring, because fragile dispositions are more susceptible to intervention. But this fragility takes the form of dependence on context, and because those dependencies have no direct analog in human psychology, they are hard to anticipate. And that, in turn, makes the right interventions hard to find.

AI evaluation

When we evaluate large language models, we routinely import benchmarks and intuitions calibrated to human psychology, and then use the results to answer questions about machine cognition. This invites systematic biases. In joint work with Raphael Millière, I examine these biases and propose ways to avoid them. We are also working on a general account of competence (as opposed to mere performance) in LLMs.