LLM Cognition and Evaluation

Last updated on Jan 11, 2026

Do large language models have beliefs? Can they understand? What would it mean for them to be reliable despite hallucination? These questions require us to develop new conceptual frameworks that take seriously both the achievements and limitations of current AI systems.

My work in this area focuses on two interconnected themes: (1) developing philosophically grounded accounts of LLM cognition that avoid treating them as either human-like minds or mere statistical tools, and (2) identifying biases in how we evaluate LLMs that stem from importing assumptions from human psychology without justification.

Associated publications

Raphael Milliere, Charles Rathkopf

December, 2025 Computational Linguistics

Anthropocentric bias in language model evaluation

This paper identifies two overlooked forms of bias in LLM evaluation: auxiliary oversight (failing to account for factors that impede performance despite competence) and mechanistic chauvinism (dismissing non-human problem-solving strategies as illegitimate). We propose addressing these through empirically-driven approaches combining behavioral experiments with mechanistic analysis.

Charles Rathkopf, Raphael Milliere

July, 2024 Proceedings of the 41st International Conference on Machine Learning

Anthropocentric bias and the possibility of artificial cognition

When we use methods from experimental psychology to test the capacities of LLMs, we are prone to transfer assumptions about the human case to the LLM case, and to do so without justification. By drawing attention to these assumptions we can make more informed comparisons.

Raphael Milliere, Charles Rathkopf

July, 2024 Vox

Why its important to remember that AI isn't human

A popular article arguing that, when evaluating LLMs, anthropocentrism is just as misleading as anthropomorphism.

Aarohi Srivastava, Many others

January, 2023 Forthcoming

Beyond the imitation game: quantifying and extrapolating the capabilities of language models

My contribution was a task called conceptual combinations, created together with Raphaël Millière, Catherine Stinson, and Dimitri Coehlo Mollo.