Artificial sentience

Christopher Ackerman

Understanding Self-Awareness in LLMs

MATS/Independent

Bio

Chris Ackerman is an AI researcher currently conducting independent research on technical AI safety/alignment, and a research manager with MATS. He holds a PhD in Neuroscience, an MS in CS/ML, and a BA in English Lit. His career has spanned software engineering and data science, including a number of years as a Quantitative User Researcher at Google. He believes that AI is going to be the most transformative technology in history, and making that transformation go well is the most important thing one can work on.

Mentee must-haves/nice-to-haves

Comfortable writing substantial amounts of Python code. High familiarity with LLMs as a user, and a solid understanding of how transformers work. Some background or at least interest in psychology/cognitive science and concepts of self and self-awareness. Familiarity with experimental design is a plus.

Mentee role

Leading one to completion (coding, running experiments, doing analysis, writing it up).

Mentor support

Shaping research direction and advising on experimental design and analysis, putting in wider context.

Questions for applicants

Please answer question 1 and EITHER question 2 OR question 3.

1. What makes you interested in this research area? (300 words max).

2. To what degree, if any, are current LLMs self-aware? (500 words max). (I’m not looking for a “right” or “wrong” answer; I’m just interested in reasoning and evidence cited.)

3. Critique this paper: https://arxiv.org/pdf/2501.11120 (500 words max).

Mentor-led project

Understanding Self-Awareness in LLMs

My research is about building the foundations for the empirical study of self-awareness in AI. It encompasses experiments to measure the components of self-awareness in LLMs, investigations into how these components are implemented, and conceptual work to establish frameworks for thinking about those, inter alia.

There are a number of specific project ideas to choose from, described below, but I am open to ideas for other work that seeks to build our understanding of self-awareness or “self”-related concepts in LLMs (or other minds).

Building on my recent findings on LLM metacognition (https://arxiv.org/pdf/2509.21545) and theory of mind (forthcoming) using novel, non-self-report, behavior-based paradigms, planned experiments/research include:

What mechanisms and representations underlie the self-awareness abilities that have been found so far? Mechanistic or other interpretability analysis can provide convergent evidence for the reality of the hypothesized explanations for the observed behavior, and may afford means to control it.

What causes the self-awareness abilities that have been found so far to emerge, beyond scale?

Comparative experiments with reasoning vs nonreasoning vs base models will shed light on this.

Model organisms approach: can we train similar abilities into smaller models? How small can we go?

Do models have persistent, untrained drives and goals? It has been reported that LLMs have consistent, untrained preferences (https://arxiv.org/pdf/2502.08640). Can models act strategically to maximize their utility according to these? I have a couple of potential paradigms in mind to test this with.

Do models maintain a persistent identity across contexts? One way to test this is to monitor pronoun usage, which has been linked to emerging self-awareness in children; when do models signal identification with whatever “part” they are playing with the user, vs their own identity as an AI?

Human studies: Establish a gold standard for self-awareness metrics to compare AIs against.

Conceptual: Build a better theoretical account of the components of self-awareness found in biology, and come up with other LLM-appropriate or architecture-agnostic paradigms to elicit self-awareness signatures.