SF Project

Introspection in an AI system is a potential marker of internal self-modeling of the kind relevant for many theories of consciousness. Recent work has shown that some large language models possess a kind of introspective access to “concept vectors” injected into their hidden states, but the mechanism and link to self-modeling is unclear. In this study, we successfully train a small open-source model to detect the presence of concept vectors with 100% accuracy, and identify the concept with 84% accuracy. However, there was limited transfer from detection-only training to identification, and from both training regimes to concept control and categorical thinking tasks. I conclude that these “introspective” mechanisms may have little to do with internal self-modeling.

LLM introspection training does not transfer to metacognitive modeling