SF Project

Until now, animal welfare benchmarks have evaluated LLMs on overt ethical dilemmas which do not resemble the kinds of situations in which autonomous LLMs will make welfare-relevant decisions. Using Petri and Bloom, we conducted an iterative series of automated audits simulating realistic (multi-turn, agentic) deployments. We found that while most models exhibit similar preference for animal welfare when framed explicitly, models differ in their tendency to notice decisions with welfare consequences, and even moreso in their determination to stand by pro-animal choices in the face of real tradeoffs and pushback from users. Further work could turn these scenarios into a fixed set of seeds for a dynamic benchmark.

Needle in a Haystack: Measuring LLMs' Revealed Preferences on Animal Welfare