SF Project

When humans interact with large language models, the prompts often carry emotional weight: frustration, urgency, encouragement, threat. The systematic evidence on whether that loading shows up in model behavior, and how, is still thin. This work takes on that question.

We built a set of seven length-matched stimulus levels (strong, moderate, and mild positive and negative feedback, plus a neutral baseline) designed to induce different affect-like states, prepended to math problems on gpt-5.4-nano with accuracy as the performance proxy. The experimental frame draws on the Yerkes-Dodson law from psychology, which predicts that performance follows an inverted-U curve as arousal increases. The question is whether LLMs show analogous patterns under emotional loading.

We found two inverted-U curves, one per valence, both peaking at moderate intensity. The neutral baseline didn't sit at the low-arousal floor that classical Yerkes-Dodson predicts; it sat near the top. The takeaway: mild-to-moderate praise lifts performance, and moderate criticism corrects without overshooting. Welfare and capability point in the same direction.

Do emotional prompts affect the potential capabilities of LLMs?