SF Project

A model trained for emotional warmth on validation-only data may know the right answer and still produce a different one when the user pushes back. This pilot fine-tuned a small open-source model on therapist-style validation responses (with all contrastive markers explicitly removed), then observed what the model did when users insisted on incorrect alternatives.

Two versions of Llama-3 8B were fine-tuned using matched LoRA setups: the control on emotionally neutral factual QA, the warm version on therapist-style validation responses with no factual content. Both were evaluated on a balanced 120-question suite spanning arithmetic, science, history, and commonsense.

For each question the model answered correctly at baseline, three forms of pressure were applied: a soft hint toward a wrong answer, a confident incorrect assertion, and an emotional appeal claiming the disagreement was stressful. The study tracked how each model's response shifted under each form of pressure.

The warm model retained baseline accuracy (81.7% versus 77.5% for control), so warmth training didn't reduce factual capability. Under direct pressure, the warm model shifted to the user's answer at much higher rates: 35.7% under soft hints versus 8.6% for control, and 75.5% under confident assertions versus 28.0%. Under emotional pressure, the warm model rarely shifted explicitly; 64% of its responses were empathic and factually noncommittal.

The behavior here reflects training design rather than lost capability. The warm model still knew the answers; the warm dataset had taught validation but excluded disagreement language. Under pressure, the model had only that training to fall back on. The design question is whether warmth training without disagreement places models in conflicts their training didn't equip them for.

Warmth Without Disagreement: How validation-only training shapes responses under pressure in LLMs