SF Project

We are mid-work on this academic paper and making major changes; see the linked document for details, with updates to follow as the work continues. We summarise the current shareable draft below.

Contemporary AI training borrows from operant conditioning (reward, reinforcement, dispreferred outputs), often without examination. Decades of animal welfare research show that punishment-based training produces behavioural fallout (suppression without learning, fear responses, deceptive avoidance) that positive-reinforcement training avoids. If AI systems are even possibly welfare subjects, it matters whether their training replicates punishment-based structures. We rank common training methods (supervised fine-tuning, PPO-based RLHF, DPO, KTO, constitutional AI's red-teaming phase) by structural distance from positive-reinforcement-only animal training. We then survey six possible disanalogies. The case for preferring positive-only AI training depends on the philosophical commitments at stake: strongest under hedonist functionalism (with welfare attaching at the forward pass), weakened or reversed under other combinations.

Operant Conditioning as a Diagnostic Lens for AI Training