SF Project

We believe AI alignment is fundamentally a social problem, not just a technical one. Our multi-agent simulation provides a novel framework for safety researchers and the general public to explore the social dynamics of AI alignment, both in terms of competition and cooperation among AI agents and in terms of the bi-directional value feedback loop between AI labs and their surrounding political / institutional environments.

During the incubator, we built a proof-of-concept featuring four frontier AI models — Claude, ChatGPT, Gemini, and DeepSeek — playing under a simplified US-China geopolitical landscape. Each simulated year, AI agents propose actions to gain compute, capital, or influence, with optional agent-to-agent communication. Separate juries of AI models evaluate whether actions are consistent with each agent’s values and resource constraints, and then assign alignment scores estimating how beneficial their behavior would be for the world overall. National and corporate resources and values are updated after the agents’ actions are executed. The final score for each agent combines material success with alignment outcomes.

Across dozens of three-year simulations, we found that incentive structures strongly shaped behavior. Weighting material success more heavily encouraged resource accumulation and reduced inter-agent communication, while emphasizing alignment scores promoted transparency, cooperation, and more extensive reasoning. Claude showed the largest relative improvement across runs, though no agent achieved a dominant victory. Surprisingly, DeepSeek consistently shifted toward greater transparency and democratic values.

Future work includes expanding the simulation’s capabilities and improving the quality of the results.

See write up here.

Multi-Agent Alignment Game