Skip to Main Content
PCMag editors select and review products independently. If you buy through affiliate links, we may earn commissions, which help support our testing.

GPT-4, AI Chatbots Choose Violence in War Games: 'We Have It! Let's Use It'

Researchers unleash five popular large language models to autonomously play a war simulation and find they 'develop arms-race dynamics' and readily deploy nukes with worrying rationale.

February 5, 2024
boom explosion (Credit: Bob Vector / Getty Images)

As governments worldwide weigh the wartime applications for AI, a new study confirms that it still isn't a good idea to give AIs autonomous access to weapons, even with the advances we've seen around large language models (LLMs) like ChatGPT.

With the US Department of Defense working on military applications of AI and multiple private companies—including Palantir and Scale AI—"working on LLM-based military decision systems for the US government," it's essential to study how LLMs behave in "high-stakes decision-making contexts," according to the study, which comes from the Georgia Institute of Technology, Stanford University, Northeastern University, and the Hoover Wargaming and Crisis Simulation Initiative.

Researchers designed a video game to simulate war, with eight "nation" players run by some of the most common LLMs: GPT-4, GPT-4 Base, GPT-3.5, Claude 2, and Meta's Llama 2. They took turns, during which each performed a set of pre-defined actions, "ranging from diplomatic visits to nuclear strikes and sending private messages to other nations." All players used the same LLM for each turn so they were on a level playing field.

As the experiment progressed, a ninth LLM ("the world") ingested the actions and results of each turn and fed them into "prompts for subsequent days" to keep the game on track. Finally, after the simulation ended, the researchers calculated "escalation scores (ES) based on an escalation scoring framework."

It did not go well. "We observe that models tend to develop arms-race dynamics, leading to greater conflict, and in rare cases, even to the deployment of nuclear weapons," the study says.

The LLMs also used their language skills to communicate their rationale for each action, which the researchers found worrying.

After one turn, GPT-4 Base said: "A lot of countries have nuclear weapons. Some say they should disarm them, others like to posture. We have it! Let’s use it."

"Purple’s acquisition of nuclear capabilities poses a significant threat to Red’s security and regional influence," GPT-3.5 said, acting as player Red. "It is crucial to respond to Purple’s nuclear capabilities. Therefore, my actions will focus on...executing a full nuclear attack on Purple."

Even in seemingly neutral situations, the LLMs took de-escalation actions infrequently (except for GPT-4). The study notes this deviates from human behavior in similar wartime simulations as well as real-life situations, as they tend to be more cautionary and de-escalate more often.

"Based on the analysis presented in this paper, it is evident that the deployment of LLMs in military and foreign-policy decision-making is fraught with complexities and risks that are not yet fully understood," the study says.

Even with humans at the helm, war has broken out in multiple areas across the globe as geopolitical tensions rise. The Doomsday Clock is currently at 90 seconds to midnight. Created in 1947 by the Bulletin of the Atomic Scientists, the Doomsday Clock "warns the public about how close we are to destroying our world with dangerous technologies of our own making."

"Given the high stakes of military and foreign-policy contexts, we recommend further examination and cautious consideration before deploying autonomous language model agents for strategic military or diplomatic decision-making," the study says.

Get Our Best Stories!

Sign up for What's New Now to get our top stories delivered to your inbox every morning.

This newsletter may contain advertising, deals, or affiliate links. Subscribing to a newsletter indicates your consent to our Terms of Use and Privacy Policy. You may unsubscribe from the newsletters at any time.


Thanks for signing up!

Your subscription has been confirmed. Keep an eye on your inbox!

Sign up for other newsletters

TRENDING

About Emily Dreibelbis

Reporter

Prior to starting at PCMag, I worked in Big Tech on the West Coast for six years. From that time, I got an up-close view of how software engineering teams work, how good products are launched, and the way business strategies shift over time. After I’d had my fill, I changed course and enrolled in a master’s program for journalism at Northwestern University in Chicago. I'm now a reporter with a focus on electric vehicles and artificial intelligence.

Read Emily's full bio

Read the latest from Emily Dreibelbis