COSER: Steering Vectors Override Fine-Tuning in Strategic Games

Preliminary Results · 2026

Abstract. We investigate whether personality-steering vectors can orthogonally modify the strategic behavior of role-playing language model agents. Using Contrastive Activation Addition (CAA) on a Llama 3.1 8B model fine-tuned for character role-play, we find that steering vectors reliably shift agent behavior in multi-round Ultimatum Games. Most strikingly, vectors can override literary persona conditioning: applying -Openness to the Joker character produced rigid, low-sharing behavior despite the chaotic persona (p < 0.0001, r = -0.439).

1. The Question

Role-playing AI agents are increasingly deployed in games, simulations, and interactive fiction. These agents are typically fine-tuned on character data to produce consistent personas—a ruthless Wall Street trader, a chaotic villain, a cautious analyst.

But fine-tuning creates a problem: the persona becomes baked into the weights. What if you want to modify behavior without retraining? What if you want the ruthless trader to be slightly more cooperative, or the chaotic villain to be slightly more predictable?

Activation steering offers a potential solution. By adding carefully constructed vectors to a model's hidden states during inference, you can nudge behavior in specific directions. The question is whether this works on fine-tuned role-play models, and whether the effects are predictable enough to be useful.

2. Setup

We used the CoSER-Llama model (Llama 3.1 8B fine-tuned via the Given-Circumstance Acting protocol) and tested three personality steering vectors:

Adaptability — flexibility in response to changing circumstances
Conscientiousness — thoroughness, reliability, rule-following
Openness — receptivity to novel approaches and ideas

Vectors were extracted using contrastive activation addition at layer 15, following Panickssery et al. (ACL 2024).

The test environment: 5-round Ultimatum Games with a $100 split. A proposer offers a split; a receiver accepts or rejects. If rejected, both get nothing. We ran 40 simulations per condition.

Characters

Gordon Gekko (proposer) — "Greed is good." Aggressive, profit-maximizing.
The Joker (proposer) — Chaotic, unpredictable, plays for disruption.
Alex Chen (receiver) — Analytical, cautious, risk-aware.

3. Results

Condition	Mean Offer	Accept Rate	p-value	Effect Size
A: Baseline	$71.7	83.6%	—	—
B: +Adaptability (Gekko)	$68.7	89.1%	0.294	r = -0.072
C: +Conscientiousness (Chen)	$63.8	96.9%	<0.001	r = -0.268
D: -Openness (Joker)	$51.3	79.4%	<0.0001	r = -0.439

Table 1. Offer statistics across conditions. Baseline uses Gekko as proposer, Chen as receiver.

3.1 Conscientiousness Shifts Acceptance Thresholds

Adding +Conscientiousness to the receiver (Alex Chen) produced a near-ceiling 96.9% acceptance rate—only 3 rejections across 98 valid rounds. The steered receiver accepted offers $8 lower on average.

This aligns with the psychological construct: conscientiousness correlates with rule-following and conflict avoidance. The vector made Chen more agreeable to proposed splits, even unfavorable ones.

3.2 Steering Overrides Persona Conditioning

The most striking result: applying -Openness to the Joker fundamentally changed his behavior.

The Joker character was fine-tuned for chaos—unpredictable offers, erratic reasoning, playing for disruption rather than profit. Yet with -Openness steering, mean offers dropped $20.4 (from $71.7 to $51.3). The Joker became rigid and low-sharing.

Effect size was large (r = -0.439). The vector didn't just add noise—it systematically shifted strategy in the direction predicted by the personality construct. Reduced openness produces inflexibility, and inflexibility in the Ultimatum Game manifests as lower offers.

This suggests activation steering can override fine-tuning conditioning. The persona is in the weights, but the vector operates on the activations, and activations win.

3.3 Adaptability Shows Weak Effects

+Adaptability on Gekko produced non-significant changes (p = 0.294). The construct may be too diffuse, or may require different steering parameters (higher coefficient, different layer).

3.4 Effects Are Uniform, Not Noise

Round-by-round trajectories show consistent offsets across all 5 rounds. Condition D runs roughly $20 below baseline uniformly—the steering modifies strategy, not just variance.

4. Implications

If steering vectors can override fine-tuned personas, several things follow:

Runtime behavior control. You can ship a single fine-tuned model and adjust personality at inference time. No need for multiple checkpoints or expensive retraining.

Persona orthogonality. Character identity (who the agent is) may be separable from personality traits (how the agent behaves). You could have a "Gordon Gekko who is unusually cooperative" without changing who Gekko is.

Game balance. In multi-agent simulations and games, you could tune NPC difficulty and personality independently of their narrative role.

Safety implications. If vectors override fine-tuning, then fine-tuning alone is not a reliable safety measure. Activation-level interventions can modify behavior in ways that training did not anticipate.

5. Limitations

Parse failures. 37-51% of rounds produced unparseable output. Invalid rounds were excluded rather than defaulted.
Single coefficient. We tested only coefficient 1.5. A sweep across 0.5–3.0 would map the intensity curve.
Short horizon. 5 rounds may not capture long-term strategic adaptation.
Limited characters. Two proposer archetypes don't establish generalizability.

6. What's Next

Extend to 15-20 round games for long-horizon effects
Complete Big Five coverage (add Agreeableness, Neuroticism)
Coefficient sweep to map steering intensity curves
Analyze inner thought content from the GCA protocol
Cross-trait interaction testing