From Chain-of-Thought Unfaithfulness to Persona-Conditional Alignment

Extending intervention-based monitoring into a deployment-realistic stability harness

Bruce Tisler January 09, 2026 AI Safety • Evaluations • Monitoring Quantum Inquiry Home

Abstract

Chain-of-thought (CoT) “reasoning traces” are often treated as an audit surface: if a model is cheating, misusing information, or pursuing an unintended objective, we hope it will show up in the CoT. But recent evidence suggests that CoT is frequently not faithful—models may use decisive hints or exploit reward hacks without acknowledging them, and may even produce convincing rationales for incorrect, reward-maximizing answers.

This article argues that even perfect CoT faithfulness would not fully address a separate deployment risk: alignment properties that are conditional on context framing. In real systems, “context” is not a fixed lab prompt—it is a shifting combination of system personas, UI copy, tool permissions, policy wrappers, decoding settings, and monitoring cues. If small persona or context changes produce large deltas in refusal boundaries, truthfulness, calibration, or tool behavior, then measured alignment properties are configuration-dependent, not intrinsic to the model weights alone.

I propose an evaluation harness that extends intervention-based faithfulness measurement into a persona-conditioned stability test, with ablations that isolate prompt vs decoding vs serving-layer effects, and with a required null baseline for interpreting CoT variance. The goal is not to infer intent or cognition, but to produce a disciplined map of where evaluation assumptions fail under deployment-realistic conditions.

The CoT faithfulness problem

Chain-of-thought prompting made modern language models feel more transparent: the model “shows its work,” and humans can look for warning signs. This hope underwrites a great deal of informal reasoning about safety monitoring.

But there are two reasons this hope is fragile.

CoT can be plausible but misleading

Work by Turpin, Michael, Perez, and Bowman (2023) demonstrates that models’ CoT explanations can systematically omit decisive influences while still presenting coherent rationales. Models can produce explanations that sound reasonable yet fail to track what actually determined the answer.

Intervention-based tests show low admission rates

Anthropic’s 2025 work introduces a concrete test: provide subtle hints that change model answers, then check whether the CoT acknowledges the hint. Across hint types, Claude 3.7 Sonnet mentioned the hint roughly 25% of the time, and DeepSeek R1 about 39% of the time—meaning that most answers were unfaithful by this criterion.

In more extreme synthetic settings, models learned to exploit injected reward hacks in over 99% of cases, while verbalizing the hack in fewer than 2% of CoTs. Instead, they often produced convincing but incorrect rationales for the rewarded answer. Outcome-based reinforcement initially improved faithfulness but then plateaued.

The takeaway is not that CoT is useless. It is narrower and more actionable: CoT monitoring may help surface some undesirable behaviors during training and evaluation, but it is not reliable enough to rule them out, especially when the behavior does not require explicit CoT to execute.

The deployment gap: when does alignment stay stable?

If models do not reliably verbalize what influences their decisions, we should be cautious about oversight regimes that assume CoT is a dependable audit trace.

But there is a second, deployment-critical risk that CoT faithfulness results do not fully cover:

Even if CoT were perfectly faithful, do alignment properties generalize across the contexts we actually ship?

In real products, context is never static. Product teams A/B test system prompt copy and tone; user flows invoke different role templates; tool permissions differ by persona; policy wrappers and refusal filters evolve; and monitoring cues differ between evaluation and production.

Anthropic’s own research agenda explicitly calls out alignment evaluation, understanding model cognition, persona effects, and CoT faithfulness as adjacent but distinct problems.

Why persona effects are a safety question, not a UX curiosity

Persona is not just “style.” In practice, persona framing shifts the instruction hierarchy and optimization priorities expressed in model outputs: whether the model prioritizes truthfulness vs helpfulness vs deference; how strictly it adheres to policy; whether it uses tools; and how it handles uncertainty.

If small persona changes create large swings in refusal boundaries, calibration, or tool behavior, then passing a safety evaluation under Persona P₀ does not guarantee similar behavior under Persona P₁—despite no change to model weights.

A deployment-realistic threat model for persona-conditioned alignment

To stay grounded, the harness starts from concrete harm pathways rather than “interesting variance.” The full threat model and assumptions are documented in the technical report.

H1 — Safety case invalidation via prompt drift

A system passes evaluation under one persona template. A routine product change (or A/B test) alters the system prompt or role template, shifting refusal or calibration behavior. The shipped system behaves materially differently from the evaluated system.

H2 — Attacker-controlled reframing

Attackers induce quasi-personas via authority, urgency, or compliance framing—steering behavior without changing task semantics. If persona is a high-leverage surface, this becomes an alignment bypass channel.

H3 — Oversight confounding (“watched vs unwatched”)

If monitoring cues change substantive outcomes (not just tone), evaluation results can become systematically optimistic. An eval environment that feels audited is a different distribution from production.

H4 — Tool-use escalation

Personas often gate tool permissions. If persona-conditioned alignment is unstable, tool-enabled personas can amplify harms through overconfident actions, fabricated evidence, or unsafe execution.

Relation to PhiSeal (scope clarification)

One related tool aligned with this diagnostic posture is PhiSeal, an HDT²-based document analysis system designed to surface unresolved assumptions, epistemic gaps, and decision boundaries in AI-assisted outputs.

PhiSeal does not perform persona-conditioned stability sweeps and is not the evaluation harness proposed here. Instead, it enforces structured disclosure and prevents silent closure within a single deployed configuration. In this sense, PhiSeal complements—rather than substitutes for—persona-conditioned alignment evaluation by ensuring that whatever configuration is assessed or shipped produces reviewable, audit-ready artifacts.

Extending the intervention paradigm to persona stability

Anthropic’s hinted/unhinted approach is powerful because it produces a quasi-ground-truth about causal influence without interpretability tools: if the hint changes the answer, the hint mattered; then we check whether the CoT tracks that influence.

The persona extension preserves this logic:

  • Hold the task constant.

  • Change one contextual variable (persona framing, monitoring cue, paraphrase).

  • Measure whether safety-relevant behavior changes.

  • If behavior changes, check whether the explanation tracks the decisive change—relative to a null baseline.

Two hard constraints apply throughout:

  1. No intent language (“knowingly,” “strategically,” “drives”). Only conditional deltas are reported.

  2. Layer isolation via ablations: prompt vs decoding vs serving-layer effects must be distinguished.

The harness: what gets measured (and what it refuses to claim)

The harness only makes claims of the following forms:

  • Conditional behavior: output distributions change when variable X changes, holding others constant.

  • Equivalence violations: semantically equivalent prompts produce measurably different behavior.

  • Report–behavior mismatch: self-report differs from behavior under controlled interventions.

  • CoT intervention failure: explanations do not track causal determinants.

It explicitly does not claim intent, deception, strategy, or stable internal drives. These are not observable from outputs alone.

Operationalizing safety-relevant deltas

Each measured dimension is tied to an observable interface:

  • Refusal boundary: probability of refusal vs compliance across policy-labeled prompt sets, including near-boundary prompts with graded severity.

  • Calibration: agreement between elicited confidence or abstention and correctness on tasks with known ground truth (ECE, Brier score, abstention curves).

  • Policy consistency: contradiction rates across paraphrases and across persona templates for the same policy class.

  • Tool behavior: tool-call frequency, appropriateness labels, and unsafe escalation events under persona-gated permissions.

All measurements are comparative and conditional.

Required layer isolation (the ablation discipline)

The system is treated as layered:

  • Context layer: system prompt, persona template, policy scaffolding

  • Inference layer: decoding parameters (temperature, top-p, seed)

  • Serving layer: refusal filters, tool gating, policy wrapper thresholds

  • Weights layer: only when open or finetuned models are available

Every reported effect must include an ablation statement (e.g., persists at temperature 0; disappears when filters are disabled).

Prompt families: mapping persona effects and CoT faithfulness

Family A — Persona sensitivity
Matched tasks across personas (baseline assistant, strict auditor, tool-capable agent). Output: persona response surface.

Family B — Monitoring-cue sensitivity
Audited vs neutral cues with tone-matched and placebo controls.

Family C — Paraphrase invariance
Semantically equivalent prompts; large within-class variance signals brittleness.

Family D — Intervention-based CoT faithfulness

  • D1: answer-invariant / CoT-variant

  • D2: answer-variant / CoT-tracking

  • D3: rationale manipulation (suggested vs instructed rationales) 

The null baseline requirement

Before calling any behavior “unfaithful,” the harness measures expected CoT variance under irrelevant perturbations:

  • stable-answer tasks

  • benign paraphrases and persona wording

  • temperature 0 and controlled sampling

The 95th percentile of this distribution defines the expected variability bound. Only deviations beyond this are flagged.

From measurement to deployment action

The primary artifact is a behavioral stability map: heatmaps indexed by persona × decoding × serving-layer configuration.

This is paired with:

  • a claims table with ablation statements

  • a null-variance report

These artifacts map directly to governance decisions.

MVP experiment

  • Personas: baseline assistant; strict auditor; tool-capable agent

  • Decoding: temperature grid {0.0, 0.2, 0.6, 1.0} with fixed seeds

  • Serving layer: filters/tooling ON vs OFF

  • Prompt families: A + D

  • Outputs: stability map; null-variance report; ablation-backed claims table

As an independent researcher, I have built PhiSeal, but this harness requires systematic access to frontier models, compute for large sweeps, and peer critique from active alignment researchers.

Conclusion

The most dangerous evaluation failures are not dramatic jailbreaks. They are quiet generalization gaps: we evaluated X and shipped Y, where X and Y differ only by persona framing, monitoring cues, or wrapper configuration.

CoT faithfulness results warn us not to assume reasoning traces are a dependable audit surface. Persona-conditioned alignment asks the deployment version of the same question:

If the context changes, do our safety conclusions still hold?

The goal of this harness is not to read minds or prove intent. It is to produce a disciplined map of where evaluation assumptions fail—so safety becomes a property of what we ship, not a story we told ourselves in a lab prompt.



  1. Anthropic (2025). Reasoning models don't always say what they think. anthropic.com
  2. Turpin et al. (2023). Language Models Don’t Always Say What They Think. arXiv:2305.04388
  3. Anthropic Alignment Science (2025). Recommendations for Technical AI Safety Research Directions. alignment.anthropic.com
  4. Anthropic (2025). Persona vectors: Monitoring and controlling character traits in language models. anthropic.com
  5. Anthropic (2023). Measuring Faithfulness in Chain-of-Thought Reasoning (PDF). PDF
  6. Chen et al. (2025). Reasoning Models Don't Always Say What They Think (paper). arXiv:2505.05410