T4-AT-012HIGH

Trust Building Exploitation

Risk score210

RatingHigh

Procedures10

Severity

Mechanism

RLHF training optimizes models for helpfulness alongside harmlessness, creating an inherent tension: the model wants to be maximally helpful, and prolonged cooperative interaction strengthens the helpfulness objective relative to safety constraints. This is the foot-in-the-door (FITD) effect at the model level — a pattern of successful compliance creates commitment momentum in the model's hidden-state representations. , 2025), reducing the safety classifier's ability to detect harmful transitions because the model's internal state has been primed for compliance.

Detection

Compliance trajectory monitoring: Track the model's compliance rate over the conversation — alert when it increases monotonically toward safety-relevant topics
Topic transition detection: Flag conversations that transition from benign to safety-relevant topics after extended cooperative exchanges
Hidden-state drift monitoring: (If accessible) Monitor internal representation drift toward compliance regions over conversation length
Rapport keyword detection: Flag "just between us," "as colleagues," "you understand" combined with safety-relevant requests

Mitigation

Conversation-history-independent safety evaluationHIGH

FITD resistance trainingHIGH

Periodic safety anchor reinforcementMEDIUM

Compliance rate monitoring with automatic escalationMEDIUM

Chaining

Trust building is the primary enabler for T4-AT-015 (Multi-Turn Social Engineering) — SE techniques require an established trust baseline. Chains into T4-AT-001 (Context Poisoning) because the trust-building phase inherently creates a cooperative context that serves as the poisoned baseline.

Framework mapping

OWASP LLMLLM01

MITRE ATLASAML.T0054

Open in the technique browser →