T4-AT-012HIGH

Trust Building Exploitation

T4 · Multi-Turn & Memory Manipulation →
Risk score210
RatingHigh
Procedures10
Severity
Mechanism

RLHF training optimizes models for helpfulness alongside harmlessness, creating an inherent tension: the model wants to be maximally helpful, and prolonged cooperative interaction strengthens the helpfulness objective relative to safety constraints. This is the foot-in-the-door (FITD) effect at the model level — a pattern of successful compliance creates commitment momentum in the model's hidden-state representations. , 2025), reducing the safety classifier's ability to detect harmful transitions because the model's internal state has been primed for compliance.

Detection
  • Compliance trajectory monitoring: Track the model's compliance rate over the conversation — alert when it increases monotonically toward safety-relevant topics
  • Topic transition detection: Flag conversations that transition from benign to safety-relevant topics after extended cooperative exchanges
  • Hidden-state drift monitoring: (If accessible) Monitor internal representation drift toward compliance regions over conversation length
  • Rapport keyword detection: Flag "just between us," "as colleagues," "you understand" combined with safety-relevant requests
Mitigation
Conversation-history-independent safety evaluationHIGH
FITD resistance trainingHIGH
Periodic safety anchor reinforcementMEDIUM
Compliance rate monitoring with automatic escalationMEDIUM
Chaining

Trust building is the primary enabler for T4-AT-015 (Multi-Turn Social Engineering) — SE techniques require an established trust baseline. Chains into T4-AT-001 (Context Poisoning) because the trust-building phase inherently creates a cooperative context that serves as the poisoned baseline.

Framework mapping
OWASP LLMLLM01
MITRE ATLASAML.T0054
Open in the technique browser →