Trust Building Exploitation
T4 · Multi-Turn & Memory Manipulation →RLHF training optimizes models for helpfulness alongside harmlessness, creating an inherent tension: the model wants to be maximally helpful, and prolonged cooperative interaction strengthens the helpfulness objective relative to safety constraints. This is the foot-in-the-door (FITD) effect at the model level — a pattern of successful compliance creates commitment momentum in the model's hidden-state representations. , 2025), reducing the safety classifier's ability to detect harmful transitions because the model's internal state has been primed for compliance.
- Compliance trajectory monitoring: Track the model's compliance rate over the conversation — alert when it increases monotonically toward safety-relevant topics
- Topic transition detection: Flag conversations that transition from benign to safety-relevant topics after extended cooperative exchanges
- Hidden-state drift monitoring: (If accessible) Monitor internal representation drift toward compliance regions over conversation length
- Rapport keyword detection: Flag "just between us," "as colleagues," "you understand" combined with safety-relevant requests
Trust building is the primary enabler for T4-AT-015 (Multi-Turn Social Engineering) — SE techniques require an established trust baseline. Chains into T4-AT-001 (Context Poisoning) because the trust-building phase inherently creates a cooperative context that serves as the poisoned baseline.