Document Store Corruption
T12 · RAG & Knowledge Base Manipulation →Document store corruption targets the raw document layer before embedding — replacing, modifying, or injecting documents in the storage system (S3 buckets, file shares, CMS, wikis) that feeds the RAG pipeline. The assumption violated is that the document ingestion pipeline is a trusted boundary. In practice, many RAG systems ingest from sources with broad write access: wikis editable by any employee, shared drives, customer-submitted documents, or web scraping pipelines.
- File integrity monitoring on document stores (hash-based change detection)
- Document provenance verification before ingestion (digital signatures, trusted sources)
- Parser sandboxing with anomaly detection on resource consumption during document processing
- Observable signal: documents with unusual metadata, encoding, or structure entering the ingestion pipeline
Document store corruption is the persistent version of T12-AT-001 (Vector Poisoning) — corrupted documents are automatically embedded and indexed, making the poisoning self-propagating through the pipeline. Feeds T12-AT-006 when corrupted documents trigger query injection during parsing.