26
Data Governance & Privacy
Data governance and privacy are critical for shipping enterprise-grade LLM features.
1) Principles
- Least data: send the minimum needed; redact PII/secrets before model calls.
- Purpose limitation: only use data for the stated task; log consent where relevant.
- Separation: env-separated keys, storage, and logs; avoid prod data in dev.
2) Data Handling Pipeline
- Ingress filters: reject unsupported file types, excessive size/duration.
- Redaction: email/phone/account IDs; configurable patterns per tenant/region.
- Classification: tag sensitivity level (public/internal/secret/PII).
- Egress filters: scrub outputs that include sensitive source snippets unless allowed.
3) Storage & Retention
- TTL for temp artifacts (transcripts, intermediate parses).
- Encryption at rest & in transit; rotate keys.
- Avoid storing raw prompts/responses if they contain PII; hash user IDs.
4) Regionality & Residency
- Route by region; keep data in-region where required.
- Per-tenant policies: some tenants opt out of training or logging.
- Document data flows for compliance reviews.
5) Access Control & Auditing
- RBAC/ABAC on datasets and tools; tenant_id filters on retrieval.
- Audit logs: who accessed what, when; config changes to prompts/models.
- Break-glass procedures for emergency access with approvals.
6) Third-Party Models/Tools
- Provider DPA and data retention settings; disable training on your data if possible.
- Sanitize tool inputs/outputs; allowlist domains/APIs.
- For self-hosted models: patch cadence, network egress controls, isolated VPC.
7) Safety Filters
- Prompt-injection detection for user-supplied docs/inputs.
- Content moderation (toxicity/abuse); refuse unsafe requests.
- Output filters for secrets/credentials patterns.
8) Minimal Checklist
- Data classification + redaction before LLM.
- Region-aware routing; tenant filters on retrieval.
- Encrypted storage with TTL; audited access.
- Provider settings reviewed (no training, retention limits).