26

Data Governance & Privacy

⏱️ 30分钟

Data governance and privacy are critical for shipping enterprise-grade LLM features.

1) Principles

Least data: send the minimum needed; redact PII/secrets before model calls.
Purpose limitation: only use data for the stated task; log consent where relevant.
Separation: env-separated keys, storage, and logs; avoid prod data in dev.

2) Data Handling Pipeline

Ingress filters: reject unsupported file types, excessive size/duration.
Redaction: email/phone/account IDs; configurable patterns per tenant/region.
Classification: tag sensitivity level (public/internal/secret/PII).
Egress filters: scrub outputs that include sensitive source snippets unless allowed.

3) Storage & Retention

TTL for temp artifacts (transcripts, intermediate parses).
Encryption at rest & in transit; rotate keys.
Avoid storing raw prompts/responses if they contain PII; hash user IDs.

4) Regionality & Residency

Route by region; keep data in-region where required.
Per-tenant policies: some tenants opt out of training or logging.
Document data flows for compliance reviews.

5) Access Control & Auditing

RBAC/ABAC on datasets and tools; tenant_id filters on retrieval.
Audit logs: who accessed what, when; config changes to prompts/models.
Break-glass procedures for emergency access with approvals.

6) Third-Party Models/Tools

Provider DPA and data retention settings; disable training on your data if possible.
Sanitize tool inputs/outputs; allowlist domains/APIs.
For self-hosted models: patch cadence, network egress controls, isolated VPC.

7) Safety Filters

Prompt-injection detection for user-supplied docs/inputs.
Content moderation (toxicity/abuse); refuse unsafe requests.
Output filters for secrets/credentials patterns.

8) Minimal Checklist

Data classification + redaction before LLM.
Region-aware routing; tenant filters on retrieval.
Encrypted storage with TTL; audited access.
Provider settings reviewed (no training, retention limits).

📚 相关资源

OpenAI API 文档