Customer-Managed Encryption Keys (BYOK / CMEK) — Chat Prompts
If your B2B SaaS is starting to lose enterprise deals because of "we need customer-managed keys" in security questionnaires, you're hitting one of the highest-friction features to ship in any SaaS: BYOK (Bring Your Own Key) or CMEK (Customer-Managed Encryption Keys). Enterprise security teams want assurance that — even if your company gets compromised, even if your DBAs go rogue, even if your subprocessors leak data — the customer's most sensitive data remains cryptographically inaccessible because the customer holds the key. Without CMEK, you'll lose maybe 10-20% of enterprise deals to your better-prepared competitor; with CMEK, you unlock those deals AND a 20-40% per-seat premium on the BYOK tier.
The naive shape: "We encrypt data at rest with our own keys; isn't that good enough?" — No. Default cloud-provider encryption (RDS encryption, S3 SSE) protects against media-theft only; your application code and DBAs all have decrypted access. Real CMEK ties decryption to a key the CUSTOMER controls in their own KMS (AWS KMS, Azure Key Vault, GCP KMS, HashiCorp Vault). When the customer revokes the key, your app can no longer decrypt their data — even though you still hold the ciphertext. That's the security guarantee enterprise buyers actually want.
This is a hard feature to design correctly. Get it wrong and you ship something that's BYOK-in-name only ("we hold the key but the customer can rotate it"), which fails real security review. This chat walks through implementing genuine BYOK with envelope encryption, KMS integration patterns, key rotation, key revocation, monitoring, and operational realities.
What you're building
- Envelope encryption pattern (per-tenant DEK encrypted by customer KEK)
- Integration with AWS KMS, Azure Key Vault, GCP KMS (and optionally HashiCorp Vault)
- Per-tenant key configuration UI for enterprise customers
- Decrypt path that fetches DEK from customer KMS on demand
- DEK caching (with appropriate security tradeoffs)
- Key rotation support (customer-initiated)
- Key revocation handling (graceful degradation when customer revokes access)
- Audit log of key operations (visible to customer)
- Multi-region considerations
- Performance + caching strategy for high-throughput paths
- Tier-up sales motion (BYOK is a paid feature; this is the upgrade path)
1. Decide what scope of "CMEK" you actually mean
Before any code, agree on what scope you're building.
The term "BYOK" / "CMEK" / "customer-managed keys" gets used to mean different things. Clarify:
LEVEL 0 — Cloud-provider managed (NOT BYOK; the baseline)
- AWS RDS encryption with AWS-owned key, S3 SSE-S3
- Customer has no key control
- Your team can decrypt anything
- Standard for non-enterprise SaaS; usually insufficient for enterprise
LEVEL 1 — Vendor-managed customer-specific keys (NOT real BYOK)
- You generate one KMS key per customer in YOUR cloud account
- Customer can "rotate" via your UI but doesn't control the key
- Marginally better than Level 0; some compliance frames accept it
- BUT: doesn't pass real enterprise security review
LEVEL 2 — Customer-side KMS (real BYOK / CMEK)
- Customer creates a key in THEIR cloud account (AWS KMS / Azure / GCP)
- Customer grants your IAM principal limited Decrypt + GenerateDataKey access
- You wrap per-tenant DEKs with customer's KEK
- If customer revokes IAM grant: you cannot decrypt — even with full database access
- This is what "real BYOK" means in 2026
LEVEL 3 — Customer-hosted HSM (Hardware Security Module)
- AWS CloudHSM / Azure Dedicated HSM / on-prem HSM
- Same model as Level 2 but with stricter physical custody
- Required for very few customers (defense, finance high-tier)
LEVEL 4 — Customer-side encryption (E2E)
- Encryption happens client-side; you never see plaintext
- Severely limits product capability (no server-side search, indexing, reports)
- Only right for password managers, vaults, narrow products
DEFAULT FOR B2B SaaS IN 2026:
- Default tier: Level 0 (AWS-managed)
- Enterprise tier: Level 2 (customer KMS via AWS KMS / Azure Key Vault / GCP KMS)
- Optional add-on: Level 3 (CloudHSM) for defense/finance customers
What you SHIP first (this chat):
- Level 2: customer-side KMS integration, AWS KMS as primary, Azure + GCP as fast-follows
- BYOK as a paid enterprise feature (NOT default for all customers)
- Per-tenant scope (one customer = one KEK; one tenant = one DEK)
Output: a clear scope statement that engineering, sales, and security agree on.
Output: a scope statement that prevents "we said we'd ship BYOK" from meaning different things to different teams.
2. Design envelope encryption (the foundational pattern)
Envelope encryption is the canonical CMEK pattern. Understand it before writing code.
Core idea:
- Each piece of data is encrypted with a Data Encryption Key (DEK) — a randomly generated symmetric key
- The DEK is encrypted by a Key Encryption Key (KEK) — the customer-managed key in customer's KMS
- The wrapped DEK is stored alongside the ciphertext
- To decrypt: fetch wrapped DEK from your DB → unwrap via customer KMS → use DEK to decrypt data
Why two layers?
- Performance: KMS calls are slow (~10-50ms) and expensive ($0.03 per 10K)
- DEKs are fast (AES-GCM in process); KEK calls are infrequent
- Rotation: change KEK without re-encrypting all data (just re-wrap DEKs)
- Compromise containment: if a DEK leaks, only that scope is compromised
Schema for per-tenant envelope encryption:
tenant_keys (
tenant_id uuid pk
customer_kms_key_arn text -- the KEK; customer's ARN in their KMS
customer_kms_provider text -- 'aws_kms' | 'azure_kv' | 'gcp_kms' | 'hashicorp_vault'
customer_kms_region text -- where the KEK lives
customer_kms_account text -- customer's AWS account / Azure tenant
current_dek_id uuid -- which DEK is currently active for new writes
status text -- 'active' | 'pending_grant' | 'revoked' | 'failed'
created_at timestamptz
rotated_at timestamptz
revoked_at timestamptz
)
tenant_deks (
id uuid pk
tenant_id uuid not null
wrapped_dek bytea not null -- DEK encrypted by customer KEK
wrap_kms_key_id text -- which version/alias of customer's KEK was used to wrap
algorithm text -- 'AES-256-GCM' (default)
status text -- 'active' | 'retired' | 'compromised'
created_at timestamptz
retired_at timestamptz
)
UNIQUE INDEX (tenant_id, status='active') WHERE status='active' -- one active DEK per tenant
encrypted_records (
id uuid pk
tenant_id uuid not null
dek_id uuid not null references tenant_deks(id) -- which DEK encrypted this
ciphertext bytea not null
iv bytea not null -- 12 bytes for GCM
auth_tag bytea not null -- 16 bytes from AEAD
encrypted_fields text[] -- which fields are encrypted (for partial encryption)
-- ... other fields like resource_type, resource_id ...
)
Note: this is a row-level CMEK pattern. For a "all data per tenant in one schema" model, you encrypt at the field level OR at the database TDE level — different patterns; this chat focuses on field-level which is the most common.
What gets encrypted:
- PII fields (email, name, phone, address)
- Sensitive content (document bodies, message content)
- Credentials / API keys customers store in your product
- Anything customer security questionnaire flags as "sensitive"
What does NOT get encrypted:
- IDs (you need to query by these; encrypted = no index)
- Timestamps (operational metadata)
- Tenant + workspace identifiers (multi-tenancy infrastructure)
- Public-facing fields (avatars, public usernames, etc.)
- Search indexes (indexing encrypted data defeats encryption — exception below)
Searchable encryption (advanced):
- For fields needing search: deterministic encryption (same plaintext → same ciphertext) with HMAC-derived keys per tenant
- Tradeoff: pattern leaks (frequency analysis possible)
- Use only when search is mandatory; otherwise plain encryption
Implement:
1. Schema migration
2. Type-safe encrypted-field wrappers in TypeScript/your language
3. The decision matrix of "which fields encrypt"
4. The deterministic-encryption variant for searchable fields (with caveats documented)
Output: a schema and pattern that scales without re-architecture.
3. Integrate with AWS KMS (primary path)
Now implement AWS KMS integration — the primary CMEK provider in 2026.
Customer setup (what they do):
1. Customer creates a KMS Customer Master Key (CMK) in their AWS account
2. Customer attaches a key policy granting:
- Your AWS IAM role: kms:Decrypt + kms:GenerateDataKey + kms:DescribeKey
- Their own admins: kms:* (full)
3. Customer provides you the KMS key ARN: arn:aws:kms:us-east-1:CUSTOMER:key/UUID
4. You verify access via DescribeKey
5. You begin encrypting their data
Server-side flow:
A. On tenant CMEK setup (one-time):
- Customer provides KMS key ARN
- You call DescribeKey via your IAM role's STS-assumed access
- Verify: key exists, you have GenerateDataKey + Decrypt permissions
- Store the ARN; mark tenant_keys.status = 'active'
B. On first encryption for a tenant:
- Generate a fresh DEK (call kms.GenerateDataKey on customer's KEK)
- GenerateDataKey returns: { Plaintext, CiphertextBlob } where Plaintext is the DEK and CiphertextBlob is the wrapped DEK
- Store CiphertextBlob in tenant_deks.wrapped_dek
- Use Plaintext DEK to encrypt the data (AES-GCM)
- DROP Plaintext DEK from memory immediately after encryption
C. On decryption:
- Look up wrapped_dek in tenant_deks
- Call kms.Decrypt(CiphertextBlob) on customer's KEK
- Returns: { Plaintext } — the unwrapped DEK
- Use to decrypt the data
- Cache the unwrapped DEK in memory (with appropriate TTL)
- DROP DEK from memory after cache TTL expires
D. On tenant CMEK revocation:
- Customer revokes your IAM grant on their KEK
- Next decrypt attempt fails (KMS returns AccessDenied or KeyUnavailable)
- Your app must handle gracefully: surface "Data unavailable; contact admin" rather than crash
- Cached DEKs become a security question (see caching section)
Code:
import { KMSClient, GenerateDataKeyCommand, DecryptCommand } from '@aws-sdk/client-kms'
import { createCipheriv, createDecipheriv, randomBytes } from 'crypto'
async function encryptForTenant(tenantId: string, plaintext: Buffer): Promise<EncryptedRecord> {
const tenantKey = await db.tenantKeys.findById(tenantId)
if (!tenantKey || tenantKey.status !== 'active') throw new TenantKeyNotAvailableError()
// Get or create active DEK
let dek = await getActiveDek(tenantId, tenantKey)
// Encrypt with DEK using AES-256-GCM
const iv = randomBytes(12)
const cipher = createCipheriv('aes-256-gcm', dek.plaintext, iv)
const ciphertext = Buffer.concat([cipher.update(plaintext), cipher.final()])
const authTag = cipher.getAuthTag()
return {
dek_id: dek.id,
ciphertext,
iv,
auth_tag: authTag,
}
}
async function decryptForTenant(tenantId: string, record: EncryptedRecord): Promise<Buffer> {
const tenantKey = await db.tenantKeys.findById(tenantId)
if (!tenantKey) throw new TenantKeyNotConfigured()
if (tenantKey.status === 'revoked') throw new TenantKeyRevokedError()
// Get the DEK that encrypted this record
const dekRecord = await db.tenantDeks.findById(record.dek_id)
if (!dekRecord) throw new DekNotFoundError()
const dek = await unwrapDek(tenantKey, dekRecord)
const decipher = createDecipheriv('aes-256-gcm', dek.plaintext, record.iv)
decipher.setAuthTag(record.auth_tag)
const plaintext = Buffer.concat([decipher.update(record.ciphertext), decipher.final()])
return plaintext
}
async function unwrapDek(tenantKey: TenantKey, dekRecord: TenantDek): Promise<{ plaintext: Buffer }> {
// Cache check (see next section)
const cached = dekCache.get(dekRecord.id)
if (cached) return cached
const kms = await getKmsClientForTenant(tenantKey) // STS assumed role for cross-account
const result = await kms.send(new DecryptCommand({
CiphertextBlob: dekRecord.wrapped_dek,
KeyId: tenantKey.customer_kms_key_arn,
}))
if (!result.Plaintext) throw new KmsDecryptError()
const dek = { plaintext: Buffer.from(result.Plaintext) }
dekCache.set(dekRecord.id, dek, { ttl: 5 * 60 * 1000 }) // 5 min default
return dek
}
Cross-account access:
- You publish a documented IAM Role ARN (your-side) that customers grant access to
- Customer's KMS key policy grants Decrypt + GenerateDataKey to that ARN
- You assume that role via STS to make KMS calls
- Track per-tenant the "your role" used (in case you publish multiple over time for migrations)
Error handling:
- AccessDenied: customer revoked grant; mark tenant status; alert customer + your ops
- ThrottlingException: KMS rate limits; backoff + retry with jitter
- KeyUnavailableException: customer disabled key; mark status; alert
- KMSInvalidStateException: customer scheduled deletion; alert with high urgency
Walk me through:
1. The full encrypt/decrypt code with proper error handling
2. The cross-account STS assume-role pattern
3. The DEK caching policy (next section)
4. The customer-facing setup wizard (UI)
5. The verification step on setup (DescribeKey)
Output: working AWS KMS integration.
4. Add Azure Key Vault + GCP KMS (multi-cloud)
Most enterprise customers are on AWS, but a meaningful subset are on Azure or GCP. Add multi-cloud support.
AZURE KEY VAULT INTEGRATION:
Azure Key Vault has a different model:
- Keys live in a "Vault" (a regional resource)
- Access via Azure AD tenant; no equivalent of cross-account STS
- Customer creates: Service Principal in your Azure AD app + grants role-based access in their Key Vault
- Two access models:
a. Customer creates service principal in your tenant; grants you delegated access (similar to AWS cross-account)
b. Customer creates an Azure AD App registration; grants your registered app access via OAuth-style consent
Recommendation: option (b) — consent-based; cleaner trust model.
Library: @azure/keyvault-keys + @azure/identity
Operations equivalent to KMS:
- DescribeKey → keyClient.getKey(keyName)
- GenerateDataKey → wrap (Azure does NOT have GenerateDataKey; you generate DEK locally + wrap with KEK)
- Decrypt → keyClient.unwrapKey(keyName, version, wrappedKey)
Key difference from AWS: you generate DEK in YOUR process, then send to Azure for wrap/unwrap. AWS lets KMS generate DEK on its side.
GCP KMS INTEGRATION:
GCP KMS is similar to AWS KMS in shape:
- KEY in a KEY_RING in a LOCATION (global / region-specific)
- IAM grants on the key (cloudkms.cryptoKeyEncrypterDecrypter role)
- Customer grants your service account that role on their key
Library: @google-cloud/kms
Operations:
- DescribeKey → kmsClient.getCryptoKey()
- GenerateDataKey → no native equivalent; generate locally + wrap (similar to Azure)
- Decrypt → kmsClient.decrypt(name, ciphertext)
Cross-org: customer adds your service account email as IAM principal on their key.
Abstraction layer:
interface CmekProvider {
describeKey(keyId: string): Promise<KeyInfo>
generateDek(keyId: string): Promise<{ plaintext: Buffer; wrapped: Buffer }>
unwrapDek(keyId: string, wrapped: Buffer): Promise<Buffer>
testAccess(keyId: string): Promise<{ ok: boolean; error?: string }>
}
class AwsKmsProvider implements CmekProvider { ... }
class AzureKvProvider implements CmekProvider { ... }
class GcpKmsProvider implements CmekProvider { ... }
Factory:
function getProviderForTenant(tenantKey: TenantKey): CmekProvider {
switch (tenantKey.customer_kms_provider) {
case 'aws_kms': return new AwsKmsProvider(...)
case 'azure_kv': return new AzureKvProvider(...)
case 'gcp_kms': return new GcpKmsProvider(...)
default: throw new Error('Unknown provider')
}
}
Implement:
1. The CmekProvider interface
2. Each provider implementation
3. The factory + tenant routing
4. Customer setup UI for each provider (different setup steps; document each)
5. Error handling normalized across providers
6. Cost considerations (Azure/GCP price differently than AWS)
Output: multi-cloud CMEK that doesn't pretend AWS is the only customer.
5. DEK caching: the security tradeoff
DEK caching is the most security-sensitive design decision. Get it wrong and you defeat the point of CMEK.
The tradeoff:
- CACHE LONGER → better performance (KMS decrypt costs $0.03/10K calls + 10-50ms latency)
- CACHE SHORTER → revocation actually means revocation
If you cache a DEK in process memory for 24 hours and the customer revokes the KEK at hour 12, your app continues decrypting their data for 12 more hours. That's a security failure.
Recommended caching policy:
LEVEL A: Conservative (security-first)
- TTL: 60 seconds in process memory
- KMS call per minute per tenant
- Cost: high (KMS calls dominate); latency: low (cache hit most of the time within minute)
- Performance: ~1ms cache hit; 10-50ms cache miss
- Best for: sensitive customers (finance, healthcare); revocation lag <60s
LEVEL B: Balanced (most common)
- TTL: 5 minutes in process memory
- KMS call ~1 per 5 min per tenant
- Cost: moderate; latency: low
- Performance: ~1ms cache hit
- Revocation lag: up to 5 min
- Best for: most enterprise customers
LEVEL C: Performance-first
- TTL: 1 hour in process memory + Redis layer with same TTL
- Lowest cost; lowest latency
- Revocation lag: up to 1 hour
- Best for: high-throughput products with less-sensitive customer data
LEVEL D: Aggressive
- TTL: 24 hours
- Almost no KMS calls
- Revocation lag: 24 hours
- WARNING: this often FAILS real enterprise security review
- Avoid
Default recommendation: Level B (5 min). Allow per-tenant override (sensitive customers can request Level A; document the tradeoff).
Cache layer architecture:
class DekCache {
private memCache: LruCache<string, { plaintext: Buffer; expiresAt: number }>
get(dekId: string): Buffer | null {
const entry = this.memCache.get(dekId)
if (!entry) return null
if (Date.now() > entry.expiresAt) {
this.memCache.delete(dekId)
return null
}
return entry.plaintext
}
set(dekId: string, plaintext: Buffer, ttlMs: number): void {
this.memCache.set(dekId, { plaintext, expiresAt: Date.now() + ttlMs })
}
invalidate(dekId: string): void {
this.memCache.delete(dekId)
}
invalidateTenant(tenantId: string): void {
// expensive; iterate
}
}
Memory hygiene:
- Use a library that zeros memory on eviction (sodium-native or libsodium-wrappers)
- DO NOT serialize DEKs to disk, logs, or any persistent store
- DO NOT log DEK contents anywhere (including verbose error logs)
- Process memory only; never persist plaintext DEKs
Multi-process / horizontal scaling:
- Each instance has its own cache (no shared cache)
- This is FINE; security improves (no shared cache to compromise)
- Cost increases (more KMS calls across instances)
Implement:
1. The DekCache class with proper TTL handling
2. Memory zeroing on eviction
3. Per-tenant TTL configuration
4. The "force-invalidate-on-revocation" path (customer-driven via webhook or polling)
5. Metrics: cache hit rate, KMS call rate, p99 decrypt latency
Output: a caching strategy that respects security AND performance.
6. Build the customer-facing setup UI
Enterprise security teams will set this up; the UI must be self-serve, clear, and rollback-safe.
Page: /settings/security/encryption-keys (admin-only)
States:
State 1: Default (CMEK not enabled)
- Banner: "Customer-Managed Keys: Available on Enterprise plan"
- "Enable CMEK" button → opens wizard
State 2: Wizard (multi-step)
Step 1: Choose provider
- AWS KMS / Azure Key Vault / GCP KMS (radio)
- Show provider-specific instructions
Step 2: Provider-specific setup
AWS:
- Show: your AWS account ID + IAM role ARN to grant access to
- Customer creates KMS key in their account
- Customer attaches key policy granting your role: Decrypt + GenerateDataKey + DescribeKey
- Customer pastes their KMS key ARN
Azure:
- Show: your Azure AD app + service principal info
- Customer grants Key Vault Crypto User role to your principal
- Customer pastes Key Vault URL + key name
GCP:
- Show: your GCP service account email
- Customer grants cloudkms.cryptoKeyEncrypterDecrypter role
- Customer pastes full key resource name
Step 3: Verify access
- Click "Test Connection" → server runs DescribeKey + does test wrap+unwrap
- On success: "Connection verified"
- On failure: clear error message + troubleshooting link
Step 4: Confirm migration
- "We will encrypt all NEW data going forward with your key"
- "Existing data will be re-encrypted in the background — typically 1-12 hours"
- "During migration, data is dual-encrypted (both old and new keys can decrypt)"
- Confirm + start
State 3: Migration in progress
- Progress bar: "Encrypting your existing data… X of Y records (Z%)"
- Estimated time remaining
- Pause / cancel option (with consequences explained)
State 4: Active CMEK
- Status: "Enabled — using your key arn:aws:kms:..."
- Last verified: timestamp
- "Test Connection" button (run anytime)
- "Rotate Key" button (initiates rotation)
- "Disable CMEK" button (with strong warning)
- Key audit log (last 50 KMS operations from your side)
State 5: Key revoked / access lost
- Banner: "WARNING: Your encryption key is unavailable"
- "We can no longer decrypt your data. Restore IAM access to your KMS key, or disable CMEK to fall back to default encryption (will require admin confirmation)."
- Last successful access timestamp
- Troubleshooting steps
- Contact support button
Implement:
1. The wizard with provider-specific flows
2. The verification endpoint (server-side test of access)
3. The migration job (re-encrypt existing data)
4. The status / monitoring page
5. The audit log component (recent KMS operations)
6. The disable-CMEK flow (with cooldown / confirmation)
Output: an enterprise-grade setup UX that customers' security teams trust.
7. Re-encryption / migration jobs
When CMEK is enabled or rotated, you need to re-encrypt existing data.
Migration scenarios:
A. CMEK first-enabled: re-encrypt all tenant data from default-encryption to CMEK
B. CMEK rotation: re-encrypt all tenant data from old DEK to new DEK
C. CMEK disabled: re-encrypt back to default
All scenarios are background jobs. Must be:
- Resumable (track progress; restart from checkpoint on failure)
- Bounded blast radius (one tenant at a time; don't burn cluster)
- Observable (admin sees progress)
- Atomic per-record (no half-encrypted state)
Pattern:
migration_jobs (
id uuid pk
tenant_id uuid not null
type text -- 'enable_cmek' | 'rotate_dek' | 'disable_cmek'
status text -- 'pending' | 'in_progress' | 'paused' | 'completed' | 'failed'
total_records bigint
processed_records bigint
current_resource_type text
current_offset bigint
started_at timestamptz
completed_at timestamptz
error text
)
Worker logic:
while not done:
batch = fetch next 1000 records for this tenant + resource_type + offset
for record in batch:
plaintext = decrypt_with_old_key(record) // current state
new_record = encrypt_with_new_key(plaintext) // new state
UPDATE record SET ciphertext = new_record.ciphertext, dek_id = new_record.dek_id WHERE id = record.id AND dek_id = old_dek_id
// optimistic lock; if dek_id changed, skip (concurrent rotation)
update migration_jobs SET processed_records = processed_records + batch.length, current_offset = ...
Throttle:
- 100-500 records per second per tenant
- Don't burst; slow + steady
- Pause if KMS error rate > threshold
- Pause if customer-side latency reports degrade
Dual-encrypt during migration:
- Reads check both old + new DEK; try new first, fall back to old
- Writes always use new DEK
- After all records migrated: drop old DEK
Implement:
1. The migration_jobs schema
2. The worker with checkpoint + resume
3. The dual-decrypt fallback pattern
4. The pause/resume admin controls
5. The progress monitoring UI
6. The validation step (post-migration verification)
Output: re-encryption that doesn't take down production.
8. Audit log + customer transparency
Customers want to see exactly what KMS operations their data has triggered. Provide an audit log.
Schema:
cmek_audit_log (
id uuid pk
tenant_id uuid not null
operation text -- 'describe_key' | 'generate_data_key' | 'decrypt' | 'access_denied' | 'key_revoked' | 'key_rotated'
kms_provider text
kms_key_arn text
resource_type text -- 'document' | 'message' | 'attachment'
resource_id text
outcome text -- 'success' | 'denied' | 'error'
error_code text
ip_address text -- your server's IP making the call (not the customer's)
occurred_at timestamptz
)
Customer-facing view:
- Filterable by operation + outcome + date range
- Exportable as CSV
- Retention: 1 year minimum; longer if customer plan calls for it
- Log: never include any plaintext data; only the operation metadata
Specific events to log:
- Every DescribeKey on customer's KEK (rare; usually only at setup)
- Every GenerateDataKey (DEK creation events; rare)
- Sample of Decrypt operations (don't log every single — high volume; sample at 1% or aggregate)
- All AccessDenied events (alert-worthy)
- Customer-initiated rotations
- Customer-initiated revocations (detected via failed access)
Customer's side audit:
- The customer ALSO has CloudTrail (AWS) / Activity Logs (Azure) / Audit Logs (GCP) on their KMS
- Their logs show YOUR IAM principal making the calls
- These should match yours (or close — sampling differences)
- Document for them: "you can verify our claims via your own KMS audit"
Implement:
1. The cmek_audit_log schema
2. The logging hook in encrypt/decrypt paths (with sampling for high-volume Decrypt)
3. The customer-facing view + export
4. The reconciliation-with-CloudTrail documentation
5. The retention policy enforcement
Output: transparency that closes the trust loop.
9. Operational concerns
Walk me through the edge cases:
1. Customer revokes IAM grant suddenly
- Decrypt calls fail with AccessDenied
- App must surface "Data unavailable" gracefully (not 500)
- Customer's data is "soft locked" (we have ciphertext, can't decrypt)
- Alerting: high-priority page to your ops + email customer admin
- Resolution: customer restores IAM grant; access returns within minutes
2. Customer disables / schedules KMS key for deletion
- Different from revocation; key itself is going away
- You get warning periods (AWS: 7-30 day pending deletion)
- Critical: alert customer admin VERY clearly: "your data will be permanently inaccessible in N days"
- Provide path: cancel deletion, OR disable CMEK and re-encrypt to default (only viable in pending-deletion window)
3. Customer changes IAM role you depend on
- Document required role; rotate carefully
- Test new role before old is removed
4. Performance regression after enabling CMEK
- Expected: ~1-2ms p50 added latency for cache-hit decrypt; 10-50ms p99 for cache miss
- Mitigation: batch operations to share DEK; longer cache TTL with customer consent
- Budget for it: enterprise tier customers expect some perf tradeoff
5. Tenant migration to a different KMS
- Customer wants to move from AWS KMS to GCP KMS (rare but happens)
- Trigger full re-encrypt with new KEK
- Document customer-side workflow
6. Cross-region: customer KEK in different region than your data
- Cross-region KMS calls are slower + more expensive
- Document recommendation: customer should put KEK in same region as your service for that customer
- Consider per-region KEK if customer is multi-region
7. DEK rotation cadence (your-side, not KEK rotation)
- Best practice: generate new DEK monthly or quarterly
- Old DEKs retired but kept for decrypting old data
- Records stamped with dek_id; can read with any historical DEK
- Eventually re-encrypt very old records to current DEK (background job)
8. Backup + disaster recovery
- Backups also encrypted; restoring requires customer's KEK to be available
- DR test: regularly verify you can restore from backup using current KEK
- Document for customer: backup retention + decryption requirements
9. Subprocessor / third-party access (e.g. you send data to OpenAI)
- Decrypt happens before sending to subprocessor
- Document this in your DPA / sub-processor list
- Customer security team WILL ask: "what services see decrypted data?"
10. Compliance audit
- SOC 2 / ISO 27001 / HIPAA assessors will ask about CMEK
- Be ready: documented architecture diagram, audit log samples, key-rotation evidence
11. Billing for CMEK
- This is an enterprise feature; charge for it
- Typical: 20-40% premium on enterprise tier OR flat $X/month CMEK fee
- Customer KMS costs are theirs (usually $1-5/key/month + per-call charges)
12. Internal access controls
- Even YOUR engineers shouldn't have casual access to plaintext
- Production decrypt access via break-glass only (audit logged)
- Test with: "if our DBA went rogue, what's their blast radius?" — should be limited to hashes/encrypted blobs
For each, walk me through code change + customer-facing impact + comms.
Output: operational robustness that customer security teams accept.
10. Recap
What you've built:
- Envelope encryption pattern (per-tenant DEK, per-tenant KEK)
- AWS KMS integration with cross-account STS
- Azure Key Vault + GCP KMS integrations
- Provider abstraction layer
- DEK caching with appropriate TTL + memory hygiene
- Customer-facing setup wizard (per-provider)
- Re-encryption / migration jobs
- Audit log + customer transparency
- Graceful handling of revocation + access loss
- Performance instrumentation
- Sales / pricing motion (BYOK as enterprise feature)
What you're explicitly NOT shipping in v1:
- Customer-hosted HSM (Level 3) — defer until a customer pays for it
- Client-side / E2E encryption — different product entirely
- Bring-your-own-storage (BYOS) — different feature
- Searchable encryption / OPE / format-preserving encryption — only if a specific product feature requires it
- Quantum-resistant algorithms — defer; revisit late 2020s
- Hybrid HSM + cloud-KMS support — niche; defer
Ship CMEK to your top 5 enterprise customers as a paid pilot. Iterate based on their security-team feedback. Generalize and price.
The biggest mistake teams make: shipping "BYOK" that's actually vendor-managed customer-named keys — fails real security review. Ship Level 2 (real customer-side KMS) or don't ship.
The second mistake: forgetting the operational realities. CMEK is a feature; it's also a 24/7 dependency on customer infrastructure. Plan for outages, revocations, key migrations.
The third mistake: not pricing for it. CMEK is a paid feature; it commands a premium. Don't give it away as a "compliance freebie."
See Also
- Audit Logs — pairs with CMEK audit
- Multi-Tenancy — the tenant boundary CMEK enforces
- SSO / Enterprise Auth — the next enterprise feature
- Backups & Disaster Recovery — pairs for backup encryption
- Roles & Permissions — adjacent enterprise control
- Account Deletion & Data Export — adjacent compliance
- Data Trust — pairs for transparency narrative
- Database Migrations — re-encryption is a migration pattern
- Background Jobs & Queue Management — re-encrypt jobs run here
- Multi-Region Deployment — pairs with cross-region KMS
- API Keys — adjacent secrets-management discipline
- OAuth Provider Implementation — adjacent enterprise integration
- Compliance Automation Tools (Reference) — supports CMEK during audits