Architecture Decision Records (ADRs)
Velocity’s design is anchored in 10 Architecture Decision Records. Each ADR is a binding decision; deviations require an ADR update.
ADR-001: Informer-Per-Replica Architecture
Status: Adopted
Date: Phase 1
Every API server replica runs its own kube-rs informer on SchemaDefinition CRDs (etcd watch). Each replica maintains a fresh, in-memory SchemaRegistry updated in real-time. No RPC to the operator, no Redis pub/sub, no stale reads.
Rationale:
- Low latency: Reads are atomic pointer loads (arc_swap), no network hop
- Availability: Registry works even if operator is down (operator is for provisioning, not routing)
- Consistency: Informer guarantees eventual consistency with etcd; cache misses impossible after reconnect
- Scalability: O(replicas) watchers vs. O(1) centralized cache (watchers are cheap; etcd handles fan-out)
Trade-off: Each replica uses ~50-100 MB RAM for SchemaRegistry. Acceptable for < 1000 schemas per org.
Impact:
- Operator downtime does NOT block API reads
- Schema updates propagate within seconds (informer latency)
- No schema versioning needed at API layer (CRD ResourceVersion handles it)
See CLAUDE.md SchemaRegistry Implementation (ADR-001, ADR-006) for code pattern.
ADR-002: Outbox Pattern for CDC
Status: Adopted
Date: Phase 4
Every data write that requires indexing (Tier-3 search, future webhooks) uses the transactional outbox pattern:
- Main table INSERT/UPDATE (e.g., purchase_order_v1)
- Outbox table INSERT (e.g., purchase_order_v1_outbox) in same transaction
- CDC worker polls outbox for unpublished rows
- CDC worker publishes to Typesense, marks published_at
- Archival sweeps old published rows
Rationale:
- Transactionality: Main data and index signal in one ACID transaction (no dual-write inconsistency)
- Idempotency: CDC worker can retry; duplicates detected by Typesense upsert
- Auditability: Outbox lag is observable; stalled CDC is detectable metric
- Decoupling: Typesense outage does not block writes (outbox buffers)
Trade-off: Extra I/O per write (2 inserts instead of 1). Negligible at < 10K writes/sec.
Scope: Tier-3 schemas only. Tier-1 and Tier-2 have no outbox.
Impact:
- Typesense indexing is “at most X seconds” behind main table (configurable CDC worker frequency)
- CDC lag alert:
velocity_cdc_lag_records > 1000indicates worker stall
See CLAUDE.md Outbox Pattern (ADR-002) for code pattern.
ADR-003: Authentication Fail-Mode Matrix
Status: Adopted
Date: Phase 1
When external dependencies fail (Redis unavailable, JWKS endpoint down, Postgres disconnected), the system defaults to DENY (fail-closed). Fail-open is never automatic; it must be explicitly configured and audit-logged.
Matrix:
| Dependency | Failure | Default Behavior | Override |
|---|---|---|---|
| Redis (revocation) | Unreachable | 503 REVOCATION_UNAVAILABLE (deny) | revocation.failOpen: true (dangerous) |
| JWKS (issuer) | Endpoint down | Cache hit: allow; Cache miss/expire: 401 (deny) | Increase cache TTL (5 min default) |
| Database | Postgres down | 503 SERVICE_UNAVAILABLE (deny) | None |
| Typesense | Indexing unavailable | Writes succeed (outbox buffers); searches degrade to Tier 2 | Automatic fallback |
Rationale:
- Security: Better to deny a legitimate request than allow an attacker due to failed checks
- Auditability: Every fail-mode decision logged in audit trail (
fail_modecolumn) - Transparency: Operators can opt-out with explicit configuration and understand the risk
Trade-off: User-facing 503s may spike during dependency outages. Uptime SLA depends on dependency health.
Impact:
- All authentication requests include
fail_modefield in audit entry - Alerts on
velocity_auth_dependency_failures_totalindicate need for incident response
See auth.md Fail-Mode Matrix (ADR-003) and hardening.md Auth Fail-Mode Matrix for operational details.
ADR-004: Warm-Tier Query Engine (DataFusion)
Status: Adopted (Revision 2026-05-18)
Date: Phase 2 (revised Phase 4)
Archive queries (warm tier: S3 Parquet) use Apache DataFusion (embedded Rust SQL engine) in the velocity-warm-reader service.
Original (Phase 2): Simple metadata scan + projection.
Revised (Phase 4): Full SQL with:
- Predicate pushdown (WHERE filters applied before reading Parquet)
- Aggregations (COUNT, SUM, GROUP BY)
- Sorting and limits
- Columnar pruning (read only selected columns)
Rationale:
- Performance: Predicate pushdown reduces S3 bytes by 40-60%
- Simplicity: SQL is familiar; no custom query DSL
- Efficiency: DataFusion is pure Rust; no external process, low latency
- Compatibility: Queries match time-machine point-in-time API shape
- Cost: Fewer bytes read from S3 = lower egress cost
Trade-off: DataFusion lacks some features (no geospatial, limited window functions) — deferred to Phase 10.
Impact:
- Archive queries use SQL WHERE/ORDER BY/LIMIT (not custom JSON)
- Response time: 100-500ms typical (S3 + DataFusion overhead)
See archive.md for Archive Operations and Query Archived Records examples.
ADR-005: Hash-Linked Append-Only Audit Chain
Status: Adopted
Date: Phase 1
The audit log is immutable and cryptographically hash-linked: each event includes the SHA256 hash of the previous event. Tampering is detectable.
Chain structure:
Event 1: SHA256(event_id || timestamp || actor || old || new || "") = abc123...Event 2: SHA256(event_id || timestamp || actor || old || new || abc123...) = def456...Event 3: SHA256(event_id || timestamp || actor || old || new || def456...) = ghi789...Any change to Event 2’s data (timestamp, actor, values) causes Event 3’s hash to break the chain.
Verification: velocity audit verify recomputes the chain and detects breaks.
Rationale:
- Compliance: SOC 2, GDPR, HIPAA, PCI-DSS all require tamper-evident audit trails
- Forensics: Hash proof that an event was not modified after-the-fact
- Cryptography: SHA256 is standard; no custom crypto
- Simplicity: Computed at write time in stored procedure; no async verification needed
Trade-off: Verification requires reading all events (slow for large records, but typical is 10-100 events per entity).
Impact:
- Audit table has
event_hashandprev_hashcolumns - Deletion of audit rows is impossible (table is append-only, no superuser privileges)
- All databases and warm-reader queries enforce immutability
See audit.md Verify Audit Chain Integrity for verification examples and Compliance Standards for regulatory details.
ADR-006: Arc_Swap for Lock-Free Registry Reads
Status: Adopted
Date: Phase 1
SchemaRegistry uses arc_swap::ArcSwap<RegistryInner> for lock-free, concurrent reads. No RwLock, no Mutex.
Pattern:
struct SchemaRegistry { inner: ArcSwap<RegistryInner>,}
// Write (rare): updateregistry.store(Arc::new(new_inner));
// Read (hot path): atomic pointer loadlet snapshot = registry.load();snapshot.resolve(&path)?;Rationale:
- Lock-free: Reads do not block each other (atomic pointer swap)
- Performance: Hot path has zero contention
- Simplicity: No lock poisoning; no deadlock risk
- Fairness: Writers do not starve readers
Trade-off: Readers hold a reference to the old registry snapshot while reading; writes allocate a new ArcSwap. Negligible cost (allocations < 1/min).
Impact:
- Schema reads are ~10x faster than RwLock
- API latency does not degrade under high concurrency
- No hand-wavy “lock contention” debugging; performance is predictable
See CLAUDE.md SchemaRegistry Implementation for code pattern.
ADR-007: Non-Superuser Postgres Role
Status: Adopted
Date: Phase 1
The velocity_api role is NON-SUPERUSER with NOBYPASSRLS=true. RLS policies are an actual backstop, not just app-layer logic.
Verification:
SELECT rolname, rolinherit, rolcanlogin, rolbypassrlsFROM pg_rolesWHERE rolname = 'velocity_api';
-- Expected:-- rolname rolinherit rolcanlogin rolbypassrls-- velocity_api t t f ← Must be falseIf rolbypassrls = true, RLS is bypassed and security is compromised.
Constraints applied to velocity_api:
- No superuser grant
- No CREATE ROLE
- No ALTER DATABASE
- No TRUNCATE (DELETE only)
- SELECT/INSERT/UPDATE/DELETE on schema-specific tables only
Rationale:
- Defense in depth: Even if app code has a bug, RLS prevents unauthorized reads
- Compliance: SOC 2, HIPAA require separation of duty and access controls
- Simplicity: Postgres enforces the policy; no trust in app logic
- Auditability:
pg_auditcan log all RLS violations at database level
Trade-off: App cannot bypass RLS (e.g., internal admin operations must use a separate role).
Impact:
- Every transaction must call
SET LOCAL ROLE {domain_role}(app-layer context switching) - RLS filters are applied even if app code forgets to include them
- Operators must verify role at startup; deployment fails if misconfigured
See installation.md Critical: ADR-007 Verification and hardening.md Database Role Verification (ADR-007) for operational details.
ADR-008: CEL for Safe Expressions
Status: Adopted
Date: Phase 7
CEL (Common Expression Language) is used for ABAC, validation rules, and archive triggers. Execution is bounded: 10 KB max, 10 depth, 10 ms timeout.
Constraints:
max_expression_size: 10 KBmax_execution_depth: 10max_execution_duration: 10 msforbidden_functions: [matches with unbounded regex]Examples:
# Valid CELcondition: "actor.department in ['procurement', 'finance']"condition: "self.amount > 1000 && self.status == 'approved'"condition: "self.created_at > now() - duration('7776000s')"
# Invalid CEL (too slow)condition: "self.description.matches('.*')" # Unbounded regex
# Invalid CEL (too long)condition: "...5000 character expression..."
# Invalid CEL (too deep)condition: "a || b || c || d || ... (>10 OR chains)"Rationale:
- Denial-of-service prevention: Bounded execution prevents CPU exhaustion
- Consistency: Same CEL used everywhere; no custom expression language per feature
- Safety: No file I/O, network calls, or external dependencies in CEL
- Debuggability: Timeout vs. parse error is explicit
Trade-off: Complex business logic must use triggers or async reconciliation, not CEL.
Impact:
- Archive triggers and validation rules use CEL
velocity schema validatechecks expression constraints at apply time- Webhook server never runs untrusted CEL; it runs trusted operator-defined CEL only
See features/schema-definition.md CEL Constraints and hardening.md Input Validation for examples.
ADR-009: Cursor-Based Pagination (Deferred)
Status: Planned for Phase 11
Date: TBD
For scalability, large result sets (> 1000 rows) use cursor-based pagination instead of offset/limit.
Current behavior: offset/limit, max 1000 rows per request.
Planned behavior: GET /api/{path}?cursor={token}&limit=100 returns next_cursor in response.
Rationale:
- Performance: Cursor avoids O(N) row skip overhead
- Consistency: Cursor snapshots the result set; offset-based pagination has gaps/dupes on concurrent writes
- Predictability: Cursor response time is constant; offset grows with skip distance
Trade-off: Client must handle pagination differently (not just loop with limit += 100).
Impact: When implemented, offset-based pagination will be deprecated but not removed.
See API reference for current pagination behavior (offset/limit).
ADR-010: Multi-Tenant Isolation (Deferred)
Status: Planned for Phase 11
Date: TBD
Hard tenant boundaries for true SaaS multi-tenancy. Org-level isolation enforced in schema definitions.
Planned constraints:
- No cross-org schema references
- Audit log includes org context
- Role binding is org-scoped
- Tenant quotas (schemas per org, records per schema)
- Separate admin personas per tenant
Current behavior: Single org per deployment. Org is set at install time (Helm values).
Future behavior: Multiple orgs in one Velocity cluster with hard boundaries.
Rationale:
- Security: Accidental data leakage between customers is impossible
- Compliance: HIPAA, PCI-DSS require data segregation
- Cost: One Postgres instance per cluster (not per org)
- Simplicity: RLS policies are org-scoped only
Trade-off: Cross-org queries will never be supported (by design).
Impact: When implemented, Helm deployment will accept multiTenant: true flag. Single-tenant deployments remain unchanged.
ADRs Awaiting Finalization
None at this time. ADRs 001–010 are stable.
Future ADRs may address:
- Sharding strategy (Phase 11)
- Webhook delivery guarantees (Phase 10)
- Search analytics backend (Phase 10)
- Disaster recovery RTO/RPO targets
How to Reference an ADR
In code comments and design docs, cite by number:
# Valid# ADR-007 requires non-superuser role for RLS
# ADR-003 fail-mode matrix applies here
# Invalid# The non-superuser thing# Security reasons for role separationUpdating an ADR
To update a decision:
- Create a new ADR with revision date (e.g., “ADR-004 Revision 2026-05-18”)
- Explain change from previous decision
- Update CLAUDE.md
Required Readingsection - Update all code comments and docs
- File a PR with
[ADR-XXX]tag in commit message
Compliance Mapping
| Standard | ADRs |
|---|---|
| SOC 2 Type II | ADR-005 (audit), ADR-007 (RLS, access controls) |
| GDPR | ADR-005 (audit trail), ADR-007 (access controls), ADR-010 (data segregation) |
| HIPAA | ADR-005 (tamper detection), ADR-007 (RLS), ADR-010 (hard boundaries) |
| PCI-DSS | ADR-005 (immutable audit), ADR-007 (role separation) |
| SOX | ADR-005 (audit chain), ADR-007 (separation of duty) |