Solution Architecture & Engineering Strategy

E-Commerce Platform
Solution Architecture

Scaling a grocery e-commerce platform for 3× traffic growth, five new regions, real-time inventory, and personalised recommendations.

PROPOSED ARCHITECTURE OVERVIEW Users 5 Regions API Layer Gateway + Mesh Services Event-Driven Data Layer DB + Cache + Search
Prepared by
Subhendu Das
Date
March 2026
Context
Scalable Multi-Region Commerce
NAVIGATE: Arrow Keys 1-6 Jump F Fullscreen Touch Swipe
Problem Statement

The Platform Is Breaking Under Its Own Growth

The platform's traffic doubled in six months. Peak-hour crashes are increasing, and expansion to five new regions is imminent.

Traffic Growth

in 6 months, targeting 3× YoY

New Regions
5

with local currencies, tax & warehouses

Peak Crashes

slow responses and outages at peak hours

Customer Complaints

slow responses & system crashes reported

CURRENT MONOLITH — SINGLE POINT OF FAILURE Growing Users Web + Mobile + API 2× in 6 months FRAGILE MONOLITH Catalog Orders Payments Inv Single Shared Database No Caching Layer Every request hits DB Single Region No DR, no failover Tight Coupling Can't scale independently No Event System Synchronous everything

Missing Business Capabilities

Key gaps identified from the case study that the current platform cannot address.

Real-Time Inventory

Customers need accurate stock levels during browsing and checkout — the monolith has no dedicated inventory service or caching layer.

Personalised Recommendations

The case study requires personalised product suggestions — no ML pipeline, feature store, or recommendation engine exists today.

Faster Delivery SLAs

Regional expansion demands local warehouse routing and fulfilment orchestration that the single-region monolith cannot support.

Cost Optimisation

The brief explicitly calls for cost-effective scaling — monolithic vertical scaling is expensive; independent service scaling is needed.

Multi-Region Operations

Five new regions require local currencies, tax rules, and warehouses — the monolith has no multi-tenancy or regionalisation layer.

Proposed Non-Functional Targets

These targets are proposed based on industry benchmarks — not specified in the original brief.

Browse/Search
99.9%
p95: 200–400 ms
Checkout/Payment
99.95%
p95: 600–1200 ms
Throughput
YoY with seasonal spikes
Solution Design & Transition

Four Candidates, One Evolutionary Path

▼ Scroll to see transition strategy, migration phases & CI/CD pipeline

Rather than a big-bang migration, we evaluate four architectures and recommend an evolutionary hybrid using the Strangler Fig pattern.

Requirement → Solution Traceability

Every architectural choice maps back to a specific case study requirement.

1. Performance & Scalability

Caching layer, connection pooling, independent service scaling, CQRS read models

2. Regional Expansion

Multi-region deploy, Pricing & Tax context, warehouse routing, local currency support

3. High Availability

Active/active multi-AZ, event backbone for decoupling, canary deploys, error budget gates

4. Business Features

Real-time inventory service, search + recommendations via ML context, faster delivery SLAs

5. Cost Optimisation

Serverless for bursty workloads, independent scaling per service, FinOps phase

Domain Decomposition (14 Bounded Contexts)

BOUNDED CONTEXTS & CONSISTENCY MODEL Identity & Session Sign-in, tokens, session mgmt STRONG Orders Lifecycle, state machine, audit STRONG Payments PSP integration, refunds, regional rails STRONG Cart Ephemeral selection & state STRONG per session Pricing & Tax Calc, regional rules, promos STRONG at checkout Fulfilment & Delivery Warehouse routing, dispatch, SLAs STRONG activation; EVENTUAL status Catalogue Products, offers, locale EVENTUAL (~5min CDN) Notifications Email, push, SMS, in-app AT-LEAST-ONCE; idempotent Customer Profile Prefs, addresses, privacy STRONG writes; EVENTUAL views Risk / Fraud Signals Rules, anomaly, device rep NEAR-REAL-TIME Analytics & ML Events, features, training EVENTUAL Inventory Stock levels, reservations, warehouse availability STRONG reserve; EVENTUAL browse Promotions & Loyalty Coupons, loyalty points, rewards, campaigns EVENTUAL Returns & Refunds Return requests, refund processing, reverse logistics STRONG Strong consistency Eventual consistency Mixed consistency Near-real-time

Four Architecture Candidates

A

Modular Monolith

Hexagonal ports & adapters. Strongest consistency. Lowest ops complexity.

Best early velocity
Risk: "Big ball of mud" without governance
B

Microservices

Sync-first + async side flows. Independent scaling. Saga transactions.

Team autonomy
Risk: Distributed monolith via sync chains
C

Streaming + CQRS

Kafka event backbone. Separate read/write. Multi-consumer fan-out.

Highest throughput
Risk: Event schema sprawl + replay complexity
D

Serverless

Managed functions + event bus. Pay-per-use. Rapid elasticity.

Cost-efficient spikes
Risk: Retry storms + cold-start tail latency
DimensionA: MonolithB: MicroservicesC: Stream+CQRSD: Serverless
Delivery VelocityHigh (early)MediumMedium-LowHigh (small features)
Ops ComplexityLowestHighVery HighMedium-High
ConsistencyStrongest (single TX)Strong/svc; saga acrossEventual reads; strong writesEventual; orchestrator
LatencyLow varianceHop-sensitiveFast reads; write lagCold-start variance
Cost ShapePredictableHigher baselineHighest (data dup)Usage-based
Best FitRapid iteration + consistencyTeam autonomy + scalingMany consumers + readsSpiky, event-heavy
Data MigrationLowest (in-process)Medium (per-service DBs)High (dual-write + projections)Medium (event replay)
Team Skill Req.General backendPlatform + DevOps maturityEvent modeling + schema governanceCloud-native + managed svc
CAP Trade-offCA (single node)CP or AP per serviceAP reads; CP writesAP (eventual + retries)
Choose A when:

Strong correctness + fast iteration needed; minimal distributed complexity; extraction-friendly hexagonal boundaries.

Choose B when:

Multiple teams need independent deployability; platform maturity (CI/CD, tracing, contract testing) exists.

Choose C when:

Many downstream consumers need same events; read volume dominates; bounded staleness acceptable.

Choose D when:

Highly bursty/event-driven workload; strong managed-service preference; can engineer around retries + tail latency.

CAP Theorem — A distributed system can guarantee at most two of Consistency, Availability, and Partition-tolerance. The monolith sidesteps the trade-off (single node, no partitions); the hybrid makes per-context choices — CP for payments/orders (strong consistency), AP for catalogue/search (availability + eventual consistency).

Why Kafka? — Compared to RabbitMQ (push-based, lower throughput), SQS (no replay, AWS-only), and Pulsar (smaller ecosystem): Kafka provides durable log replay, high throughput (25K+ evt/sec), partitioned ordering, consumer groups for fan-out, and schema registry integration. Critical for event sourcing, outbox relay, and CQRS projections across 14 bounded contexts.

Brownfield vs Greenfield — The platform is a brownfield project (existing monolith → hybrid migration via Strangler Fig). A greenfield approach (building microservices from scratch) would bypass legacy constraints but forfeit existing business logic, data, and customer traffic. The evolutionary hybrid preserves brownfield value while introducing greenfield patterns (event backbone, CQRS, new bounded contexts) incrementally.

Service Integration Patterns

API Gateway

Single entry point for all clients. Handles auth, rate limiting, routing, and acts as the Strangler Facade. The platform's primary pattern.

Aggregator

A composite service calls multiple downstream services and merges results. Used for product detail pages (Catalogue + Pricing + Inventory + Reviews in one response).

Chained

Synchronous service-to-service call chain where each step depends on the prior. Used in checkout: Cart → Pricing → Payment → Order. Risk: latency compounds per hop.

Branch

Request fans out to multiple services in parallel, results merged. Used for search (Catalogue + Personalisation + Pricing queried simultaneously, fastest wins).

Client-Side UI Composition

Frontend (React/Next.js) fetches from multiple BFFs independently and assembles the page. Each UI section maps to a bounded context. Enables independent team deployment.

The platform uses a mix: API Gateway for ingress, Aggregator for composite reads, Chained for transactional flows (with saga compensation), Branch for parallel search, and Client-Side Composition for the storefront.

Recommended: Evolutionary Hybrid (Strangler Fig)

Three-stage transition from fragile monolith to scalable target. Each stage delivers value while managing risk.

CURRENT STATE TRANSITION TARGET STATE Monolithic Application Cat Ord Pay Inv Tightly Coupled Single Shared DB Single Region · No DR · No Events Phase 1-4 ~4 weeks Strangler Facade (API Gateway) Identity Catalog Cart Orders Payments Fulfil Outbox → Kafka Event Backbone Redis Cache Multi-AZ DR Phase 5-7 ~8 weeks CDN / WAF → API GW → Service Mesh Identity Catalog Cart Payments Orders Fulfil Kafka Streaming · CQRS · Notify · ML · Fraud PG Redis ES DynDB Active/Active Multi-Region (5 Regions) Domain Services Gateway / Infrastructure Event Backbone (Kafka) Data Stores Legacy / Not-yet-migrated

Hybrid in Action: Purchase Flow

The checkout flow demonstrates how each candidate pattern contributes to the recommended hybrid architecture.

END-TO-END PURCHASE FLOW THROUGH EVOLUTIONARY HYBRID 1. Client Checkout request + auth token (JWT) OAUTH 2.0 / OIDC 2. API Gateway JWT validation Rate limit + mTLS SERVICE MESH (B) 3. Order + Payment Create Order + PaymentIntent PSP tokenised authorise SINGLE TX + SAGA (A) 4. Outbox → Kafka OrderConfirmed event CloudEvents envelope EVENT BACKBONE (C) 5a. Notifications Email + push confirm 5b. Fulfilment Warehouse route + dispatch 5c. Analytics + ML Recs, fraud scoring ASYNC FAN-OUT (C) 6. Edge Workers Receipt PDF gen Webhook dispatch SERVERLESS (D) PATTERN MAP: A: Strong TX (Order+Pay) B: Sync services (Gateway+Mesh) C: Event backbone + async fan-out D: Serverless edge Saga compensation: if fulfilment fails → void payment → revert order
Transition Strategy

How We Get There — Strangler Fig Migration

MONOLITH NEW HYBRID grows around & replaces services events Architecture Metaphor

The Strangler Fig Pattern

Named after the tropical strangler fig tree that germinates on a host tree, gradually enveloping it with aerial roots until the host decomposes and the fig stands independently.

In software migration, the new system (blue — new services) wraps around the legacy monolith (grey — old code) via an API Gateway facade. Traffic shifts incrementally. As bounded contexts are extracted, the monolith shrinks until safely decommissioned.

Key advantage: Zero big-bang risk. Each phase delivers value independently, and rollback is always possible.

Left: Architecture View

Blue roots (new services & events) wrapping grey monolith trunk. Minimal, clean design.

Right: Natural Analogy

Green fig roots enveloping a decaying host tree — the real-world inspiration for the pattern.

Strangler fig in tropical forest
CURRENT Monolith Single DB · Single Region PHASE 1-2 Stabilise SLOs · Cache · Telemetry PHASE 3-4 Modularise Hex Ports · Kafka · Outbox PHASE 5-6 Extract Services · CQRS · Regions TARGET Evolutionary Hybrid Multi-Region · Event-Driven

8-Phase Rollout

Incremental delivery — each phase produces a working system. Accelerated timelines assume AI-agent-driven development with human oversight for architecture decisions and code review.

PhaseScopeTimeline
1 Observe & BaselineDefine SLOs, error budgets; instrument with OpenTelemetryDay 1-3
2 StabiliseAdd caching layer, CDN, connection pooling; run load testsWk 1
3 ModulariseHexagonal ports & adapters; enforce module boundaries with arch testsWk 2-3
4 Event BackboneIntroduce event bus + Outbox pattern; CloudEvents schemaWk 3-4
5 Extract ServicesPayment, Order, Catalogue — strangler fig with contract testsWk 5-7
6 CQRS + StreamSelective CQRS projections; read model optimizationWk 7-9
7 Multi-Region5 regions — IaC provisioning, data replication, DR runbooksWk 9-12
8 FinOpsCost dashboards, right-sizing, reserved capacity planningOngoing

Key Risks & Mitigations

RiskMitigation
Distributed monolithLimit sync depth; async for non-critical flows
Event schema sprawlAsyncAPI + schema registry + versioning
Module boundary erosionHexagonal ports + consumer-driven contracts

Migration Principles

⚠ Never Big-Bang

Strangler Fig wraps old system. Traffic shifts incrementally via gateway.

⚙ Fix Before Split

Stabilise with caching first. Don't extract from a broken monolith.

⭐ Events Before Services

Kafka backbone before extraction prevents distributed monolith.

✓ Verify in Shadow

Dual-write verification + shadow CQRS projections before cutover.

Rollback & Safety Nets

MechanismHow It Works
Blue-Green Deploy<1s rollback to previous version
Canary Auto-RollbackAutomated revert within 60s if p95 or error-rate SLO breached
Feature FlagsDecouple deploy from release; instant kill-switch
Error Budget GatesAuto-pause releases when reliability degrades
Expand/ContractBackward-compatible schema migrations

CI/CD Pipeline & Delivery Metrics

GITOPS PIPELINE Developer Push code GH Actions Build, test, scan ArgoCD Sync + contracts Flagger Canary 5%→100% Prod GA Source / Production CI — Build & Test CD — Deploy & Sync Canary Validation
DORA Metrics

Deploy frequency, lead time, change failure rate, MTTR.

Release Automation

Trunk-based dev. Automated promotion gates. Zero-touch deploys.

Testing Strategy

Mono: unit+module. Micro: +contract tests. Serverless: +replay.

Contract Governance

OpenAPI + AsyncAPI + schema registry for events.

Key insight: The evolutionary hybrid combines the best of all four candidates — strong transactions where correctness matters, async events for scale, and serverless for bursty edge workloads.

System Architecture & Technology

Target-State System Architecture

Full layered view of the evolutionary hybrid architecture — from client edge to data persistence, with technology choices and performance strategies.

▼ Scroll to see technology choices, caching, security & resilience patterns

End-to-End System Design

Holistic single-page view: actors, UI portals, API gateway, domain services with interconnections, message brokers, 3rd-party integrations, caching, databases, notifications, and sidecar observability.

E-COMMERCE PLATFORM — END-TO-END SYSTEM DESIGN Customer Store Staff Admin / Business Delivery Partner Web App (React SPA) + Mobile (React Native) PWA | Offline cart | Push notifications POS Terminal App In-store scanning & checkout Admin Dashboard (React + Analytics) Business KPIs | Inventory mgmt | User mgmt Delivery Tracker App Route optimisation | Status updates CloudFront CDN → AWS WAF (rate limit, geo-block, bot protection) → BFF / GraphQL Gateway → Kong API Gateway (auth, throttle, route) TLS 1.3 termination | JWT validation | Request deduplication | Circuit breaker at edge DOMAIN SERVICES (EKS + Istio Service Mesh) — 14 Bounded Contexts Identity & Auth JWT | OAuth2 | MFA Cognito + custom Catalogue Products | Categories ES search + PG Cart Add | Remove | Merge Redis (ephemeral) Orders Place | Track | Cancel PG + Saga orchestrator Payments Charge | Refund | Ledger PG + PCI vault Inventory Stock | Reserve | Adjust PG + Redis hot cache Fulfilment Pick | Pack | Ship | Slot PG + geo-index Profile Prefs | Addresses | Loyalty DynamoDB global table Promotions & Loyalty Coupons | Points | Tiers PG + Redis promo cache Returns & Refunds RMA | Inspect | Restock PG + linked to payments Fraud & Risk Rules | Anomaly | Device ES audit + ML scoring Analytics & ML Reco | Search rank | Forecast SageMaker + feature store Pricing & Tax Tax rules | Dynamic pricing PG + 3rd-party tax API Notifications SMS | Email | Push | WhatsApp DynamoDB + SQS fanout 1 2 3 4 5 6 7 8 11 12 13 14 15 16 17 18 19 20 N = Synchronous (gRPC / REST) N = Asynchronous (Kafka events) 3RD-PARTY INTEGRATIONS Stripe / Adyen Payment gateway Google Maps API Routing & geocoding SAP / ERP Finance & supply chain Twilio / SES / FCM Notifications (SMS/Email/Push) Delivery Partners API Last-mile logistics 9 10 21 KAFKA EVENT BACKBONE (Amazon MSK — 3 brokers × 3 AZ) order.events payment.events inventory.events cart.events fulfilment.events catalogue.events promo.events return.events notification.events publish / subscribe REDIS CACHE LAYER (ElastiCache Cluster) cart:{userId} session:{token} stock:{sku} rate-limit:{client} promo:{code} catalogue:{category} DATABASES (per-service ownership) PostgreSQL (RDS) orders | payments | identity inventory | fulfilment | returns Elasticsearch catalogue-search order-history | audit-logs DynamoDB user-profiles | notif-log ml-feature-store S3 Object Storage Product images | invoices Event archive (Parquet) CQRS Reporting Athena + Redshift Business dashboards NOTIFICATION SUBSYSTEM SMS Email Push WhatsApp notification.events SIDECAR / OBSERVABILITY (attached to every pod) Distributed Tracing OTel → Grafana Tempo Metrics Prometheus → Mimir Log Aggregation Fluentd → Grafana Loki Health Checks Liveness / Readiness probes Alerting Alertmanager → PagerDuty Dashboards Grafana (golden signals) INFRASTRUCTURE PLATFORM Terraform (IaC) ArgoCD (GitOps) Vault (Secrets) ECR (Images) GitHub Actions (CI) Flagger (Canary) KEY: Clients/UI Edge/Notifications Domain Services 3rd Party / ES Kafka / DynamoDB PostgreSQL Observability Infra Platform

Synchronous Calls (gRPC / REST via Istio mTLS)

# From To Call / Purpose
1Identity & AuthAll servicesvalidateToken() — JWT verification on every request
2CartCataloguegetPrice() — fetch current price & product details
3CartInventorycheckStock() — verify availability before adding to cart
4OrdersPaymentschargePayment() — process payment during checkout (saga step)
5OrdersInventoryreserveStock() — reserve items during checkout (saga step)
6OrdersFraud & RiskriskCheck() — fraud score before order confirmation
7PaymentsStripe / AdyenprocessPayment() — external payment gateway call
8FulfilmentGoogle Mapsgeocode() / optimiseRoute() — delivery routing
9Notification svcTwilio / SES / FCMsend() — dispatch SMS, email, or push notification
10FulfilmentDelivery Partnersdispatch() — hand off to last-mile logistics partner

Asynchronous Events (Kafka — non-blocking, eventual consistency)

# Producer Consumer Event / Topic
11OrdersFulfilmentOrderPlacedorder.events — trigger pick/pack/ship
12OrdersNotificationsOrderConfirmednotification.events — email + push
13PaymentsOrdersPaymentCompletedpayment.events — confirm order
14InventoryCatalogueStockUpdatedinventory.events — reindex search
15FulfilmentNotificationsShipmentDispatchedfulfilment.events — SMS/push
16ReturnsInventoryRefundApprovedreturn.events — restock items
17ReturnsPaymentsRefundApprovedreturn.events — issue refund
18PromotionsOrdersCouponAppliedpromo.events — apply discount
19Fraud & RiskOrdersFraudFlaggedfraud.events — block/review order
20All servicesAnalytics & ML*.* — fan-out consumer of all events for ML features
21OrdersSAP / ERPOrderCompletedorder.events — sync to finance system

Layered Architecture Detail

Detailed layered view with technology choices per component — zoom into any layer from the end-to-end design above.

CLIENTS Web App React / Next.js iOS App SwiftUI 3P API Partner REST EDGE & INGRESS CloudFront CDN + WAF Route 53 Global DNS Global Accel Anycast routing Rate Limit Per-tenant API GATEWAY + SERVICE MESH AuthN / AuthZ JWT Validation mTLS Termination Traffic Routing Circuit Breakers Request Tracing (OTel) DOMAIN SERVICES (K8s) Identity OAuth 2.0 / OIDC Pluggable IdP adapter STRONG Catalogue Products, offers, search ES read model EVENTUAL Cart Session state, pricing Redis-backed STRONG/SESSION Orders Lifecycle, saga orchestr. PG primary STRONG Payments PSP tokenisation + regional rails PCI-scoped isolation STRONG Fulfilment Warehouse routing, dispatch STRONG dispatch Pricing & Tax Regional rules, promos STRONG/checkout Notifications Email, push, SMS AT-LEAST-ONCE Profile Prefs, addresses MIXED Fraud / Risk Rules + ML scoring NEAR-REAL-TIME Analytics & ML Recs engine, A/B EVENTUAL Inventory Stock levels, reservations, warehouse avail. STRONG reserve; EVENTUAL browse Promotions & Loyalty Coupons, loyalty points, rewards, campaigns EVENTUAL Returns & Refunds Return requests, refund processing, reverse logistics STRONG KAFKA EVENT BACKBONE (OUTBOX RELAY + CLOUDEVENTS + IDEMPOTENT CONSUMERS) order.* payment.* inventory.* catalogue.* notify.* fraud.* analytics.* promo.* return.* dlq.* POLYGLOT DATA LAYER PostgreSQL ACID source of truth Multi-AZ · Read replicas · PgBouncer Redis Cluster Sub-ms cache + sessions Multi-AZ · Sentinel HA Elasticsearch Full-text search + CQRS reads Cross-AZ · Auto-sharded DynamoDB Feature store + ML recs Global tables · 6hr TTL OBSERVABILITY OpenTelemetry Collector Traces + Metrics + Logs Prometheus Time-series DB Grafana Dashboards + SLOs PagerDuty Alerting + On-call Kubecost FinOps + right-size Distributed Trace Correlation Trace ID → Span ID → Service Map INFRASTRUCTURE & PLATFORM EKS (K8s) 3-AZ per region Terraform IaC per-region ArgoCD + Flagger GitOps + canary deploy Vault Secrets auto-rotate HPA / VPA Auto-scaling Cosign Image signing Multi-Region (5 Regions) US · EU (GDPR) · APAC · India (RBI) · Brazil (LGPD) DR progression: Pilot Light → Warm Standby → Active/Active Region-local data masters · Cross-region async replication KEY: Clients Edge / Ingress Gateway / Mesh Domain Services Event Backbone Data Layer Observability / Infra FLOW: Client → Edge (CDN+WAF+DNS) → API Gateway (mTLS) → Domain Services (K8s) ↔ Kafka Event Backbone → Polyglot Data Layer All layers instrumented via OpenTelemetry · GitOps + canary via ArgoCD+Flagger · Auto-scaling via HPA/VPA · DLQ with capped retries · Secrets via Vault

Technology Choices, Performance & Security

Metrics are proposed targets based on industry benchmarks; final values to be validated during load testing.

Core Technology Stack

TechnologyRole
Kubernetes (EKS)Container orchestration, HPA auto-scaling
Service Mesh (e.g. Istio)mTLS, traffic mgmt, circuit breaking
Apache KafkaEvent streaming: 25K evt/sec, outbox relay
PostgreSQLACID transactions, read replicas, sharding
Redis ClusterSub-ms cache, sessions (80-90% DB offload)
ElasticsearchFull-text search, CQRS read models
OpenTelemetryVendor-neutral traces, metrics, logs
DynamoDBFeature store, global tables, pay-per-req
AWS LambdaServerless for bursty workloads + edges
ArgoCD + FlaggerGitOps, canary deploys, auto-rollback
Prometheus + GrafanaK8s-native monitoring, dashboards
KubecostFinOps: cost visibility, right-sizing
TerraformIaC: parameterised regional modules

Multi-Layer Caching

MULTI-LAYER CACHING STRATEGY L1: CDN Static assets Edge caching L2: Redis Sub-ms reads 80-90% offload L3: DB Read replicas Connection pool L4: ETag HTTP 304 Client cache
Cache Patterns: Cache-Aside (lazy load) — app checks cache first; on miss, reads DB and populates cache. Default for catalogue/inventory. Cache-Put (write-through) — writes update both DB and cache atomically, ensuring cache is always fresh. Used for sessions and cart. Write-Behind — writes go to cache first, async flush to DB. Used for analytics counters (eventual consistency acceptable). Eviction: TTL-based (short TTL for prices/stock, 6hr for ML features) + LRU fallback when memory pressure hits. Event-driven invalidation via Kafka for catalogue updates.

Security Architecture

Identity LayerHex adapter wraps external IdP; pluggable for future providers
TransportTLS 1.3 external + mTLS pod-to-pod via service mesh
Payment IsolationPCI DSS 4.0 scope reduced to 1 service via PSP tokenisation adapter
Zero TrustNIST SP 800-207; service identities via SPIFFE
SecretsVault auto-rotation, K8s external-secrets + RBAC. See Vault vs AWS KMS below.
Encryption at RestAES-256 via AWS KMS for RDS, S3, EBS, Kafka (at-rest encryption), backups
Supply ChainSBOM generation, image signing (Cosign/Sigstore), image scanning
VerificationOWASP ASVS 5.0.0 (Level 2) + SAST, SCA & DAST in CI

Multi-Region Topology

Route 53 + Global Accelerator US (Primary) EKS 3-AZ RDS Multi-AZ us-east-1 EU (GDPR) EKS 3-AZ RDS eu-west Data residency APAC EKS 3-AZ RDS ap-* ap-southeast India (RBI) EKS ap-south · Local payments Brazil (LGPD) EKS sa-east · Data residency
Why Redis over Memcached/Hazelcast?

Redis offers data structures (sorted sets, hashes, streams), pub/sub, Lua scripting, persistence (RDB/AOF), and multi-AZ Sentinel HA — all missing from Memcached. Hazelcast adds distributed compute but with higher memory overhead and a smaller managed-service ecosystem on AWS. Redis is open-source (BSD licence, free); AWS ElastiCache/MemoryDB is the managed option (paid, ~$0.017/hr for cache.t3.micro). The platform uses ElastiCache for production HA.

Secrets: HashiCorp Vault vs AWS KMS/Secrets Manager

Vault — cloud-agnostic, dynamic secrets, auto-rotation, fine-grained RBAC, audit log, K8s external-secrets operator. Best for multi-cloud or hybrid. AWS KMS — fully managed envelope encryption, tight IAM integration, lower ops overhead but AWS-locked. AWS Secrets Manager — managed key-value store with rotation via Lambda. The platform uses Vault for portability across regions (multi-cloud roadmap) + K8s-native secret injection, with KMS for envelope encryption of Vault's storage backend.

Resilience Patterns, Scaling & Observability

Circuit Breaker

5 failures → fail-fast 30s → half-open test. ~58% cascade reduction.

Retry + Backoff

Exponential jitter. Idempotency keys.

Bulkhead

Isolated thread/conn pools. Pod Disruption Budgets.

Health Checks

Liveness (restart) + Readiness (remove from LB).

Rate Limiting

Per-tenant quotas via service mesh. Hard reject above threshold. Prevent noisy-neighbour.

Request Throttling

Gradual backpressure (HTTP 429 + Retry-After) before hard limit. Token bucket at gateway level. Distinct from rate limiting — slows rather than rejects.

Load Shedding

Under extreme load, drop low-priority requests (analytics, recs) to protect critical paths (checkout, payments). Priority-based queue with CPU/memory triggers.

Graceful Degradation

Serve stale cache if upstream fails. Priority queues. Reduced functionality over total outage.

Auto-Scaling

HPA: CPU >70% / Mem >80% → scale pods ~30s. VPA: 7-day analysis → right-size. Cluster Autoscaler: Add nodes for unschedulable pods.

DB Scaling

Read replicas: 1-2s lag. PgBouncer: ~600-700 conns. Sharding: by order_id (start 4, grow to 12).

Observability

OTel: Trace/span correlation across all hops. Dashboards: Grafana SLO burn-rate alerts + service maps. Cost: Head-based sampling + 15-day retention for high-cardinality traces.

Kubernetes Deployment Architecture

Physical deployment topology across 3 Availability Zones per region, showing how domain services map to EKS namespaces, pods, and supporting infrastructure.

EKS DEPLOYMENT TOPOLOGY (PER REGION) ECR Registry Cosign-signed images ArgoCD + Flagger GitOps + canary rollout Vault Secrets auto-rotate OTel Collector DaemonSet per node HPA / VPA Auto-scaling policies Terraform IaC per-region modules EKS CLUSTER (3 AVAILABILITY ZONES) AVAILABILITY ZONE 1 ns: core-services Orders Payments Cart Identity Inv 2-5 replicas each (HPA) ns: support-services Catalogue Fulfilment Notify Promo Returns 2-3 replicas each (HPA) ns: data-plane PG Primary Redis Leader Kafka Broker ES Data Node AVAILABILITY ZONE 2 ns: core-services (replica) Orders Payments Cart Identity Inv ns: support-services (replica) Catalogue Fulfilment Notify Promo Returns ns: data-plane (replica) PG Replica Redis Replica Kafka Broker ES Data Node AVAILABILITY ZONE 3 ns: core-services (replica) Orders Payments Cart Identity Inv ns: support-services (replica) Catalogue Fulfilment Notify Promo Returns ns: data-plane (replica) PG Replica Redis Replica Kafka Broker ES Data Node Service Pods (HPA-scaled) Stateful Data (multi-AZ replication) Kafka (3-broker quorum per AZ) Platform / Infra tooling AZ boundary (dashed = cross-AZ replication)

Network & VPC Architecture

AWS VPC layout showing how traffic flows from the internet through public and private subnets to reach application pods and data stores.

VPC NETWORK TOPOLOGY (PER REGION) Internet (Clients + 3P APIs) CloudFront CDN + AWS WAF + Shield + Route 53 VPC 10.0.0.0/16 PUBLIC SUBNETS (10.0.1-3.0/24) ALB (Ingress) TLS termination NAT Gateway Outbound internet Security Group: sg-public (443, 80 inbound) PRIVATE SUBNETS — APP (10.0.10-12.0/24) EKS Node Group Service pods (mTLS) Istio Service Mesh Traffic routing + mTLS Security Group: sg-app (mesh-only, no direct internet) PRIVATE SUBNETS — DATA (10.0.20-22.0/24) RDS (PG) Multi-AZ Redis ElastiCache MSK Kafka managed Security Group: sg-data (app-only, port-specific) FLOW: Internet → CloudFront (CDN cache + WAF filter) → ALB (TLS termination) → Istio Mesh (mTLS + routing) → K8s Pods → Data stores (encrypted at rest + in transit)

CI/CD Pipeline Architecture

Trunk-based development with automated promotion gates. Zero-touch deployment from commit to production via ArgoCD + Flagger canary rollout.

CI/CD PIPELINE (TRUNK-BASED · ZERO-TOUCH DEPLOYS) Commit Trunk-based dev Short-lived branches GitHub Build Docker multi-stage SBOM generation GitHub Actions Test Unit + integration Contract tests (Pact) >80% coverage gate Security Snyk + Trivy scan SAST + dependency audit Zero critical CVE gate Publish Push to ECR Cosign image signing Immutable tags Deploy ArgoCD sync Flagger canary (5% → 100%) Error budget gate Production Live traffic (all regions) SLO monitoring active Feature flags enabled Auto-rollback on canary failure AUTOMATED PROMOTION GATES Gate 1: Quality Tests pass + coverage >80% Contract tests green Gate 2: Security Zero critical/high CVEs Signed image verified Gate 3: Canary Error rate <0.1% at 5% traffic p99 latency within SLO Gate 4: Error Budget Monthly SLO budget remaining Auto-freeze if budget <10% Gate 5: Compliance SBOM generated + stored Audit trail recorded

Data Flow & Event-Driven Architecture

How domain events flow through the Kafka backbone between producers and consumers, including CQRS read/write separation and event sourcing paths.

EVENT-DRIVEN DATA FLOW (CQRS + EVENT SOURCING) WRITE PATH (COMMANDS) CLIENT COMMANDS PlaceOrder | AddToCart | UpdateProfile | ProcessPayment API GATEWAY (Kong) — Validate & Route DOMAIN SERVICES (WRITE MODELS) Command Handler → Validate → Apply Aggregate Root → Domain Events Transactional Outbox Pattern (Write to DB + Event in same TX) WRITE STORE (PostgreSQL) Normalised, ACID, per-service schema KAFKA EVENT BACKBONE (MSK) Partitioned by aggregate ID | 7-day retention | Schema Registry (Avro) order.events payment.events cart.events inventory.events catalogue.events fulfilment.events promo.events return.events user.events notification.events analytics.events fraud.events CONSUMER GROUPS cg-fulfilment cg-notifications cg-analytics cg-search-index SAGA ORCHESTRATOR Order → Payment → Inventory → Fulfilment (compensating TXs on failure) DEAD LETTER QUEUE (DLQ) Failed events → retry (3x exp backoff) → DLQ → alert → manual review publish events READ PATH (QUERIES) READ PROJECTIONS Elasticsearch — full-text search, catalogue Redis — session, cart, hot inventory counts DynamoDB — order history, user profiles (Eventually consistent — write lag <500ms p99) project QUERY SERVICES (READ MODELS) Thin query handlers — no business logic BFF (GraphQL) — Aggregate Read Views CLIENTS (Web / Mobile / POS) EVENT STORE (Kafka Log + S3 Archive) Immutable event log — full replay capability — 7 days hot (Kafka) + unlimited cold (S3 Parquet) Schema Registry (Avro) — backward compatible evolution — AsyncAPI contracts persist ANALYTICS & ML PIPELINE S3 → Glue ETL → Athena / Redshift → ML Feature Store → SageMaker KEY: Clients Gateway/Saga Domain Services Kafka Backbone Data/Read Stores BFF/Edge Analytics/ML

Database Schema & Data Ownership Map

Each bounded context owns its data store exclusively — no shared databases. Shows which service owns which storage technology and key entities.

DATA OWNERSHIP MAP — DATABASE PER SERVICE PostgreSQL (RDS) — STRONG CONSISTENCY orders-db Owner: Order Service Tables: orders, order_items, order_status_history, sagas Consistency: Strong (ACID) Multi-AZ | Read replica payments-db Owner: Payment Service Tables: transactions, refunds, payment_methods, ledger Consistency: Strong (ACID) Encrypted at rest | PCI-DSS identity-db Owner: Identity & Auth Tables: users, roles, sessions, oauth_tokens, audit_log Consistency: Strong (ACID) Hashed passwords | MFA inventory-db Owner: Inventory Service Tables: stock_levels, warehouses, reservations, stock_movements Consistency: Mixed (Strong writes) Optimistic locking | Hot path fulfilment-db Owner: Fulfilment Service Tables: shipments, routes, delivery_slots, tracking Consistency: Eventual Geo-indexed | Time-slot locking returns-db Owner: Returns & Refunds Tables: return_requests, refund_ledger, return_reasons Consistency: Strong (ACID) Linked to payments-db via events Redis (ElastiCache) — LOW LATENCY / CACHING cart-cache Owner: Cart Service Keys: cart:{userId}, ttl 24h Cluster mode | 3 shards session-store Owner: Identity & Auth Keys: sess:{token}, ttl 30m Sentinel HA | Encrypted inventory-hot Owner: Inventory Service Keys: stock:{sku}, ttl 60s Write-through from PG rate-limit Owner: API Gateway (Kong) Keys: rl:{clientId}:{endpoint} Sliding window counters promo-cache Owner: Promotions & Loyalty Keys: promo:{code}, loyalty:{uid} TTL varies by campaign Elasticsearch (OpenSearch) — FULL-TEXT SEARCH & ANALYTICS catalogue-search Owner: Catalogue Service (CQRS read projection) Indices: products, categories, brands 3 primary shards × 2 replicas | Fuzzy + autocomplete order-history-search Owner: Order Service (CQRS read projection) Indices: orders_view, order_analytics Time-based indices | ILM rollover 30d audit-logs Owner: Security / Fraud Service Indices: security_events, fraud_signals Immutable | 90-day retention | Compliance DynamoDB — HIGH THROUGHPUT KEY-VALUE user-profiles Owner: Profile Service PK: userId | SK: #PROFILE | #PREFS | #ADDR On-demand capacity | Global tables (multi-region) notifications-log Owner: Notification Service PK: userId | SK: timestamp#channel TTL 90d auto-expire | Streams enabled ml-feature-store Owner: Analytics & ML Service PK: featureGroup | SK: entityId#version Point-in-time lookups for model inference DATA ISOLATION RULES ✗ No shared databases between services ✗ No direct DB-to-DB queries across contexts ✗ No distributed transactions (use sagas) ✓ Cross-context reads via Kafka projections ✓ API composition for aggregate views ✓ Event-carried state transfer for denorm Schema governance: AsyncAPI contracts + Confluent Schema Registry (Avro) + backward-compatible evolution only DB migrations: Flyway + blue/green schema deploys STORES: PostgreSQL (Strong) Redis (Cache/Session) Elasticsearch (Search/Analytics) DynamoDB (Key-Value)

Observability Stack & Telemetry Pipeline

End-to-end observability: how metrics, logs, and traces flow from application services through OpenTelemetry collectors to dashboards and alerting.

OBSERVABILITY & TELEMETRY PIPELINE SIGNAL SOURCES APPLICATION SERVICES OTel SDK (auto-instrumented) Traces (OTLP) Metrics (OTLP) Logs (stdout) Events (custom) W3C TraceContext propagation INFRASTRUCTURE Node Exporter | cAdvisor | kube-state CPU, memory, disk, network, pod status Prometheus scrape interval: 15s DATA LAYER Kafka (JMX) | PG (pg_stat) | Redis (INFO) Lag, connections, cache hit ratio, replication COLLECTION & PROCESSING OTel Collector (DaemonSet) Receive → Process → Export Batching | Sampling (tail-based) | Enrichment Trace Pipeline → Tempo Metrics Pipeline → Prometheus/Mimir Log Pipeline → Loki Event Pipeline → Kafka → S3 STORAGE BACKENDS Grafana Tempo Distributed traces | S3 backend | 14d retention Prometheus / Mimir Time-series metrics | 90d retention | HA pair Grafana Loki Log aggregation | Label-indexed | 30d retention S3 (Event Archive) Long-term event store | Parquet | Athena queryable VISUALIZATION & ALERTING GRAFANA DASHBOARDS • Platform Overview — golden signals (latency, traffic, errors, saturation) • Service Health — per-service SLI/SLO dashboards • Kafka Lag — consumer group lag, partition health • Business KPIs — orders/min, conversion, cart abandonment GitOps-managed dashboard-as-code (Jsonnet) ALERTING ENGINE Alertmanager → PagerDuty (P1/P2) | Slack (P3/P4) SLO burn-rate alerts (multi-window, multi-burn) Error budget consumption → auto-pause deploys Escalation: 5m ack → 15m page → 30m incident commander SLO FRAMEWORK — ERROR BUDGET DRIVEN OPERATIONS Availability SLO 99.95% (≈22 min/mo) Measured: successful req / total Latency SLO (p99) API <200ms | Page <1.5s Histogram buckets at p50/p90/p99 Throughput SLO 5K req/s sustained peak HPA scales at 70% target Data Freshness SLO Search index <5s | Inventory <500ms Kafka consumer lag monitoring ERROR BUDGET POLICY >50% remaining → ship freely | 25-50% → feature freeze, fix only | <25% → deploy freeze, reliability sprint Budget resets monthly | Reviewed in weekly SRE sync | Exemptions require VP approval SIGNALS: Traces Metrics Logs Events Dashboards/Alerts
Business Alignment & Operations

Trade-Offs, Cost & Business Impact

▼ Scroll to see post-production support, maintenance & feature roadmap

Key Trade-Offs & Mitigations

Trade-OffRiskMitigation
Micro vs. monoOps overheadExtract only when pain justifies
Eventual consistencyStale dataStrong for financials; short-TTL
CQRS selectiveComplexityOnly where read/write ratio needs it
Multi-regionCost + syncPilot light → active/active
IdP / PSP couplingVendor changesHex adapter; pluggable identity + payment providers
Serverless lock-inMigrationCloudEvents + adapter isolation

Data Migration Tactics

ChallengeApproach
DB ownership splitShared-schema → per-service via Change Data Capture
Sync → asyncDual-write with outbox verification
CQRS introductionShadow projections; compare then switch
Data residency complianceRegion-local masters; cross-region replication policy

Cost Optimisation (~$24K/mo est.)

Costs are estimated based on published cloud pricing at proposed scale; actual costs depend on provider and workload.

MONTHLY COST BREAKDOWN (~$24K) K8s Compute $18,000 Burst (Spot) $3,600 CDN + Redis $1,200 Serverless $800 Monitoring $400 Proposed: 10K QPS baseline, scales to 30K QPS (3x) · Reserved + Spot + Serverless mix · FinOps via Kubecost

Compute Savings Strategy

Reserved 1yr
~40%
savings
Reserved 3yr
~60%
savings
Spot Burst
70-90%
savings
Serverless
~99%
for intermittent

Business Impact

Regional Expansion

New regions via parameterised IaC modules.

Customer Experience

Sub-second search, real-time inventory, ML recs.

Scalability

3× growth via auto-scaling + caching.

Cost Discipline

Evolutionary approach. Pay for what you need.

Key Business Features

Real-Time Inventory & Fulfilment

Event-driven system processes thousands of events/sec at peak. Per-SKU, per-region read models in Redis/Elasticsearch (<100ms queries). Warehouse routing and dispatch with delivery SLA tracking.

Personalised ML Recommendations

Hybrid engine: collaborative filtering (60%), content-based (30%), business rules (10%). DynamoDB feature store with 6hr TTL. End-to-end scoring <100ms. Proposed split — to be validated with A/B testing post-launch.

Evolutionary Hybrid Architecture — Start modular, add complexity only where scaling pain demands it.
Stabilise → Modularise → Event Backbone → Extract Services → Multi-Region → FinOps

Operations & Evolution

Post-Production Support, Maintenance & Feature Upgrades

A mature operational model ensures the platform stays healthy, secure, and continuously improves after launch.

CONTINUOUS OPERATIONS LIFECYCLE Monitor OpenTelemetry SLO Dashboards Detect & Alert PagerDuty / OpsGenie Anomaly Detection Respond Incident Runbooks Auto-Remediation Fix & Deploy GitOps + Canary Blue-Green Rollback Learn & Improve Action Item Tracking Reliability Reviews Evolve Feature Roadmap Platform Upgrades Continuous feedback loop — every incident improves the system

Support Tiers & SLAs

TierScopeResponseResolution
P1 Critical Payment/checkout down, data loss 15 min 4 hrs
P2 Major Feature degraded, workaround exists 1 hr 8 hrs
P3 Minor Non-critical bug, UI issue 4 hrs 48 hrs
P4 Request Enhancement, cosmetic fix 1 day Sprint

Scheduled Maintenance Windows

ActivityFrequencyImpact
Security patching (OS/K8s)WeeklyZero downtime
DB maintenance (vacuum/index)Bi-weeklyZero downtime
Kafka broker rolling upgradeMonthlyZero downtime
Major version upgrades (EKS)QuarterlyBlue-green
Disaster recovery drillsQuarterlyFailover test
Security compliance audit (PCI DSS / SOC 2)AnnualScheduled

On-Call & Incident Management

24/7 On-Call Rotation

Follow-the-sun across regions. PagerDuty escalation chains.

Blameless Postmortems

Root cause + action items within 48 hrs. Tracked to completion.

Feature Governance & Prioritisation

FEATURE INTAKE & GOVERNANCE Request Business need or user feedback RFC / ADR Arch review Impact analysis Prioritise RICE scoring Roadmap slot Deliver Sprint execution via CI/CD pipeline KPI Track

Post-Launch Feature Roadmap

QuarterFeature UpgradesPriority
Q1 ML-powered recommendations, A/B testing infra High
Q2 Real-time fraud detection, loyalty programme High
Q3 GraphQL API layer, advanced analytics dashboards Medium
Q4 Edge computing (CDN functions), AI-driven inventory forecasting Strategic

Operational Maturity

Chaos Engineering

Scheduled fault injection (Litmus/Gremlin). Quarterly game days.

Capacity Planning

Predictive scaling models. FinOps reviews. Right-sizing automation.

Dependency Mgmt

Dependabot + Snyk scanning. Monthly CVE review cycle.

Platform Upgrades

EKS version policy (N-1). Rolling Kafka & PG major upgrades.

Operational excellence: Zero-downtime maintenance · Continuous feature delivery · SRE-driven reliability · Proactive security posture

Appendix

Assumptions, Questions & Glossary

Scroll down to explore key assumptions, questions for leadership, and glossary of terms referenced throughout this architecture.

▼ Scroll to see all content

Key Assumptions

The following assumptions were made where the case study did not provide specific values. These should be validated with stakeholders before finalising the design.

Traffic & Performance

AssumptionValue UsedRationale
Concurrent users10K+Estimated for a "rapidly growing" grocery platform with 2× traffic surge
Baseline QPS10K (scales to 30K)Derived from 3× growth requirement in case study
Availability SLO99.95%Industry standard for e-commerce; case study says "minimal downtime"
Browse latency (p95)200–400 msCompetitive benchmark for grocery e-commerce search/browse
Checkout latency (p95)600–1200 msAcceptable threshold for payment processing flows
Kafka throughput25K evt/secSized for 3× peak with headroom for event-driven flows

Infrastructure & Cost

AssumptionValue UsedRationale
Cloud providerAWSSelected for mature K8s (EKS), global reach, serverless ecosystem
Existing databasePostgreSQLMost common ACID DB for e-commerce monoliths
Monthly infra cost~$24K/moEstimated for EKS + managed services at 10K QPS baseline
Redis cache offload80–90%Typical for read-heavy grocery catalogue/inventory lookups
Compute savingsRI ~40%, Spot ~60-70%Published AWS pricing benchmarks for steady-state workloads

Architecture & Migration

AssumptionValue UsedRationale
Migration patternStrangler FigLowest risk for monolith-to-hybrid; incremental value delivery
Delivery timeline~12 weeksAI-agent-accelerated migration; Strangler Fig with parallel workstreams and human oversight
Team size~6 engineersAssumed cross-functional squad; to be validated with leadership
DR progressionPilot Light → Active/ActivePhased approach to manage cost vs. resilience trade-off

Business & Integrations

AssumptionValue UsedRationale
Identity & Payment providersOAuth 2.0 / OIDC, PSP tokenisation, regional railsGeneric integrations; specific providers to be confirmed with leadership
ML recommendation split60/30/10Collaborative (60%), content-based (30%), business rules (10%)
Search latency target<100 ms scoringEnd-to-end ML scoring SLA for real-time personalisation
Recommendation

Conduct a discovery workshop with product, infra, and finance stakeholders to validate these assumptions before committing to detailed sprint planning.

18 assumptions identified — all values are estimated from industry benchmarks and should be refined with actual platform telemetry and business inputs.

Questions for Leadership

To finalise the architecture and migration plan, we need leadership alignment on the following open items from the case study.

Regional Expansion

Which of the five new regions should we prioritise first, and what is the rollout sequence?

Different regions carry different tax compliance, currency, and warehouse integration complexity. A phased rollout order lets us pilot in lower-risk regions before scaling.

Performance & Scalability

Is the 3× traffic growth expected to be gradual or driven by specific launch events (e.g., regional go-lives, promotions)?

This determines whether we invest in auto-scaling elasticity or pre-provisioned capacity — and how aggressively we optimise burst handling with Spot instances.

High Availability

What is the acceptable downtime target during peak hours — 99.9% (8.7 hrs/yr) or 99.95% (4.4 hrs/yr)?

The case study requires “minimal downtime during peak hours.” A concrete SLO drives the multi-region failover strategy, error budget gates, and infrastructure cost.

Business Features

Should real-time inventory checks be per-warehouse or aggregated per-region, and what latency is acceptable for stock updates?

Per-warehouse granularity enables faster delivery SLAs but requires tighter event-streaming integration with each new warehouse partner.

Personalisation

What user data is available for personalised recommendations — purchase history only, or also browsing behaviour and demographic data?

This shapes the ML model complexity (collaborative vs. content-based vs. hybrid) and determines data pipeline and privacy compliance requirements.

Cost Optimisation

Is there a target monthly infrastructure budget, and should we optimise for lowest cost or fastest time-to-market?

The case study asks to “keep infrastructure costs in check while scaling.” A specific envelope helps us decide between Reserved Instances, Spot, and Serverless mix.

Migration Strategy

How aggressive should the Strangler Fig migration be — stabilise-first (lower risk, longer) or extract-early (faster, higher risk)?

With five new regions launching next quarter, we need to balance migration velocity against the risk of destabilising the monolith during expansion.

Team & Organisation

What is the current engineering team size, and are there plans to scale the team or adopt accelerated tooling for the migration?

Team capacity directly impacts how many services we can extract in parallel and whether the proposed ~12-week AI-accelerated phased timeline is realistic.

Glossary

Every technical term, acronym, pattern, and standard referenced across all six slides.

Architecture & Patterns

Monolith A single deployable unit containing all application modules. The platform's current state — tightly coupled, single DB, single region.
Modular Monolith Candidate A — monolith with enforced module boundaries (hexagonal ports). Strongest consistency, lowest ops complexity, best early velocity.
Microservices Candidate B — independently deployable services communicating via sync (REST/gRPC) and async (events). Enables team autonomy and independent scaling.
Serverless Candidate D — managed functions (e.g., AWS Lambda) triggered by events. Pay-per-use, rapid elasticity. Risk: cold-start latency and retry storms.
Evolutionary Hybrid The recommended architecture — combines best of all four candidates. Start modular, add microservices/events/serverless only where scaling pain justifies.
Bounded Context A DDD concept defining a clear boundary around a domain model, ensuring each service owns its data and logic (e.g., Orders, Payments, Catalogue).
DDD Domain-Driven Design — software modelling approach that structures code around business domains. Drives the 14 bounded contexts in Slide 3.
Domain Decomposition The process of breaking a system into bounded contexts aligned with business capabilities. The platform decomposes into 14 contexts.
Hexagonal / Ports & Adapters Architecture pattern isolating domain logic from external systems (DB, APIs) via ports (interfaces) and adapters (implementations). Enables pluggable IdP/PSP.
Strangler Fig Incremental migration pattern where new functionality wraps the legacy system via a facade (API Gateway), gradually replacing it without a big-bang rewrite.
CQRS Command Query Responsibility Segregation — separates write (command) and read (query) models for independent scaling. Reads from ES/Redis, writes to PG.
Saga Pattern Manages distributed transactions across services via a sequence of local transactions with compensating actions on failure (e.g., void payment if fulfilment fails).
Transactional Outbox Persists domain state and an event-to-be-published in the same DB transaction, preventing "commit succeeded but event lost" failures.
Circuit Breaker Fault-tolerance pattern: after N failures (5 in the platform), requests fail fast for a cooldown period (30s), then half-open to test recovery. ~58% cascade reduction.
Bulkhead Isolates resources (thread pools, connections) so a failure in one component cannot cascade and exhaust shared resources. Enforced via Pod Disruption Budgets.
Event Sourcing Stores state as an immutable, time-ordered sequence of events rather than mutable rows, enabling replay and full auditability.
CAP Theorem States a distributed system can guarantee at most two of Consistency, Availability, and Partition-tolerance. the platform chooses CP for payments/orders (strong consistency) and AP for catalogue/search (eventual consistency + high availability).
Brownfield Project Developing within an existing system — migrating or extending legacy code. The platform is brownfield: monolith → hybrid via Strangler Fig, preserving existing data and business logic.
Greenfield Project Building a new system from scratch with no legacy constraints. New bounded contexts (e.g., ML/Personalisation) in the platform are effectively greenfield within the brownfield migration.
Cache-Aside (Lazy Load) App checks cache first; on miss, reads from DB and populates cache. Default pattern for catalogue and inventory lookups in the platform.
Cache-Put (Write-Through) Writes update both DB and cache atomically, ensuring cache is always fresh. Used for sessions and cart data where stale reads are unacceptable.
Write-Behind Writes go to cache first, then asynchronously flush to DB. Used for analytics counters where eventual consistency is acceptable and write throughput matters.
Polyglot Persistence Using different database technologies for different services based on their needs — PG for transactions, Redis for caching, ES for search, DynamoDB for ML features.
Graceful Degradation Serving stale cached data or reduced functionality when an upstream dependency fails, rather than returning errors. Priority queues for critical paths.
Idempotent Consumers Event consumers that can safely process the same message multiple times without side effects. Essential for at-least-once delivery guarantees on Kafka.
Aggregator Pattern Composite service that calls multiple downstream services and merges results into a single response. Used for product detail pages (Catalogue + Pricing + Inventory).
Chained Pattern Synchronous service-to-service call chain where each step depends on the prior. Used in checkout flow: Cart → Pricing → Payment → Order. Risk: latency compounds per hop.
Branch Pattern Request fans out to multiple services in parallel, results merged. Used for search (Catalogue + Personalisation + Pricing queried simultaneously).
Client-Side UI Composition Frontend independently fetches from multiple BFF endpoints and assembles the page. Each UI section maps to a bounded context, enabling independent team deployment.
BFF (Backend for Frontend) Specialised API layer tailored to specific frontend clients (web, mobile). Minimises over-fetching and optimises response formats per client type.
IdP (Identity Provider) External authentication service managing user credentials and identity verification. Integrated via pluggable hexagonal adapter for vendor flexibility.

Protocols & Standards

OAuth 2.0 (RFC 6749) Delegated authorisation framework allowing third-party access to resources without sharing credentials.
OIDC OpenID Connect — identity layer on top of OAuth 2.0, providing authentication and user claims via ID tokens.
JWT (RFC 7519) JSON Web Token — compact, URL-safe format for securely transmitting claims between parties. Validated at API Gateway.
JWS (RFC 7515) JSON Web Signature — ensures data integrity. Used by payment providers and identity servers for signed transaction verification.
TLS 1.3 (RFC 8446) Transport Layer Security — encrypts data in transit. Mandatory for all external communication; mTLS for pod-to-pod.
mTLS Mutual TLS — both client and server authenticate each other. Implemented via the service mesh sidecar for zero-trust networking.
CloudEvents CNCF specification for interoperable event envelope format, ensuring consistent metadata across event-driven systems.
AsyncAPI Machine-readable specification for message-driven APIs (Kafka, AMQP, WebSockets), analogous to OpenAPI for REST.
OpenAPI Machine-readable specification for RESTful APIs. Used with AsyncAPI for contract governance across sync and async services.
GraphQL Query language for APIs allowing clients to request exactly the data they need. Planned for Q3 post-launch feature roadmap.
ACID Atomicity, Consistency, Isolation, Durability — database transaction guarantees. PostgreSQL provides ACID for orders/payments.

Deployment & Release Patterns

Blue-Green Deploy Two identical environments (blue/green). Deploy to inactive, switch load balancer. Rollback in <1 second by switching back.
Canary Deploy Route 5% of traffic to new version, monitor SLOs (p95, error rate). Auto-promote to 100% or auto-rollback within 60 seconds via Flagger.
Feature Flags Decouple deploy from release. Code is in production but behind a toggle — instant kill-switch. Enables % rollout and A/B testing.
Error Budget Gates Auto-pause releases when SLO reliability degrades beyond the error budget (e.g., 0.1% = ~43 min/month). Prevents shipping during instability.
Expand/Contract Backward-compatible schema migration pattern. Add new columns first (expand), migrate data, then remove old columns (contract). Zero-downtime DB changes.
Trunk-Based Dev All developers commit to a single main branch with short-lived feature branches. Enables continuous integration and zero-touch deploys.

Consistency Models

Strong Consistency All reads reflect the most recent write. Used for Orders, Payments, Identity — where correctness is non-negotiable.
Eventual Consistency Reads may temporarily return stale data, but will converge. Used for Catalogue (~5min CDN), Analytics, ML features. Bounded staleness.
Near-Real-Time Sub-second propagation delay. Used for Risk/Fraud Signals where freshness matters but strong consistency is unnecessary.
At-Least-Once Delivery Message delivery guarantee where events may be delivered more than once. Requires idempotent consumers. Used for Notifications.

Infrastructure & Cloud Services

EKS Elastic Kubernetes Service — AWS-managed Kubernetes for container orchestration, auto-scaling, and rolling deployments. 3-AZ per region.
Istio / Service Mesh Service mesh providing mTLS, traffic management, circuit breaking, rate limiting, and distributed tracing between services via sidecar proxies.
API Gateway Entry point for all client requests. Handles AuthN/AuthZ, JWT validation, rate limiting, traffic routing, and acts as the Strangler Facade.
CloudFront / CDN Content Delivery Network — caches static assets at edge locations globally. L1 caching layer. Includes WAF (Web Application Firewall) for DDoS protection.
Route 53 AWS Global DNS service. Routes users to the nearest region via latency-based or geolocation routing policies.
Global Accelerator AWS Anycast routing service that directs traffic to optimal endpoints via AWS's global network, reducing internet hops and latency.
AWS Lambda Serverless compute for bursty workloads (receipt PDF gen, webhook dispatch). Pay-per-invocation. Risk: cold-start tail latency.
ArgoCD GitOps continuous delivery tool that syncs Kubernetes manifests from Git to clusters, ensuring declarative, auditable deployments.
Flagger Progressive delivery operator — automates canary rollouts (5%→100%) with metrics-driven auto-rollback via Prometheus.
Terraform / IaC Infrastructure as Code tool for provisioning cloud resources via declarative configuration. Parameterised regional modules for 5-region deployment.
HPA / VPA Horizontal Pod Autoscaler scales pod count by CPU/memory metrics (~30s). Vertical Pod Autoscaler right-sizes resource requests via 7-day analysis.
Cluster Autoscaler Kubernetes component that adds/removes worker nodes when pods are unschedulable or nodes are underutilised.
Vault (HashiCorp) Cloud-agnostic secrets management with dynamic secrets, auto-rotation, fine-grained RBAC, and K8s external-secrets operator. Chosen for multi-cloud portability.
AWS KMS Key Management Service — fully managed envelope encryption with IAM integration. The platform uses KMS for encrypting Vault's storage backend. Lower ops overhead but AWS-locked.
AWS Secrets Manager Managed key-value secret store with Lambda-based auto-rotation. Alternative to Vault for AWS-only deployments; The platform uses Vault instead for multi-cloud flexibility.
Multi-AZ Multi-Availability Zone — deploying across 3+ data centres within a region for fault tolerance. EKS, RDS, and Redis all run multi-AZ.
RDS Relational Database Service — AWS-managed database hosting. Runs PostgreSQL with multi-AZ failover and automated backups.
GitOps Operational model where Git is the single source of truth for infrastructure and application state. Changes via PRs with automated reconciliation.
DORA Metrics Four key software delivery metrics: Deployment Frequency, Lead Time for Changes, Change Failure Rate, and Mean Time to Recovery (MTTR).
Pod Disruption Budget Kubernetes policy limiting how many pods can be unavailable during voluntary disruptions (upgrades, scaling). Enforces bulkhead isolation.
Spot / Reserved Instances Spot instances offer 70-90% savings for interruptible burst workloads. Reserved instances offer ~40-60% savings for baseline compute.
WAF (Web Application Firewall) Filters and monitors HTTP requests at the CDN edge, protecting against DDoS, SQL injection, XSS, and OWASP Top 10 attacks before traffic reaches the application.
Redis Sentinel High-availability solution for Redis providing automatic failover, monitoring, and configuration management across multi-AZ clusters.
ElastiCache / MemoryDB AWS managed Redis services: ElastiCache provides hosting with automatic backups and failover; MemoryDB adds Redis-compatible durability for primary data store use cases.
React / Next.js React is a component-based JavaScript UI library. Next.js is a React framework providing server-side rendering, static generation, and optimised production builds for the web storefront.
SwiftUI A declarative UI framework for building native mobile applications with reactive state management and improved developer velocity.

Data & Messaging

Apache Kafka Distributed event streaming platform. Chosen over RabbitMQ (lower throughput), SQS (no replay, AWS-only), and Pulsar (smaller ecosystem). Provides durable log replay, 25K+ evt/sec, partitioned ordering, consumer groups, and schema registry integration.
PostgreSQL Open-source RDBMS providing ACID transactions for orders/payments. Runs with read replicas, PgBouncer pooling, and sharding (4→12).
Redis Open-source (BSD licence, free) in-memory key-value store. Sub-ms reads, data structures (sorted sets, hashes, streams), pub/sub, persistence. Chosen over Memcached (no data structures) and Hazelcast (higher overhead). Production runs on AWS ElastiCache (paid managed service).
Elasticsearch Distributed search engine for full-text product search, faceted navigation, and CQRS read models with sub-50ms query latency. Auto-sharded.
DynamoDB AWS fully managed NoSQL database with global tables, auto-sharding. Used for ML feature store with 6hr TTL and pay-per-request pricing.
PgBouncer Lightweight connection pooler for PostgreSQL. Manages ~600-700 connections, preventing DB connection exhaustion under load.
Read Replicas Database copies that serve read queries, offloading the primary. ~1-2 second replication lag. Used with PG and Redis for horizontal read scaling.
Schema Registry Centralised store for event schemas (Avro/Protobuf). Enforces compatibility rules to prevent event schema sprawl across Kafka topics.
DLQ (Dead Letter Queue) Queue for messages that fail processing after max retries. Prevents poison messages from blocking consumers. Monitored for manual review.
Change Data Capture CDC — captures database changes as events. Used to migrate from shared-schema monolith to per-service databases during extraction phases.
Dual-Write Writing to both old and new systems simultaneously during migration. Combined with outbox verification to ensure consistency before cutover.
Shadow Projections Running CQRS read models in parallel with existing queries (shadow mode) to compare results before switching over. Risk-free CQRS introduction.
ETag / HTTP 304 L4 client-side caching. Server returns entity tag; client sends it back. If unchanged, server responds 304 (Not Modified) — zero payload transfer.
RDB / AOF (Redis Persistence) Redis persistence mechanisms: RDB creates point-in-time snapshots for fast recovery; AOF (Append-Only File) logs every write for maximum durability. The platform uses AOF for session/cart data.

Security & Compliance

PSP Tokenisation Payment Service Provider replaces raw card credentials with a token, reducing PCI DSS 4.0 scope to the Payments service only (dedicated network segment, annual SAQ validation) and enabling multi-provider payment flows.
Payment Rails Regional payment networks (UPI in India, Pix in Brazil, SEPA in EU) requiring adapter integrations per market via hexagonal ports.
UPI (Unified Payments Interface) India's real-time interbank payment system enabling instant transfers via mobile devices. Mandated by RBI for domestic transactions.
IMPS (Immediate Payment Service) India's 24/7 electronic funds transfer system for inter-bank and intra-bank transactions. Complements UPI for non-mobile payment flows.
Pix Brazil's instant payment system enabling 24/7 real-time transfers with minimal fees. Regulated by the Central Bank of Brazil (BCB).
SEPA (Single Euro Payments Area) European payment infrastructure enabling cross-border euro transfers within the EU and EEA with standardised processing times and fees.
PCI DSS / PCI Scope Payment Card Industry Data Security Standard (v4.0). PSP tokenisation reduces scope to the Payments service only — isolation via dedicated network segment, annual SAQ validation.
AuthN / AuthZ Authentication (who are you?) and Authorisation (what can you do?). Handled at API Gateway via JWT validation and RBAC policies.
RBAC Role-Based Access Control — assigns permissions based on roles. Used in K8s for secret access and at the application layer for user authorisation.
SPIFFE Secure Production Identity Framework for Everyone — provides cryptographic service identities for zero-trust workloads.
Zero Trust (SP 800-207) NIST security model requiring continuous verification of all users, assets, and resources — never implicitly trust, always verify.
Cosign / Sigstore Container image signing and verification tools for supply chain security, ensuring only trusted images are deployed to K8s.
SBOM Software Bill of Materials — inventory of all components/dependencies in a build. Generated in CI for vulnerability tracking and supply chain security.
OWASP ASVS Application Security Verification Standard (v5.0.0) — security requirements verification framework for web applications. Used to define and validate controls across authentication, session management, access control, and data protection. Level 2 targeted.
SAST Static Application Security Testing — white-box code analysis that scans source code for vulnerabilities before deployment. Integrated into CI alongside DAST.
SCA Software Composition Analysis — scans third-party dependencies for known vulnerabilities (CVEs) and licence compliance risks. Complements SBOM generation.
DAST Dynamic Application Security Testing — black-box security scanning of running applications. Integrated into CI/CD pipeline for every release.
Rate Limiting Per-tenant request quotas enforced at the service mesh. Hard reject (HTTP 429) above threshold to prevent noisy-neighbour problems.
Request Throttling Gradual backpressure before hard limit — returns HTTP 429 with Retry-After header, using token bucket algorithm at gateway. Slows traffic rather than rejecting it outright. Distinct from rate limiting.
Load Shedding Under extreme load, intentionally drop low-priority requests (analytics, recs) to protect critical paths (checkout, payments). Triggered by CPU/memory thresholds. Last resort before circuit breaker opens.
GDPR General Data Protection Regulation — EU data privacy law. Requires data residency (eu-west), lawful basis for processing, consent management, DPIA for high-risk processing, 72-hour breach notification, data subject rights (access, erasure, portability), and DPA with processors.
LGPD Lei Geral de Proteção de Dados — Brazil's data protection law. Requires data residency (sa-east), consent management, breach notification to ANPD, and DPO appointment. Key differences from GDPR: applies to legal entities, fines capped at 2% revenue / 50M BRL, enforced by ANPD.
RBI Compliance Reserve Bank of India regulations: payment data localisation (ap-south), mandatory 2FA for transactions, local payment rails (UPI/IMPS), Cyber Security Framework compliance, 48-hour incident reporting, and annual security audits.
Data Residency Legal requirement to store and process personal data within specific geographic boundaries. Drives region-local database masters.

Observability & Operations

OpenTelemetry (OTel) Vendor-neutral observability framework for collecting traces, metrics, and logs with consistent resource context across all service layers.
Prometheus K8s-native time-series database for metrics collection. Powers Grafana dashboards and Flagger canary analysis.
Grafana Dashboarding and visualisation platform. Hosts SLO burn-rate alerts, service maps, and DORA metrics dashboards.
PagerDuty / OpsGenie Incident alerting and on-call management platforms. Escalation chains for P1-P4 incidents. Follow-the-sun rotation across regions.
SLO / Error Budget Service Level Objective defines target reliability (99.9% browse, 99.95% checkout). Error budget = allowed failure margin — exhaustion gates deploys.
p95 Latency 95th percentile response time — 95% of requests complete faster than this threshold. Browse target: 200-400ms. Checkout: 600-1200ms.
Distributed Tracing Trace ID → Span ID → Service Map correlation across all hops. Head-based sampling + 15-day retention for high-cardinality traces.
Kubecost / FinOps Cloud financial management combining engineering and finance. Kubecost provides per-namespace cost visibility, right-sizing recommendations, and VPA integration.
SRE Site Reliability Engineering — operational discipline applying software engineering to infrastructure. Drives SLOs, error budgets, and blameless postmortems.
Chaos Engineering Scheduled fault injection (Litmus/Gremlin) to proactively discover weaknesses. Quarterly game days simulate region failures and cascade scenarios.
Blameless Postmortem Incident review focused on root cause and systemic improvements rather than individual fault. Action items tracked to completion within 48 hrs.
Dependabot / Snyk Automated dependency scanning tools that detect CVEs (Common Vulnerabilities and Exposures) in third-party libraries. Monthly review cycle.
DR (Disaster Recovery) Business continuity strategy. Progression: Pilot Light → Warm Standby → Active/Active. Quarterly failover drills across regions.

Business, ML & Governance

Collaborative Filtering ML technique recommending products based on similar users' behaviour. Makes up ~60% of The platform's hybrid recommendation engine.
Content-Based Filtering ML technique recommending products based on item attributes matching user preferences. ~30% of the recommendation engine mix.
Feature Store Centralised repository for ML features (DynamoDB with 6hr TTL). Ensures consistent feature values between training and serving, with <100ms scoring.
A/B Testing Controlled experiment serving different variants to user segments to measure impact. Used with feature flags to validate ML models and UX changes.
RICE Scoring Prioritisation framework: Reach × Impact × Confidence ÷ Effort. Used in feature governance to rank roadmap items objectively.
RFC / ADR Request for Comments / Architecture Decision Record — formal documents for proposing and recording architectural decisions with impact analysis.
Consumer-Driven Contracts Testing pattern where API consumers define expected behaviour. Provider verifies against these contracts in CI, preventing breaking changes.
Connection Pooling Reusing database connections across requests instead of creating new ones. PgBouncer manages ~600-700 connections for PostgreSQL.

18 assumptions + 101 glossary terms — covering architecture patterns, consistency models, protocols, cloud infrastructure, data systems, security, compliance, observability, operations, ML, and governance

Thank you for your time. We welcome further discussion on any of the above.

sumaninster7@gmail.com