E-Commerce Platform
Solution Architecture
Scaling a grocery e-commerce platform for 3× traffic growth, five new regions, real-time inventory, and personalised recommendations.
The Platform Is Breaking Under Its Own Growth
The platform's traffic doubled in six months. Peak-hour crashes are increasing, and expansion to five new regions is imminent.
in 6 months, targeting 3× YoY
with local currencies, tax & warehouses
slow responses and outages at peak hours
slow responses & system crashes reported
Missing Business Capabilities
Key gaps identified from the case study that the current platform cannot address.
Real-Time Inventory
Customers need accurate stock levels during browsing and checkout — the monolith has no dedicated inventory service or caching layer.
Personalised Recommendations
The case study requires personalised product suggestions — no ML pipeline, feature store, or recommendation engine exists today.
Faster Delivery SLAs
Regional expansion demands local warehouse routing and fulfilment orchestration that the single-region monolith cannot support.
Cost Optimisation
The brief explicitly calls for cost-effective scaling — monolithic vertical scaling is expensive; independent service scaling is needed.
Multi-Region Operations
Five new regions require local currencies, tax rules, and warehouses — the monolith has no multi-tenancy or regionalisation layer.
Proposed Non-Functional Targets
These targets are proposed based on industry benchmarks — not specified in the original brief.
Four Candidates, One Evolutionary Path
▼ Scroll to see transition strategy, migration phases & CI/CD pipeline
Rather than a big-bang migration, we evaluate four architectures and recommend an evolutionary hybrid using the Strangler Fig pattern.
Requirement → Solution Traceability
Every architectural choice maps back to a specific case study requirement.
Caching layer, connection pooling, independent service scaling, CQRS read models
Multi-region deploy, Pricing & Tax context, warehouse routing, local currency support
Active/active multi-AZ, event backbone for decoupling, canary deploys, error budget gates
Real-time inventory service, search + recommendations via ML context, faster delivery SLAs
Serverless for bursty workloads, independent scaling per service, FinOps phase
Domain Decomposition (14 Bounded Contexts)
Four Architecture Candidates
Modular Monolith
Hexagonal ports & adapters. Strongest consistency. Lowest ops complexity.
Best early velocityMicroservices
Sync-first + async side flows. Independent scaling. Saga transactions.
Team autonomyStreaming + CQRS
Kafka event backbone. Separate read/write. Multi-consumer fan-out.
Highest throughputServerless
Managed functions + event bus. Pay-per-use. Rapid elasticity.
Cost-efficient spikes| Dimension | A: Monolith | B: Microservices | C: Stream+CQRS | D: Serverless |
|---|---|---|---|---|
| Delivery Velocity | High (early) | Medium | Medium-Low | High (small features) |
| Ops Complexity | Lowest | High | Very High | Medium-High |
| Consistency | Strongest (single TX) | Strong/svc; saga across | Eventual reads; strong writes | Eventual; orchestrator |
| Latency | Low variance | Hop-sensitive | Fast reads; write lag | Cold-start variance |
| Cost Shape | Predictable | Higher baseline | Highest (data dup) | Usage-based |
| Best Fit | Rapid iteration + consistency | Team autonomy + scaling | Many consumers + reads | Spiky, event-heavy |
| Data Migration | Lowest (in-process) | Medium (per-service DBs) | High (dual-write + projections) | Medium (event replay) |
| Team Skill Req. | General backend | Platform + DevOps maturity | Event modeling + schema governance | Cloud-native + managed svc |
| CAP Trade-off | CA (single node) | CP or AP per service | AP reads; CP writes | AP (eventual + retries) |
Strong correctness + fast iteration needed; minimal distributed complexity; extraction-friendly hexagonal boundaries.
Multiple teams need independent deployability; platform maturity (CI/CD, tracing, contract testing) exists.
Many downstream consumers need same events; read volume dominates; bounded staleness acceptable.
Highly bursty/event-driven workload; strong managed-service preference; can engineer around retries + tail latency.
CAP Theorem — A distributed system can guarantee at most two of Consistency, Availability, and Partition-tolerance. The monolith sidesteps the trade-off (single node, no partitions); the hybrid makes per-context choices — CP for payments/orders (strong consistency), AP for catalogue/search (availability + eventual consistency).
Why Kafka? — Compared to RabbitMQ (push-based, lower throughput), SQS (no replay, AWS-only), and Pulsar (smaller ecosystem): Kafka provides durable log replay, high throughput (25K+ evt/sec), partitioned ordering, consumer groups for fan-out, and schema registry integration. Critical for event sourcing, outbox relay, and CQRS projections across 14 bounded contexts.
Brownfield vs Greenfield — The platform is a brownfield project (existing monolith → hybrid migration via Strangler Fig). A greenfield approach (building microservices from scratch) would bypass legacy constraints but forfeit existing business logic, data, and customer traffic. The evolutionary hybrid preserves brownfield value while introducing greenfield patterns (event backbone, CQRS, new bounded contexts) incrementally.
Service Integration Patterns
Single entry point for all clients. Handles auth, rate limiting, routing, and acts as the Strangler Facade. The platform's primary pattern.
A composite service calls multiple downstream services and merges results. Used for product detail pages (Catalogue + Pricing + Inventory + Reviews in one response).
Synchronous service-to-service call chain where each step depends on the prior. Used in checkout: Cart → Pricing → Payment → Order. Risk: latency compounds per hop.
Request fans out to multiple services in parallel, results merged. Used for search (Catalogue + Personalisation + Pricing queried simultaneously, fastest wins).
Frontend (React/Next.js) fetches from multiple BFFs independently and assembles the page. Each UI section maps to a bounded context. Enables independent team deployment.
The platform uses a mix: API Gateway for ingress, Aggregator for composite reads, Chained for transactional flows (with saga compensation), Branch for parallel search, and Client-Side Composition for the storefront.
Recommended: Evolutionary Hybrid (Strangler Fig)
Three-stage transition from fragile monolith to scalable target. Each stage delivers value while managing risk.
Hybrid in Action: Purchase Flow
The checkout flow demonstrates how each candidate pattern contributes to the recommended hybrid architecture.
How We Get There — Strangler Fig Migration
The Strangler Fig Pattern
Named after the tropical strangler fig tree that germinates on a host tree, gradually enveloping it with aerial roots until the host decomposes and the fig stands independently.
In software migration, the new system (blue — new services) wraps around the legacy monolith (grey — old code) via an API Gateway facade. Traffic shifts incrementally. As bounded contexts are extracted, the monolith shrinks until safely decommissioned.
Key advantage: Zero big-bang risk. Each phase delivers value independently, and rollback is always possible.
Blue roots (new services & events) wrapping grey monolith trunk. Minimal, clean design.
Green fig roots enveloping a decaying host tree — the real-world inspiration for the pattern.
8-Phase Rollout
Incremental delivery — each phase produces a working system. Accelerated timelines assume AI-agent-driven development with human oversight for architecture decisions and code review.
| Phase | Scope | Timeline |
|---|---|---|
| 1 Observe & Baseline | Define SLOs, error budgets; instrument with OpenTelemetry | Day 1-3 |
| 2 Stabilise | Add caching layer, CDN, connection pooling; run load tests | Wk 1 |
| 3 Modularise | Hexagonal ports & adapters; enforce module boundaries with arch tests | Wk 2-3 |
| 4 Event Backbone | Introduce event bus + Outbox pattern; CloudEvents schema | Wk 3-4 |
| 5 Extract Services | Payment, Order, Catalogue — strangler fig with contract tests | Wk 5-7 |
| 6 CQRS + Stream | Selective CQRS projections; read model optimization | Wk 7-9 |
| 7 Multi-Region | 5 regions — IaC provisioning, data replication, DR runbooks | Wk 9-12 |
| 8 FinOps | Cost dashboards, right-sizing, reserved capacity planning | Ongoing |
Key Risks & Mitigations
| Risk | Mitigation |
|---|---|
| Distributed monolith | Limit sync depth; async for non-critical flows |
| Event schema sprawl | AsyncAPI + schema registry + versioning |
| Module boundary erosion | Hexagonal ports + consumer-driven contracts |
Migration Principles
Strangler Fig wraps old system. Traffic shifts incrementally via gateway.
Stabilise with caching first. Don't extract from a broken monolith.
Kafka backbone before extraction prevents distributed monolith.
Dual-write verification + shadow CQRS projections before cutover.
Rollback & Safety Nets
| Mechanism | How It Works |
|---|---|
| Blue-Green Deploy | <1s rollback to previous version |
| Canary Auto-Rollback | Automated revert within 60s if p95 or error-rate SLO breached |
| Feature Flags | Decouple deploy from release; instant kill-switch |
| Error Budget Gates | Auto-pause releases when reliability degrades |
| Expand/Contract | Backward-compatible schema migrations |
CI/CD Pipeline & Delivery Metrics
Deploy frequency, lead time, change failure rate, MTTR.
Trunk-based dev. Automated promotion gates. Zero-touch deploys.
Mono: unit+module. Micro: +contract tests. Serverless: +replay.
OpenAPI + AsyncAPI + schema registry for events.
Key insight: The evolutionary hybrid combines the best of all four candidates — strong transactions where correctness matters, async events for scale, and serverless for bursty edge workloads.
Target-State System Architecture
Full layered view of the evolutionary hybrid architecture — from client edge to data persistence, with technology choices and performance strategies.
▼ Scroll to see technology choices, caching, security & resilience patterns
End-to-End System Design
Holistic single-page view: actors, UI portals, API gateway, domain services with interconnections, message brokers, 3rd-party integrations, caching, databases, notifications, and sidecar observability.
Synchronous Calls (gRPC / REST via Istio mTLS)
| # | From | To | Call / Purpose |
|---|---|---|---|
| 1 | Identity & Auth | All services | validateToken() — JWT verification on every request |
| 2 | Cart | Catalogue | getPrice() — fetch current price & product details |
| 3 | Cart | Inventory | checkStock() — verify availability before adding to cart |
| 4 | Orders | Payments | chargePayment() — process payment during checkout (saga step) |
| 5 | Orders | Inventory | reserveStock() — reserve items during checkout (saga step) |
| 6 | Orders | Fraud & Risk | riskCheck() — fraud score before order confirmation |
| 7 | Payments | Stripe / Adyen | processPayment() — external payment gateway call |
| 8 | Fulfilment | Google Maps | geocode() / optimiseRoute() — delivery routing |
| 9 | Notification svc | Twilio / SES / FCM | send() — dispatch SMS, email, or push notification |
| 10 | Fulfilment | Delivery Partners | dispatch() — hand off to last-mile logistics partner |
Asynchronous Events (Kafka — non-blocking, eventual consistency)
| # | Producer | Consumer | Event / Topic |
|---|---|---|---|
| 11 | Orders | Fulfilment | OrderPlaced → order.events — trigger pick/pack/ship |
| 12 | Orders | Notifications | OrderConfirmed → notification.events — email + push |
| 13 | Payments | Orders | PaymentCompleted → payment.events — confirm order |
| 14 | Inventory | Catalogue | StockUpdated → inventory.events — reindex search |
| 15 | Fulfilment | Notifications | ShipmentDispatched → fulfilment.events — SMS/push |
| 16 | Returns | Inventory | RefundApproved → return.events — restock items |
| 17 | Returns | Payments | RefundApproved → return.events — issue refund |
| 18 | Promotions | Orders | CouponApplied → promo.events — apply discount |
| 19 | Fraud & Risk | Orders | FraudFlagged → fraud.events — block/review order |
| 20 | All services | Analytics & ML | *.* — fan-out consumer of all events for ML features |
| 21 | Orders | SAP / ERP | OrderCompleted → order.events — sync to finance system |
Layered Architecture Detail
Detailed layered view with technology choices per component — zoom into any layer from the end-to-end design above.
Technology Choices, Performance & Security
Metrics are proposed targets based on industry benchmarks; final values to be validated during load testing.
Core Technology Stack
| Technology | Role |
|---|---|
| Kubernetes (EKS) | Container orchestration, HPA auto-scaling |
| Service Mesh (e.g. Istio) | mTLS, traffic mgmt, circuit breaking |
| Apache Kafka | Event streaming: 25K evt/sec, outbox relay |
| PostgreSQL | ACID transactions, read replicas, sharding |
| Redis Cluster | Sub-ms cache, sessions (80-90% DB offload) |
| Elasticsearch | Full-text search, CQRS read models |
| OpenTelemetry | Vendor-neutral traces, metrics, logs |
| DynamoDB | Feature store, global tables, pay-per-req |
| AWS Lambda | Serverless for bursty workloads + edges |
| ArgoCD + Flagger | GitOps, canary deploys, auto-rollback |
| Prometheus + Grafana | K8s-native monitoring, dashboards |
| Kubecost | FinOps: cost visibility, right-sizing |
| Terraform | IaC: parameterised regional modules |
Multi-Layer Caching
Security Architecture
| Identity Layer | Hex adapter wraps external IdP; pluggable for future providers |
| Transport | TLS 1.3 external + mTLS pod-to-pod via service mesh |
| Payment Isolation | PCI DSS 4.0 scope reduced to 1 service via PSP tokenisation adapter |
| Zero Trust | NIST SP 800-207; service identities via SPIFFE |
| Secrets | Vault auto-rotation, K8s external-secrets + RBAC. See Vault vs AWS KMS below. |
| Encryption at Rest | AES-256 via AWS KMS for RDS, S3, EBS, Kafka (at-rest encryption), backups |
| Supply Chain | SBOM generation, image signing (Cosign/Sigstore), image scanning |
| Verification | OWASP ASVS 5.0.0 (Level 2) + SAST, SCA & DAST in CI |
Multi-Region Topology
Redis offers data structures (sorted sets, hashes, streams), pub/sub, Lua scripting, persistence (RDB/AOF), and multi-AZ Sentinel HA — all missing from Memcached. Hazelcast adds distributed compute but with higher memory overhead and a smaller managed-service ecosystem on AWS. Redis is open-source (BSD licence, free); AWS ElastiCache/MemoryDB is the managed option (paid, ~$0.017/hr for cache.t3.micro). The platform uses ElastiCache for production HA.
Vault — cloud-agnostic, dynamic secrets, auto-rotation, fine-grained RBAC, audit log, K8s external-secrets operator. Best for multi-cloud or hybrid. AWS KMS — fully managed envelope encryption, tight IAM integration, lower ops overhead but AWS-locked. AWS Secrets Manager — managed key-value store with rotation via Lambda. The platform uses Vault for portability across regions (multi-cloud roadmap) + K8s-native secret injection, with KMS for envelope encryption of Vault's storage backend.
Resilience Patterns, Scaling & Observability
5 failures → fail-fast 30s → half-open test. ~58% cascade reduction.
Exponential jitter. Idempotency keys.
Isolated thread/conn pools. Pod Disruption Budgets.
Liveness (restart) + Readiness (remove from LB).
Per-tenant quotas via service mesh. Hard reject above threshold. Prevent noisy-neighbour.
Gradual backpressure (HTTP 429 + Retry-After) before hard limit. Token bucket at gateway level. Distinct from rate limiting — slows rather than rejects.
Under extreme load, drop low-priority requests (analytics, recs) to protect critical paths (checkout, payments). Priority-based queue with CPU/memory triggers.
Serve stale cache if upstream fails. Priority queues. Reduced functionality over total outage.
Auto-Scaling
HPA: CPU >70% / Mem >80% → scale pods ~30s. VPA: 7-day analysis → right-size. Cluster Autoscaler: Add nodes for unschedulable pods.
DB Scaling
Read replicas: 1-2s lag. PgBouncer: ~600-700 conns. Sharding: by order_id (start 4, grow to 12).
Observability
OTel: Trace/span correlation across all hops. Dashboards: Grafana SLO burn-rate alerts + service maps. Cost: Head-based sampling + 15-day retention for high-cardinality traces.
Kubernetes Deployment Architecture
Physical deployment topology across 3 Availability Zones per region, showing how domain services map to EKS namespaces, pods, and supporting infrastructure.
Network & VPC Architecture
AWS VPC layout showing how traffic flows from the internet through public and private subnets to reach application pods and data stores.
CI/CD Pipeline Architecture
Trunk-based development with automated promotion gates. Zero-touch deployment from commit to production via ArgoCD + Flagger canary rollout.
Data Flow & Event-Driven Architecture
How domain events flow through the Kafka backbone between producers and consumers, including CQRS read/write separation and event sourcing paths.
Database Schema & Data Ownership Map
Each bounded context owns its data store exclusively — no shared databases. Shows which service owns which storage technology and key entities.
Observability Stack & Telemetry Pipeline
End-to-end observability: how metrics, logs, and traces flow from application services through OpenTelemetry collectors to dashboards and alerting.
Trade-Offs, Cost & Business Impact
▼ Scroll to see post-production support, maintenance & feature roadmap
Key Trade-Offs & Mitigations
| Trade-Off | Risk | Mitigation |
|---|---|---|
| Micro vs. mono | Ops overhead | Extract only when pain justifies |
| Eventual consistency | Stale data | Strong for financials; short-TTL |
| CQRS selective | Complexity | Only where read/write ratio needs it |
| Multi-region | Cost + sync | Pilot light → active/active |
| IdP / PSP coupling | Vendor changes | Hex adapter; pluggable identity + payment providers |
| Serverless lock-in | Migration | CloudEvents + adapter isolation |
Data Migration Tactics
| Challenge | Approach |
|---|---|
| DB ownership split | Shared-schema → per-service via Change Data Capture |
| Sync → async | Dual-write with outbox verification |
| CQRS introduction | Shadow projections; compare then switch |
| Data residency compliance | Region-local masters; cross-region replication policy |
Cost Optimisation (~$24K/mo est.)
Costs are estimated based on published cloud pricing at proposed scale; actual costs depend on provider and workload.
Compute Savings Strategy
Business Impact
New regions via parameterised IaC modules.
Sub-second search, real-time inventory, ML recs.
3× growth via auto-scaling + caching.
Evolutionary approach. Pay for what you need.
Key Business Features
Real-Time Inventory & Fulfilment
Event-driven system processes thousands of events/sec at peak. Per-SKU, per-region read models in Redis/Elasticsearch (<100ms queries). Warehouse routing and dispatch with delivery SLA tracking.
Personalised ML Recommendations
Hybrid engine: collaborative filtering (60%), content-based (30%), business rules (10%). DynamoDB feature store with 6hr TTL. End-to-end scoring <100ms. Proposed split — to be validated with A/B testing post-launch.
Evolutionary Hybrid Architecture — Start modular, add complexity only where scaling pain demands it.
Stabilise → Modularise → Event Backbone → Extract Services → Multi-Region → FinOps
Post-Production Support, Maintenance & Feature Upgrades
A mature operational model ensures the platform stays healthy, secure, and continuously improves after launch.
Support Tiers & SLAs
| Tier | Scope | Response | Resolution |
|---|---|---|---|
| P1 Critical | Payment/checkout down, data loss | 15 min | 4 hrs |
| P2 Major | Feature degraded, workaround exists | 1 hr | 8 hrs |
| P3 Minor | Non-critical bug, UI issue | 4 hrs | 48 hrs |
| P4 Request | Enhancement, cosmetic fix | 1 day | Sprint |
Scheduled Maintenance Windows
| Activity | Frequency | Impact |
|---|---|---|
| Security patching (OS/K8s) | Weekly | Zero downtime |
| DB maintenance (vacuum/index) | Bi-weekly | Zero downtime |
| Kafka broker rolling upgrade | Monthly | Zero downtime |
| Major version upgrades (EKS) | Quarterly | Blue-green |
| Disaster recovery drills | Quarterly | Failover test |
| Security compliance audit (PCI DSS / SOC 2) | Annual | Scheduled |
On-Call & Incident Management
Follow-the-sun across regions. PagerDuty escalation chains.
Root cause + action items within 48 hrs. Tracked to completion.
Feature Governance & Prioritisation
Post-Launch Feature Roadmap
| Quarter | Feature Upgrades | Priority |
|---|---|---|
| Q1 | ML-powered recommendations, A/B testing infra | High |
| Q2 | Real-time fraud detection, loyalty programme | High |
| Q3 | GraphQL API layer, advanced analytics dashboards | Medium |
| Q4 | Edge computing (CDN functions), AI-driven inventory forecasting | Strategic |
Operational Maturity
Scheduled fault injection (Litmus/Gremlin). Quarterly game days.
Predictive scaling models. FinOps reviews. Right-sizing automation.
Dependabot + Snyk scanning. Monthly CVE review cycle.
EKS version policy (N-1). Rolling Kafka & PG major upgrades.
Operational excellence: Zero-downtime maintenance · Continuous feature delivery · SRE-driven reliability · Proactive security posture
Assumptions, Questions & Glossary
Scroll down to explore key assumptions, questions for leadership, and glossary of terms referenced throughout this architecture.
▼ Scroll to see all content
Key Assumptions
The following assumptions were made where the case study did not provide specific values. These should be validated with stakeholders before finalising the design.
Traffic & Performance
| Assumption | Value Used | Rationale |
|---|---|---|
| Concurrent users | 10K+ | Estimated for a "rapidly growing" grocery platform with 2× traffic surge |
| Baseline QPS | 10K (scales to 30K) | Derived from 3× growth requirement in case study |
| Availability SLO | 99.95% | Industry standard for e-commerce; case study says "minimal downtime" |
| Browse latency (p95) | 200–400 ms | Competitive benchmark for grocery e-commerce search/browse |
| Checkout latency (p95) | 600–1200 ms | Acceptable threshold for payment processing flows |
| Kafka throughput | 25K evt/sec | Sized for 3× peak with headroom for event-driven flows |
Infrastructure & Cost
| Assumption | Value Used | Rationale |
|---|---|---|
| Cloud provider | AWS | Selected for mature K8s (EKS), global reach, serverless ecosystem |
| Existing database | PostgreSQL | Most common ACID DB for e-commerce monoliths |
| Monthly infra cost | ~$24K/mo | Estimated for EKS + managed services at 10K QPS baseline |
| Redis cache offload | 80–90% | Typical for read-heavy grocery catalogue/inventory lookups |
| Compute savings | RI ~40%, Spot ~60-70% | Published AWS pricing benchmarks for steady-state workloads |
Architecture & Migration
| Assumption | Value Used | Rationale |
|---|---|---|
| Migration pattern | Strangler Fig | Lowest risk for monolith-to-hybrid; incremental value delivery |
| Delivery timeline | ~12 weeks | AI-agent-accelerated migration; Strangler Fig with parallel workstreams and human oversight |
| Team size | ~6 engineers | Assumed cross-functional squad; to be validated with leadership |
| DR progression | Pilot Light → Active/Active | Phased approach to manage cost vs. resilience trade-off |
Business & Integrations
| Assumption | Value Used | Rationale |
|---|---|---|
| Identity & Payment providers | OAuth 2.0 / OIDC, PSP tokenisation, regional rails | Generic integrations; specific providers to be confirmed with leadership |
| ML recommendation split | 60/30/10 | Collaborative (60%), content-based (30%), business rules (10%) |
| Search latency target | <100 ms scoring | End-to-end ML scoring SLA for real-time personalisation |
Conduct a discovery workshop with product, infra, and finance stakeholders to validate these assumptions before committing to detailed sprint planning.
18 assumptions identified — all values are estimated from industry benchmarks and should be refined with actual platform telemetry and business inputs.
Questions for Leadership
To finalise the architecture and migration plan, we need leadership alignment on the following open items from the case study.
Which of the five new regions should we prioritise first, and what is the rollout sequence?
Different regions carry different tax compliance, currency, and warehouse integration complexity. A phased rollout order lets us pilot in lower-risk regions before scaling.
Is the 3× traffic growth expected to be gradual or driven by specific launch events (e.g., regional go-lives, promotions)?
This determines whether we invest in auto-scaling elasticity or pre-provisioned capacity — and how aggressively we optimise burst handling with Spot instances.
What is the acceptable downtime target during peak hours — 99.9% (8.7 hrs/yr) or 99.95% (4.4 hrs/yr)?
The case study requires “minimal downtime during peak hours.” A concrete SLO drives the multi-region failover strategy, error budget gates, and infrastructure cost.
Should real-time inventory checks be per-warehouse or aggregated per-region, and what latency is acceptable for stock updates?
Per-warehouse granularity enables faster delivery SLAs but requires tighter event-streaming integration with each new warehouse partner.
What user data is available for personalised recommendations — purchase history only, or also browsing behaviour and demographic data?
This shapes the ML model complexity (collaborative vs. content-based vs. hybrid) and determines data pipeline and privacy compliance requirements.
Is there a target monthly infrastructure budget, and should we optimise for lowest cost or fastest time-to-market?
The case study asks to “keep infrastructure costs in check while scaling.” A specific envelope helps us decide between Reserved Instances, Spot, and Serverless mix.
How aggressive should the Strangler Fig migration be — stabilise-first (lower risk, longer) or extract-early (faster, higher risk)?
With five new regions launching next quarter, we need to balance migration velocity against the risk of destabilising the monolith during expansion.
What is the current engineering team size, and are there plans to scale the team or adopt accelerated tooling for the migration?
Team capacity directly impacts how many services we can extract in parallel and whether the proposed ~12-week AI-accelerated phased timeline is realistic.
Glossary
Every technical term, acronym, pattern, and standard referenced across all six slides.
Architecture & Patterns
| Monolith | A single deployable unit containing all application modules. The platform's current state — tightly coupled, single DB, single region. |
| Modular Monolith | Candidate A — monolith with enforced module boundaries (hexagonal ports). Strongest consistency, lowest ops complexity, best early velocity. |
| Microservices | Candidate B — independently deployable services communicating via sync (REST/gRPC) and async (events). Enables team autonomy and independent scaling. |
| Serverless | Candidate D — managed functions (e.g., AWS Lambda) triggered by events. Pay-per-use, rapid elasticity. Risk: cold-start latency and retry storms. |
| Evolutionary Hybrid | The recommended architecture — combines best of all four candidates. Start modular, add microservices/events/serverless only where scaling pain justifies. |
| Bounded Context | A DDD concept defining a clear boundary around a domain model, ensuring each service owns its data and logic (e.g., Orders, Payments, Catalogue). |
| DDD | Domain-Driven Design — software modelling approach that structures code around business domains. Drives the 14 bounded contexts in Slide 3. |
| Domain Decomposition | The process of breaking a system into bounded contexts aligned with business capabilities. The platform decomposes into 14 contexts. |
| Hexagonal / Ports & Adapters | Architecture pattern isolating domain logic from external systems (DB, APIs) via ports (interfaces) and adapters (implementations). Enables pluggable IdP/PSP. |
| Strangler Fig | Incremental migration pattern where new functionality wraps the legacy system via a facade (API Gateway), gradually replacing it without a big-bang rewrite. |
| CQRS | Command Query Responsibility Segregation — separates write (command) and read (query) models for independent scaling. Reads from ES/Redis, writes to PG. |
| Saga Pattern | Manages distributed transactions across services via a sequence of local transactions with compensating actions on failure (e.g., void payment if fulfilment fails). |
| Transactional Outbox | Persists domain state and an event-to-be-published in the same DB transaction, preventing "commit succeeded but event lost" failures. |
| Circuit Breaker | Fault-tolerance pattern: after N failures (5 in the platform), requests fail fast for a cooldown period (30s), then half-open to test recovery. ~58% cascade reduction. |
| Bulkhead | Isolates resources (thread pools, connections) so a failure in one component cannot cascade and exhaust shared resources. Enforced via Pod Disruption Budgets. |
| Event Sourcing | Stores state as an immutable, time-ordered sequence of events rather than mutable rows, enabling replay and full auditability. |
| CAP Theorem | States a distributed system can guarantee at most two of Consistency, Availability, and Partition-tolerance. the platform chooses CP for payments/orders (strong consistency) and AP for catalogue/search (eventual consistency + high availability). |
| Brownfield Project | Developing within an existing system — migrating or extending legacy code. The platform is brownfield: monolith → hybrid via Strangler Fig, preserving existing data and business logic. |
| Greenfield Project | Building a new system from scratch with no legacy constraints. New bounded contexts (e.g., ML/Personalisation) in the platform are effectively greenfield within the brownfield migration. |
| Cache-Aside (Lazy Load) | App checks cache first; on miss, reads from DB and populates cache. Default pattern for catalogue and inventory lookups in the platform. |
| Cache-Put (Write-Through) | Writes update both DB and cache atomically, ensuring cache is always fresh. Used for sessions and cart data where stale reads are unacceptable. |
| Write-Behind | Writes go to cache first, then asynchronously flush to DB. Used for analytics counters where eventual consistency is acceptable and write throughput matters. |
| Polyglot Persistence | Using different database technologies for different services based on their needs — PG for transactions, Redis for caching, ES for search, DynamoDB for ML features. |
| Graceful Degradation | Serving stale cached data or reduced functionality when an upstream dependency fails, rather than returning errors. Priority queues for critical paths. |
| Idempotent Consumers | Event consumers that can safely process the same message multiple times without side effects. Essential for at-least-once delivery guarantees on Kafka. |
| Aggregator Pattern | Composite service that calls multiple downstream services and merges results into a single response. Used for product detail pages (Catalogue + Pricing + Inventory). |
| Chained Pattern | Synchronous service-to-service call chain where each step depends on the prior. Used in checkout flow: Cart → Pricing → Payment → Order. Risk: latency compounds per hop. |
| Branch Pattern | Request fans out to multiple services in parallel, results merged. Used for search (Catalogue + Personalisation + Pricing queried simultaneously). |
| Client-Side UI Composition | Frontend independently fetches from multiple BFF endpoints and assembles the page. Each UI section maps to a bounded context, enabling independent team deployment. |
| BFF (Backend for Frontend) | Specialised API layer tailored to specific frontend clients (web, mobile). Minimises over-fetching and optimises response formats per client type. |
| IdP (Identity Provider) | External authentication service managing user credentials and identity verification. Integrated via pluggable hexagonal adapter for vendor flexibility. |
Protocols & Standards
| OAuth 2.0 (RFC 6749) | Delegated authorisation framework allowing third-party access to resources without sharing credentials. |
| OIDC | OpenID Connect — identity layer on top of OAuth 2.0, providing authentication and user claims via ID tokens. |
| JWT (RFC 7519) | JSON Web Token — compact, URL-safe format for securely transmitting claims between parties. Validated at API Gateway. |
| JWS (RFC 7515) | JSON Web Signature — ensures data integrity. Used by payment providers and identity servers for signed transaction verification. |
| TLS 1.3 (RFC 8446) | Transport Layer Security — encrypts data in transit. Mandatory for all external communication; mTLS for pod-to-pod. |
| mTLS | Mutual TLS — both client and server authenticate each other. Implemented via the service mesh sidecar for zero-trust networking. |
| CloudEvents | CNCF specification for interoperable event envelope format, ensuring consistent metadata across event-driven systems. |
| AsyncAPI | Machine-readable specification for message-driven APIs (Kafka, AMQP, WebSockets), analogous to OpenAPI for REST. |
| OpenAPI | Machine-readable specification for RESTful APIs. Used with AsyncAPI for contract governance across sync and async services. |
| GraphQL | Query language for APIs allowing clients to request exactly the data they need. Planned for Q3 post-launch feature roadmap. |
| ACID | Atomicity, Consistency, Isolation, Durability — database transaction guarantees. PostgreSQL provides ACID for orders/payments. |
Deployment & Release Patterns
| Blue-Green Deploy | Two identical environments (blue/green). Deploy to inactive, switch load balancer. Rollback in <1 second by switching back. |
| Canary Deploy | Route 5% of traffic to new version, monitor SLOs (p95, error rate). Auto-promote to 100% or auto-rollback within 60 seconds via Flagger. |
| Feature Flags | Decouple deploy from release. Code is in production but behind a toggle — instant kill-switch. Enables % rollout and A/B testing. |
| Error Budget Gates | Auto-pause releases when SLO reliability degrades beyond the error budget (e.g., 0.1% = ~43 min/month). Prevents shipping during instability. |
| Expand/Contract | Backward-compatible schema migration pattern. Add new columns first (expand), migrate data, then remove old columns (contract). Zero-downtime DB changes. |
| Trunk-Based Dev | All developers commit to a single main branch with short-lived feature branches. Enables continuous integration and zero-touch deploys. |
Consistency Models
| Strong Consistency | All reads reflect the most recent write. Used for Orders, Payments, Identity — where correctness is non-negotiable. |
| Eventual Consistency | Reads may temporarily return stale data, but will converge. Used for Catalogue (~5min CDN), Analytics, ML features. Bounded staleness. |
| Near-Real-Time | Sub-second propagation delay. Used for Risk/Fraud Signals where freshness matters but strong consistency is unnecessary. |
| At-Least-Once Delivery | Message delivery guarantee where events may be delivered more than once. Requires idempotent consumers. Used for Notifications. |
Infrastructure & Cloud Services
| EKS | Elastic Kubernetes Service — AWS-managed Kubernetes for container orchestration, auto-scaling, and rolling deployments. 3-AZ per region. |
| Istio / Service Mesh | Service mesh providing mTLS, traffic management, circuit breaking, rate limiting, and distributed tracing between services via sidecar proxies. |
| API Gateway | Entry point for all client requests. Handles AuthN/AuthZ, JWT validation, rate limiting, traffic routing, and acts as the Strangler Facade. |
| CloudFront / CDN | Content Delivery Network — caches static assets at edge locations globally. L1 caching layer. Includes WAF (Web Application Firewall) for DDoS protection. |
| Route 53 | AWS Global DNS service. Routes users to the nearest region via latency-based or geolocation routing policies. |
| Global Accelerator | AWS Anycast routing service that directs traffic to optimal endpoints via AWS's global network, reducing internet hops and latency. |
| AWS Lambda | Serverless compute for bursty workloads (receipt PDF gen, webhook dispatch). Pay-per-invocation. Risk: cold-start tail latency. |
| ArgoCD | GitOps continuous delivery tool that syncs Kubernetes manifests from Git to clusters, ensuring declarative, auditable deployments. |
| Flagger | Progressive delivery operator — automates canary rollouts (5%→100%) with metrics-driven auto-rollback via Prometheus. |
| Terraform / IaC | Infrastructure as Code tool for provisioning cloud resources via declarative configuration. Parameterised regional modules for 5-region deployment. |
| HPA / VPA | Horizontal Pod Autoscaler scales pod count by CPU/memory metrics (~30s). Vertical Pod Autoscaler right-sizes resource requests via 7-day analysis. |
| Cluster Autoscaler | Kubernetes component that adds/removes worker nodes when pods are unschedulable or nodes are underutilised. |
| Vault (HashiCorp) | Cloud-agnostic secrets management with dynamic secrets, auto-rotation, fine-grained RBAC, and K8s external-secrets operator. Chosen for multi-cloud portability. |
| AWS KMS | Key Management Service — fully managed envelope encryption with IAM integration. The platform uses KMS for encrypting Vault's storage backend. Lower ops overhead but AWS-locked. |
| AWS Secrets Manager | Managed key-value secret store with Lambda-based auto-rotation. Alternative to Vault for AWS-only deployments; The platform uses Vault instead for multi-cloud flexibility. |
| Multi-AZ | Multi-Availability Zone — deploying across 3+ data centres within a region for fault tolerance. EKS, RDS, and Redis all run multi-AZ. |
| RDS | Relational Database Service — AWS-managed database hosting. Runs PostgreSQL with multi-AZ failover and automated backups. |
| GitOps | Operational model where Git is the single source of truth for infrastructure and application state. Changes via PRs with automated reconciliation. |
| DORA Metrics | Four key software delivery metrics: Deployment Frequency, Lead Time for Changes, Change Failure Rate, and Mean Time to Recovery (MTTR). |
| Pod Disruption Budget | Kubernetes policy limiting how many pods can be unavailable during voluntary disruptions (upgrades, scaling). Enforces bulkhead isolation. |
| Spot / Reserved Instances | Spot instances offer 70-90% savings for interruptible burst workloads. Reserved instances offer ~40-60% savings for baseline compute. |
| WAF (Web Application Firewall) | Filters and monitors HTTP requests at the CDN edge, protecting against DDoS, SQL injection, XSS, and OWASP Top 10 attacks before traffic reaches the application. |
| Redis Sentinel | High-availability solution for Redis providing automatic failover, monitoring, and configuration management across multi-AZ clusters. |
| ElastiCache / MemoryDB | AWS managed Redis services: ElastiCache provides hosting with automatic backups and failover; MemoryDB adds Redis-compatible durability for primary data store use cases. |
| React / Next.js | React is a component-based JavaScript UI library. Next.js is a React framework providing server-side rendering, static generation, and optimised production builds for the web storefront. |
| SwiftUI | A declarative UI framework for building native mobile applications with reactive state management and improved developer velocity. |
Data & Messaging
| Apache Kafka | Distributed event streaming platform. Chosen over RabbitMQ (lower throughput), SQS (no replay, AWS-only), and Pulsar (smaller ecosystem). Provides durable log replay, 25K+ evt/sec, partitioned ordering, consumer groups, and schema registry integration. |
| PostgreSQL | Open-source RDBMS providing ACID transactions for orders/payments. Runs with read replicas, PgBouncer pooling, and sharding (4→12). |
| Redis | Open-source (BSD licence, free) in-memory key-value store. Sub-ms reads, data structures (sorted sets, hashes, streams), pub/sub, persistence. Chosen over Memcached (no data structures) and Hazelcast (higher overhead). Production runs on AWS ElastiCache (paid managed service). |
| Elasticsearch | Distributed search engine for full-text product search, faceted navigation, and CQRS read models with sub-50ms query latency. Auto-sharded. |
| DynamoDB | AWS fully managed NoSQL database with global tables, auto-sharding. Used for ML feature store with 6hr TTL and pay-per-request pricing. |
| PgBouncer | Lightweight connection pooler for PostgreSQL. Manages ~600-700 connections, preventing DB connection exhaustion under load. |
| Read Replicas | Database copies that serve read queries, offloading the primary. ~1-2 second replication lag. Used with PG and Redis for horizontal read scaling. |
| Schema Registry | Centralised store for event schemas (Avro/Protobuf). Enforces compatibility rules to prevent event schema sprawl across Kafka topics. |
| DLQ (Dead Letter Queue) | Queue for messages that fail processing after max retries. Prevents poison messages from blocking consumers. Monitored for manual review. |
| Change Data Capture | CDC — captures database changes as events. Used to migrate from shared-schema monolith to per-service databases during extraction phases. |
| Dual-Write | Writing to both old and new systems simultaneously during migration. Combined with outbox verification to ensure consistency before cutover. |
| Shadow Projections | Running CQRS read models in parallel with existing queries (shadow mode) to compare results before switching over. Risk-free CQRS introduction. |
| ETag / HTTP 304 | L4 client-side caching. Server returns entity tag; client sends it back. If unchanged, server responds 304 (Not Modified) — zero payload transfer. |
| RDB / AOF (Redis Persistence) | Redis persistence mechanisms: RDB creates point-in-time snapshots for fast recovery; AOF (Append-Only File) logs every write for maximum durability. The platform uses AOF for session/cart data. |
Security & Compliance
| PSP Tokenisation | Payment Service Provider replaces raw card credentials with a token, reducing PCI DSS 4.0 scope to the Payments service only (dedicated network segment, annual SAQ validation) and enabling multi-provider payment flows. |
| Payment Rails | Regional payment networks (UPI in India, Pix in Brazil, SEPA in EU) requiring adapter integrations per market via hexagonal ports. |
| UPI (Unified Payments Interface) | India's real-time interbank payment system enabling instant transfers via mobile devices. Mandated by RBI for domestic transactions. |
| IMPS (Immediate Payment Service) | India's 24/7 electronic funds transfer system for inter-bank and intra-bank transactions. Complements UPI for non-mobile payment flows. |
| Pix | Brazil's instant payment system enabling 24/7 real-time transfers with minimal fees. Regulated by the Central Bank of Brazil (BCB). |
| SEPA (Single Euro Payments Area) | European payment infrastructure enabling cross-border euro transfers within the EU and EEA with standardised processing times and fees. |
| PCI DSS / PCI Scope | Payment Card Industry Data Security Standard (v4.0). PSP tokenisation reduces scope to the Payments service only — isolation via dedicated network segment, annual SAQ validation. |
| AuthN / AuthZ | Authentication (who are you?) and Authorisation (what can you do?). Handled at API Gateway via JWT validation and RBAC policies. |
| RBAC | Role-Based Access Control — assigns permissions based on roles. Used in K8s for secret access and at the application layer for user authorisation. |
| SPIFFE | Secure Production Identity Framework for Everyone — provides cryptographic service identities for zero-trust workloads. |
| Zero Trust (SP 800-207) | NIST security model requiring continuous verification of all users, assets, and resources — never implicitly trust, always verify. |
| Cosign / Sigstore | Container image signing and verification tools for supply chain security, ensuring only trusted images are deployed to K8s. |
| SBOM | Software Bill of Materials — inventory of all components/dependencies in a build. Generated in CI for vulnerability tracking and supply chain security. |
| OWASP ASVS | Application Security Verification Standard (v5.0.0) — security requirements verification framework for web applications. Used to define and validate controls across authentication, session management, access control, and data protection. Level 2 targeted. |
| SAST | Static Application Security Testing — white-box code analysis that scans source code for vulnerabilities before deployment. Integrated into CI alongside DAST. |
| SCA | Software Composition Analysis — scans third-party dependencies for known vulnerabilities (CVEs) and licence compliance risks. Complements SBOM generation. |
| DAST | Dynamic Application Security Testing — black-box security scanning of running applications. Integrated into CI/CD pipeline for every release. |
| Rate Limiting | Per-tenant request quotas enforced at the service mesh. Hard reject (HTTP 429) above threshold to prevent noisy-neighbour problems. |
| Request Throttling | Gradual backpressure before hard limit — returns HTTP 429 with Retry-After header, using token bucket algorithm at gateway. Slows traffic rather than rejecting it outright. Distinct from rate limiting. |
| Load Shedding | Under extreme load, intentionally drop low-priority requests (analytics, recs) to protect critical paths (checkout, payments). Triggered by CPU/memory thresholds. Last resort before circuit breaker opens. |
| GDPR | General Data Protection Regulation — EU data privacy law. Requires data residency (eu-west), lawful basis for processing, consent management, DPIA for high-risk processing, 72-hour breach notification, data subject rights (access, erasure, portability), and DPA with processors. |
| LGPD | Lei Geral de Proteção de Dados — Brazil's data protection law. Requires data residency (sa-east), consent management, breach notification to ANPD, and DPO appointment. Key differences from GDPR: applies to legal entities, fines capped at 2% revenue / 50M BRL, enforced by ANPD. |
| RBI Compliance | Reserve Bank of India regulations: payment data localisation (ap-south), mandatory 2FA for transactions, local payment rails (UPI/IMPS), Cyber Security Framework compliance, 48-hour incident reporting, and annual security audits. |
| Data Residency | Legal requirement to store and process personal data within specific geographic boundaries. Drives region-local database masters. |
Observability & Operations
| OpenTelemetry (OTel) | Vendor-neutral observability framework for collecting traces, metrics, and logs with consistent resource context across all service layers. |
| Prometheus | K8s-native time-series database for metrics collection. Powers Grafana dashboards and Flagger canary analysis. |
| Grafana | Dashboarding and visualisation platform. Hosts SLO burn-rate alerts, service maps, and DORA metrics dashboards. |
| PagerDuty / OpsGenie | Incident alerting and on-call management platforms. Escalation chains for P1-P4 incidents. Follow-the-sun rotation across regions. |
| SLO / Error Budget | Service Level Objective defines target reliability (99.9% browse, 99.95% checkout). Error budget = allowed failure margin — exhaustion gates deploys. |
| p95 Latency | 95th percentile response time — 95% of requests complete faster than this threshold. Browse target: 200-400ms. Checkout: 600-1200ms. |
| Distributed Tracing | Trace ID → Span ID → Service Map correlation across all hops. Head-based sampling + 15-day retention for high-cardinality traces. |
| Kubecost / FinOps | Cloud financial management combining engineering and finance. Kubecost provides per-namespace cost visibility, right-sizing recommendations, and VPA integration. |
| SRE | Site Reliability Engineering — operational discipline applying software engineering to infrastructure. Drives SLOs, error budgets, and blameless postmortems. |
| Chaos Engineering | Scheduled fault injection (Litmus/Gremlin) to proactively discover weaknesses. Quarterly game days simulate region failures and cascade scenarios. |
| Blameless Postmortem | Incident review focused on root cause and systemic improvements rather than individual fault. Action items tracked to completion within 48 hrs. |
| Dependabot / Snyk | Automated dependency scanning tools that detect CVEs (Common Vulnerabilities and Exposures) in third-party libraries. Monthly review cycle. |
| DR (Disaster Recovery) | Business continuity strategy. Progression: Pilot Light → Warm Standby → Active/Active. Quarterly failover drills across regions. |
Business, ML & Governance
| Collaborative Filtering | ML technique recommending products based on similar users' behaviour. Makes up ~60% of The platform's hybrid recommendation engine. |
| Content-Based Filtering | ML technique recommending products based on item attributes matching user preferences. ~30% of the recommendation engine mix. |
| Feature Store | Centralised repository for ML features (DynamoDB with 6hr TTL). Ensures consistent feature values between training and serving, with <100ms scoring. |
| A/B Testing | Controlled experiment serving different variants to user segments to measure impact. Used with feature flags to validate ML models and UX changes. |
| RICE Scoring | Prioritisation framework: Reach × Impact × Confidence ÷ Effort. Used in feature governance to rank roadmap items objectively. |
| RFC / ADR | Request for Comments / Architecture Decision Record — formal documents for proposing and recording architectural decisions with impact analysis. |
| Consumer-Driven Contracts | Testing pattern where API consumers define expected behaviour. Provider verifies against these contracts in CI, preventing breaking changes. |
| Connection Pooling | Reusing database connections across requests instead of creating new ones. PgBouncer manages ~600-700 connections for PostgreSQL. |
18 assumptions + 101 glossary terms — covering architecture patterns, consistency models, protocols, cloud infrastructure, data systems, security, compliance, observability, operations, ML, and governance
Thank you for your time. We welcome further discussion on any of the above.