Skip to main content

System overview

Atlas Infra is a shared AWS stack with one public edge, one shared ECS cluster, one Kafka backbone, and separate runtime surfaces for the workloads in this repository.

Runtime topology

LayerCurrent shape
Public ingressInternet-facing ALB in public subnets with host-header routing for events ingestion, dashboard backend, and Kafka UI
ComputeECS/Fargate in private subnets
StreamingAmazon MSK with IAM + TLS
StoragePostgreSQL RDS for dashboard backend and Camunda, ElastiCache for Valkey, plus S3 for optional Kafka exports
SecretsAWS Secrets Manager
Scheduled operationsEventBridge Scheduler + Lambda for staged start/stop, plus manual Kafka cleanup; DevOps Agent investigation wiring exists only as a test path
ObservabilityCloudWatch logs, CloudWatch alarms, CloudWatch dashboard, SNS, AWS Budgets, a dedicated Slack alarm path, New Relic dashboards for ECS/RDS/MSK, New Relic AWS pull integration, and ClickHouse Cloud Prometheus agents for New Relic remote write

Request flow

Events ingestion API

The default ALB target group and the events_ingestion_host rule both route to the events ingestion ECS service. The service runs on the shared ECS cluster, reads its runtime config from Secrets Manager, publishes telemetry to MSK, and can add a non-essential newrelic-infra sidecar when the root enables that observability path.

Dashboard backend

Requests for dashboard_backend_host are routed to a dedicated ECS service and target group. The service receives configuration from a dedicated Secrets Manager secret, persists data in PostgreSQL RDS, and reaches MSK through the shared internal ECS security-group path.

Kafka UI

Requests for kafka_ui_host are routed to a separate ECS service that connects to the same MSK cluster over IAM + TLS for operator inspection.

Scoring + Camunda

The scoring service is internal-only. It accepts in-cluster HTTP calls from the dashboard backend on http://scoring:8083 over ECS Service Connect, publishes results to atlas.l3.user.score, and calls Camunda over the internal http://camunda:8080/engine-rest endpoint. It no longer consumes any Kafka topic — the previous atlas.l2.transaction.deposit consumer was removed. Camunda is not exposed through the public ALB; it persists process state in its own PostgreSQL RDS instance and is reachable only from internal ECS workloads.

Valkey cache

Both roots now provision a dedicated ElastiCache Valkey replication group in the private subnets. The first rollout uses cluster mode disabled with one primary, one replica, Multi-AZ failover, IPv4 connectivity, one day of backup retention, and no at-rest or in-transit encryption. The cache security group allows 6379 only from the approved ECS workload security groups.

ClickHouse Cloud observability

The ClickHouse Cloud Prometheus agent is a dedicated internal ECS/Fargate service that can be enabled per root. The current staging and production outputs both expose the agent service, secret, and New Relic dashboard. The service has no public ingress, reads ClickHouse Cloud and New Relic credentials from its own Secrets Manager secret, scrapes one ClickHouse Cloud service through api.clickhouse.cloud with filtered_metrics=false, and sends the metrics to New Relic Prometheus remote write.

Shared building blocks

  • VPC: two availability zones, public and private subnets, NAT gateways, an S3 gateway endpoint, and workload-specific security groups.
  • ALB: one HTTPS edge with host-based routing for the public Atlas workloads that need internet ingress.
  • MSK: one Kafka cluster shared by the services in this repository, with outputs for internal and public IAM + TLS bootstrap brokers.
  • ElastiCache for Valkey: one private cache layer for backend workloads with root outputs for the primary and reader endpoints.
  • Monitoring: standard ECS service metrics, CloudWatch log groups, alarm wiring, a dedicated Slack alert path, a CloudWatch dashboard, New Relic ECS/RDS/MSK dashboards, RDS storage/load visibility, selective MSK enhanced monitoring, New Relic AWS pull integration, and ClickHouse Cloud Prometheus remote-write paths.
  • Scheduled operations: staging adds start, stop, cleanup, and alarm_investigation Lambdas; only start and stop are wired to EventBridge Scheduler. The alarm_investigation Lambda was used for a DevOps Agent proof of concept and is not an implemented operational alerting path.

Current operating model

  • terraform/staging is the current active environment root, but its default naming still deploys the poc-atlas-dev shape.
  • terraform/staging2 adds a second staging workload plane with its own ALB, ECS cluster, databases, cache, and secrets while reusing the shared staging VPC, approved security groups, MSK cluster, and application ECR repositories.
  • terraform/prod uses the same module graph with production-oriented values such as private MSK placement, multi-VPC connectivity, and private RDS placement.
  • The shared environment-operations module is enabled only in staging; prod keeps the same module wiring disabled, and both roots expose a toggle to keep the Scheduler entries disabled until explicitly re-enabled.
  • The repository provisions infrastructure only. The events ingestion API, dashboard backend, and scoring images are expected to be built and pushed before their ECS services can run. Camunda and the ClickHouse Prometheus agent use upstream images directly.
tip

Use Environment model next if you need to understand why the staging directory and the environment = "dev" default are both present.