Skip to main content

Network and security

Atlas uses one shared VPC per full environment root, split across two availability zones with explicit security group boundaries between the edge, workloads, database, and Kafka. The current exception is terraform/staging2, which reuses the VPC, subnets, and approved security groups already owned by terraform/staging instead of creating a second network foundation.

Network shape

ConcernCurrent implementation
Availability zones2 dynamically selected AZs
Subnets2 public and 2 private subnets
Internet egress1 NAT gateway per public subnet
Private S3 pathS3 gateway VPC endpoint attached to private route tables
Flow logsVPC Flow Logs to CloudWatch for rejected traffic in the committed roots

staging2 reuses the same two public and two private subnets from staging. It does not create a second NAT, route-table set, or VPC endpoint path.

Security group model

Security groupAllows
alb-sginbound 80 and 443 from alb_ingress_cidrs, outbound unrestricted
ecs-sginbound 8080 only from alb-sg, outbound unrestricted
dashboard backend SGinbound target port from alb-sg, outbound unrestricted; attached together with the shared ecs-sg for common internal egress paths such as MSK
scoring SGinbound 8083 from alb-sg and dashboard backend SG, outbound unrestricted
Camunda SGinbound 8080 only from scoring SG, outbound unrestricted
Valkey SGinbound 6379 from the shared ecs-sg plus the dashboard backend, scoring, and Camunda SGs; outbound unrestricted
ClickHouse Prometheus agent SGno inbound, outbound unrestricted for HTTPS egress to ClickHouse Cloud, New Relic, and AWS service endpoints through NAT
msk-connect-sgno inbound, outbound unrestricted for connector workers
msk-sginbound 9098 from ecs-sg and msk-connect-sg, inbound 9198 from msk_public_access_cidrs
RDS SGinbound 5432 from allowed CIDRs plus the dashboard backend or Camunda SG

staging2 attaches its duplicated workloads to these same security-group IDs. That means no *-sg2 copies exist for the shared edge, ECS, MSK, service, cache, or database paths.

Ingress model

  • The ALB is internet-facing and sits in public subnets.
  • Port 80 redirects to 443.
  • The default HTTPS listener action forwards to the events ingestion target group.
  • Additional listener rules route the dashboard backend, scoring service, and Kafka UI by hostname.
  • Internal dashboard-backend-to-scoring traffic uses ECS Service Connect with the scoring alias instead of the public scoring hostname.
  • Internal scoring-to-Camunda traffic uses ECS Service Connect with the camunda client alias instead of the public ALB hostname.
  • Internal ECS workloads reach Valkey through the dedicated cache security group on 6379.
  • The Camunda Service Connect config sets per_request_timeout_seconds = 30 so the scoring worker long-poll request can exceed the AWS HTTP default safely.
  • The ClickHouse Prometheus agent has no inbound path and reaches api.clickhouse.cloud plus New Relic remote write over outbound HTTPS through private-subnet NAT.

Current exposure notes

ALB exposure

Both the staging example values and the committed production values currently allow alb_ingress_cidrs = ["0.0.0.0/0"]. The module supports a tighter allow-list, but the current committed state is wide open at the edge.

MSK public access

The MSK module enables service-provided public broker EIPs when msk_enable_public_access = true. The VPC module separately exposes port 9198 from msk-sg to msk_public_access_cidrs, which controls which external client CIDRs can use the public IAM + TLS path.

Dashboard database access

The staging example keeps RDS on a public subnet group with publicly_accessible = true and open CIDR defaults. Production committed values move the database to private subnets and disable public accessibility.

Operational implications

  • ECS tasks stay in private subnets with assign_public_ip = false.
  • The ElastiCache Valkey subnet group also stays in the private subnets and is not exposed publicly.
  • Hostname routing is managed at the ALB listener level, not inside a separate ingress service.
  • The dashboard backend joins the same Service Connect namespace as a client, but remains publicly reachable only through the ALB route.
  • Security hardening happens primarily through input values, not by changing the root module graph.
  • terraform/staging must remain the owner of the shared staging foundation outputs that terraform/staging2 consumes.
note

Some older OpenSpec pages describe stricter or different networking assumptions. The current Terraform code is the source of truth for what Atlas actually provisions today.