Network and security
Atlas uses one shared VPC per full environment root, split across two availability zones with explicit security group boundaries between the edge, workloads, database, and Kafka. The current exception is terraform/staging2, which reuses the VPC, subnets, and approved security groups already owned by terraform/staging instead of creating a second network foundation.
Network shape
| Concern | Current implementation |
|---|---|
| Availability zones | 2 dynamically selected AZs |
| Subnets | 2 public and 2 private subnets |
| Internet egress | 1 NAT gateway per public subnet |
| Private S3 path | S3 gateway VPC endpoint attached to private route tables |
| Flow logs | VPC Flow Logs to CloudWatch for rejected traffic in the committed roots |
staging2 reuses the same two public and two private subnets from staging. It does not create a second NAT, route-table set, or VPC endpoint path.
Security group model
| Security group | Allows |
|---|---|
alb-sg | inbound 80 and 443 from alb_ingress_cidrs, outbound unrestricted |
ecs-sg | inbound 8080 only from alb-sg, outbound unrestricted |
| dashboard backend SG | inbound target port from alb-sg, outbound unrestricted; attached together with the shared ecs-sg for common internal egress paths such as MSK |
| scoring SG | inbound 8083 from alb-sg and dashboard backend SG, outbound unrestricted |
| Camunda SG | inbound 8080 only from scoring SG, outbound unrestricted |
| Valkey SG | inbound 6379 from the shared ecs-sg plus the dashboard backend, scoring, and Camunda SGs; outbound unrestricted |
| ClickHouse Prometheus agent SG | no inbound, outbound unrestricted for HTTPS egress to ClickHouse Cloud, New Relic, and AWS service endpoints through NAT |
msk-connect-sg | no inbound, outbound unrestricted for connector workers |
msk-sg | inbound 9098 from ecs-sg and msk-connect-sg, inbound 9198 from msk_public_access_cidrs |
| RDS SG | inbound 5432 from allowed CIDRs plus the dashboard backend or Camunda SG |
staging2 attaches its duplicated workloads to these same security-group IDs. That means no *-sg2 copies exist for the shared edge, ECS, MSK, service, cache, or database paths.
Ingress model
- The ALB is internet-facing and sits in public subnets.
- Port 80 redirects to 443.
- The default HTTPS listener action forwards to the events ingestion target group.
- Additional listener rules route the dashboard backend, scoring service, and Kafka UI by hostname.
- Internal dashboard-backend-to-scoring traffic uses ECS Service Connect with the
scoringalias instead of the public scoring hostname. - Internal scoring-to-Camunda traffic uses ECS Service Connect with the
camundaclient alias instead of the public ALB hostname. - Internal ECS workloads reach Valkey through the dedicated cache security group on
6379. - The Camunda Service Connect config sets
per_request_timeout_seconds = 30so the scoring worker long-poll request can exceed the AWS HTTP default safely. - The ClickHouse Prometheus agent has no inbound path and reaches
api.clickhouse.cloudplus New Relic remote write over outbound HTTPS through private-subnet NAT.
Current exposure notes
ALB exposure
Both the staging example values and the committed production values currently allow alb_ingress_cidrs = ["0.0.0.0/0"]. The module supports a tighter allow-list, but the current committed state is wide open at the edge.
MSK public access
The MSK module enables service-provided public broker EIPs when msk_enable_public_access = true. The VPC module separately exposes port 9198 from msk-sg to msk_public_access_cidrs, which controls which external client CIDRs can use the public IAM + TLS path.
Dashboard database access
The staging example keeps RDS on a public subnet group with publicly_accessible = true and open CIDR defaults. Production committed values move the database to private subnets and disable public accessibility.
Operational implications
- ECS tasks stay in private subnets with
assign_public_ip = false. - The ElastiCache Valkey subnet group also stays in the private subnets and is not exposed publicly.
- Hostname routing is managed at the ALB listener level, not inside a separate ingress service.
- The dashboard backend joins the same Service Connect namespace as a client, but remains publicly reachable only through the ALB route.
- Security hardening happens primarily through input values, not by changing the root module graph.
terraform/stagingmust remain the owner of the shared staging foundation outputs thatterraform/staging2consumes.
Some older OpenSpec pages describe stricter or different networking assumptions. The current Terraform code is the source of truth for what Atlas actually provisions today.