Architecture decisions

These are the decisions that still matter when operating or extending the current Atlas stack.

One shared ALB instead of API Gateway

Atlas currently uses an internet-facing ALB as the single edge for the environment. The archive shows an earlier API Gateway plus VPC Link design, but the shipped implementation moved to the simpler and lower-cost shared ALB model.

Infrastructure-only repository scope

The repository owns shared infrastructure and runtime wiring, not application source. The events ingestion API, dashboard backend, and scoring service are expected to publish their own images from separate repositories. Camunda currently uses the upstream image directly.

Two roots, one module graph

terraform/staging and terraform/prod deliberately share the same root structure and shared modules. Environment differences should come from values, not from divergent infrastructure definitions.

Shared ECS cluster, separate services

Atlas keeps one shared ECS cluster for the current workloads, while isolating the events ingestion API, dashboard backend, scoring service, Camunda, and Kafka UI into separate ECS services and task definitions.

Service Connect for internal scoring traffic

Atlas uses the public ALB only for edge traffic. The scoring service is not exposed on the public edge; internal callers use ECS Service Connect with the scoring alias, and scoring itself reaches Camunda with the camunda alias so east-west traffic stays inside the cluster.

MSK with IAM + TLS

Kafka access is standardized on AWS MSK IAM authentication with TLS. Internal workloads use the internal bootstrap brokers, and the implementation also exposes a public IAM + TLS path where needed.

Secrets are provisioned by Terraform, values are populated operationally

Terraform creates the secret resources and placeholder document shape. Operators update the real values after apply so sensitive runtime configuration stays out of Git and out of long-lived Terraform diffs.

AWS-native observability first, but cost-aware by default

The current roots rely on CloudWatch logs, standard ECS service metrics, selective MSK enhanced monitoring, CloudWatch alarms, SNS, and AWS Budgets rather than introducing a separate observability stack. Atlas keeps higher-cost knobs such as ECS Container Insights disabled by default unless a concrete troubleshooting need justifies turning them back on.

Staging provisions environment automation but keeps Scheduler suspended by default

terraform/staging provisions the environment start and stop automation and the manual Kafka cleanup Lambda. It also contains test-only SNS-driven DevOps Agent investigation wiring from an earlier proof of concept, but that path is not treated as implemented operations. Kafka topic cleanup remains manual because delete-and-recreate of Kafka topics can take long enough that a scheduled cleanup was not a good cost fit, and the committed Terraform keeps the start and stop Scheduler entries disabled until they are explicitly re-enabled.

tip

Use Notable changes when you need the short history of how Atlas arrived at the current shape.