Monitoring and alerting

Atlas uses AWS-native observability in both environment roots and layers New Relic dashboards on top of samples imported by the New Relic AWS pull integration. The current implementation emphasizes fast signal capture over deep customization, with one shared operator SNS topic plus an optional dedicated Slack SNS topic.

Signals by layer

Layer	Signal	Current implementation
ECS	task health, CPU, memory	CloudWatch ECS service metrics on the shared ECS cluster; Container Insights is enabled in the active staging and production values so running-task alarms can use `ECS/ContainerInsights`
RDS	storage pressure, CPU pressure, active session load, and connection visibility	CloudWatch metrics and dedicated alarms for the dashboard and Camunda databases
MSK	broker health, disk usage, CPU, memory, and replication	enhanced monitoring set through `msk_cloudwatch_enhanced_monitoring` plus broker alarms and the `msk_under_replicated` alarm
Logs	workload, connector, and network logs	CloudWatch log groups with environment-specific retention; VPC Flow Logs capture rejected traffic only in the committed roots
Dashboard	cross-service runtime view	CloudWatch dashboard per root with alarm, MSK, ECS, and RDS sections
Cost	monthly spend thresholds	AWS Budget notifications to `owner_email`
New Relic	account-level AWS service metrics and dashboards	pull integration for ALB, ECS, RDS, MSK, S3, and VPC; dashboards for non-events ECS services, RDS, MSK, events ingestion, and ClickHouse Cloud
ClickHouse Cloud	service metrics, query activity, capacity signals, and scrape health	ECS Prometheus agent scraping `api.clickhouse.cloud` with `filtered_metrics=false` and remote-writing to New Relic when enabled in a root

Provisioned alarms

Alarm	What it means
`<service-name>-high-cpu`	an ECS service stays above the configured `CPUUtilization` threshold
`<service-name>-high-memory`	an ECS service stays above the configured `MemoryUtilization` threshold
`<service-name>-running-tasks-below-desired`	an ECS service is running below the desired count committed in Terraform
`<service-name>-unhealthy-targets`	an ALB-backed ECS service has unhealthy targets in its target group
`<db-instance-identifier>-high-cpu`	the dashboard or Camunda RDS instance stays above `80%` CPU for `5` minutes
`<db-instance-identifier>-high-db-load`	the dashboard or Camunda RDS instance stays above `4` average active sessions (`DBLoad`) for `5` minutes
`<db-instance-identifier>-low-free-storage`	the dashboard or Camunda RDS instance drops below the configured `FreeStorageSpace` threshold
`msk_under_replicated`	the Kafka cluster has under-replicated partitions
`<prefix>-msk-broker-<id>-disk-usage-critical`	a broker is above the critical threshold for native `KafkaDataLogsDiskUsed`
`<prefix>-msk-broker-<id>-high-cpu-user`	a broker is above the configured native `CpuUser` threshold
`<prefix>-msk-broker-<id>-low-memory-available`	a broker is below the configured estimated available-memory threshold, calculated from native `MemoryFree + MemoryCached + MemoryBuffered`
`<prefix>-msk-broker-<id>-swap-used`	a broker reports native `SwapUsed` greater than or equal to the configured threshold; the default alerts on any swap usage

All alarms publish to the shared operator SNS topic named from the root prefix and subscribe the operator email configured in owner_email.

The environment-operations module still contains alarm_investigation test wiring that was used to validate the AWS DevOps Agent generic webhook shape from SNS alarm notifications. Do not treat that path as implemented operational alerting. The supported alert delivery paths are the shared operator SNS topic with email subscription and, when enabled, the dedicated Slack SNS topic plus Slack notifier Lambda.

When monitoring_slack_notifications_enabled = true, the roots also create a dedicated poc_alerts_slack SNS topic and a separate Slack notifier Lambda. That path is isolated from environment-operations and forwards CloudWatch alarm notifications to the configured Slack incoming webhook using a readable Slack layout with state, region, resource, description, metric summary, and a direct CloudWatch link.

The CloudWatch dashboard created in each root organizes:

alarm status across ECS, MSK, and RDS
MSK replication, disk usage, native CPU metrics, estimated available memory, swap usage, and throughput
ECS CPU, memory, running task count, and unhealthy ALB targets
RDS free storage in GiB, CPU utilization, DBLoad, and DatabaseConnections

The New Relic ClickHouse Cloud dashboard is created when clickhouse_prometheus_agent_enabled = true. It is scoped by the prometheus_server attribute generated from the root prefix and includes service info, query activity, capacity-oriented panels, metric discovery, and Prometheus scrape health. It intentionally does not create New Relic alert conditions or Slack workflows yet.

The New Relic AWS-metric dashboards are controlled separately from the events ingestion sidecar dashboard:

ecs_service_newrelic_dashboards_enabled creates one dashboard each for dashboard backend, scoring, Camunda, and Kafka UI. These use ComputeSample for ECS service CPU, memory, and task count, plus LoadBalancerSample for ALB target metrics when a target group exists.
dashboard_backend_newrelic_apm_entity_guid is optional. When populated, the dashboard backend dashboard adds APM throughput and error widgets without changing the CloudWatch-based panels.
rds_newrelic_dashboard_enabled creates one dashboard for the dashboard and Camunda PostgreSQL instances, including CPU, connections, memory, free storage, DBLoad, IOPS, throughput, latency, disk queue depth, deadlocks, burst balance, and CPU credits.
msk_newrelic_dashboard_enabled creates one dashboard for the MSK cluster, including replication health, offline partitions, broker disk, CPU, memory, swap, throughput, throttle pressure, produce/fetch latency, partitions, and leaders.

Use terraform output ecs_service_newrelic_dashboard_permalinks, terraform output -raw rds_newrelic_dashboard_permalink, and terraform output -raw msk_newrelic_dashboard_permalink after apply to open them.

ClickHouse Cloud Prometheus agent

The ClickHouse Prometheus agent runs as a dedicated ECS service in private subnets when enabled in a root. It has no ALB, no Service Connect listener, no public IP, and no inbound security-group rules. Outbound HTTPS uses the existing NAT path to reach ClickHouse Cloud and New Relic.

Before expecting the ECS service to report healthy ingestion:

Replace the placeholder values in terraform output -raw clickhouse_prometheus_agent_secret_name with CLICKHOUSE_ORG_ID, CLICKHOUSE_SERVICE_ID, CLICKHOUSE_API_KEY_ID, CLICKHOUSE_API_KEY_SECRET, and NEW_RELIC_LICENSE_KEY.
Check /ecs/<prefix>-clickhouse-prometheus-agent for ClickHouse authentication errors, scrape failures, or New Relic remote-write errors.
Open terraform output -raw clickhouse_cloud_newrelic_dashboard_permalink and validate live metric names before designing alerts.

Log groups to expect

/ecs/<events-service-name>
/ecs/<events-service-name>/newrelic-infra when the events New Relic sidecar is enabled
/ecs/<dashboard-service-name>
/ecs/<kafka-ui-service-name>
/ecs/<prefix>-clickhouse-prometheus-agent when the ClickHouse Prometheus agent is enabled
/msk-connect/<connector-name> when the sink is enabled
/vpc/flow-logs for rejected traffic only in the committed roots

Environment differences

staging example values keep most log retention at 1 day for cost control.
prod committed values keep application, connector, environment-operations, and flow-log retention at 3 days.
Both roots keep the same alarm types, dashboard layout model, and budget threshold model.
DevOps Agent investigation creation was only tested and is not an active runbook path in either environment.
staging currently enables the dedicated Slack delivery path through committed values; prod keeps that path disabled until a real webhook is injected outside Git.
Both committed roots currently enable the New Relic AWS pull integration and New Relic dashboards for ECS, RDS, MSK, events ingestion, and ClickHouse Cloud.
Both committed roots currently enable the ClickHouse Cloud Prometheus agent resources and dashboard; the collector service desired count is controlled separately and is currently 0 in the committed values.

note

The current CloudWatch alarms cover AWS resource availability and pressure only. ClickHouse Cloud metrics are ingested into New Relic for dashboarding first; Slack alerts should be added later with New Relic NRQL conditions after live metrics and thresholds are validated.

Signals by layer​

Provisioned alarms​

ClickHouse Cloud Prometheus agent​

Log groups to expect​

Environment differences​

Signals by layer

Provisioned alarms

ClickHouse Cloud Prometheus agent

Log groups to expect

Environment differences