Monitoring and alerting
Atlas uses AWS-native observability in both environment roots and layers New Relic dashboards on top of samples imported by the New Relic AWS pull integration. The current implementation emphasizes fast signal capture over deep customization, with one shared operator SNS topic plus an optional dedicated Slack SNS topic.
Signals by layer
| Layer | Signal | Current implementation |
|---|---|---|
| ECS | task health, CPU, memory | CloudWatch ECS service metrics on the shared ECS cluster; Container Insights is enabled in the active staging and production values so running-task alarms can use ECS/ContainerInsights |
| RDS | storage pressure, CPU pressure, active session load, and connection visibility | CloudWatch metrics and dedicated alarms for the dashboard and Camunda databases |
| MSK | broker health, disk usage, CPU, memory, and replication | enhanced monitoring set through msk_cloudwatch_enhanced_monitoring plus broker alarms and the msk_under_replicated alarm |
| Logs | workload, connector, and network logs | CloudWatch log groups with environment-specific retention; VPC Flow Logs capture rejected traffic only in the committed roots |
| Dashboard | cross-service runtime view | CloudWatch dashboard per root with alarm, MSK, ECS, and RDS sections |
| Cost | monthly spend thresholds | AWS Budget notifications to owner_email |
| New Relic | account-level AWS service metrics and dashboards | pull integration for ALB, ECS, RDS, MSK, S3, and VPC; dashboards for non-events ECS services, RDS, MSK, events ingestion, and ClickHouse Cloud |
| ClickHouse Cloud | service metrics, query activity, capacity signals, and scrape health | ECS Prometheus agent scraping api.clickhouse.cloud with filtered_metrics=false and remote-writing to New Relic when enabled in a root |
Provisioned alarms
| Alarm | What it means |
|---|---|
<service-name>-high-cpu | an ECS service stays above the configured CPUUtilization threshold |
<service-name>-high-memory | an ECS service stays above the configured MemoryUtilization threshold |
<service-name>-running-tasks-below-desired | an ECS service is running below the desired count committed in Terraform |
<service-name>-unhealthy-targets | an ALB-backed ECS service has unhealthy targets in its target group |
<db-instance-identifier>-high-cpu | the dashboard or Camunda RDS instance stays above 80% CPU for 5 minutes |
<db-instance-identifier>-high-db-load | the dashboard or Camunda RDS instance stays above 4 average active sessions (DBLoad) for 5 minutes |
<db-instance-identifier>-low-free-storage | the dashboard or Camunda RDS instance drops below the configured FreeStorageSpace threshold |
msk_under_replicated | the Kafka cluster has under-replicated partitions |
<prefix>-msk-broker-<id>-disk-usage-critical | a broker is above the critical threshold for native KafkaDataLogsDiskUsed |
<prefix>-msk-broker-<id>-high-cpu-user | a broker is above the configured native CpuUser threshold |
<prefix>-msk-broker-<id>-low-memory-available | a broker is below the configured estimated available-memory threshold, calculated from native MemoryFree + MemoryCached + MemoryBuffered |
<prefix>-msk-broker-<id>-swap-used | a broker reports native SwapUsed greater than or equal to the configured threshold; the default alerts on any swap usage |
All alarms publish to the shared operator SNS topic named from the root prefix and subscribe the operator email configured in owner_email.
The environment-operations module still contains alarm_investigation test wiring that was used to validate the AWS DevOps Agent generic webhook shape from SNS alarm notifications. Do not treat that path as implemented operational alerting. The supported alert delivery paths are the shared operator SNS topic with email subscription and, when enabled, the dedicated Slack SNS topic plus Slack notifier Lambda.
When monitoring_slack_notifications_enabled = true, the roots also create a dedicated poc_alerts_slack SNS topic and a separate Slack notifier Lambda. That path is isolated from environment-operations and forwards CloudWatch alarm notifications to the configured Slack incoming webhook using a readable Slack layout with state, region, resource, description, metric summary, and a direct CloudWatch link.
The CloudWatch dashboard created in each root organizes:
- alarm status across ECS, MSK, and RDS
- MSK replication, disk usage, native CPU metrics, estimated available memory, swap usage, and throughput
- ECS CPU, memory, running task count, and unhealthy ALB targets
- RDS free storage in GiB, CPU utilization,
DBLoad, andDatabaseConnections
The New Relic ClickHouse Cloud dashboard is created when clickhouse_prometheus_agent_enabled = true. It is scoped by the prometheus_server attribute generated from the root prefix and includes service info, query activity, capacity-oriented panels, metric discovery, and Prometheus scrape health. It intentionally does not create New Relic alert conditions or Slack workflows yet.
The New Relic AWS-metric dashboards are controlled separately from the events ingestion sidecar dashboard:
ecs_service_newrelic_dashboards_enabledcreates one dashboard each for dashboard backend, scoring, Camunda, and Kafka UI. These useComputeSamplefor ECS service CPU, memory, and task count, plusLoadBalancerSamplefor ALB target metrics when a target group exists.dashboard_backend_newrelic_apm_entity_guidis optional. When populated, the dashboard backend dashboard adds APM throughput and error widgets without changing the CloudWatch-based panels.rds_newrelic_dashboard_enabledcreates one dashboard for the dashboard and Camunda PostgreSQL instances, including CPU, connections, memory, free storage,DBLoad, IOPS, throughput, latency, disk queue depth, deadlocks, burst balance, and CPU credits.msk_newrelic_dashboard_enabledcreates one dashboard for the MSK cluster, including replication health, offline partitions, broker disk, CPU, memory, swap, throughput, throttle pressure, produce/fetch latency, partitions, and leaders.
Use terraform output ecs_service_newrelic_dashboard_permalinks, terraform output -raw rds_newrelic_dashboard_permalink, and terraform output -raw msk_newrelic_dashboard_permalink after apply to open them.
ClickHouse Cloud Prometheus agent
The ClickHouse Prometheus agent runs as a dedicated ECS service in private subnets when enabled in a root. It has no ALB, no Service Connect listener, no public IP, and no inbound security-group rules. Outbound HTTPS uses the existing NAT path to reach ClickHouse Cloud and New Relic.
Before expecting the ECS service to report healthy ingestion:
- Replace the placeholder values in
terraform output -raw clickhouse_prometheus_agent_secret_namewithCLICKHOUSE_ORG_ID,CLICKHOUSE_SERVICE_ID,CLICKHOUSE_API_KEY_ID,CLICKHOUSE_API_KEY_SECRET, andNEW_RELIC_LICENSE_KEY. - Check
/ecs/<prefix>-clickhouse-prometheus-agentfor ClickHouse authentication errors, scrape failures, or New Relic remote-write errors. - Open
terraform output -raw clickhouse_cloud_newrelic_dashboard_permalinkand validate live metric names before designing alerts.
Log groups to expect
/ecs/<events-service-name>/ecs/<events-service-name>/newrelic-infrawhen the events New Relic sidecar is enabled/ecs/<dashboard-service-name>/ecs/<kafka-ui-service-name>/ecs/<prefix>-clickhouse-prometheus-agentwhen the ClickHouse Prometheus agent is enabled/msk-connect/<connector-name>when the sink is enabled/vpc/flow-logsfor rejected traffic only in the committed roots
Environment differences
stagingexample values keep most log retention at 1 day for cost control.prodcommitted values keep application, connector, environment-operations, and flow-log retention at 3 days.- Both roots keep the same alarm types, dashboard layout model, and budget threshold model.
- DevOps Agent investigation creation was only tested and is not an active runbook path in either environment.
stagingcurrently enables the dedicated Slack delivery path through committed values;prodkeeps that path disabled until a real webhook is injected outside Git.- Both committed roots currently enable the New Relic AWS pull integration and New Relic dashboards for ECS, RDS, MSK, events ingestion, and ClickHouse Cloud.
- Both committed roots currently enable the ClickHouse Cloud Prometheus agent resources and dashboard; the collector service desired count is controlled separately and is currently
0in the committed values.
The current CloudWatch alarms cover AWS resource availability and pressure only. ClickHouse Cloud metrics are ingested into New Relic for dashboarding first; Slack alerts should be added later with New Relic NRQL conditions after live metrics and thresholds are validated.