Skip to main content

Monitoring and alerting

Atlas uses AWS-native observability in both environment roots and layers New Relic dashboards on top of samples imported by the New Relic AWS pull integration. The current implementation emphasizes fast signal capture over deep customization, with one shared operator SNS topic plus an optional dedicated Slack SNS topic.

Signals by layer

LayerSignalCurrent implementation
ECStask health, CPU, memoryCloudWatch ECS service metrics on the shared ECS cluster; Container Insights is enabled in the active staging and production values so running-task alarms can use ECS/ContainerInsights
RDSstorage pressure, CPU pressure, active session load, and connection visibilityCloudWatch metrics and dedicated alarms for the dashboard and Camunda databases
MSKbroker health, disk usage, CPU, memory, and replicationenhanced monitoring set through msk_cloudwatch_enhanced_monitoring plus broker alarms and the msk_under_replicated alarm
Logsworkload, connector, and network logsCloudWatch log groups with environment-specific retention; VPC Flow Logs capture rejected traffic only in the committed roots
Dashboardcross-service runtime viewCloudWatch dashboard per root with alarm, MSK, ECS, and RDS sections
Costmonthly spend thresholdsAWS Budget notifications to owner_email
New Relicaccount-level AWS service metrics and dashboardspull integration for ALB, ECS, RDS, MSK, S3, and VPC; dashboards for non-events ECS services, RDS, MSK, events ingestion, and ClickHouse Cloud
ClickHouse Cloudservice metrics, query activity, capacity signals, and scrape healthECS Prometheus agent scraping api.clickhouse.cloud with filtered_metrics=false and remote-writing to New Relic when enabled in a root

Provisioned alarms

AlarmWhat it means
<service-name>-high-cpuan ECS service stays above the configured CPUUtilization threshold
<service-name>-high-memoryan ECS service stays above the configured MemoryUtilization threshold
<service-name>-running-tasks-below-desiredan ECS service is running below the desired count committed in Terraform
<service-name>-unhealthy-targetsan ALB-backed ECS service has unhealthy targets in its target group
<db-instance-identifier>-high-cputhe dashboard or Camunda RDS instance stays above 80% CPU for 5 minutes
<db-instance-identifier>-high-db-loadthe dashboard or Camunda RDS instance stays above 4 average active sessions (DBLoad) for 5 minutes
<db-instance-identifier>-low-free-storagethe dashboard or Camunda RDS instance drops below the configured FreeStorageSpace threshold
msk_under_replicatedthe Kafka cluster has under-replicated partitions
<prefix>-msk-broker-<id>-disk-usage-criticala broker is above the critical threshold for native KafkaDataLogsDiskUsed
<prefix>-msk-broker-<id>-high-cpu-usera broker is above the configured native CpuUser threshold
<prefix>-msk-broker-<id>-low-memory-availablea broker is below the configured estimated available-memory threshold, calculated from native MemoryFree + MemoryCached + MemoryBuffered
<prefix>-msk-broker-<id>-swap-useda broker reports native SwapUsed greater than or equal to the configured threshold; the default alerts on any swap usage

All alarms publish to the shared operator SNS topic named from the root prefix and subscribe the operator email configured in owner_email.

The environment-operations module still contains alarm_investigation test wiring that was used to validate the AWS DevOps Agent generic webhook shape from SNS alarm notifications. Do not treat that path as implemented operational alerting. The supported alert delivery paths are the shared operator SNS topic with email subscription and, when enabled, the dedicated Slack SNS topic plus Slack notifier Lambda.

When monitoring_slack_notifications_enabled = true, the roots also create a dedicated poc_alerts_slack SNS topic and a separate Slack notifier Lambda. That path is isolated from environment-operations and forwards CloudWatch alarm notifications to the configured Slack incoming webhook using a readable Slack layout with state, region, resource, description, metric summary, and a direct CloudWatch link.

The CloudWatch dashboard created in each root organizes:

  • alarm status across ECS, MSK, and RDS
  • MSK replication, disk usage, native CPU metrics, estimated available memory, swap usage, and throughput
  • ECS CPU, memory, running task count, and unhealthy ALB targets
  • RDS free storage in GiB, CPU utilization, DBLoad, and DatabaseConnections

The New Relic ClickHouse Cloud dashboard is created when clickhouse_prometheus_agent_enabled = true. It is scoped by the prometheus_server attribute generated from the root prefix and includes service info, query activity, capacity-oriented panels, metric discovery, and Prometheus scrape health. It intentionally does not create New Relic alert conditions or Slack workflows yet.

The New Relic AWS-metric dashboards are controlled separately from the events ingestion sidecar dashboard:

  • ecs_service_newrelic_dashboards_enabled creates one dashboard each for dashboard backend, scoring, Camunda, and Kafka UI. These use ComputeSample for ECS service CPU, memory, and task count, plus LoadBalancerSample for ALB target metrics when a target group exists.
  • dashboard_backend_newrelic_apm_entity_guid is optional. When populated, the dashboard backend dashboard adds APM throughput and error widgets without changing the CloudWatch-based panels.
  • rds_newrelic_dashboard_enabled creates one dashboard for the dashboard and Camunda PostgreSQL instances, including CPU, connections, memory, free storage, DBLoad, IOPS, throughput, latency, disk queue depth, deadlocks, burst balance, and CPU credits.
  • msk_newrelic_dashboard_enabled creates one dashboard for the MSK cluster, including replication health, offline partitions, broker disk, CPU, memory, swap, throughput, throttle pressure, produce/fetch latency, partitions, and leaders.

Use terraform output ecs_service_newrelic_dashboard_permalinks, terraform output -raw rds_newrelic_dashboard_permalink, and terraform output -raw msk_newrelic_dashboard_permalink after apply to open them.

ClickHouse Cloud Prometheus agent

The ClickHouse Prometheus agent runs as a dedicated ECS service in private subnets when enabled in a root. It has no ALB, no Service Connect listener, no public IP, and no inbound security-group rules. Outbound HTTPS uses the existing NAT path to reach ClickHouse Cloud and New Relic.

Before expecting the ECS service to report healthy ingestion:

  1. Replace the placeholder values in terraform output -raw clickhouse_prometheus_agent_secret_name with CLICKHOUSE_ORG_ID, CLICKHOUSE_SERVICE_ID, CLICKHOUSE_API_KEY_ID, CLICKHOUSE_API_KEY_SECRET, and NEW_RELIC_LICENSE_KEY.
  2. Check /ecs/<prefix>-clickhouse-prometheus-agent for ClickHouse authentication errors, scrape failures, or New Relic remote-write errors.
  3. Open terraform output -raw clickhouse_cloud_newrelic_dashboard_permalink and validate live metric names before designing alerts.

Log groups to expect

  • /ecs/<events-service-name>
  • /ecs/<events-service-name>/newrelic-infra when the events New Relic sidecar is enabled
  • /ecs/<dashboard-service-name>
  • /ecs/<kafka-ui-service-name>
  • /ecs/<prefix>-clickhouse-prometheus-agent when the ClickHouse Prometheus agent is enabled
  • /msk-connect/<connector-name> when the sink is enabled
  • /vpc/flow-logs for rejected traffic only in the committed roots

Environment differences

  • staging example values keep most log retention at 1 day for cost control.
  • prod committed values keep application, connector, environment-operations, and flow-log retention at 3 days.
  • Both roots keep the same alarm types, dashboard layout model, and budget threshold model.
  • DevOps Agent investigation creation was only tested and is not an active runbook path in either environment.
  • staging currently enables the dedicated Slack delivery path through committed values; prod keeps that path disabled until a real webhook is injected outside Git.
  • Both committed roots currently enable the New Relic AWS pull integration and New Relic dashboards for ECS, RDS, MSK, events ingestion, and ClickHouse Cloud.
  • Both committed roots currently enable the ClickHouse Cloud Prometheus agent resources and dashboard; the collector service desired count is controlled separately and is currently 0 in the committed values.
note

The current CloudWatch alarms cover AWS resource availability and pressure only. ClickHouse Cloud metrics are ingested into New Relic for dashboarding first; Slack alerts should be added later with New Relic NRQL conditions after live metrics and thresholds are validated.