Skip to main content

DevOps Agent investigation test and broader alarm coverage

ยท 2 min read
Atlas Infra

Atlas Infra tested an AWS DevOps Agent investigation path from the existing staging alerts SNS topic, and the CloudWatch alarm set became more explicit about which services and databases are under pressure.

  • the environment-operations alarm_investigation Lambda was used as a proof-of-concept against the shared alerts SNS topic
  • the test Lambda only reacted when a CloudWatch alarm notification entered ALARM
  • the test webhook request was signed with HMAC and included the alarm name, environment, region, account, description, metrics, and a CloudWatch console link
  • ECS task-count alarms now cover events-ingestion, dashboard-backend, scoring, and camunda
  • Kafka UI stays out of the task-count alarm set
  • both RDS instances now have CPUUtilization and DBLoad alarms

This update keeps the existing email notification flow in place. The DevOps Agent integration was only a test and is not treated as an implemented operational path.

The revised alarm set also removes the earlier ALB unhealthy-host alarm, which was not producing reliable signal for this stack.

The current monitoring shape is:

  • shared SNS topic for alarm fan-out
  • operator email subscription in both roots
  • DevOps Agent investigation test wiring only; not an active runbook path
  • explicit task-count alarms for the non-Kafka-UI ECS workloads
  • direct database pressure alarms for dashboard and Camunda RDS

prod does not use the DevOps Agent investigation path.