DevOps Agent investigation test and broader alarm coverage
ยท 2 min read
Atlas Infra tested an AWS DevOps Agent investigation path from the existing staging alerts SNS topic, and the CloudWatch alarm set became more explicit about which services and databases are under pressure.
- the
environment-operationsalarm_investigationLambda was used as a proof-of-concept against the shared alerts SNS topic - the test Lambda only reacted when a CloudWatch alarm notification entered
ALARM - the test webhook request was signed with HMAC and included the alarm name, environment, region, account, description, metrics, and a CloudWatch console link
- ECS task-count alarms now cover
events-ingestion,dashboard-backend,scoring, andcamunda - Kafka UI stays out of the task-count alarm set
- both RDS instances now have
CPUUtilizationandDBLoadalarms
This update keeps the existing email notification flow in place. The DevOps Agent integration was only a test and is not treated as an implemented operational path.
The revised alarm set also removes the earlier ALB unhealthy-host alarm, which was not producing reliable signal for this stack.
The current monitoring shape is:
- shared SNS topic for alarm fan-out
- operator email subscription in both roots
- DevOps Agent investigation test wiring only; not an active runbook path
- explicit task-count alarms for the non-Kafka-UI ECS workloads
- direct database pressure alarms for dashboard and Camunda RDS
prod does not use the DevOps Agent investigation path.