CMS Alerts are managed by Sentry and DataDog.
Runtime issues are reported to Sentry via the Raven module.
Various metrics in the CI/CD phases and at runtime are reported to DataDog.
DataDog also includes some monitors that probe correct functionality of various responsibilities, e.g. Tugboat base previews completing successfully, CMS login pages being accessible, etc.
When these checks fail in some way, DataDog will generally respond in one of a few different ways:
-
notify Slack directly for the awareness of team members and stakeholders
-
notify PagerDuty for issues that should be remediated by the DevOps team.
Alert Level / Target | Action |
---|---|
Urgent | PagerDuty Critical >> Slack #cms-notifications |
High | @here in Slack |
Medium | Slack #cms-notifications @cms-alerts-medium |
Low | None |
QA | Slack #cms-notifications @cms-qa-engineers |
Drupal Level | Alert Level / Target |
---|---|
Emergency, Alert, Critical | Urgent |
Error | High |
Warning, Notice | None |
Informational, Debug | None |
Source | Alert Path | Support Severity Level |
---|---|---|
GraphQL API | DataDog to PagerDuty | High |
JSD Widget | Datadog to PagerDuty | Medium |
DNS | DataDog to PagerDuty | Urgent |
Drupal Emergency/Critical/Alert/Error | Drupal to Sentry to Slack | QA |
Drupal Warning/Info/Notice/Debug | None | None |
Prod CMS Down | Datadog to Slack, PagerDuty | Urgent |
Drupal Post Content Webhook | GovDelivery, Drupal post_api to Slack |
High |
Drupal Flag List (/flags_list ) |
Datadog to Slack | High |
Content Build Fails | Github Actions to Datadog to PagerDuty; also reports into #status-content-build in Slack |
High |
Forms API | Datadog to PagerDuty | Medium |
Health Service Descriptions | Drupal post_api to Slack |
Medium |
Tugboat Base Preview Accessible | DataDog to PagerDuty, Slack | Medium |
Tugoboat Server Resource | DataDog to PagerDuty, Slack | High |
GovDeliveryAPI | Trigger Drupal Error | High |
Periodic Job failure in Jenkins | Jenkins to Slack | Medium |
Daily Job in Jenkins | Jenkins to Slack | Medium |
Prod Deploy Warn | Jenkins to Slack | Medium |
Prod Deploy Start | Jenkins to Slack | Medium |
Prod Deploy Failure | Jenkins to Slack | High |
Prod Deploy Success | Jenkins to Slack | Medium |
Non-Prod Deploy Failures | Jenkins to Slack | Medium |
Non-Prod CMS Down | Datadog to PagerDuty (non-critical) | Medium |
Staging Test Failures | Jenkins to Slack | High |