Skip to content

On Call Responsibilities

Gabriel Zurita edited this page Oct 25, 2024 · 20 revisions

Overview

This document provides comprehensive guidelines on the responsibilities and expectations for team members who are on call. It consolidates all relevant procedures, including incident response, communication standards, monitoring requirements, and deployment responsibility.

Summary

  • On-call rotations align with each two-week sprint, consisting of a primary and a secondary engineer.
  • The primary on-call person is the first point of contact for issues but not necessarily the one who resolves everything.
  • Responsibilities include but are not limited to monitoring application health, responding to outages, performing deployments, addressing SecRel failures, and reviewing Dependabot changes.
  • Each reported incident must be converted into a ZenHub ticket and will be prioritized based on severity.
  • Given the variety of responsibilities that on-call engineers are tasked with, and the caveat mentioned in item 2 above, it is important for the on-call engineer to remain cognizant of their capacity to juggle these responsibilities and to reach out for secondary on-call assistance or even other team mates outside of the on-call rotation.

On-Call Rotation

Each two-week sprint will have a story assigned to a primary and secondary engineer who will be responsible for the on-call responsibilities outlined below. The secondary on-call person from a sprint will typically become the primary for the following sprint, and an engineer will volunteer to become the new secondary.

The role of the primary contact is to be the first point of contact for issues but not necessarily the one who resolves everything - they should enlist the help of the team as needed and be responsible for tracking the completion of all on-call tasks. The secondary contact should remain accessible to the primary for assistance as required, and concentrate on addressing smaller tickets or collaborating on larger ones during the Sprint.

Responsibilities

Responding to Incidents

VRO considers an incident as an unplanned disruption or degradation of service that impacts normal operations or user experience. On-call engineers must follow the incident response and communication standards agreed upon by VRO when responding to incidents. Each reported incident must be converted into a ZenHub ticket and included in the Sprint. When creating the ticket, ensure it is linked to the "Incident" epic and follow the guidance on labeling the root cause of the issue within the epic. Incidents initiated via the Slack workflow will automatically have a ZenHub ticket created but it will need to be linked to the “Incident” epic.

After investigating the incident, it’s important to determine whether the cause of the failure can be attributed to the VRO team, LHDI, and partner teams or external VA and the incident ticket should have the appropriate label added. VRO-caused failures should directly tie to our team's quality metrics and are what we measure our MTTR (Mean Time to Resolve) around. Issues attributed to LHDI trace to the administration of ArgoCD, Aqua, k8s, SecRel, or Vault. If the failure is due to an issue with the partner team application controlled by the partner team, or due to a VA external system not functioning appropriately. These issues are not within VROs control, but we must work in partnership with our partner teams to investigate and assist in their resolution.

Performing Deployments

An additional responsibility of on-call engineers is to handle software deployments. The VRO Deployment Policy document describes the scheduled release cadence, pathway to requesting ad-hoc deployments, and the procedures for performing a deployment.

Deployments are scheduled to go out the first Wednesday of a new sprint at a time which works best for the release captain, and will contain all VRO platform services as well as any partner team services which have been requested for release. The #benefits-vro-on-call channel will receive an automatic reminder 24 hours before a release is to be scheduled and needs to initiate the workflow to notify partner teams of the release. During this time, partner teams should either specify the changes being released as well as any specific hashes to be deployed, or they may opt out of the release and only VRO platform services will be pushed. The on-call engineer should also be available to handle ad-hoc deployment requests from partner teams to push code to lower, non-prod regions or to push emergency fixes to production.

Monitoring Channels

Below are the primary Slack channels within the VA’s Office of the Chief Technology Officer (OCTO) workspace which should be monitored for incidents and alerts. Follow the incident response guidelines above for incidents which impact partner team applications.

#benefits-vro-alerts

This channel should be monitored for alerts triggered through DataDog.

#benefits-vro-support

This channel should be monitored for communication from partner teams.

#benefits-vro-on-call

This channel should be monitored for incident report workflow notifications and PagerDuty alerts. The incident report workflow should be the preferred route for notifying the VRO team of incidents.

Monitoring Tools

PagerDuty

PagerDuty is a support tool for handling scheduling and alerts for on-call engineers. The tool is configured to align with the engineers assigned as primary and secondary contacts for incidents and will alert the team when new incidents are reported according to our custom escalation policies.

DataDog

DataDog (wiki reference) is a cloud-based application monitoring tool that supports custom alerts and visualizations generated from incoming events. These alerts, which mostly pertain to service availability and performance, are sent to the #benefits-vro-alerts Slack channel and should be addressed promptly when received.

SecRel

The secure release (SecRel) pipeline is a GitHub Actions workflow that supports continuous delivery outcomes by validating that all software changes have addressed security and privacy risks. SecRel is a critical part of VRO’s development workflow in that it allows for continuous authority to operate (cATO), which removes the burden of going through lengthy approval processes for each release. The pipeline can be manually run against any branch, is automatically run against the develop branch with each merge, and is scheduled to run weekday mornings at 7AM ET against the develop branch.

One of the steps in this pipeline serves to evaluate our dependencies against a database of known vulnerabilities. If new vulnerabilities are found between one scan and the next, the SecRel pipeline will fail and signed images will not be published to GitHub’s container registry (GHCR) to be available for release. On-call engineers will be responsible for reviewing SecRel build results and resolving vulnerabilities as they are discovered.

Dependabot

This GitHub tool automatically updates dependencies to their latest versions. On-call engineers will be responsible for reviewing pull requests created by the bot and merging the proposed version updates into the develop branch. It’s important to note that integration and unit tests should be run against new builds since updating dependencies can introduce defects.

Incident Report Workflow

A Slack workflow was created to improve the process for notifying the VRO team of incidents. The workflow is initiated, typically by partner team members, by clicking a link in the #benefits-vro-support channel and completing a form intended to gather information about the incident. Once the form is submitted, a message is posted to both #benefits-vro-support and #benefits-vro-on-call which contains guidelines for responding to the report. An incident reported through this workflow will generate a PagerDuty alert as well as create Zenhub and GitHub tickets - the Zenhub ticket will need to be linked to the “Incidents” epic.

Aqua

Aqua scans images for operating system vulnerabilities, malware, and insecure configurations and monitors for images that are deployed into production as part of the SecRel pipeline. The results from this scan in part help determine whether a SecRel scan passes. Results are shown in the SecRel run’s summary and may include new vulnerabilities discovered between runs or as the result of version upgrades, and should be remediated through LHDI’s Aqua webpage. Remediation usually involves requesting exceptions or by upgrading to recommended versions of software.

For all AWS related vulnerabilities, please contact the Lighthouse cATO Technical Application Assessor (currently: Andrew Palopoli) to get them addressed.

Snyk

Snyk scans developer source code and third-party libraries, or dependencies, for known vulnerabilities as part of the SecRel pipeline. The results from this scan in part help determine whether a SecRel scan passes. Results from these scans are displayed in a SecRel run’s summary as well as on the repository’s Security tab.

For vulnerabilities that involve upgrading libraries, please look at the logs for the failed stage in the Secrel run and make any necessary PRs (example).

VRO Services, Points of Contact, and Issue Escalation Paths

Please take a look at the VRO Services, Points of Contact, and Issue Escalation Paths document for this info. More information regarding VRO’s dependencies can be found in this dependency diagram.

Index of Resources

Clone this wiki locally