“Claiming” a service means your squad will become operationally responsible for it. That responsibility comes in two phases:
Phase 1: Squad as Second Responder
This phase begins after EPD Leadership approves your squad’s service claims. This is scheduled to be ~2 days after your kickoff begins.
-
At that time, the Global SRE on-call will start posting in your squad channel when they’re investigating an alert related to one of your services. During the squad’s working hours, at least one backend or fullstack engineer must join the investigation. ”Join the investigation” means joining a Zoom and/or Slack thread, and supporting Global SRE in resolving the alert.
The squad’s Engineering Manager decides which engineer joins, but has flexibility on their approach. A couple example approaches:
- Implement a backend/fullstack “on-call” rotation within the squad. Whoever’s “on-call” joins Global SRE investigations posted to their squad during their working hours. If your backend/fullstack engineers have different working hours, make sure there’s “on-call” coverage that covers the whole span. For example, if two of your engineers work from 9 AM - 5 PM and the others work from 12 PM - 8 PM, at least one engineer should cover 9 AM - 5 PM and another 12 PM - 8 PM.
- Require all backend/fullstack engineers to join Global SRE investigations during their working hours.
- [ ] EM TO DO: Implement an approach to ensure at least one engineer on your squad engages with Global SRE alerts. Document your approach here.
-
At the same time, the opsbug process changes. The on-call Engineering Manager will triage opsbugs first to Subject Matter Experts (SMEs) by feature area, as defined here. But they will do so by posting the ticket in the channel of the squad who owns that feature area, and tagging the SME. While the SME is responsible for leading the resolution during their work hours, the engineers on the squad are responsible for participating in the resolution. Similar to above, the Engineering Manager of each squad decides who engages, and when:
- [ ] EM TO DO: Implement an approach to ensure at least one engineer on your squad engages with opsbug Subject Matter Experts (SMEs). Document your approach here.
Phase 2: Squad as First Responder
This phase begins at or before the end of February, ideally after each squad’s engineers have had 3-5 rounds of alert and opsbug resolution participation.
- At this time, squad engineers become first responders to alerts regarding their services, not Global SRE. This applies only during the squad engineers’ working hours, as before. Outside of these hours, Global SRE remains first responder.
- This also means squad members will more regularly call incidents and produce Incident Reports, not SRE.
- Similarly, opsbugs triaged to the squad no longer have SMEs tagged. It is up to the squad to triage and get the support they need to resolve during their working hours.
- The squad is now also responsible for deploying all* PRs affecting their services. Any engineer can submit and review PRs related to that service, however.
- To facilitate other engineers submitting and reviewing PRs, the squad should provide internal documentation on what the service does, and other context engineers would need to submit and review PRs for it. It’s up to Engineering Managers to make sure this happens.
- The exception is during an incident outside the squad’s working hours, in which case the on-call engineer(s) may approve and deploy.
<aside>
❓ If you need help, bring your questions to the Nest HQ room in Roam. Or post questions to your #epd-leads channel.
</aside>