On Tech

Tag: SRE

Aim for Operability, not SRE as a Cult

“In 2017, a Director of Ops asked me to turn their sysadmins into ‘SRE consultants’. I reminded them of their operability engineering team driving similar practices, and that I was their lead.

In 2018, a CTO at a gaming company told me SRE was better than DevOps, but recruitment was harder. They said they didn’t know much about SRE.

In 2020, I learned of a sysadmin team that were rebranded as an SRE team, received a small pay increase… and then carried on doing the same sysadmin work.

This is for decision makers who have been told SRE will solve their IT problems…”

Steve Smith

TL;DR:

  • SRE as a Philosophy means the Site Reliability Engineering principles from Google, and is associated with a lot of valuable ideas and insights.
  • SRE as a Cult refers to the marketing of SRE teams, SRE certifications as a panacea for technology problems.
  • Some aspects of SRE as a Philosophy are far harder to apply to enterprise organisations than others, such as SRE teams and error budgets.
  • Operability needs to be a key focus, not SRE as a Cargo Cult, and SRE as a Philosophy can supply solid ideas for improving operability.

Introduction 

A successful Digital transformation is predicated on a transition from IT as a Cost Centre to IT as a Business Differentiator. An IT cost centre creates segregated Delivery and Operations teams, trapped in an endless conflict between feature speed and service reliability. Delivery wants to maximise deployments, to increase speed. Operations wants to minimise deployments, to increase reliability.

In Accelerate, Dr Nicole Forsgren et al confirm this produces low performance IT, and has negative consequences for profitability, market share, and productivity. Accelerate also demonstrates speed and reliability are not a zero sum game. Investing in both feature speed and service reliability will produce a high performance IT capability that can uncover new product revenue streams.

SRE as a Philosophy

In 2004, Ben Treynor Sloss started an initiative called SRE within Google. He later described SRE as a software engineering approach to IT operations, with developers automating work historically owned outside Google by sysadmins. SRE was disseminated in 2016 by the seminal book Site Reliability Engineering, by Betsey Byers et al. Key concepts include:

Availability levels are known by the nines of availability. 99.0% is two nines, 99.999% is five nines. 100% availability is unachievable, as less reliable user devices will limit the user experience. 100% is also undesirable, as maximising availability limits speed of feature delivery and increases operational costs. Site Reliability Engineering contains the astute observation that ‘an additional nine of reliability requires an order of magnitude improvement’. At any availability level, an amount of unplanned downtime needs to be tolerated, in order to invest in feature delivery.

A Service Level Objective (SLO) is a published target range of measurements, which sets user expectations on an aspect of service performance. A product manager chooses SLOs, based on their own risk tolerance. They have to balance the engineering cost of meeting an SLO with user needs, the revenue potential of the service, and competitor offerings. An availability SLO could be a median request success rate of 99.9% in 24 hours. 

An error budget is a quarterly amount of tolerable, unplanned downtime for a service. It is used to mitigate any inter-team conflicts between product teams and SRE teams, as found in You Build It Ops Run It. It is calculated as 100% minus the chosen nines of availability. For example, an availability level of 99.9% equates to an error budget of 0.01% unsuccessful requests. 0.002% of failing requests in a week would consume 20% of the error budget, and leave 80% for the quarter. 

You Build It SRE Run It is a conditional production support method, where a team of SREs support a service for a product team. All product teams do You Build It You Run It by default, and there are strict entry and exit criteria for an SRE team. A service must have a critical level of user traffic, some elevated SLOs, and pass a readiness review. The SREs will take over on-call, and ensure SLOs are consistently met. The product team can launch new features if the service is within its error budget. If not, they cannot deploy until any errors are resolved. If the error budget is repeatedly blown, the SRE team can hand on-call back to the product team, who revert to You Build It You Run It.

This is SRE as a Philosophy. The biggest gift from SRE is a framework for quantifying availability targets and engineering effort, based on product revenue. SRE has also promoted ideas such as measuring partial availability, monitoring the golden signals of a service, building SLO alerts and SLI dashboards from the same telemetry data, and reducing operational toil where possible.  

SRE as a Cult

In the 2010s, the DevOps philosophy of collaboration was bastardised by the ubiquitous DevOps as a Cult. Its beliefs are:

  1. The divide between Delivery and Operations teams is always the constraint in IT performance. 
  2. DevOps automation tools, DevOps engineers, DevOps teams, and/or DevOps certifications are always solutions to that problem.

In a similar vein, the SRE philosophy has been corrupted by SRE as a Cult. The SRE cargo cult is based on the same flawed premise, and espouses SRE error budgets, SRE engineers, SRE teams, and SRE certifications as a panacea. Examples include Patrick Hill stating in Love DevOps? Wait until you meet SRE that ‘SRE removes the conjecture and debate over what can be launched and when’, and the DevOps Institute offering SRE certification

SRE as a Cult ignores the central question facing the SRE philosophy – its applicability to IT as a Cost Centre. SRE originated from talented, opinionated software engineers inside Google, where IT as a Business Differentiator is a core tenet. Using A Typology of Organisational Cultures by Ron Westrum, the Google culture can be described as generative. Accelerate confirms this is predictive of high performance IT, and less employee burnout.

There are fundamental challenges with applying SRE to an IT as a Cost Centre organisation with a bureaucratic or pathological culture. Product, Delivery, and Operations teams will be hindered by orthogonal incentives, funding pressures, and silo rivalries. 

For availability levels, if failure leads to scapegoating or justice:

  • Heads of Product/Delivery/Operations might not agree 100% reliability is unachievable.
  • Heads of Product/Delivery/Operations might not accept an additional nine of reliability means an order of magnitude more engineering effort. 
  • Heads of Delivery/Operations might not consent to availability levels being owned by product managers.

For Service Level Objectives, if responsibilities are shirked or discouraged:

  • Product managers might decline to take on responsibility for service availability.
  • Product managers will need help from Delivery teams to uncover user expectations, calculate service revenue potential, and check competitor availability levels.
  • Sysadmins might object to developers wiring automated, fine-grained measurements into their own production alerts. 

For error budgets, if cooperation is modest or low:

  • Product manager/developers/sysadmins might disagree on availability levels and the maths behind     error budgets.
  • Heads of Product/Development might not accept a block on deployments when an error budget is 0%.
  • A Head of Operations might not accept deployments at all hours when an error budget is above 0%.
  • Product managers/developers might accuse sysadmins of blocking deployments unnecessarily
  • Sysadmins might accuse product managers/developers of jeopardising reliability
  • A Head of Operations might arbitrarily block production deployments
  • A Head of Development might escalate a block on production deployments
  • A Head of Product might override a block on production deployments

For You Build It SRE Run It, if bridging is merely tolerated or discouraged:

  • A Head of Operations might not consent to on-call Delivery teams on their opex budget
  • A Head of Development might not consent to on-call Delivery teams on their capex budget
  • A Head of Operations might be unable to afford months of software engineering training for their sysadmins on an opex budget
  • Sysadmins might not want to undergo training, or be rebadged as SREs
  • Developers might not want to do on-call for their services, or be rebadged as SREs
  • Delivery teams will find it hard to collaborate with an Operations SRE team on errors and incident management
  • A Head of Operations might be unable to transfer an unreliable service back to the original Delivery team, if it was disbanded when its capex funding ended 

In Site Reliability Engineering, Ben Treynor Sloss identifies SRE recruitment as a significant challenge for Google. Developers are needed that excel in both software engineering and systems administration, which is rare. He counters this by arguing an SRE team is cheaper than an Operations team, as headcount is reduced by task automation. Recruitment challenges will be exacerbated by smaller budgets in IT as a Cost Centre organisations. The touted headcount benefit is absurd, as salary rates are invariably higher for developers than sysadmins. 

Aim for Operability, not SRE as a Cult

High performance IT requires Continuous Delivery and Operability. Operability refers to the ease of safely and reliably operating production systems. Increasing service operability will improve reliability, reduce operational rework, and increase feature speed. Operability practices include prioritising operational requirements, automated infrastructure, deployment health checks, pervasive telemetry, failure injection, incident swarming, learning from incidents, and You Build It You Run It.

These practices can be implemented with, and without SRE. In addition, some SRE concepts such as availability levels and Service Level Objectives can be implemented independently of SRE. In particular, product managers being responsible for calculating availability levels based on their risk tolerances is often a major step forward from the status quo.

SRE as a Cult obscures important questions about SRE applicability to SMEs and enterprise organisations. You Build It SRE Run It is a difficult fit for an IT as a Cost Centre organisation, and is not cost effective at all availability levels. The amount of investment required in employee training, organisational change, and task automation to run an SRE team alongside You Build It You Run It teams is an order of magnitude more than You Build It You Run It itself. It is only warranted when multiple services exist with critical user traffic, and at an availability level of four nines or more. 

An IT as a Cost Centre organisation would do well to implement You Build It You Run It instead. It unlocks daily deployments, by eliminating handoffs between Delivery and Operations teams. It minimises incident resolution times, via single-level swarming support prioritised ahead of feature development. Furthermore, it maximises incentives for developers to focus on operational features, as they are on-call out of hours themselves. It is a cost effective method of revenue protection, from two nines to five nines of availability.

In some cases, an SME or enterprise organisation will earn tens of millions in product revenues each day, its reliability needs will be extreme, and investing in SRE as a Philosophy could be warranted. Otherwise, heed the perils of SRE as a Cult. As Luke Stone said in Seeking SRE, ‘in the long run, SRE is not going to thrive in your organisation based purely on its current popularity’.

Acknowledgements

Thanks to Adam Hansrod, Dave Farley, Denise Yu, John Allspaw, Spike Lindsey, and Thierry de Pauw for their feedback.

You Build It SRE Run It

How does Site Reliability Engineering (SRE) approach production support? Why is it conditional, and how do error budgets try to avoid the inter-team conflicts of You Build It Ops Run It?

This is part of the Who Runs It series.

Introduction

The usual alternative to the You Build It Ops Run It production support method is You Build It You Run It. This means a development team is responsible for supporting its own services in production. It eliminates handoffs between developers and sysadmins, and maximises operability incentives for developers. It has the ability to unlock daily deployments, and improve production reliability. 

A less common alternative to You Build It Ops Run It is a Site Reliability Engineering (SRE) on-call team. This can be referred to as You Build It SRE Run It. It is a conditional production support method, with an operations-focussed development team supporting critical services owned by other development teams. 

SRE is a software engineering approach to IT operations. It started at Google in 2004, and was popularised by Betsey Byers et al in the 2016 book Site Reliability Engineering. In The SRE model, Jaana Dogan states ‘what makes Google SRE significantly different is not just their world-class expertise, but the fact that they are optional’. An SRE on-call team has strict entry and exit criteria for services. The process is:

  1. A development team does You Build It You Run It by default. Their service has a quarterly error budget. 
  2. If user traffic becomes substantial, the development team requests SRE on-call assistance. Their service must pass a readiness review.
  3. If the review is successful, the development team shares the on-call rota with some SREs. 
  4. If user traffic becomes critical, the development team hands over the on-call rota to a team of SREs.
  5. The SRE team automates operational tasks to improve service availability, latency, and performance. They monitor the service, and respond to any incidents. 
  6. If the service is inside its error budget, the development team can launch new features without involving the SRE team.
  7. If the service is outside its error budget, the development team cannot launch new features until the SRE team is satisfied all errors are resolved. 
  8. If the service is consistently outside its error budget, the SRE team hands the on-call rota back to the development team. The service reverts to You Build It You Run It.

In a startup with IT as a Business Differentiator, an SRE on-call team is a product team like any other development team. Those development teams might support their own services, or rely on the SRE on-call team.

In an SME or enterprise organisation with IT as a Cost Centre, You Build It SRE Run It is very different. There are segregated Delivery and Operations functions, due to COBIT and Plan-Build-Run. The SRE on-call team could be within the Delivery function, and report into the Head of Delivery.

Alternatively, the SRE on-call team could be within the Operations function, and report into the Head of Operations.

In IT as a Cost Centre, You Build It SRE Run It consists of single-level and multi-level support. An SRE on-call team participates in multiple support levels, with the Delivery teams that rely on them. A Delivery team supporting its own service has single level swarming.

The Service Desk handles incoming customer requests. They can link a ticket in the incident management system to a specific web page or user journey, which reassigns the ticket to the correct on-call team. Delivery teams doing You Build It You Run It are L1 on-call for their own services. The SRE on-call team is L1 on-call for critical services, and when necessary they can escalate issues to the L2 Delivery teams building those services. 

If the SRE on-call team is in Delivery, they will be funded by a capex Delivery budget. The Service Desk will be funded out of an Operations opex budget.

If the SRE on-call team is in Operations, they will be funded by an Operations opex budget like the Service Desk team.

Continuous Delivery and Operability

In You Build It SRE Run It, delivery teams on-call for their own production services experience the usual benefits of You Build It You Run It. Using an SRE on-call team and error budgets is a different way to prioritise service availability and incident resolution. Delivery teams reliant on an SRE on-call team are encouraged to limit their failure blast radius, to protect their error budget. The option for an SRE on-call team to hand back an on-call rota to a delivery team is a powerful reminder that operability needs a continual investment.  

You Build It SRE Run It has these advantages for product development:

  • Short deployment lead times. Lead times are minimised as there are no handoffs to the SRE on-call team.
  • Focus on outcomes. Delivery teams are empowered to test product hypotheses and deliver outcomes.
  • Short incident resolution times. Incident response from the SRE on-call team is rapid and effective. 
  • Adaptive architecture. Services will be architected for failure, including Circuit Breakers and Canary Deployments.
  • Product telemetry. Delivery teams continually update dashboards and alerts for the SRE on-call team, according to the product context.

You Build It SRE Run It creates strong incentives for operability. Delivery teams on-call for their own services will have the maximum incentives to balance operational features with product features. There is 1 on-call engineer per team, at a low capex cost with no knowledge synchronisation costs between teams.

Delivery teams collaborating with an SRE on-call team do not have maximum operability incentives, as another team supports critical services with high levels of user traffic on their behalf. Theoretically, strong incentives remain due to error budgets. The ability of a delivery team to maintain a high deployment throughput without intervention depends on protecting service availability. This should ensure product managers prioritise operational features alongside product features. There is 1 on-call SRE for critical services at a capex or opex cost, and knowledge synchronisation costs between teams are inevitable.

Overinvesting in inapplicability

Production support is revenue insurance. At first glance, it might make sense to pay a premium for a high-powered SRE team to support highly available services with critical levels of user traffic. However, investing in an SRE on-call team should be questioned when its applicability to IT as a Cost Centre is so challenging. 

Funding a SRE on-call team will be constrained by cost accounting. An SRE team in Delivery will have a capex budget, and undergo periodic funding renewals. An SRE team in Operations will have an opex budget, and endure regular pressure to find cost efficiencies. Either approach is at odds with a long term commitment to a large team of highly paid software engineers. 

Error budgets are unlikely to magically solve the politics and bureaucracy that exists between Delivery teams and an SRE on-call team. Product managers, developers, and/or sysadmins might not agree on a service availability level, availability losses in recent incidents, and/or the remaining latitude in an error budget. A Head of Product might not accept an SRE block on deployments, when an error budget is lost. A Head of Delivery or Operations might not accept deployments at all hours, even with an error budget in place.  In addition, an SRE on-call team might be unable to hand over an on-call rotation back to a Delivery team, if it was disbanded when its capex funding ended.

In Site Reliability Engineering, Betsey Byers et al describe near-universally applicable SRE practices, such as revenue-based availability targets and service level objectives. The authors also make the astute observation ‘an additional nine of reliability requires an order of magnitude improvement. A 99.99% service requires 10x more engineering effort than 99.9%, and 100x more than 99.0%. You Build It SRE Run It is not easily applied to IT as a Cost Centre, and it requires a sizable investment in culture, people, process, and tools. It is best suited to organisations with a website that genuinely requires 99.99% availability, and the maximum revenue loss in a large-scale failure could jeopardise the organisation itself. In a majority of scenarios, You Build It You Run It will be a simpler and more cost effective alternative. 

Acknowledgements

Thanks to Thierry de Pauw.

The Who Runs It series:

  1. You Build It Ops Run It
  2. You Build It You Run It
  3. You Build It Ops Run It at scale
  4. You Build It You Run It at scale
  5. You Build It Ops Sometimes Run It
  6. Implementing You Build It You Run It at scale
  7. You Build It SRE Run It

Acknowledgements

Thanks to Thierry de Pauw.

© 2024 Steve Smith

Theme by Anders NorénUp ↑