On Tech

Category: Operability (Page 1 of 2)

Aim for Operability, not SRE as a Cult

“In 2017, a Director of Ops asked me to turn their sysadmins into ‘SRE consultants’. I reminded them of their operability engineering team driving similar practices, and that I was their lead.

In 2018, a CTO at a gaming company told me SRE was better than DevOps, but recruitment was harder. They said they didn’t know much about SRE.

In 2020, I learned of a sysadmin team that were rebranded as an SRE team, received a small pay increase… and then carried on doing the same sysadmin work.

This is for decision makers who have been told SRE will solve their IT problems…”

Steve Smith

TL;DR:

  • SRE as a Philosophy means the Site Reliability Engineering principles from Google, and is associated with a lot of valuable ideas and insights.
  • SRE as a Cult refers to the marketing of SRE teams, SRE certifications as a panacea for technology problems.
  • Some aspects of SRE as a Philosophy are far harder to apply to enterprise organisations than others, such as SRE teams and error budgets.
  • Operability needs to be a key focus, not SRE as a Cargo Cult, and SRE as a Philosophy can supply solid ideas for improving operability.

Introduction 

A successful Digital transformation is predicated on a transition from IT as a Cost Centre to IT as a Business Differentiator. An IT cost centre creates segregated Delivery and Operations teams, trapped in an endless conflict between feature speed and service reliability. Delivery wants to maximise deployments, to increase speed. Operations wants to minimise deployments, to increase reliability.

In Accelerate, Dr Nicole Forsgren et al confirm this produces low performance IT, and has negative consequences for profitability, market share, and productivity. Accelerate also demonstrates speed and reliability are not a zero sum game. Investing in both feature speed and service reliability will produce a high performance IT capability that can uncover new product revenue streams.

SRE as a Philosophy

In 2004, Ben Treynor Sloss started an initiative called SRE within Google. He later described SRE as a software engineering approach to IT operations, with developers automating work historically owned outside Google by sysadmins. SRE was disseminated in 2016 by the seminal book Site Reliability Engineering, by Betsey Byers et al. Key concepts include:

Availability levels are known by the nines of availability. 99.0% is two nines, 99.999% is five nines. 100% availability is unachievable, as less reliable user devices will limit the user experience. 100% is also undesirable, as maximising availability limits speed of feature delivery and increases operational costs. Site Reliability Engineering contains the astute observation that ‘an additional nine of reliability requires an order of magnitude improvement’. At any availability level, an amount of unplanned downtime needs to be tolerated, in order to invest in feature delivery.

A Service Level Objective (SLO) is a published target range of measurements, which sets user expectations on an aspect of service performance. A product manager chooses SLOs, based on their own risk tolerance. They have to balance the engineering cost of meeting an SLO with user needs, the revenue potential of the service, and competitor offerings. An availability SLO could be a median request success rate of 99.9% in 24 hours. 

An error budget is a quarterly amount of tolerable, unplanned downtime for a service. It is used to mitigate any inter-team conflicts between product teams and SRE teams, as found in You Build It Ops Run It. It is calculated as 100% minus the chosen nines of availability. For example, an availability level of 99.9% equates to an error budget of 0.01% unsuccessful requests. 0.002% of failing requests in a week would consume 20% of the error budget, and leave 80% for the quarter. 

You Build It SRE Run It is a conditional production support method, where a team of SREs support a service for a product team. All product teams do You Build It You Run It by default, and there are strict entry and exit criteria for an SRE team. A service must have a critical level of user traffic, some elevated SLOs, and pass a readiness review. The SREs will take over on-call, and ensure SLOs are consistently met. The product team can launch new features if the service is within its error budget. If not, they cannot deploy until any errors are resolved. If the error budget is repeatedly blown, the SRE team can hand on-call back to the product team, who revert to You Build It You Run It.

This is SRE as a Philosophy. The biggest gift from SRE is a framework for quantifying availability targets and engineering effort, based on product revenue. SRE has also promoted ideas such as measuring partial availability, monitoring the golden signals of a service, building SLO alerts and SLI dashboards from the same telemetry data, and reducing operational toil where possible.  

SRE as a Cult

In the 2010s, the DevOps philosophy of collaboration was bastardised by the ubiquitous DevOps as a Cult. Its beliefs are:

  1. The divide between Delivery and Operations teams is always the constraint in IT performance. 
  2. DevOps automation tools, DevOps engineers, DevOps teams, and/or DevOps certifications are always solutions to that problem.

In a similar vein, the SRE philosophy has been corrupted by SRE as a Cult. The SRE cargo cult is based on the same flawed premise, and espouses SRE error budgets, SRE engineers, SRE teams, and SRE certifications as a panacea. Examples include Patrick Hill stating in Love DevOps? Wait until you meet SRE that ‘SRE removes the conjecture and debate over what can be launched and when’, and the DevOps Institute offering SRE certification

SRE as a Cult ignores the central question facing the SRE philosophy – its applicability to IT as a Cost Centre. SRE originated from talented, opinionated software engineers inside Google, where IT as a Business Differentiator is a core tenet. Using A Typology of Organisational Cultures by Ron Westrum, the Google culture can be described as generative. Accelerate confirms this is predictive of high performance IT, and less employee burnout.

There are fundamental challenges with applying SRE to an IT as a Cost Centre organisation with a bureaucratic or pathological culture. Product, Delivery, and Operations teams will be hindered by orthogonal incentives, funding pressures, and silo rivalries. 

For availability levels, if failure leads to scapegoating or justice:

  • Heads of Product/Delivery/Operations might not agree 100% reliability is unachievable.
  • Heads of Product/Delivery/Operations might not accept an additional nine of reliability means an order of magnitude more engineering effort. 
  • Heads of Delivery/Operations might not consent to availability levels being owned by product managers.

For Service Level Objectives, if responsibilities are shirked or discouraged:

  • Product managers might decline to take on responsibility for service availability.
  • Product managers will need help from Delivery teams to uncover user expectations, calculate service revenue potential, and check competitor availability levels.
  • Sysadmins might object to developers wiring automated, fine-grained measurements into their own production alerts. 

For error budgets, if cooperation is modest or low:

  • Product manager/developers/sysadmins might disagree on availability levels and the maths behind     error budgets.
  • Heads of Product/Development might not accept a block on deployments when an error budget is 0%.
  • A Head of Operations might not accept deployments at all hours when an error budget is above 0%.
  • Product managers/developers might accuse sysadmins of blocking deployments unnecessarily
  • Sysadmins might accuse product managers/developers of jeopardising reliability
  • A Head of Operations might arbitrarily block production deployments
  • A Head of Development might escalate a block on production deployments
  • A Head of Product might override a block on production deployments

For You Build It SRE Run It, if bridging is merely tolerated or discouraged:

  • A Head of Operations might not consent to on-call Delivery teams on their opex budget
  • A Head of Development might not consent to on-call Delivery teams on their capex budget
  • A Head of Operations might be unable to afford months of software engineering training for their sysadmins on an opex budget
  • Sysadmins might not want to undergo training, or be rebadged as SREs
  • Developers might not want to do on-call for their services, or be rebadged as SREs
  • Delivery teams will find it hard to collaborate with an Operations SRE team on errors and incident management
  • A Head of Operations might be unable to transfer an unreliable service back to the original Delivery team, if it was disbanded when its capex funding ended 

In Site Reliability Engineering, Ben Treynor Sloss identifies SRE recruitment as a significant challenge for Google. Developers are needed that excel in both software engineering and systems administration, which is rare. He counters this by arguing an SRE team is cheaper than an Operations team, as headcount is reduced by task automation. Recruitment challenges will be exacerbated by smaller budgets in IT as a Cost Centre organisations. The touted headcount benefit is absurd, as salary rates are invariably higher for developers than sysadmins. 

Aim for Operability, not SRE as a Cult

High performance IT requires Continuous Delivery and Operability. Operability refers to the ease of safely and reliably operating production systems. Increasing service operability will improve reliability, reduce operational rework, and increase feature speed. Operability practices include prioritising operational requirements, automated infrastructure, deployment health checks, pervasive telemetry, failure injection, incident swarming, learning from incidents, and You Build It You Run It.

These practices can be implemented with, and without SRE. In addition, some SRE concepts such as availability levels and Service Level Objectives can be implemented independently of SRE. In particular, product managers being responsible for calculating availability levels based on their risk tolerances is often a major step forward from the status quo.

SRE as a Cult obscures important questions about SRE applicability to SMEs and enterprise organisations. You Build It SRE Run It is a difficult fit for an IT as a Cost Centre organisation, and is not cost effective at all availability levels. The amount of investment required in employee training, organisational change, and task automation to run an SRE team alongside You Build It You Run It teams is an order of magnitude more than You Build It You Run It itself. It is only warranted when multiple services exist with critical user traffic, and at an availability level of four nines or more. 

An IT as a Cost Centre organisation would do well to implement You Build It You Run It instead. It unlocks daily deployments, by eliminating handoffs between Delivery and Operations teams. It minimises incident resolution times, via single-level swarming support prioritised ahead of feature development. Furthermore, it maximises incentives for developers to focus on operational features, as they are on-call out of hours themselves. It is a cost effective method of revenue protection, from two nines to five nines of availability.

In some cases, an SME or enterprise organisation will earn tens of millions in product revenues each day, its reliability needs will be extreme, and investing in SRE as a Philosophy could be warranted. Otherwise, heed the perils of SRE as a Cult. As Luke Stone said in Seeking SRE, ‘in the long run, SRE is not going to thrive in your organisation based purely on its current popularity’.

Acknowledgements

Thanks to Adam Hansrod, Dave Farley, Denise Yu, John Allspaw, Spike Lindsey, and Thierry de Pauw for their feedback.

You Build It SRE Run It

How does Site Reliability Engineering (SRE) approach production support? Why is it conditional, and how do error budgets try to avoid the inter-team conflicts of You Build It Ops Run It?

This is part of the Who Runs It series.

Introduction

The usual alternative to the You Build It Ops Run It production support method is You Build It You Run It. This means a development team is responsible for supporting its own services in production. It eliminates handoffs between developers and sysadmins, and maximises operability incentives for developers. It has the ability to unlock daily deployments, and improve production reliability. 

A less common alternative to You Build It Ops Run It is a Site Reliability Engineering (SRE) on-call team. This can be referred to as You Build It SRE Run It. It is a conditional production support method, with an operations-focussed development team supporting critical services owned by other development teams. 

SRE is a software engineering approach to IT operations. It started at Google in 2004, and was popularised by Betsey Byers et al in the 2016 book Site Reliability Engineering. In The SRE model, Jaana Dogan states ‘what makes Google SRE significantly different is not just their world-class expertise, but the fact that they are optional’. An SRE on-call team has strict entry and exit criteria for services. The process is:

  1. A development team does You Build It You Run It by default. Their service has a quarterly error budget. 
  2. If user traffic becomes substantial, the development team requests SRE on-call assistance. Their service must pass a readiness review.
  3. If the review is successful, the development team shares the on-call rota with some SREs. 
  4. If user traffic becomes critical, the development team hands over the on-call rota to a team of SREs.
  5. The SRE team automates operational tasks to improve service availability, latency, and performance. They monitor the service, and respond to any incidents. 
  6. If the service is inside its error budget, the development team can launch new features without involving the SRE team.
  7. If the service is outside its error budget, the development team cannot launch new features until the SRE team is satisfied all errors are resolved. 
  8. If the service is consistently outside its error budget, the SRE team hands the on-call rota back to the development team. The service reverts to You Build It You Run It.

In a startup with IT as a Business Differentiator, an SRE on-call team is a product team like any other development team. Those development teams might support their own services, or rely on the SRE on-call team.

In an SME or enterprise organisation with IT as a Cost Centre, You Build It SRE Run It is very different. There are segregated Delivery and Operations functions, due to COBIT and Plan-Build-Run. The SRE on-call team could be within the Delivery function, and report into the Head of Delivery.

Alternatively, the SRE on-call team could be within the Operations function, and report into the Head of Operations.

In IT as a Cost Centre, You Build It SRE Run It consists of single-level and multi-level support. An SRE on-call team participates in multiple support levels, with the Delivery teams that rely on them. A Delivery team supporting its own service has single level swarming.

The Service Desk handles incoming customer requests. They can link a ticket in the incident management system to a specific web page or user journey, which reassigns the ticket to the correct on-call team. Delivery teams doing You Build It You Run It are L1 on-call for their own services. The SRE on-call team is L1 on-call for critical services, and when necessary they can escalate issues to the L2 Delivery teams building those services. 

If the SRE on-call team is in Delivery, they will be funded by a capex Delivery budget. The Service Desk will be funded out of an Operations opex budget.

If the SRE on-call team is in Operations, they will be funded by an Operations opex budget like the Service Desk team.

Continuous Delivery and Operability

In You Build It SRE Run It, delivery teams on-call for their own production services experience the usual benefits of You Build It You Run It. Using an SRE on-call team and error budgets is a different way to prioritise service availability and incident resolution. Delivery teams reliant on an SRE on-call team are encouraged to limit their failure blast radius, to protect their error budget. The option for an SRE on-call team to hand back an on-call rota to a delivery team is a powerful reminder that operability needs a continual investment.  

You Build It SRE Run It has these advantages for product development:

  • Short deployment lead times. Lead times are minimised as there are no handoffs to the SRE on-call team.
  • Focus on outcomes. Delivery teams are empowered to test product hypotheses and deliver outcomes.
  • Short incident resolution times. Incident response from the SRE on-call team is rapid and effective. 
  • Adaptive architecture. Services will be architected for failure, including Circuit Breakers and Canary Deployments.
  • Product telemetry. Delivery teams continually update dashboards and alerts for the SRE on-call team, according to the product context.

You Build It SRE Run It creates strong incentives for operability. Delivery teams on-call for their own services will have the maximum incentives to balance operational features with product features. There is 1 on-call engineer per team, at a low capex cost with no knowledge synchronisation costs between teams.

Delivery teams collaborating with an SRE on-call team do not have maximum operability incentives, as another team supports critical services with high levels of user traffic on their behalf. Theoretically, strong incentives remain due to error budgets. The ability of a delivery team to maintain a high deployment throughput without intervention depends on protecting service availability. This should ensure product managers prioritise operational features alongside product features. There is 1 on-call SRE for critical services at a capex or opex cost, and knowledge synchronisation costs between teams are inevitable.

Overinvesting in inapplicability

Production support is revenue insurance. At first glance, it might make sense to pay a premium for a high-powered SRE team to support highly available services with critical levels of user traffic. However, investing in an SRE on-call team should be questioned when its applicability to IT as a Cost Centre is so challenging. 

Funding a SRE on-call team will be constrained by cost accounting. An SRE team in Delivery will have a capex budget, and undergo periodic funding renewals. An SRE team in Operations will have an opex budget, and endure regular pressure to find cost efficiencies. Either approach is at odds with a long term commitment to a large team of highly paid software engineers. 

Error budgets are unlikely to magically solve the politics and bureaucracy that exists between Delivery teams and an SRE on-call team. Product managers, developers, and/or sysadmins might not agree on a service availability level, availability losses in recent incidents, and/or the remaining latitude in an error budget. A Head of Product might not accept an SRE block on deployments, when an error budget is lost. A Head of Delivery or Operations might not accept deployments at all hours, even with an error budget in place.  In addition, an SRE on-call team might be unable to hand over an on-call rotation back to a Delivery team, if it was disbanded when its capex funding ended.

In Site Reliability Engineering, Betsey Byers et al describe near-universally applicable SRE practices, such as revenue-based availability targets and service level objectives. The authors also make the astute observation ‘an additional nine of reliability requires an order of magnitude improvement. A 99.99% service requires 10x more engineering effort than 99.9%, and 100x more than 99.0%. You Build It SRE Run It is not easily applied to IT as a Cost Centre, and it requires a sizable investment in culture, people, process, and tools. It is best suited to organisations with a website that genuinely requires 99.99% availability, and the maximum revenue loss in a large-scale failure could jeopardise the organisation itself. In a majority of scenarios, You Build It You Run It will be a simpler and more cost effective alternative. 

Acknowledgements

Thanks to Thierry de Pauw.

The Who Runs It series:

  1. You Build It Ops Run It
  2. You Build It You Run It
  3. You Build It Ops Run It at scale
  4. You Build It You Run It at scale
  5. You Build It Ops Sometimes Run It
  6. Implementing You Build It You Run It at scale
  7. You Build It SRE Run It

Acknowledgements

Thanks to Thierry de Pauw.

You Build It Ops Sometimes Run It

Why is a hybrid of You Build It Ops Run It and You Build It You Run It doomed to fail at scale?

This is part of the Who Runs It series.

Introduction

You Build It Ops Sometimes Run It refers to a mix of You Build It You Run It and You Build It Ops Run It. A minority of applications are supported by Delivery teams, whereas the majority are supported by a Monitoring team in Operations. This can be accomplished by splitting applications into higher and lower availability targets.

Some vendors may erroneously refer to this as Site Reliability Engineering (SRE). SRE refers to a central, on-call Delivery team supporting high availability, stable applications that meet stringent entry criteria. You Build It Ops Sometimes Run It is completely different to SRE, as the Monitoring team is disempowered and supports lower availability applications . The role and responsibilities of the Monitoring team are simply a hybrid of the Operations Bridge and Application Operations teams in the ITIL v3 Service Operation standard.

Out of hours, higher availability applications are supported by their L1 Delivery teams. Lower availability applications are supported by the L1 Monitoring team, who will receive alerts and respond to incidents. When necessary, the Monitoring team will escalate to  L2 Delivery team members on best endeavours. 

Support costs for Monitoring will be paid out of OpEx. Support costs for L1 Delivery teams should be paid out of CapEx, to ensure product managers balance desired availability with on-call costs. An L1 Delivery team member will be paid a flat standby rate, and a per-incident callout rate. An L2 Delivery team member will do best endeavours unpaid, and might be compensated per-callout with time off in lieu.

Inherited Discontinuous Delivery and inoperability

Proponents of You Build It Ops Sometimes Run It will argue it is a low cost, wafer thin support team that simply follows runbooks to resolve straightforward incidents, and Delivery teams can be called out when necessary. However, many of the disadvantages of You Build It Ops Run It are inherited:

  • Long time to restore – support ticket handoffs between Monitoring and Delivery teams will delay availability restoration during complex incidents
  • Very high knowledge synchronisation costs – application and incident knowledge will not be shared between multiple Delivery teams and the Monitoring team without significant coordination costs, such as handover meetings
  • Slow operational improvements – problems and workarounds identified by Monitoring will languish in Delivery backlogs for weeks or months, building up application complexity for future incidents
  • No focus on outcomes – applications will be built as outputs only, with little regard for product hypotheses
  • Fragile architecture – failure scenarios will not be designed into applications, increasing failure blast radius
  • Inadequate telemetry – dashboards and alerts by the Monitoring team will only be able to include low-level operational metrics
  • Traffic ignorance – challenges in live traffic management will be localised, and unable to inform application design decisions
  • Restricted collaboration – joint incident response between Monitoring and Delivery teams will be hampered by different ways of working and tools
  • Unfair on-call expectations – Delivery team members will be expected  to  be available out of hours without compensation for the inconvenience, and disruption to their lives

The Delivery teams on-call for high availability applications will have strong operability incentives. However, with a Monitoring team responsible for a majority of applications, most Delivery teams will be unaware of or uninvolved in incidents. Those Delivery teams will have little reason to prioritise operational features, and the Monitoring team will be powerless to do so. Widespread inoperability, and an increased vulnerability to production incidents is unavoidable.

The notion of a wafer thin Monitoring team is fundamentally naive. If an IT department has an entrenched culture of You Build It Ops Run It At Scale, there will be a predisposition towards Operations support. Delivery teams on-call for higher availability applications will be viewed as a mere exception to the rule. Over time, there will be a drift to the Monitoring team taking over office hours support, and then higher availability applications out of hours. At that point, the Monitoring team is just another Application Operations team, and all the disadvantages of You Build It Ops Run It At Scale are assured.

The Who Runs It series:

  1. You Build It Ops Run It
  2. You Build It You Run It
  3. You Build It Ops Run It at scale
  4. You Build It You Run It at scale
  5. You Build It Ops Sometimes Run It
  6. Implementing You Build It You Run It at scale
  7. You Build It SRE Run It

Acknowledgements

Thanks to Thierry de Pauw.

You Build It Ops Run It at scale

Why does Operations production support become less effective as Delivery teams and applications increase in scale?

This is part of the Who Runs It series.

Introduction

An IT As A Cost Centre organisation beholden to Plan-Build-Run will have a Delivery group responsible for building applications, and an Operations group responsible for deploying applications and production support. When there are 10+ Delivery teams and applications, this can be referred to as You Build It Ops Run It at scale. For example, imagine a single technology value stream used by 10 delivery teams, and each team builds a separate customer-facing application.

As with You Build It Ops Run It, there will be multi-level production support in accordance with the ITIL v3 Service Operation standard:

L1 and L2 Operations teams will be paid standby and callout costs out of Operational Expenditure (OpEx), and L3 Delivery team members on best endeavours are not paid. The key difference at scale is Operations workload. In particular, Application Operations will have to manage deployments and L2 incident response for 10+ applications. It will be extremely difficult for Application Operations to keep track of when a deployment is required, which alert corresponds to which application, and which Delivery team can help with a particular application.

Discontinuous Delivery and inoperability

At scale, You Build It Ops Run It magnifies the problems with You Build Ops Run It, with a negative impact on both Continuous Delivery and operability:

  • Long time to restore – support ticket handoffs between Ops Bridge, Application Operations, and multiple Delivery teams will delay availability restoration on failure
  • Very high knowledge synchronisation costs – Application Operations will find it difficult to ingest knowledge of multiple applications and share incident knowledge with multiple Delivery teams
  • No focus on customer outcomes – applications will be built as outputs only, with little time for product hypotheses
  • Fragile architecture – failure scenarios will not be designed into user journeys and applications, increasing failure blast radius
  • Inadequate telemetry – dashboards and alerts from Applications Operations will only be able to show low-level operational metrics
  • Traffic ignorance – applications will be built with little knowledge of how traffic flows through different dependencies
  • Restricted collaboration – incident response between Application Operations and multiple Delivery teams will be hampered by different ways of working
  • Unfair on-call expectations – Delivery team members will be expected to do unpaid on-call out of hours

These problems will make it less likely that application availability targets can consistently be met, and will increase Time To Restore (TTR) on availability loss. Production incidents will be more frequent, and revenue impact will potentially be much greater. This is a direct result of the lack of operability incentives. Application Operations cannot build operability into 10+ applications they do not own. Delivery teams will have little reason to do so when they have little to no responsibility for incident response.

A Theory Of Constraints lens on Continuous Delivery shows that reducing rework and queue times is key to deployment throughput. With 10+ Delivery teams and applications the Application Operations workload will become intolerable, and team member burnout will be a real possibility. Queue time for deployments will mount up, and the countermeasure to release candidates blocking on Application Operations will be time-consuming management escalations. If product demand calls for more than weekly deployments, the rework and delays incurred in Application Operations will result in long-term Discontinuous Delivery.

The Who Runs It series

  1. You It Build It Ops Run It
  2. You Build It You Run It
  3. You Build It Ops Run It at scale
  4. You Build It You Run It at scale
  5. You Build It Ops Sometimes Run It
  6. Implementing You Build It You Run It at scale
  7. You Build IT SRE Run It

Acknowledgements

Thanks to Thierry de Pauw.

You Build It Ops Runs It

Why do disparate Delivery and Operations teams result in long-term Discontinuous Delivery of inoperable applications?

This is part of the Who Runs It series.

Introduction

An organisation modelled on IT As A Cost Centre and Plan-Build-Run will have an Operations group in its IT department. Operations teams will be responsible for all Run activities, including deployments and production support for all applications. This can be referred to as You Build It Ops Run It. For example, consider a technology value stream comprising 1 development team in Delivery and an Application Operations team.

You Build It Ops Run It usually involves multi-level production support, in line with the ITIL v3 Service Operation standard:

The Service Desk will receive customer requests, and Operations Bridge will monitor dashboards and receive alerts. Both L1 teams will be trained to resolve simple technology issues, and to escalate more complicated tickets to L2. Application Operations will respond to incidents that require technology specialisation, and when necessary will escalate to an L3 Delivery team to contribute their expertise to an incident.

Cost accounting in IT As A Cost Centre creates a funding divide. A Delivery team will be budgeted under Capital Expenditure (CapEx), whereas Operations teams will be under Operational Expenditure (OpEx). An Operations team member will be paid a flat standby rate and a per-incident callout rate. A Delivery team member will not be paid for standby, and might be unofficially compensated per-callout with time off in lieu. Operations will be under continual pressure to reduce OpEx spending, and the Service Desk,  Ops Bridge, and/or Application Operations might be outsourced to third party suppliers.

Discontinuous Delivery and inoperability

In ITSM and why three-tier support should be replaced with Swarming, Jon Hall argues “the current organizational structure of the vast majority of IT support organisations is fundamentally flawed”. Multi-level support in You Build It Ops Run It means non-trivial tickets will go from Service Desk or Ops Bridge through triage queues until the best-placed responder team is found. Repeated, unilateral ticket reassignments can occur between teams and individuals. Those handoffs can increase incident resolution time by hours, days, or even weeks. Rework can also be incurred as Application Operations introduce workarounds and data fixes, which await resolution in a Delivery backlog for months before prioritisation.

In addition, You Build It Ops Run It has major disadvantages for fast customer feedback and iterative product development:

  • Long deployment lead times – handoffs with Application Operations will inflate lead times by hours or days
  • High knowledge synchronisation costs – Delivery team application knowledge and Application Operations incident knowledge will be lost in handoffs, without substantial synchronisation efforts
  • Focus on outputs – software will be built as an output, with little to no understanding of product hypotheses or customer outcomes
  • Fragile architecture – applications will be architected without limits on failure blast radius, and exposed to high impact incidents
  • Inadequate telemetry – dashboards and alerts created by Application Operations in isolation will only be able to use operational metrics
  • Traffic ignorance – challenges involved in managing live traffic will be localised and unable to inform design decisions
  • Restricted collaboration – Application Operations and Delivery teams will find joint incident response hard, due to differences in ways of working and tools, and lack of Delivery team access to production
  • Unfair on-call expectations –  Delivery team members will be expected to be available out of hours without compensation for the inconvenience, and disruption to their lives

These problems can be traced back to incentives. With Application Operations responsible for production support, a Delivery team will be unaware of or uninvolved in production incidents. Application Operations cannot build operability into applications they do not own, and a Delivery team will have little reason to prioritise operational features. As a result, inoperability is inevitable.

You Build It Ops Run It injects substantial delays and rework into a technology value stream. This is likely to constrain Continuous Delivery if product demand is high. If weekly or fewer deployments are sufficient to meet demand, then Continuous Delivery is possible. However, if product demand calls for more than weekly deployments then You Build It Ops Run It can only lead to Discontinuous Delivery.

The Who Runs It series:

  1. You Build It Ops Run It
  2. You Build It You Run It
  3. You Build It Ops It at scale
  4. You Build It You Run It at scale
  5. You Build It Ops Sometimes Run It
  6. Implementing You Build It You Run It at scale
  7. You Build It SRE Run It

Acknowledgements

Thanks to Thierry de Pauw.

Implementing You Build It You Run It at scale

How can You Build It You Run It at scale be implemented? How can support costs be balanced with operational incentives, to ensure multiple teams can benefit from Continuous Delivery and operability at scale?

This is part of the Who Runs It series.

Introduction

Traditionally, an IT As A Cost Centre organisation with roots in Plan-Build-Run will have Delivery teams responsible for building applications, and Operations teams responsible for deployments and production support. You Build It You Run It at scale fundamentally changes that organisational model. It means 10+ Delivery teams are responsible for deploying and supporting their own 10+ applications.

Applying You Build It You Run It at scale maximises the potential for fast deployment lead times, and fast incident resolution times across an IT department. It incentivises Delivery teams to increase operability via failure design, product telemetry, and cumulative learning. It is a revenue insurance policy, that offers high risk coverage at a high premium. This is in contrast to You Build It Ops Run It at scale, which offers much lower risk coverage at a lower premium.

You Build It You Run It at scale can be intimidating. It has a higher engineering cost than You Build It Ops Run It at scale, as the table stakes are higher. These include a centralised catalogue of service ownership, detailed runbooks, on-call training, and global operability measures. It can also have support costs that are significantly higher than You Build It Ops Run It at scale.

At its extreme, You Build It You Run It at scale will have D support rotas for D Delivery teams. The out of hours support costs for D rotas will be greater than 2 rotas in You Build It Ops Run It at scale, unless Operations support is on an exorbitant third party contract. As a result You Build It Ops Run It at scale can be an attractive insurance policy, despite its severe disadvantages on risk coverage. This should not be surprising, as graceful extensibility trades off with robust optimality. As Mary Patterson et al stated in Resilience and Precarious Success, “fundamental goals (such as safety) tend to be sacrificed with increasing pressure to achieve acute goals (faster, better, and cheaper)”. 

You Build It You Run It at scale does not have to mean 1 Delivery team on-call for every 1 application. It offers cost effectiveness as well as high risk coverage when support costs are balanced with operability incentives and risk of revenue loss. The challenge is to minimise standby costs without weakening operability incentives.

By availability target

The level of production support afforded to an application in You Build It You Run It at scale should be based on its availability target. In office hours, Delivery teams support their own applications, and halt any feature development to respond to an application alert. Out of hours, production support for an application is dictated by its availability target and rate of product demand.

Applications with a low availability target have no out of hours support. This is low cost, easy to implement, and counter-intuitively does not sacrifice operability incentives. A Delivery team responsible for dealing with overnight incidents on the next working day will be incentivised to design an application that can gracefully degrade over a number of hours.  No on-call is also fairer than best endeavours, as there is no expectation for  Delivery team members to disrupt their personal lives without compensation.

Applications with a high availability target and a high rate of product demand each have their own team rota. A team rota is a single Delivery team member on-call for one or more applications from their team. This is classic You Build It You Run It, and produces the maximum operability incentives as the Delivery team have sole responsibility for their application. When product demand for an application is filled, it should be downgraded to a domain rota.

Applications with a medium availability target share a domain rota. A domain rota is a single Delivery team member on-call for a logical grouping of applications with an established affinity, from multiple Delivery teams.

The domain construct should be as fine-grained and flexible as possible. It needs to minimise on-call cognitive load, simplify knowledge sharing between teams, and focus on organisational outcomes. The following constructs should be considered:

  • Product domains – sibling teams should already be tied together by customer journeys and/or sales channels
  • Architectural domains – sibling teams should already know how their applications fit into technology capabilities

The following constructs should be rejected:

  • Geographic domains – per-location rotas for teams split between locations would produce a mishmash of applications, cross-cutting product and architectural boundaries and increasing on-call cognitive load
  • Technology domains – per-tech rotas for teams split between frontend and backend technologies would completely lack a focus on organisational outcomes

A domain rota will create strong operability incentives for multiple Delivery teams, as they have a shared on-call responsibility for their applications. It is also cost effective as people on-call do not scale linearly with teams or applications.  However, domain rotas can be challenging if knowledge sharing barriers exist, such as multiple teams in one domain with dissimilar engineering skills and/or technology choices.  It is important to be pragmatic, and technology choices can be used as a tiebreaker on a product or architectural construct where necessary.

For example, a Fruits R Us organisation has 10 Delivery teams, each with 1 application. There are 3 availability targets of 99.0%, 99.5%, and 99.9%. An on-call rota is £3Kpcm in standby costs. If all 10 applications had their own rota, the support cost of £30Kpcm would likely be unacceptable.

Assume Fruits R Us managers assign minimum revenue losses of £20K, £50K, and £100K to their availability targets, and ask product owners to consider their minimum potential revenue losses per target. The Product and Checkout applications could lose £100K+ in 43 minutes, so they remain at 99.9% and have their own rotas. 4 applications in the same Fulfilment domain could lose £50K+ in 3 hours, so they are downgraded to 99.5% and share a Fulfilment domain rota across 4 teams. 4 applications in the Stock domain could lose £20K in 7 hours but no more, so they are downgraded to 99.0% with no out of hours on-call. This would result in a support cost of £9Kpcm while retaining strong operability incentives.

Optimising costs

A number of techniques can be used to optimise support costs for You Build It You Run It Per Availability Target:

  • Recalibrate application availability targets. Application revenue analytics should regularly be analysed, and compared with the engineering time and on-call costs linked to an availability target. Where possible, availability targets should be downgraded. It should also be possible to upgrade a target, including fixed time windows for peak trading periods
  • Minimise failure blast radius. Rigorous testing and deployment practices including Canary Deployments, Dark Launching, and Circuit Breakers should reduce the cost of application failure, and allow for availability targets to be gradually downgraded. These practices should be validated with automated and exploratory Chaos  Engineering on a regular basis
  • Align out of hours support with core trading hours. A majority of website revenue might occur in one timezone, and within core trading hours. In that scenario, production support hours could be redefined from 0000-2359 to 0600-2200 or similar. This could remove the need for out of hours support 2200-0600, and alerts would be investigated by Delivery teams on the following morning
  • Automated, time-limited shuttering on failure. A majority of product owners might be satisfied with shuttering on failure out of hours, as opposed to application restoration. If so, an automated shutter with per-application user messaging could be activated on application failure, for a configurable time period out of hours. This could remove the need for out of hours support entirely, but would require a significant engineering investment upfront and operability incentives would need to be carefully considered

This list is not exhaustive. As with any other Continuous Delivery or operability practice, You Build It You Run It at scale should be founded upon the Improvement Kata. Ongoing experimentation is the key to success.

Production support is a revenue insurance policy, and implementing You Build It You Run It at scale is a constant balance between support costs with operability. You Build It You Run It Per Availability Target ensures on-call Delivery team members do not scale linearly with teams and/or applications, while trading away some operability incentives and some Time To Restore – but far less than You Build It Ops Run It at scale. Overall, You Build It You Run It Per Availability Target is an excellent starting point.

The Who Runs It series:

  1. You Build It Ops Run It
  2. You Build It You Run It
  3. You Build It Ops Run It at scale
  4. You Build It You Run It at scale
  5. You Build It Ops Sometimes Run It
  6. Implementing You Build It You Run It at scale
  7. You Build It SRE Run It

Acknowledgements

Thanks to Thierry de Pauw.

You Build It You Run It at scale

How can You Build It You Run It be applied to 10+ teams and applications without an overwhelming support cost? How can operability incentives be preserved for so many teams?

This is part of the Who Runs It series.

Introduction

You Build It You Run It at scale means 10+ Delivery teams are responsible for their own deployments and production support. It is the You Build It You Run It approach, applied to multiple teams and multiple applications.

There is an L1 Service Desk team to handle customer requests. Each Delivery team is on L1 support for their applications, and creates their own monitoring dashboard and alerts. There should be a consistent toolchain for anomaly detection and alert notifications for all Delivery teams, that can incorporate those dashboards and alerts. 

The Service Desk team will tackle customer complaints and resolve simple technology issues. When an alert fires, a Delivery team will practice Stop The Line by halting feature development, and swarming on the problem within the team. That cross-functional collaboration means a problem can be quickly isolated and diagnosed, and the whole team creates new knowledge they can incorporate into future work. If the Service Desk cannot resolve an issue, they should be able to route it to the appropriate Delivery team via an application mapping in the incident management system. 

In On-Call At Any Size, Susan Fowler et al warn “multiple rotations is a key scaling challenge, requiring active attention to ensure practices remain healthy and consistent”. Funding is the first You Build It You Run It practice that needs attention at scale. On-call support for each Delivery team should be charged to the CapEx budget for that team. This will encourage each product manager to regularly work on the delicate trade-off between protecting their desired availability target out of hours and on-call costs. Central OpEx funding must be avoided, as it eliminates the need for product managers to consider on-call costs at all.

Continuous Delivery and Operability at scale

You Build It You Run It has the following advantages at scale:

  • Fast incident resolution – an alert will be immediately assigned to the team that owns the application, and can rapidly swarm to recover from failure and minimise TTR
  • Short deployment lead times – deployments can be performed on demand by a Delivery team, with no handoffs involved
  • Minimal knowledge synchronisation costs – teams can easily convert new operational information into knowledge
  • Focus on outcomes – teams are encouraged to work in smaller batches, towards customer outcomes and product hypotheses
  • Adaptive architecture – applications can be designed with failure scenarios in mind, including circuit breakers and feature toggles to reduce failure blast radius
  • Product telemetry – application dashboards and alerts can be constantly updated to include the latest product metrics
  • Situational awareness – teams will have a prior understanding of normal versus abnormal live traffic conditions that can be relied on during incident response
  • Fair on-call compensation – team members will be remunerated for the disruption to their lives incurred by supporting applications

In Accelerate, Dr Nicole Forsgren et al found “high performance is possible with all kinds of systems, provided that systems – and the teams that build and maintain them – are loosely coupled”. Accelerate research showed the key to high performance is for a team to be able to independently test and deploy its applications, with negligible coordination with other teams. You Build It You Run It enables a team to increase its throughput and achieve Continuous Delivery, by removing rework and queue times associated with deployments and production support. At scale, You Build It You Run It enables an organisation to increase overall throughput while simultaneously increasing the number of teams. This allows an organisation to move faster as it adds more people, which is a true competitive advantage.

You Build It You Run It creates a healthy engineering culture at scale, in which product development consists of a balance between product ideas and operational features. 10+ Delivery teams with on-call responsibilities will be incentivised to care about operability and consistently meeting availability targets, while increasing delivery throughput to meet product demand. Delivery teams doing 24×7 on-call at scale will be encouraged to build operability into all their applications, from inception to retirement.

You Build It You Run It can incur high support costs at scale. It can be cost effective if a compromise is struck between deployment targets, operability incentives, and on-call costs that does not weaken operability incentives for Delivery teams.

Production support as revenue insurance

Production support should be thought of as a revenue insurance policy. As insurance policies, You Build It Ops Run It and You Build It You Run It are opposites at scale in terms of risk coverage and costs.

You Build It Ops Run It offers a low degree of risk coverage, limits deployment throughput, and has a potential for revenue loss on unavailability that should not be underestimated. You Build It You Run It has a higher degree of risk coverage, with no limits on deployment throughput and a short TTR to minimise revenue losses on failure.

You Build It You Run It becomes more cost effective as product demand and reliability needs increase, as deployment targets and availability targets are ratcheted up, and the need for Continuous Delivery and operability becomes ever more apparent. The right revenue insurance policy should be chosen based on the number of teams and applications, and the range of availability targets. The fuzzy model below can be used to distinguish when You Build It You Run It is appropriate – when availability targets are demanding and the number of teams and applications is 10+.

The Who Runs It series

  1. You Build It Ops Run It
  2. You Build It You Run It
  3. You Build It Ops Run It at scale
  4. You Build It You Run It at scale
  5. You Build It Ops Sometimes Run It
  6. Implementing You Build It You Run It at scale
  7. You Build It SRE Run It

Acknowledgements

Thanks to Thierry de Pauw.

Availability targets

Why is it important to measure operability? What should be the trailing indicators and leading indicators of operability?

TL;DR:

  • Reliability means balancing the risk of unavailability with the cost of sustaining availability.
  • Availability can be understood as a level of availability, from 99.0% to 99.999%.
  • Increasing an availability level incurs up to an order of magnitude more engineering effort.
  • An availability target is selected by a product manager based upon the maximum revenue loss they can tolerate for their service.

Introduction

Organisations must have reliable IT applications at the heart of their business if they are to innovate in changing markets. Reliability is defined by Patrick O’Connor and Andre Kleyner in Practical Reliability Engineering as “the probability that [a system] will perform a required function without failure under stated conditions for a stated period of time”. There must be an investment in reliability if propositions are to be rapidly delivered to customers and remain highly available.

Reliability means balancing the risk of application unavailability with the cost of sustaining application availability. Application unavailability will incur opportunity costs related to lower customer revenue, loss of confidence, and reputational damage. On the other hand, sustaining application availability also incurs opportunity costs, as engineering time must be devoted to operational work instead of new product features visible to customers. In Site Reliability Engineering, Betsey Beyer et al state “cost does not increase linearly… an incremental improvement in reliability may cost 100x more than the previous increment”.

Furthermore, the user experience of application availability will be constrained by lower levels of user device availability. For example, a smartphone with 99.0% availability will not allow a user to experience a website with 99.999% availability. 100% availability is never the answer, as the cost is too high and users will not perceive any benefits. Maximising feature delivery will harm availability, maximising availability will harm feature delivery.

Availability targets

Application availability can be understood as an availability target. An availability target represents a desired level of availability, and is usually expressed as a number of nines. Each additional nine of availability represents an order of magnitude more of engineering effort. For example, 99.0% availability means “two nines”, and if its engineering effort is N then 99.9% availability would require 10N in engineering effort.

An availability target should be coupled to product risk. This will ensure a product owner translates their business goals into operational objectives, and empowers their team to strike a balance between application availability and costs. The goal is to improve the operability of an application until its availability target is met, and can be sustained.

For example, consider a Fruits R Us organisation with 3 availability targets for its applications – 99.0% (“two nines”), 99.5% (“two and a half nines”), and 99.9% (“three nines”). The 99.9% availability target allows for a maximum of 0.1% unavailability per month, which in a 30 day month equates to a maximum of 43 mins 12 seconds unavailability. It also requires 10 times more engineering effort to sustain than the 99.0% availability target.

In Site Reliability Engineering, the maximum unavailability per month for an availability target is expressed as an error budget. Error budgets are are a method of formalising the shared ownership and prioritisation of product features versus operational features, and might be used to halt production deployments during periods of sustained unavailability.

Availability target selection

A product owner should select an availability target by comparing their projected revenue impact of application unavailability with the set of possible availability targets. They need to consider if their application is tied directly or indirectly to revenue, their application payment model, what expectations users will have, and what level of service is provided by competitors in the same marketplace.

First, an organisation needs to establish a minimum Cost Of Delay revenue loss for each availability target, on loss of availability. Then a product owner should estimate the Cost Of Delay for their application being unavailable for the duration of each target. The Value Framework by Joshua Arnold et al can be used to estimate the financial impact of the loss of an application:

  • Increase Revenue – does the application increase sales levels
  • Protect Revenue – does the application sustain current sales levels
  • Reduce Costs – does the application reduce current incurred costs
  • Avoid Costs – does the application reduce potential for future incurred costs

This will allow a product owner to balance their need for application availability with the opportunity costs associated with consistently meeting that availability level.

For example, at Fruits R Us a set of revenue bands is attached to existing availability targets, based on an analysis of existing revenue streams. The 99.0% availability target is intended for applications where the Cost Of Delay on unavailability is at least £50K in 7h 12m, whereas 99.9% is for unavailability that could cost £1M or more in only 43m 12s.

A proposed Bananas application is expected to produce a monthly revenue increase of £40K. It is intended to replace an Apples application, which has an availability target of 99.0% sustained by an average of 8 engineering hours per month. The Bananas product owner believes customers will have heightened reliability expectations due to superior competitor offerings in the marketplace, and that Bananas could lose the £40K revenue increase within 2 hours of unavailability in a month. The 99.0% availability target can fit 2 hours of unavailability into its 7h 12m ceiling, but cannot fit a £40K revenue loss. The 99.5% availability target is selected, and the Bananas product owner knows at 5N engineering effort that 40 engineering hours will be needed per month to invest in operational  features.

Acknowledgements

Thanks to Thierry de Pauw for the review

You Build It You Run It

What is You Build It You Run It, and why does it have such a positive impact on operability? Why is it important to balance support cost effectiveness with operability incentives?

This is part of the Who Runs It series.

Introduction

The usual alternative to You Build It Ops Run It is for a Delivery team to assume responsibility for its Run activities, including deployments and production support. This is often referred to as You Build It You Run It.

You Build It You Run It consists of single-level swarming support, with developers on-call. There is also a Service Desk to handle customer requests. The  toolchain needs to include anomaly detection, alert notifications, messaging, and incident management tools, such as Prometheus, PagerDuty, Slack, and ServiceNow.

As with You Build It Ops Run It, Service Desk is an L1 team that receives customer requests and will resolve simple technology issues wherever possible. A development team in Delivery is also L1, and they will monitor dashboards, receive alerts, and respond to incidents. Service Desk should escalate tickets for particular website pages or user journeys into the incident management system, which would be linked to applications.

Delivery engineering costs and on-call support will both be paid out of CapEx, and Operations teams such as Service Desk will be under OpEx. As with You Build It Ops Run It, the Service Desk team might be outsourced to reduce OpEx costs. CapEx funding for You Build It You Run It will compel a product manager to balance their desired availability with on-call costs. OpEx funding for Delivery on-call should be avoided wherever possible, as it encourages product managers to artificially minimise risk tolerance and select high availability targets irregardless of on-call costs.

Continuous Delivery and operability

Swarming support means Delivery prioritising incident resolution over feature development, in line with the Continuous Delivery practice of Stop The Line and the Toyota Andon Cord. This encourages developers to limit failure blast radius wherever possible, and prevents them from deploying changes mid-incident that might exacerbate a failure. Swarming also increases learning, as it ensures developers are able to uncover perishable mid-incident information, and cross-pollinate their skills.

You Build It You Run It also has the following advantages for product development:

  • Short deployment lead times – lead times will be minimised due to no  handoffs
  • Minimal knowledge synchronisation costs – developers will be able to easily share application and incident knowledge, to better prepare themselves for future incidents
  • Focus on outcomes – teams will be empowered to deliver outcomes that test product hypotheses, and iterate based on user feedback
  • Short incident resolution times – incident response will be quickened by no support ticket handoffs or rework
  • Adaptive architecture – applications will be architected to limit failure blast radius, including bulkheads and circuit breakers
  • Product telemetry – dashboards and alerts will be continually updated by developers, to be multi-level and tailored to the product context
  • Traffic knowledge – an appreciation of the pitfalls and responsibilities inherent in managing live traffic will be factored into design work
  • Rich situational awareness – developers will respond to incidents with the same context, ways of working, and tooling
  • Clear on-call expectations – developers will be aware they are building applications they themselves will support, and they should be remunerated

You Build It You Run It creates the right incentives for operability. When Delivery is responsible for their own deployments and production support, product owners will be more aware of operational shortfalls, and pressed by developers to prioritise operational features alongside product ideas. Ensuring that application availability is the responsibility of everyone will improve outcomes and accelerate learning, particularly for developers who in IT As A Cost Centre are far removed from actual customers. Empowering delivery teams to do on-call 24×7 is the only way to maximise incentives to build operability in.

Production support as revenue insurance

The most common criticism of You Build It You Run It is that it is too expensive. Paying Delivery team members for L1 on-call standby and callout can seem costly, particularly when You Build It Ops Run It allows for L1-2 production support to be outsourced to cheaper third party suppliers. This perception should not be surprising, given David Wood’s assertion in The Flip Side Of Resilience that “graceful extensibility trades off with robust optimality”. Implementing You Build It You Run to increase adaptive capacity for future incidents may look wasteful, particularly if incidents are rare.

A more holistic perspective would be to treat production support as revenue insurance for availability targets, and consider risk in terms of revenue impact instead of incident count. A production support policy will cover:

  • Availability protection
  • Availability restoration on loss

You Build It You Run It maximises incentives for Delivery teams to focus from the outset on protecting availability, and it guarantees the callout of an L1 Delivery engineer to restore availability on loss. This should be demonstrable with a short Time To Restore (TTR), which could be measured via availability time series metrics or incident duration. That high level of risk coverage will come at a higher premium. This means You Build It You Run It will be more cost effective for applications with higher availability targets and greater potential for revenue loss.

You Build It Ops Run It offers a lower level of risk coverage at a lower premium, with weak incentives to protect application availability and an L2 Application Operations team to restore application availability. This will produce a higher TTR  than You Build It You Run It. This may be acceptable for applications with lower availability targets and/or limited potential for revenue loss.

The cost effectiveness of a production support policy can be calculated per availability target by comparing its availability restoration capability with support cost. For example, at Fruits R Us there are 3 availability targets with estimated maximum revenue losses on availability target loss. Fruits R Us has a Delivery team with an on-call cost of £3K per calendar month and a TTR of 20 minutes, and an Application Operations team with a cost of £1.5K per month and a TTR of 1 hour.

Projected availability loss per team is a function of TTR and the £ maximum availability loss per availability target, and lower losses can be calculated for the Delivery team due to its shorter TTR.

At 99.0%, Application Operations is as cost effective at availability restoration of a 7 hour 12 minute outage as the Delivery team, and Fruits R Us might consider the merits of You Build It Ops Run It. However, this would mean Application Operations would be unable to build operability in and increase availability protection, and the Delivery team would have few incentives to contribute.

At 99.5%, the Delivery team is more cost effective at availability restoration of a 3 hour 36 minute outage than Application Operations.

At 99.9%, the Delivery team is far more cost effective at availability restoration of a 43 minute 12 second outage. The 1 hour TTR of Application Operations means their £ projected availability loss is greater than the £ maximum availability loss at 99.9%. You Build It You Run It is the only choice.

The Who Runs It series:

  1. You Build It Ops Run It
  2. You Build It You Run It
  3. You Build It Ops Run It at scale
  4. You Build It You Run It at scale
  5. You Build It Ops Sometimes Run It
  6. Implementing You Build It You Run It at scale
  7. You Build It SRE Run It

Acknowledgements

Thanks to Thierry de Pauw.

Operability measures

Why is it important to measure operability? What should the trailing indicators and leading indicators of operability?

TL;DR:

  • The trailing indicators of operability are availability rate and time to restore availability.
  • The leading indicators of operability include the frequency of Chaos Days and the time to act upon incident review findings.

Introduction

In How To Measure Anything, Douglas Hubbard states organisations have a Measurement Inversion, and waste their time measuring variables with a low information value. This is certainly true of IT reliability, which is usually badly measured if at all. By proxy, this includes operability as well.

In many organisations, reliability is measured by equating failures with recorded production incidents. Incident durations are calculated for Mean Time Between Failures (MTBF) and Mean Time To Repair (MTTR), or there is just an overall incident count. These are classic vanity measures. They are easy to implement and understand, but they have a low information value due to the following:

  • Quantitative measures such as incident count have no reflection on business drivers, such as percentage of unrecoverable user errors
  • Manual recording of incidents in a ticket system can be affected by data inaccuracies and cognitive biases, such as confirmation bias and recency bias
  • Goodhart’s Law means measuring incidents will result in fewer incident reports. People adjust their behaviours based on how they are measured, and measuring incidents will encourage people to suppress incident reports with potentially valuable information.

If operability is to be built into applications, there is a need to identify trailing and leading indicators of operability that are holistic and actionable. Measures of operability that encourage system-level collaboration rather than individual productivity will pinpoint where improvements need to be made. Without those indicators, it is difficult to establish a clear picture of operability, and where changes are needed.

Effective leading and trailing indicators of software delivery should be visualised and publicly communicated throughout an organisation, via internal websites and dashboards. Information radiators help engineers, managers, and executives understand at a glance the progress being made and alignment with organisational goals. Transparency also reduces the potential for accidents and bad behaviours. As Louis Brandeis said in Other People’s Money “sunlight is said to be the best of disinfectants; electric light the most efficient policeman”.

Availability as a trailing indicator

Failures should be measured in terms of application availability targets, not production incidents. Availability measurements are easy to implement with automated time series metrics collection, easy to understand, and have a high information value. Measurements can be designed to distinguish between full and partial degradation, and between unrecoverable and recoverable user errors.

For example, a Fruits R Us organisation has 99.0%, 99.5%, and 99.9% as its availability targets A product manager for an Oranges application selects 99.5% for at least the first 3 months.

Availability should be measured in the aggregate as Request Success Rate, as described by Betsey Beyer et al in Site Reliability Engineering. Request Success Rate can approximate degradation for customer-facing or back office applications, provided a well-defined notion of successful and unsuccessful work. It covers partial and full downtime for an application, and is more fine-grained than uptime versus downtime.

When an application has a Request Success Rate lower than its availability target, it is considered a failure. The average time to restore availability can be tracked as a Mean Time To Repair metric, and visualised in a graph alongside availability.

At Fruits R Us, the Oranges application communicates with upstream consumers via a HTTPS API. Its availability is constantly measured by Request Success Rate, which is implemented by checking the percentage of upstream requests that produce a HTTP response code lower than HTTP 500. When the Request Success Rate over 15 minutes is lower than the availability target of 99.5%, it is considered a failure and a production incident is raised. An availability graph can be used to illustrate availability, incidents, and time to repair as a trailing indicator of operability.

Leading indicators of operability

Failures cannot be predicted in a production environment as it is a complex, adaptive system. In addition, it is easy to infer a false narrative of past behaviours from quantitative data. The insights uncovered from an availability trailing indicator and the right leading indicators can identify inoperability prior to a production incident, and they can be pattern matched to select the best heuristic for the circumstances.

A leading indicator should be split into an automated check and one or more exploratory tests. This allows for continuous discovery of shallow data, and frees up people to examine contextual, richer data with a higher information value. Those exploratory tests might be part of an operational readiness assessment, or a Chaos Day dedicated to particular applications. Leading indicators of operability can include:

Learning is a vital leading indicator of operability. An organisation is more likely to produce operable, reliable applications if it fosters a culture of continuous learning and experimentation. After a production incident, nothing should be more important than everyone in the organisation having the opportunity to accumulate new knowledge, for their colleagues as well as themselves.

The initial automated check of learning should be whether a post-incident review is published within 24 hours of an incident. This is easy to automate with a timestamp comparison between a post-incident review document and the central incident system, easy to communicate across an organisation, and highly actionable. It will uncover incident reviews that do not happen, are not publicly published, or happen too late to prevent information decay.

Another learning check should be the throughput of operability tasks, comprising the lead time to complete a task and interval between completing tasks. Tasks should be created and stored in a machine readable format during operability readiness assessments, Chaos Days, exploratory testing, and other automated checks of operability. Task lead time should not be more than a week, and task interval should not exceed the fastest learning source. For example, if operability readiness assessments occur every 90 days and Chaos Days are 30 days then at least one operability task should be completed per month.

Acknowledgements

Thanks as usual to Thierry de Pauw for reviewing this series

« Older posts

© 2024 Steve Smith

Theme by Anders NorénUp ↑