Month: May 2020

Implementing You Build It You Run It at scale

19 May, 2020 / Steve Smith

How can You Build It You Run It at scale be implemented? How can support costs be balanced with operational incentives, to ensure multiple teams can benefit from Continuous Delivery and operability at scale?
This is part of the Who Runs It series.

Introduction

Traditionally, an IT As A Cost Centre organisation with roots in Plan-Build-Run will have Delivery teams responsible for building applications, and Operations teams responsible for deployments and production support. You Build It You Run It at scale fundamentally changes that organisational model. It means 10+ Delivery teams are responsible for deploying and supporting their own 10+ applications.

Applying You Build It You Run It at scale maximises the potential for fast deployment lead times, and fast incident resolution times across an IT department. It incentivises Delivery teams to increase operability via failure design, product telemetry, and cumulative learning. It is a revenue insurance policy, that offers high risk coverage at a high premium. This is in contrast to You Build It Ops Run It at scale, which offers much lower risk coverage at a lower premium.

You Build It You Run It at scale can be intimidating. It has a higher engineering cost than You Build It Ops Run It at scale, as the table stakes are higher. These include a centralised catalogue of service ownership, detailed runbooks, on-call training, and global operability measures. It can also have support costs that are significantly higher than You Build It Ops Run It at scale.

At its extreme, You Build It You Run It at scale will have D support rotas for D Delivery teams. The out of hours support costs for D rotas will be greater than 2 rotas in You Build It Ops Run It at scale, unless Operations support is on an exorbitant third party contract. As a result You Build It Ops Run It at scale can be an attractive insurance policy, despite its severe disadvantages on risk coverage. This should not be surprising, as graceful extensibility trades off with robust optimality. As Mary Patterson et al stated in Resilience and Precarious Success, “fundamental goals (such as safety) tend to be sacrificed with increasing pressure to achieve acute goals (faster, better, and cheaper)”.

You Build It You Run It at scale does not have to mean 1 Delivery team on-call for every 1 application. It offers cost effectiveness as well as high risk coverage when support costs are balanced with operability incentives and risk of revenue loss. The challenge is to minimise standby costs without weakening operability incentives.

By availability target

The level of production support afforded to an application in You Build It You Run It at scale should be based on its availability target. In office hours, Delivery teams support their own applications, and halt any feature development to respond to an application alert. Out of hours, production support for an application is dictated by its availability target and rate of product demand.

Applications with a low availability target have no out of hours support. This is low cost, easy to implement, and counter-intuitively does not sacrifice operability incentives. A Delivery team responsible for dealing with overnight incidents on the next working day will be incentivised to design an application that can gracefully degrade over a number of hours. No on-call is also fairer than best endeavours, as there is no expectation for Delivery team members to disrupt their personal lives without compensation.

Applications with a high availability target and a high rate of product demand each have their own team rota. A team rota is a single Delivery team member on-call for one or more applications from their team. This is classic You Build It You Run It, and produces the maximum operability incentives as the Delivery team have sole responsibility for their application. When product demand for an application is filled, it should be downgraded to a domain rota.

Applications with a medium availability target share a domain rota. A domain rota is a single Delivery team member on-call for a logical grouping of applications with an established affinity, from multiple Delivery teams.

The domain construct should be as fine-grained and flexible as possible. It needs to minimise on-call cognitive load, simplify knowledge sharing between teams, and focus on organisational outcomes. The following constructs should be considered:

Product domains – sibling teams should already be tied together by customer journeys and/or sales channels
Architectural domains – sibling teams should already know how their applications fit into technology capabilities

The following constructs should be rejected:

Geographic domains – per-location rotas for teams split between locations would produce a mishmash of applications, cross-cutting product and architectural boundaries and increasing on-call cognitive load
Technology domains – per-tech rotas for teams split between frontend and backend technologies would completely lack a focus on organisational outcomes

A domain rota will create strong operability incentives for multiple Delivery teams, as they have a shared on-call responsibility for their applications. It is also cost effective as people on-call do not scale linearly with teams or applications. However, domain rotas can be challenging if knowledge sharing barriers exist, such as multiple teams in one domain with dissimilar engineering skills and/or technology choices. It is important to be pragmatic, and technology choices can be used as a tiebreaker on a product or architectural construct where necessary.

For example, a Fruits R Us organisation has 10 Delivery teams, each with 1 application. There are 3 availability targets of 99.0%, 99.5%, and 99.9%. An on-call rota is £3Kpcm in standby costs. If all 10 applications had their own rota, the support cost of £30Kpcm would likely be unacceptable.

Assume Fruits R Us managers assign minimum revenue losses of £20K, £50K, and £100K to their availability targets, and ask product owners to consider their minimum potential revenue losses per target. The Product and Checkout applications could lose £100K+ in 43 minutes, so they remain at 99.9% and have their own rotas. 4 applications in the same Fulfilment domain could lose £50K+ in 3 hours, so they are downgraded to 99.5% and share a Fulfilment domain rota across 4 teams. 4 applications in the Stock domain could lose £20K in 7 hours but no more, so they are downgraded to 99.0% with no out of hours on-call. This would result in a support cost of £9Kpcm while retaining strong operability incentives.

Optimising costs

A number of techniques can be used to optimise support costs for You Build It You Run It Per Availability Target:

Recalibrate application availability targets. Application revenue analytics should regularly be analysed, and compared with the engineering time and on-call costs linked to an availability target. Where possible, availability targets should be downgraded. It should also be possible to upgrade a target, including fixed time windows for peak trading periods
Minimise failure blast radius. Rigorous testing and deployment practices including Canary Deployments, Dark Launching, and Circuit Breakers should reduce the cost of application failure, and allow for availability targets to be gradually downgraded. These practices should be validated with automated and exploratory Chaos Engineering on a regular basis
Align out of hours support with core trading hours. A majority of website revenue might occur in one timezone, and within core trading hours. In that scenario, production support hours could be redefined from 0000-2359 to 0600-2200 or similar. This could remove the need for out of hours support 2200-0600, and alerts would be investigated by Delivery teams on the following morning
Automated, time-limited shuttering on failure. A majority of product owners might be satisfied with shuttering on failure out of hours, as opposed to application restoration. If so, an automated shutter with per-application user messaging could be activated on application failure, for a configurable time period out of hours. This could remove the need for out of hours support entirely, but would require a significant engineering investment upfront and operability incentives would need to be carefully considered

This list is not exhaustive. As with any other Continuous Delivery or operability practice, You Build It You Run It at scale should be founded upon the Improvement Kata. Ongoing experimentation is the key to success.

Production support is a revenue insurance policy, and implementing You Build It You Run It at scale is a constant balance between support costs with operability. You Build It You Run It Per Availability Target ensures on-call Delivery team members do not scale linearly with teams and/or applications, while trading away some operability incentives and some Time To Restore – but far less than You Build It Ops Run It at scale. Overall, You Build It You Run It Per Availability Target is an excellent starting point.

The Who Runs It series:

You Build It Ops Run It
You Build It You Run It
You Build It Ops Run It at scale
You Build It You Run It at scale
You Build It Ops Sometimes Run It
Implementing You Build It You Run It at scale
You Build It SRE Run It

Acknowledgements

Thanks to Thierry de Pauw.

You Build It You Run It at scale

15 May, 2020 / Steve Smith

How can You Build It You Run It be applied to 10+ teams and applications without an overwhelming support cost? How can operability incentives be preserved for so many teams?
This is part of the Who Runs It series.

Introduction

You Build It You Run It at scale means 10+ Delivery teams are responsible for their own deployments and production support. It is the You Build It You Run It approach, applied to multiple teams and multiple applications.

There is an L1 Service Desk team to handle customer requests. Each Delivery team is on L1 support for their applications, and creates their own monitoring dashboard and alerts. There should be a consistent toolchain for anomaly detection and alert notifications for all Delivery teams, that can incorporate those dashboards and alerts.

The Service Desk team will tackle customer complaints and resolve simple technology issues. When an alert fires, a Delivery team will practice Stop The Line by halting feature development, and swarming on the problem within the team. That cross-functional collaboration means a problem can be quickly isolated and diagnosed, and the whole team creates new knowledge they can incorporate into future work. If the Service Desk cannot resolve an issue, they should be able to route it to the appropriate Delivery team via an application mapping in the incident management system.

In On-Call At Any Size, Susan Fowler et al warn “multiple rotations is a key scaling challenge, requiring active attention to ensure practices remain healthy and consistent”. Funding is the first You Build It You Run It practice that needs attention at scale. On-call support for each Delivery team should be charged to the CapEx budget for that team. This will encourage each product manager to regularly work on the delicate trade-off between protecting their desired availability target out of hours and on-call costs. Central OpEx funding must be avoided, as it eliminates the need for product managers to consider on-call costs at all.

Continuous Delivery and Operability at scale

You Build It You Run It has the following advantages at scale:

Fast incident resolution – an alert will be immediately assigned to the team that owns the application, and can rapidly swarm to recover from failure and minimise TTR
Short deployment lead times – deployments can be performed on demand by a Delivery team, with no handoffs involved
Minimal knowledge synchronisation costs – teams can easily convert new operational information into knowledge
Focus on outcomes – teams are encouraged to work in smaller batches, towards customer outcomes and product hypotheses
Adaptive architecture – applications can be designed with failure scenarios in mind, including circuit breakers and feature toggles to reduce failure blast radius
Product telemetry – application dashboards and alerts can be constantly updated to include the latest product metrics
Situational awareness – teams will have a prior understanding of normal versus abnormal live traffic conditions that can be relied on during incident response
Fair on-call compensation – team members will be remunerated for the disruption to their lives incurred by supporting applications

In Accelerate, Dr Nicole Forsgren et al found “high performance is possible with all kinds of systems, provided that systems – and the teams that build and maintain them – are loosely coupled”. Accelerate research showed the key to high performance is for a team to be able to independently test and deploy its applications, with negligible coordination with other teams. You Build It You Run It enables a team to increase its throughput and achieve Continuous Delivery, by removing rework and queue times associated with deployments and production support. At scale, You Build It You Run It enables an organisation to increase overall throughput while simultaneously increasing the number of teams. This allows an organisation to move faster as it adds more people, which is a true competitive advantage.

You Build It You Run It creates a healthy engineering culture at scale, in which product development consists of a balance between product ideas and operational features. 10+ Delivery teams with on-call responsibilities will be incentivised to care about operability and consistently meeting availability targets, while increasing delivery throughput to meet product demand. Delivery teams doing 24×7 on-call at scale will be encouraged to build operability into all their applications, from inception to retirement.

You Build It You Run It can incur high support costs at scale. It can be cost effective if a compromise is struck between deployment targets, operability incentives, and on-call costs that does not weaken operability incentives for Delivery teams.

Production support as revenue insurance

Production support should be thought of as a revenue insurance policy. As insurance policies, You Build It Ops Run It and You Build It You Run It are opposites at scale in terms of risk coverage and costs.

You Build It Ops Run It offers a low degree of risk coverage, limits deployment throughput, and has a potential for revenue loss on unavailability that should not be underestimated. You Build It You Run It has a higher degree of risk coverage, with no limits on deployment throughput and a short TTR to minimise revenue losses on failure.

You Build It You Run It becomes more cost effective as product demand and reliability needs increase, as deployment targets and availability targets are ratcheted up, and the need for Continuous Delivery and operability becomes ever more apparent. The right revenue insurance policy should be chosen based on the number of teams and applications, and the range of availability targets. The fuzzy model below can be used to distinguish when You Build It You Run It is appropriate – when availability targets are demanding and the number of teams and applications is 10+.

The Who Runs It series

You Build It Ops Run It
You Build It You Run It
You Build It Ops Run It at scale
You Build It You Run It at scale
You Build It Ops Sometimes Run It
Implementing You Build It You Run It at scale
You Build It SRE Run It

Acknowledgements

Thanks to Thierry de Pauw.

Availability targets

11 May, 2020 / Steve Smith

Why is it important to measure operability? What should be the trailing indicators and leading indicators of operability?

TL;DR:

Reliability means balancing the risk of unavailability with the cost of sustaining availability.
Availability can be understood as a level of availability, from 99.0% to 99.999%.
Increasing an availability level incurs up to an order of magnitude more engineering effort.
An availability target is selected by a product manager based upon the maximum revenue loss they can tolerate for their service.

Introduction

Organisations must have reliable IT applications at the heart of their business if they are to innovate in changing markets. Reliability is defined by Patrick O’Connor and Andre Kleyner in Practical Reliability Engineering as “the probability that [a system] will perform a required function without failure under stated conditions for a stated period of time”. There must be an investment in reliability if propositions are to be rapidly delivered to customers and remain highly available.

Reliability means balancing the risk of application unavailability with the cost of sustaining application availability. Application unavailability will incur opportunity costs related to lower customer revenue, loss of confidence, and reputational damage. On the other hand, sustaining application availability also incurs opportunity costs, as engineering time must be devoted to operational work instead of new product features visible to customers. In Site Reliability Engineering, Betsey Beyer et al state “cost does not increase linearly… an incremental improvement in reliability may cost 100x more than the previous increment”.

Furthermore, the user experience of application availability will be constrained by lower levels of user device availability. For example, a smartphone with 99.0% availability will not allow a user to experience a website with 99.999% availability. 100% availability is never the answer, as the cost is too high and users will not perceive any benefits. Maximising feature delivery will harm availability, maximising availability will harm feature delivery.

Availability targets

Application availability can be understood as an availability target. An availability target represents a desired level of availability, and is usually expressed as a number of nines. Each additional nine of availability represents an order of magnitude more of engineering effort. For example, 99.0% availability means “two nines”, and if its engineering effort is N then 99.9% availability would require 10N in engineering effort.

An availability target should be coupled to product risk. This will ensure a product owner translates their business goals into operational objectives, and empowers their team to strike a balance between application availability and costs. The goal is to improve the operability of an application until its availability target is met, and can be sustained.

For example, consider a Fruits R Us organisation with 3 availability targets for its applications – 99.0% (“two nines”), 99.5% (“two and a half nines”), and 99.9% (“three nines”). The 99.9% availability target allows for a maximum of 0.1% unavailability per month, which in a 30 day month equates to a maximum of 43 mins 12 seconds unavailability. It also requires 10 times more engineering effort to sustain than the 99.0% availability target.

In Site Reliability Engineering, the maximum unavailability per month for an availability target is expressed as an error budget. Error budgets are are a method of formalising the shared ownership and prioritisation of product features versus operational features, and might be used to halt production deployments during periods of sustained unavailability.

Availability target selection

A product owner should select an availability target by comparing their projected revenue impact of application unavailability with the set of possible availability targets. They need to consider if their application is tied directly or indirectly to revenue, their application payment model, what expectations users will have, and what level of service is provided by competitors in the same marketplace.

First, an organisation needs to establish a minimum Cost Of Delay revenue loss for each availability target, on loss of availability. Then a product owner should estimate the Cost Of Delay for their application being unavailable for the duration of each target. The Value Framework by Joshua Arnold et al can be used to estimate the financial impact of the loss of an application:

Increase Revenue – does the application increase sales levels
Protect Revenue – does the application sustain current sales levels
Reduce Costs – does the application reduce current incurred costs
Avoid Costs – does the application reduce potential for future incurred costs

This will allow a product owner to balance their need for application availability with the opportunity costs associated with consistently meeting that availability level.

For example, at Fruits R Us a set of revenue bands is attached to existing availability targets, based on an analysis of existing revenue streams. The 99.0% availability target is intended for applications where the Cost Of Delay on unavailability is at least £50K in 7h 12m, whereas 99.9% is for unavailability that could cost £1M or more in only 43m 12s.

A proposed Bananas application is expected to produce a monthly revenue increase of £40K. It is intended to replace an Apples application, which has an availability target of 99.0% sustained by an average of 8 engineering hours per month. The Bananas product owner believes customers will have heightened reliability expectations due to superior competitor offerings in the marketplace, and that Bananas could lose the £40K revenue increase within 2 hours of unavailability in a month. The 99.0% availability target can fit 2 hours of unavailability into its 7h 12m ceiling, but cannot fit a £40K revenue loss. The 99.5% availability target is selected, and the Bananas product owner knows at 5N engineering effort that 40 engineering hours will be needed per month to invest in operational features.

Acknowledgements

Thanks to Thierry de Pauw for the review

Month: May 2020

Implementing You Build It You Run It at scale

Introduction

By availability target

Optimising costs

Acknowledgements

You Build It You Run It at scale

Introduction

Continuous Delivery and Operability at scale

Production support as revenue insurance

Acknowledgements

Availability targets

TL;DR:

Introduction

Availability targets

Availability target selection

Acknowledgements

Recent Posts

Categories

Archives

Recent Posts

Recent Comments

Archives

Categories

Meta

Implementing You Build It You Run It at scale

Introduction

By availability target

Optimising costs

Acknowledgements

You Build It You Run It at scale

Introduction

Continuous Delivery and Operability at scale

Production support as revenue insurance

Acknowledgements

Availability targets

TL;DR:

Introduction

Availability targets

Availability target selection

Acknowledgements

Recent Posts

Tags

Categories

Archives

Recent Posts

Recent Comments

Archives

Categories

Meta