On Tech

Tag: Resilience As A Continuous Delivery Enabler

Resilience as a Continuous Delivery enabler

Why does optimising for robustness leave organisations in a state of Discontinuous Delivery, and vulnerable to failure? How does optimising for resilience improve reliability, and how can it encourage the adoption of Continuous Delivery?

The Resilience as a Continuous Delivery Enabler series:

  1. The cost and theatre of Optimising For Robustness
  2. When Optimising For Robustness fails
  3. The value of Optimising For Resilience
  4. Resilience as a Continuous Delivery enabler

TL;DR:

  1. Optimising For Robustness – prioritising MTBF over MTTR – is an antiquated, flawed approach to IT reliability that results in Discontinuous Delivery and an operational brittleness that begets failure
  2. If an organisation has previously optimised for robustness, a Continuous Delivery programme focussed on throughput is unlikely to succeed
  3. Optimising For Resilience – prioritising MTTR over MTBF – is a superior reliability strategy that enables an organisation to gracefully extend to limit the impact of failures, and position itself for sustained adaptability
  4. Resilience As A Continuous Delivery Enabler is a heuristic that advocates resilience as the focus of a Continuous Delivery programme
  5. Improving the resilience of services makes it easier to reduce Risk Management Theatre, and gradually adopt Continuous Delivery

The tradition of robustness

As software continues to eat the world, organisations must have reliable IT services at the heart of their business if they are to innovate in rapidly changing markets. Reliability is defined by Patrick O’Connor and Andre Kleyner in Practical Reliability Engineering as “the probability that [a system] will perform a required function without failure under stated conditions for a stated period of time“, or as a function of Mean Time Between Failures (MTBF) and Mean Time To Repair (MTTR).

The traditional IT reliability strategy is Optimising For Robustness. This means prioritising a higher MTBF over a lower MTTR for IT services, by attempting to maintain a failure-free production environment. It is based on the belief that a production environment is a complicated system, in which services are homogeneous processes with predictable interactions in repeatable conditions. Failures are believed to be caused by isolated, faulty changes and are considered entirely preventable. When an organisation optimises for robustness, it will usually rely upon:

  • End-To-End Testing to verify the functionality of a new service version against its unowned dependent services
  • Change Advisory Boards to assess, prioritise, and approve the deployment of new service versions
  • Change Freezes to restrict the deployment of new service versions for a period of time due to market conditions

These practices are inherently slow, and a form of Risk Management Theatre 1. End-To-End Testing incurs long execution times and significant maintenance time, and defects can still occur. Change Advisory Boards involve slow approval times, and deployments can still fail. Change Freezes cause huge productivity impediments, and failures can still happen. In addition, the long deployment lead times caused by robustness practices ensure a large batch of requirements and technology changes per release, which actually increases the risk of failure 2.

Optimising For Robustness constrains the stability and throughput of IT delivery such that business demand cannot be satisfied. It is the predominant reason why so many organisations are trapped in a state of Discontinuous Delivery.

The constancy of failure

Ironically, Optimising For Robustness leaves an organisation ill-equipped to deal with failure. In Resilience and Precarious Success, Mary Patterson and Robert Wears describe how “fundamental goals (such as safety) tend to be sacrificed with increasing pressure to achieve acute goals (faster, better, and cheaper)“. When an organisation optimises for robustness it will under-invest in its production environment, resulting in unimplemented “non-functional” requirements, inadequate telemetry 3, snowflake infrastructure, and a fragile service architecture. This will be considered acceptable, as failures are expected to be rare.

However, it is naive to think of a production environment of running services as a complicated system. A production environment is an intractable mass of heterogeneous processes, with unpredictable interactions occurring in unrepeatable conditions. It is a complex system of emergent behaviours, in which the cause and effect of an event can only be perceived in retrospect. Furthermore, as Richard Cook explains in How Complex Systems Fail the complexity of these systems makes it impossible for them to run without multiple flaws“. A production environment is perpetually in a state of near-failure.

A failure occurs when multiple faults unexpectedly coalesce such that one or more business operations cannot succeed. It will create a revenue cost expressed as a function of cost per unit time and duration, and in an organisation optimised for robustness the impact can be considerable. The sunk cost incurred until failure detection can be high, as unimplemented “non-functional” requirements and inadequate telemetry will restrict situational awareness. The opportunity cost until failure resolution can also be high, as snowflake infrastructure and a fragile architecture will increase failure blast radius. In addition, the loss of customer confidence and increased failure demand will create further opportunity costs.

Consider a Fruits-U-Like website optimised for robustness. Its third party registration service begins to suffer under load, and new customers are rejected on checkout. The failure has a static cost per day of £80k, but with no telemetry the failure is not detected for 3 days. The checkout team then produces a hotfix within a day, and it is deployed the following day. The revenue cost is £400K, with a £240K sunk cost and a £160K opportunity cost.

Optimising For Robustness encourages an attitude Sidney Dekker calls the Bad Apple Theory, in which a system is considered absolutely reliable except for the actions of unreliable employees. When a failure occurs, the combination of the Bad Apple Theory and hindsight bias will produce an oppressive culture of naming, blaming, and shaming the individuals involved. This discourages knowledge sharing and collaboration.

An interesting consequence of Optimising For Robustness is Dual Value Streams. An organisation optimised for robustness will have feature value streams with deployment lead times of weeks or months. When a failure is detected its sunk cost will create urgency, and people will want to immediately minimise the opportunity cost duration. That will lead to robustness practices being sacrificed for speed, in a truncated fix value stream with an MTTR of hours or days 4. The robustness practices omitted from the fix value stream should be considered theatre until proven otherwise.


Continuous Delivery improves the stability and throughput of IT delivery, but it is hard. A Continuous Delivery programme in an organisation optimised for robustness will not succeed if it is focussed solely on throughput. The most significant accelerator of deployment lead time will likely be the removal of robustness risk management theatre, but practices like End-To-End Testing will be woven into the fabric of the organisation 5. If they are forcibly removed, Continuous Delivery will be blamed for the first subsequent production failure. Resisters will lobby for more robustness practices, and a return to the status quo is all but inevitable. Unfortunately, it only takes one inopportune failure for a Continuous Delivery programme to be cancelled.

The value of resilience

A far more effective reliability strategy is Optimising For Resilience. This means prioritising a lower MTTR over a higher MTBF for IT services, by rapidly responding to failures in a production environment. Some classes of failure should never occur, some failures are more costly than others, and some safety-critical systems should never fail, but in general organisations should adhere to John Allspaw’s advice that “being able to recover quickly from failure is more important than having failures less often“.

Resilience can be thought of as graceful extensibility. In Four Concepts for Resilience and their Implications for Systems Safety in the Face of Complexity, David Woods describes graceful extensibility as “the ability of a system to extend its capacity to adapt when surprise events challenge its boundaries“. The graceful extensibility of a system is derived from its adaptive capacity, which represents the capacity for adaptation when a failure occurs.

Erik Hollnagel et al break down resilience in Resilience Engineering In Practice using a conceptual model known as the Four Cornerstones of Resilience:

The cornerstones are non-linear, complementary aspects of resilience:

  • Anticipation is imagining the potential for future failures, and countering those scenarios in advance
  • Monitoring is inspecting operating conditions, and alerting when anomalies occur
  • Response is using guidelines, heuristics, improvisation, and situational awareness to mitigate a failure
  • Learning is understanding the circumstances of a near-miss or failure, and sharing the observations

Optimising For Resilience means creating a production environment in which running IT services can gracefully extend to deal with the unpredictable behaviours, unexpected changes, and periods of failure that will inevitably occur. When a service has sufficient adaptive capacity the cost per unit time and duration of production failures can potentially be minimised, reducing the direct revenue costs and indirect opportunity costs caused by a failure.

A lower MTTR can be achieved by investing in the operability of IT services. Operability is defined as “the ability to keep a system in a safe and reliable functioning condition“, and is associated with a set of practices:

Each of these will increase the capacity of a service to adapt to unexpected operating conditions, and produce a more effective incident response:

  • Development: an Adaptive Architecture limits the blast radius of a failure, and Feature Toggles allow features to be limited, tested in isolation, or turned off on failure
  • Testing: Smoke Testing verifies service health, and Chaos Engineering uncovers latent failures in production
  • Infrastructure: Automated Provisioning creates reproducible environments, and Self-Healing automatically restores failed service instances
  • Telemetry: Logging radiates data on traffic, errors, latency, and saturation, and Monitoring visualises service metrics and events in a time series. Anomaly detection identifies events that breach normal operating conditions and Alerting notifies operators of abnormalities to act on. User analytics show success rates for user journeys
  • People: Shared On-Call fosters a “You Build It, You Run It” culture and increases situational awareness, and Runbooks are a repository for operational knowledge. Blameless Post-Mortems uncover the multiple contributors to a near-miss or failure and suggest future preventative measures, while respecting the best efforts of individuals and the dangers of hindsight bias 1

If Fruits-U-Like was optimised for resilience its checkout team could receive an alert within 5 minutes of third party registration errors. A Circuit Breaker would allow some registrations to succeed, and a Bulkhead could trigger an anonymous checkout for failed registrations. This could decrease the cost per day to £5K, and a hotfix could be deployed within 3 hours. The revenue cost would be £625, with a £18 sunk cost and a £607 opportunity cost.

Optimising For Resilience sets the foundation for an organisation to act on market disruption and innovate. Once an organisation has the required level of graceful extensibility, it can continue to invest in its people and technology to achieve sustained adaptability. Sustained adaptability has been described by David Woods as “the ability to adapt to future surprises as conditions continue to evolve“, and can be thought of as innovation capability. An organisation that can quickly adapt to unexpected business events will hold a powerful First Mover Advantage over its competitors.

Resilience as a Continuous Delivery enabler

There is no recipe for success with Continuous Delivery, as every organisation is a complex, adaptive system with its own circumstances and constraints. However, if an organisation has previously optimised for robustness and is in a state of Discontinuous Delivery there is a heuristic that can be used:

Resilience as a Continuous Delivery enabler

This can be applied to bootstrapping Continuous Delivery:

This bootstrap sequence can guide the formative steps of a Continuous Delivery programme, and build confidence throughout an organisation. It demonstrates a commitment to stability, transparency, and reliability which will help to win over resisters. Storing all code, configuration, infrastructure definitions, documents, scripts etc. in version control eliminates the predominant source of failure demand. Creating stability and throughput indicators helps people to understand their delivery capabilities, and make better decisions 7.

Improving production reliability minimises the cost of failure, and lays the groundwork for challenging robustness risk management theatre later on. Automated anomaly detection and alerting will speed up the detection time of an anticipated failure, reducing its sunk cost duration to seconds or minutes. An adaptive architecture will limit the blast radius of a failure, decreasing both cost per unit time and duration.

Implementing production telemetry early on also provides insurance for unsafe-to-fail situations. Logging, monitoring, and analytics dashboards can identify the contributing technical faults to a failure, and when they first entered production. If resisters blame Continuous Delivery for a failure, the data will pinpoint which faults were recent and which were lying dormant in production beforehand.

Once the Continuous Delivery programme reaches the experimentation phase, other sources of adaptive capacity can be created with operability practices such as Capacity Planning, Self-Healing, Shared On-Call, and Blameless Post-Mortems. At the same time, the programme should widen its focus to include deployment throughput as well as deployment stability and production resilience.

The end of theatre

The key to removing robustness risk management theatre is to visualise its costs to stakeholders and offer a practical alternative, rather than rely on theoretical arguments about wait times or defect discovery rates. Using the Resilience As A Continuous Delivery Enabler heuristic ensures a Continuous Delivery programme can supply those visualisations, and outline an alternative approach from the outset.

Stakeholders should be made aware of their robustness risk management theatre with a showcase of the delivery awareness and production reliability improvements so far. The stability and throughput indicators will illustrate the historical cost of robustness practices, by visualising the disparity between deployment lead times and MTTR in the Dual Value Streams. Some carefully calibrated Chaos Engineering in a test environment 8 will demonstrate how MTTR has been shrunk to minutes or hours, by showing how failures can be managed with the new production telemetry and adaptive architecture. An MTTR an order of magnitude faster than deployment lead times will show stakeholders what a team can accomplish with minimal robustness practices.

Each robustness practice subsequently agreed to be risk management theatre should be incrementally replaced with the appropriate mix of Continuous Delivery and operability practices. End-To-End Testing should be superseded by a multi-faceted testing portfolio, in order to turn the resident testing strategy from a Test Ice Cream Cone into a Test Pyramid. This will reduce test execution times and maintenance costs, while simultaneously improving defect discovery rates:

Practice Quantity Frequency Duration Environment
Unit Testing 100 to 1000+ Per build < 30s total Local and Build
Acceptance Testing 10 to 100+ Per build < 10m total Local and Build
Exploratory Testing 10 to 100+ Per build Timebox Local and 3rd Party
Contract Testing ~20 Per 3rd party deploy < 1m 3rd Party
Smoke Testing ~5 Per deploy < 5m All
Monitoring 10 to 100+ Always < 10s All
Anomaly detection 10 to 100+ < 1m < 10s All
Adaptive architecture N/A Always N/A All

Change Advisory Boards and Change Freezes should end in favour of incremental deployments and incremental launches. Blue Green Deployments and Canary Deployments gradually direct users to a newly deployed service version, and users can be redirected to the old version on service failure. Dark Launching controls feature rollouts based on user demographics, and services can be operated in a degraded state on feature failure. Lightweight change management conversations should be reserved for unavoidably large releases, or turbulent market conditions.

Summary

Optimising For Robustness is an antiquated, flawed approach to IT reliability that results in long-term Discontinuous Delivery and an operational brittleness that begets failure. As John Allspaw has stated, reliability is “the presence of adaptive capacity, not the absence of failures“. Robustness is of value, but it must be rejected as an outcome if an organisation wants to innovate in changing markets.

Optimising For Resilience is a superior reliability strategy that enables an organisation to gracefully extend to limit the impact of failures, and position itself for sustained adaptability. It is a paradigm shift, in which people need to accept the inherent complexity within their IT services and the hard truth that failures are inevitable. This is neatly summarised by David Woods’ assertion that “graceful extensibility trades off with robust optimality“. An organisation optimised for robustness will reject sources of adaptive capacity such as Circuit Breakers as inefficiencies, but to an organisation optimised for resilience its graceful extensibility is more important than cost efficiencies.

If an organisation has optimised for robustness a Continuous Delivery programme focussed on throughput alone is unlikely to succeed. Resilience As A Continuous Delivery Enabler is a heuristic that advocates resilience as the focus of Continuous Delivery, and using it to bootstrap a Continuous Delivery programme improves production reliability from the outset. Improving the resilience of services by an order of magnitude makes it easier to offer a series of practical alternatives to robustness risk management theatre, and reduce deployment throughput until there is a single value stream that can satisfy business demand 9.

1 Other robustness practices include manual regression testing, segregation of duties, artificial deployment limits, and uptime incentives

2 The Principles of Product Development Flow by Don Reinertsen describes in detail how large batch sizes increase risk

3 The DevOps Handbook by Patrick Debois et al defines telemetry as a logical grouping of logging, monitoring, anomaly detection, alerting, and user analytics

4 In ITIL these are termed Normal and Emergency Changes

5 The Anxiety Of Learning by Edgar Schein describes how people resist change due to learning and survival anxieties

6 How Complex Systems Fail by Richard Cook explains why hindsight bias is such an obstacle to understanding failures, and why root causes do not exist

7 Measuring Continuous Delivery by the author details how to measure the stability and throughput of IT delivery

8 Chaos Engineering should be restricted to test environments in an unsafe-to-fail culture

9 In ITIL these are termed Standard Changes

Acknowledgements

This series is indebted to John Allspaw and Dave Snowden for their respective work on Resilience Engineering and Cynefin.

Thanks to Beccy Stafford, Charles Kubicek, Chris O’Dell, Edd Grant, Daniel Mitchell, Martin Jackson, and Thierry de Pauw for their feedback on this series.

The value of Optimising For Resilience

What does it mean to optimise for resilience? Why is resilience so valuable to an organisation, and how can operability contribute to the adaptive capacity of IT services?

This is part of the Resilience As A Continuous Delivery Enabler series:

  1. The cost and theatre of Optimising For Robustness
  2. When Optimising For Robustness fails
  3. The value of Optimising For Resilience
  4. Resilience as a Continuous Delivery enabler

The value of resilience

When an organisation wants to improve the reliability of its IT services it should Optimise For Resilience. Resilience is the ability to “absorb or avoid damage without suffering complete failure“, and it is immensely valuable in IT. A production environment is a complex system of partial failures in which the potential for catastrophe is ever-present, so an ability to resist failure is vital.

Resilience can be thought of as graceful extensibility. In Four Concepts for Resilience and their Implications for Systems Safety in the Face of Complexity, David Woods describes graceful extensibility as “the ability of a system to extend its capacity to adapt when surprise events challenge its boundaries“. The graceful extensibility of a system is derived from its adaptive capacity, which represents the capacity for adaptation when a failure occurs.

Erik Hollnagel et al break down resilience in Resilience Engineering In Practice using a conceptual model known as the Four Cornerstones of Resilience:

The cornerstones are non-linear, complementary aspects of resilience:

  • Anticipation is knowing what to expect. This is imagining the potential for future failures, and mitigating for those scenarios in advance
  • Monitoring is knowing what to look for. This is inspecting past and present operating conditions, and alerting when anomalies occur
  • Response is knowing what to do. This is using guidelines, heuristics, improvisation skills, and situational awareness to mitigate a failure
  • Learning is knowing what has happened. This is understanding the circumstances of a near-miss or failure, and sharing the observations

Creating adaptive capacity with Operability

Optimising For Resilience means creating a production environment in which running IT services can gracefully extend to deal with the unpredictable behaviours, unexpected changes, and periods of failure that will inevitably occur. When a service has sufficient adaptive capacity the cost per unit time and duration of production failures can potentially be minimised, reducing the direct revenue costs and indirect opportunity costs caused by a failure.

The adaptive capacity of IT services can be increased by explicitly prioritising a lower Mean Time To Repair (MTTR) over a higher Mean Time Between Failures (MTTR). Some classes of failure should never occur, some failures are more costly than others, and safety-critical services should never have failures, but in general organisations should adhere to John Allspaw’s advice that “being able to recover quickly from failure is more important than having failures less often“.

A lower MTTR can be achieved by investing in the operability of IT services. Operability is defined as “the ability to keep a system in a safe and reliable functioning condition“, and is associated with a set of practices:

Each of these will increase the capacity of a service to adapt to unexpected operating conditions, and produce a more effective incident response:

  • Development: an Adaptive Architecture limits the blast radius of a failure, and Feature Toggles allow features to be limited, tested in isolation, or turned off on failure
  • Testing: Smoke Testing verifies service health, and Chaos Engineering uncovers latent failures in production
  • Infrastructure: Automated Provisioning creates reproducible environments, and Self-Healing automatically restores failed service instances
  • Telemetry: Logging radiates data on traffic, errors, latency, and saturation, and Monitoring visualises service metrics and events in a time series. Anomaly detection identifies events that breach normal operating conditions and Alerting notifies operators of abnormalities to act on. User analytics show success rates for user journeys
  • People: Shared On-Call fosters a “You Build It, You Run It” culture and increases situational awareness, and Runbooks are a repository for operational knowledge. Blameless Post-Mortems uncover the multiple contributors to a near-miss or failure and suggest future preventative measures, while respecting the best efforts of individuals and the dangers of hindsight bias 1

For example, incident response at Fruits-U-Like would be much improved if the organisation was optimising for resilience. Assume its third party registration service starts to struggle under load, new customers cannot checkout their purchases, and the failure cost per unit time is £80K per day. The checkout team would receive an automated alert for the failure, and their logging and monitoring dashboards would show a correlation between checkout and registration failures. The team would be able to triage a third party registration error within 5 minutes, and self-deploy an improvement to connection handling within a day. The failure would have a 1 day repair cost of £80K, with a detection sunk cost of £278 and a remediation opportunity cost of £79,722.

If the checkout team implemented an Adaptive Architecture they could combine a Circuit Breaker, a Bulkhead, and a Feature Toggle in anticipation of registration errors. If the registration service struggled under load the Circuit Breaker would regulate registration requests to allow a percentage to succeed, and the Bulkhead would warn the checkout frontend to skip registration for some customers. This approach would reduce the failure cost per unit time to a marketing opportunity cost of £5K a day. The checkout team would not receive an alert, but within minutes their dashboards would highlight registration errors and they could use a Feature Toggle to enable anonymous checkouts for new customers. This would allow them to deploy their connection handling fix within 3 hours with no customer impact. The result would be a 3 hour repair cost of £625, with a sunk cost of £18 and an opportunity cost of £607.

Optimising For Resilience sets the foundation for an organisation to act on market disruption and innovate. Once an organisation has the required level of graceful extensibility, it can continue to invest in its people and technology to achieve sustained adaptability. Sustained adaptability has been described by David Woods as “the ability to adapt to future surprises as conditions continue to evolve“, and can be thought of as innovation capability. An organisation that can quickly adapt to unexpected business events will hold a powerful First Mover Advantage over its competitors.

1 In How Complex Systems Fail, Richard Cook warns that “hindsight bias remains the primary obstacle to accident investigation. There is no such thing as a root cause in a complex production system, nor a blameworthy individual

The Resilience As A Continuous Delivery Enabler series:

  1. The Cost And Theatre Of Optimising For Robustness
  2. Responding To Failure When Optimising For Robustness
  3. The Value Of Optimising For Resilience
  4. Resilience As A Continuous Delivery Enabler

Acknowledgements

This series is indebted to John Allspaw and Dave Snowden for their respective work on Resilience Engineering and Cynefin.

Thanks to Beccy Stafford, Charles Kubicek, Chris O’Dell, Edd Grant, Daniel Mitchell, Martin Jackson, and Thierry de Pauw for their feedback on this series.

When Optimising For Robustness fails

Why is it wrong to assume failures are preventable in IT? Why does optimising for robustness leave organisations ill-equipped to deal with failure, and what are the usual outcomes?

This is part of the Resilience as a Continuous Delivery enabler series:

  1. The cost and theatre of Optimising For Robustness
  2. When Optimising For Robustness fails
  3. The value of Optimising For Resilience
  4. Resilience as a Continuous Delivery enabler

Underinvesting in operability

An organisation that optimises for robustness will attempt to maintain a production environment free from failure. This approach is based on the belief that failures in IT services are caused by isolated, faulty changes that are entirely preventable. A production environment is viewed as a set of homogeneous processes, with predictable interactions occurring in repeatable conditions. This matches the Cynefin definition of a complicated system, in which expert knowledge can be used to predict the cause and effect of events.

Optimising for robustness will inevitably lead to an overinvestment in pre-production risk management, and an underinvestment in production risk management. Symptoms of underinvestment include:

  • Stagnant requirements – “non-functional” requirements are deprioritised for weeks or months at a time
  • Snowflake infrastructure – environments are manually created and maintained in an unreproducible state
  • Inadequate telemetry – logs and metrics are scarce, anomaly detection and alerting are manual, and user analytics lack insights
  • Fragile architecture – services are coupled, service instances are stateful, failures are uncontained, and load vulnerabilities exist
  • Insufficient training – operators are not given the necessary coaching, education, or guidance

This underinvestment creates an inoperable production environment, which makes it difficult for operators to keep IT services in a safe and reliable functioning condition. This will often be deemed acceptable, as production failures are expected to be rare.

The constancy of failure

A production environment of running IT services is not a complicated system. It is an intractable mass of heterogeneous processes, with unpredictable interactions occurring in unrepeatable conditions. It is a complex system of emergent behaviours, in which the cause and effect of an event can only be perceived in retrospect.

As Richard Cook explains in How Complex Systems Fail, “the complexity of these systems makes it impossible for them to run without multiple flaws being present“. A production environment always contains partial faults, and is constantly in a state of near-failure.

A failure will occur when unrelated faults unexpectedly coalesce such that one or more functions cannot succeed. Its revenue cost will be a function of cost per unit time and duration, with cost per unit time the economic impact and duration the time between start and end. Its opportunity costs will come from loss of customer confidence, and increased failure demand slowing feature development.

An organisation optimised for robustness will be ill-equipped to deal with a failure when it does occur. The inoperability of the production environment will produce a brittle incident response:

  • Stagnant requirements and insufficient training will make it difficult to anticipate how services might fail
  • Inadequate telemetry will impede the monitoring of normal versus abnormal operating conditions
  • Snowflake infrastructure and a fragile architecture will prevent a rapid response to failure

For example, at Fruits-U-Like a third party registration service begins to suffer under load. The website rejects new customers on checkout, and a failure begins with a static cost per unit time of £80K per day. A lack of telemetry means the operations team cannot triage for 3 days. After triage an incident is assigned to the checkout team, who improve connection handling within a day. The Change Advisory Board agrees the fix can skip End-To-End Testing, and it is deployed the following day. The failure has a 5 day repair cost of £400K, with a detection sunk cost of £240K and a remediation opportunity cost of £160K.

After a failure, the assumption that failures are caused by individuals will lead to a blame culture. There will be an attitude Sidney Dekker calls the Bad Apple Theory, in which production is considered absolutely reliable bar the actions of a few unreliable employees. The combination of the Bad Apple Theory and hindsight bias will create an oppressive culture of naming, blaming, and shaming the individuals involved. This discourages the sharing of operational knowledge and organisational learnings.

The Dual Value Streams countermeasure

An organisation optimised for robustness will be in a state of Discontinuous Delivery. Attempting to increase the Mean Time Between Failures (MTBF) with practices such as End-To-End Testing will increase feature lead times to the extent that business demand will be unsatisfiable. However, the rules for deploying a production fix will be very different.

When a production fix for a failure is available, people will share a sense of urgency. Regardless of how cost per unit time is estimated, there will be a recognition that a sunk cost has been incurred and an opportunity cost needs to be minimised. There will be a consensus that a different approach is required to avoid long feature lead times.

Dual Value Streams is a common countermeasure to failure when optimising for robustness. For each technology value stream in situ, there will actually be two different value streams. The feature value stream will retain all the advertised pre-production risk management practices, and will take weeks or months to complete. The fix value stream will strip out most if not all pre-production activities, and will take days to complete.

At Fruits-U-Like, that means a 12 week feature value stream from code to production and a 5 day fix value stream from failure start to end 2.

Dual Value Streams signify Discontinuous Delivery, but they also show potential for Continuous Delivery. The fix value stream indicates the lead times that can be accomplished when people have a shared sense of urgency, actively collaborate on releases, and omit the risk management theatre.

1 In The DevOps Handbook by Patrick Debois et al telemetry is defined as a logical grouping of logging, monitoring, anomaly detection, alerting, and user analytics

2 Measuring Continuous Delivery details why deployment failure recovery time should include development time and deployment lead time should not. Deployment failure recovery time is measured from failure start to failure end, while deployment lead time is measured from master commit to production deployment

The Resilience As A Continuous Delivery Enabler series:

  1. The Cost And Theatre Of Optimising For Robustness
  2. Responding To Failure When Optimising For Robustness
  3. The Value Of Optimising For Resilience
  4. Resilience As A Continuous Delivery Enabler

Acknowledgements

This series is indebted to John Allspaw and Dave Snowden for their respective work on Resilience Engineering and Cynefin.

Thanks to Beccy Stafford, Charles Kubicek, Chris O’Dell, Edd Grant, Daniel Mitchell, Martin Jackson, and Thierry de Pauw for their feedback on this series.

The cost and theatre of Optimising For Robustness

Why do so many organisations optimise their IT delivery for robustness? What risk management practices are normally involved, and do their capabilities outweigh their costs?

This is part of the Resilience as a Continuous Delivery enabler series:

  1. The cost and theatre of Optimising For Robustness
  2. When Optimising For Robustness fails
  3. The value of Optimising For Resilience
  4. Resilience as a Continuous Delivery enabler

The tradition of robustness

As software continues to eat the world, organisations must position IT at the heart of their business strategy. The speed of IT delivery needs to be capable of satisfying customer demand, and at the same time the reliability of IT services must be ensured to protect daily business operations. In Practical Reliability Engineering, Patrick O’Connor and Andre Kleyner define reliability as “The probability that [a system] will perform a required function without failure under stated conditions for a stated period of time, or as a function of Mean Time Between Failures (MTBF) and Mean Time To Repair (MTTR). When an organisation has unreliable IT services its business operations are left vulnerable to IT outages, and the cost of downtime could prove ruinous if market conditions are unfavourable.

Many organisations have a lack of confidence in their IT services, and an ingrained fear of failure. There is often a simultaneous belief that failures are preventable, based on the assumption that IT services are predictable and failures are caused by isolated changes. In such circumstances an organisation will traditionally Optimise For Robustness. It will focus on maximising the ability of its IT services to “resist change without adapting [their] initial stable configuration, by implicitly favouring a higher MTBF over a lower MTTR. It will use robustness-centric risk management practices in its technology value streams to reduce the risk of future failures, such as 1:

  • End-To-End Testing to verify the functionality of a new service version against its unowned dependent services
  • Change Advisory Boards to assess, prioritise, and approve the deployment of new service versions
  • Change Freezes to restrict the deployment of new service versions for a period of time derived from market conditions

Consider a fictional Fruits-U-Like organisation, with development teams working to 2 week iterations and a quarterly release cycle. Fruits-U-Like has optimised itself for robustness ever since a 24 hour website outage 5 years ago. Each release goes through 6 weeks of End-To-End Testing with the testing team, a 2 week Change Advisory Board, and 1 week of preparation with the operations team. There are also several 4 week Change Freezes throughout the year, to coincide with marketing campaigns.

The costs and theatre of robustness

Robustness is a desirable capability of an IT service, but optimising for robustness invariably means spending too much time for too little risk reduction. The risk management practices used will be far more costly and less valuable than expected:

If the next Fruits-U-Like release was estimated to be worth £50K per day in new revenue, the 12 week lead time would create a total opportunity cost of £4.2 million. This would include the handover delays between the development, testing, and operations teams due to misaligned priorities. If a Change Freeze delayed the deployment by another 4 weeks the opportunity cost would increase to £5.6 million.

These risk management practices are what Jez Humble calls Risk Management Theatre. They are based on the misguided assumption that preventative controls on everyone will prevent anyone from making a mistake. Furthermore, they actually increase risk by ensuring a large batch size and a sizeable amount of requirements/technology changes per service version 2. They impede knowledge sharing, restrict situational awareness, create enormous opportunity costs, and doom organisations to a state of Discontinuous Delivery.

1 Other practices include manual regression testing, segregation of duties, and uptime incentives for operators

2 The Principles of Product Development Flow by Don Reinertsen describes in detail how large batch sizes increase risk

The Resilience As A Continuous Delivery Enabler series:

  1. The Cost And Theatre Of Optimising For Robustness
  2. When Optimising For Robustness Fails
  3. The Value Of Optimising For Resilience
  4. Resilience As A Continuous Delivery Enabler

Acknowledgements

This series is indebted to John Allspaw and Dave Snowden for their respective work on Resilience Engineering and Cynefin.

Thanks to Beccy Stafford, Charles Kubicek, Chris O’Dell, Edd Grant, Daniel Mitchell, Martin Jackson, and Thierry de Pauw for their feedback on this series.

© 2020 Steve Smith

Theme by Anders NorénUp ↑