On Tech

Category: Continuous Delivery (Page 1 of 7)

GitOps is a placebo

“In 2017, I dismissed GitOps as a terrible portmanteau for Kubernetes Infrastructure as Code. Since then, Weaveworks has dialled up the hype, and GitOps is now promoted as a developer experience as well as a Kubernetes operating model.

I dislike GitOps because it’s a sugar pill, and it’s marketed as more than a sugar pill. It’s just another startup sharing what’s worked for them. It’s one way of implementing Continuous Delivery with Kubernetes. Its ‘best practices’ aren’t best for everyone, and can cause problems.

The benefits of GitOps are purely transitive, from the Continuous Delivery principles and Infrastructure as Code practices implemented. It’s misleading to suggest GitOps has a new idea of substance. It doesn’t.

Steve Smith

TL;DR:

GitOps is defined by Weaveworks as ‘a way to do Kubernetes cluster management and application delivery’.

A placebo is defined by Merriam Webster as ‘a usually pharmacologically inert preparation, prescribed more for the mental relief of the patient than for its actual effect on a disorder’. 

GitOps is a placebo. Its usage may make people happy, but it offers nothing that cannot be achieved with Continuous Delivery principles and Infrastructure as Code practices as is.  

An unnecessary rebadging

In 2017, Alexis Richardson of Weaveworks coined the term GitOps in Operations by Pull Request. He defined GitOps as:

  • AWS resources and Kubernetes infrastructure are declarative.
  • The entire system state is version controlled in a single Git repository.
  • Operational changes are made by GitHub pull request into a deployment pipeline.
  • Configuration drift detection and correction happen via kubediff and Weave Flux.

The Weaveworks deployment and operational capabilities for their Weave Cloud product are outlined, including declarative Kubernetes configuration in Git and a cloud native stack in AWS. There is an admirable caveat of ‘this is what works for us, and you may disagree’, which is unfortunately absent in later GitOps marketing. 

As a name, GitOps is an awful portmanteau, ripe for confusion and misinterpretation. Automated Kubernetes provisioning does not require ‘Git’, nor does it encompass all the ‘Ops’ activities required for live user traffic. It is a small leap to imagine organisations selling GitOps implementations that Weaveworks do not recognise as GitOps. 

As an application delivery method, GitOps offers nothing new. Version controlling declarative infrastructure definitions, correcting configuration drift, and monitoring infrastructure did not originate from GitOps. In their 2010 book Continuous Delivery, Dave Farley and Jez Humble outlined infrastructure management principles:

  • Declarative. The desired state of your infrastructure should be specified through version controlled configuration.
  • Autonomic. Infrastructure should correct itself to the desired state automatically.
  • Monitorable. You should always know the actual state of your infrastructure through instrumentation and monitoring.

The DevOps Handbook by Gene Kim et al in 2016 included the 2014 State Of DevOps Report comment that ‘the use of version control by Ops was the highest predictor of both IT and organisational performance’, and it recommended a single source of truth for the entire system. In the same year, Kief Morris established infrastructure definition practices in Infrastructure as Code 1st Edition, and reaffirmed them in 2021 in Infrastructure as Code 2nd Edition

GitOps is simply a rebadging of Continuous Delivery principles and Infrastructure as Code practices, with a hashtag for a name and contemporary tools. Those principles and practices predate GitOps by some years. 

No new ideas of substance

In 2018, Weaveworks published their Guide to GitOps, to explain how GitOps differs from Continuous Delivery and Infrastructure as Code. GitOps is redefined as ‘an operating model for Kubernetes’ and ‘a path towards a developer experience for managing applications’. Its principles are introduced as:

  • The entire system described declaratively.
  • The canonical desired system state versioned in Git.
  • Approved changes that can be automatically applied to the system.  
  • Software agents to ensure correctness and alert on divergence.

These are similar to the infrastructure management principles in Continuous Delivery.

GitOps best practices are mentioned, and expanded by Alexis Richardson in What is GitOps. They include declaring per-environment target states in a single Git repository, monitoring Kubernetes clusters for state divergence, and continuously converging state between Git and Kubernetes via Weave Cloud. These are all sound Infrastructure as Code practices for Kubernetes. However, positioning them as best practices for continuous deployment is wrong. 

What works for Weaveworks will not necessarily work in other organisations, because each organisation is a complex, adaptive system with its own unique context. Continuous Delivery needs context-rich, emergent practices, borne out of heuristics and experimentation. Context-free, prescriptive best practices are unlikely to succeed. Examples include:

  • Continuous deployment. Multiple deployments per day can be an overinvestment. If customer demand is satisfied by fortnightly deployments or less, separate developer and operations teams might be viable. In that scenario, Kubernetes is a poor choice, as both teams would need to understand it well for their shared deployment process.
  • Declarative configuration. There is no absolute right or wrong in declarative versus imperative configuration. Declarative infrastructure definitions can become thousands of lines of YAML, full of unintentional complexity. It is unwise to mandate either paradigm for an entire toolchain. 
  • Feature branching. Branching in an infrastructure repository can encourage large merges to main, and/or long-lived per-environment  branches. Both are major impediments to Continuous Delivery. The DevOps Handbook notes the 2015 State Of DevOps Report showed ‘trunk-based development predicts higher throughput and better stability, and even higher job satisfaction and lower rates of burnout’.
  • Source code deployments. Synchronising source code directly with Kubernetes clusters violates core deployment pipeline practices. Omitting versioned build packages makes it easier for infrastructure changes to reach environments out of order, and harder for errors to be diagnosed when they occur.
  • Kubernetes infatuation. Kubernetes can easily become an operational burden, due to its substantial onboarding costs, steep learning curve, and extreme configurability. It can be hard to justify investing in Kubernetes, due to the total cost of ownership. Lightweight alternatives exist, such as AWS Fargate and GCP Cloud Run. 

The article has no compelling reason why GitOps differs from Continuous Delivery or Infrastructure as Code. It claims GitOps creates a freedom to choose the tools that are needed, faster deployments and rollbacks, and auditable functions. Those capabilities were available prior to GitOps. GitOps has no new ideas of substance.

Transitive and disputable benefits

In 2021, Weaveworks published How GitOps Boosts Business Performance: The Facts. GitOps is redefined as ‘best practices and an operational model that reduces the complexity of Kubernetes and makes it easier to deliver on the promise of DevOps’. The Weave Kubernetes Platform product is marketed as the easiest way to implement GitOps. 

The white paper lists the benefits of GitOps:

  • Increased productivity – ‘mean time to deployment is reduced significantly… teams can ship 30-100 times more changes per day’
  • Familiar developer experience – ‘they can manage updates and introduce new features more rapidly without expert knowledge of how Kubernetes works’
  • Audit trails for compliance – ‘By using Git workflows to manage all deployments… you automatically generate a full audit log’
  • Greater reliability – ‘GitOps gives you stable and reproducible rollbacks, reducing mean time to recovery from hours to minutes’
  • Consistent workflows – ‘GitOps has the potential to provide a single, standardised model for amending your infrastructure’
  • Stronger security guarantees – ‘Git already offers powerful security guarantees… you can secure the whole development pipeline’

The white paper also explains the State of DevOps Report 2019 by Dr. Nicole Forsgren et al, which categorises organisations by their Software Delivery and Operational (SDO) metrics. There is a description of how GitOps results in a higher deployment frequency, reduced lead times, a lower change failure rate, reduced time to restore service, and higher availability. There is a single Weaveworks case study cited, which contains limited data.

These benefits are not unique to GitOps. They are transitive. They are sourced from implementing Continuous Delivery principles and Infrastructure as Code practices upstream of GitOps. Some benefits are also disputable. For example, Weaveworks do not cite any data for their increased productivity claim of ‘30-100 times more changes per day’, and for many organisations operational workloads will not be the biggest source of waste. In addition, developers will need some working knowledge of Kubernetes for incident response at least, and it is an arduous learning curve. 

Summary 

GitOps started out in 2017 as Weaveworks publicly sharing their own experiences in implementing Infrastructure as Code for Kubernetes, which is to be welcomed. Since 2018, GitOps has morphed into Weaveworks marketing a new application delivery method that offers nothing new. 

GitOps is simply a rebadging of 2010 Continuous Delivery principles and 2016 Infrastructure as Code practices, applied to Kubernetes. Its benefits are transitive, sourced from implementing those principles and practices that came years before GitOps. Some of those benefits can also be disputed. 

GitOps is well on its way to becoming the latest cargo cult, as exemplified by Weaveworks announcing a GitOps certification scheme. It is easy to predict the inclusion of a GitOps retcon that downplays Kubernetes, so that Weaveworks can future proof their sugar pill from the inevitable decline in Kubernetes demand.

References 

  1. Continuous Delivery [2010] by Dave Farley and Jez Humble
  2. The DevOps Handbook [2016] by Gene Kim et al
  3. Infrastructure as Code: Managing Servers in the Cloud [2016] by Kief Morris
  4. Operations by Pull Request [2017] by Alexis Richardson
  5. Guide to GitOps [2018] by Weaveworks
  6. What is GitOps Really [2018] by Alexis Richardson
  7. How GitOps boosts business performance – the facts [2021] by Weaveworks
  8. Infrastructure as Code: Dynamic Systems for the Cloud Age [2021] by Kief Morris

Acknowledgements

Thanks to Dave Farley, Kris Buytaert, and Thierry de Pauw for their feedback.

Investing in Continuous Delivery

“Is it possible to overinvest in Continuous Delivery?

The benefits of Continuous Delivery are astonishing, so it’s tempting to say no. Keep on increasing throughput indefinitely, and enjoy the efficiency gains! But that costs time and money, and if you’re already satisfying customer demand… should you keep pushing so hard?

If you’ve already achieved Continuous Delivery, sometimes your organisation should invest its scarce resources elsewhere – for a time. Continuously improve in the fuzzy front end of product development, as well as in technology.”

Steve Smith

TL;DR:

  • Deployment throughput levels describe the effort necessary to implement Continuous Delivery in an enterprise organisation. 
  • Investing in Continuous Delivery means experimenting with technology and organisational changes to find and remove constraints in build, testing, and operational activities.
  • When such a constraint does not exist, it is possible to overinvest in Continuous Delivery beyond the required throughput level.

Introduction

Continuous Delivery means increasing deployment throughput until customer demand is satisfied. It involves radical technology and organisational changes. Accelerate by Dr. Nicole Forsgren et al describes the benefits of Continuous Delivery:

  • A faster time to market, and increased revenues.
  • A substantial improvement in technical quality, and reduced costs.
  • An uptick in profitability, market share, and productivity. 
  • Improved job satisfaction, and less burnout for employees.

If an enterprise organisation has IT as a Cost Centre, funding for Continuous Delivery is usually time-limited and orthogonal to development projects. Discontinuous Delivery and a historic underinvestment in continuous improvement are the starting point.

Continuous Delivery levels provide an estimation heuristic for different levels of deployment throughput versus organisational effort. A product manager might hypothesise their product has to move from monthly to weekly deployments, in order to satisfy customer demand. It can be estimated that implementing weekly deployments would take twice as much effort as monthly deployments.

Once the required throughput level is reached, incremental improvement efforts need to be funded and completed as business as usual. This protects ongoing deployment throughput, and the satisfaction of customer demand. The follow-up question is then how much more time and money to spend on deployment throughput. This can be framed as:

Is it possible to overinvest in Continuous Delivery?

To answer this question, a deeper understanding of how Continuous Delivery happens is necessary.

Investing with a deployment constraint

In an organisation, a product traverses a value stream with a fuzzy front end of design and development activities, and a technology value stream of build, testing, and operational activities. 

With a Theory Of Constraints lens, Discontinuous Delivery is caused by a constraint within the technology value stream. Time and money must be invested in technology and organisational changes from the Continuous Delivery canon, to find and remove that constraint. An example would be from You Build It Ops Run It, where a central operations team cannot keep up with deployment requests from a development team.

The optimal approach to implementing Continuous Delivery is the Improvement Kata. Improvement cycles are run to experiment with different changes. This continues until all constraints in the technology value stream are removed, and the flow of release candidates matches the required throughput level.

The overinvestment question can now be qualified as:

Is it possible to overinvest in Continuous Delivery, once constraints on deployment throughput are removed and customer demand is satisfied?

The answer depends on the amount of time and money to be invested, and where else that investment could be made in the organisation.

Investing without a deployment constraint

Indefinite investment in Continuous Delivery is possible. The deployment frequency and deployment lead time linked to a throughput level are its floor, not its ceiling. For example, a development team at its required throughput level of weekly deployments could push for a one hour deployment lead time.

Deployment lead time strongly correlates with technical quality. An ever-faster deployment lead time tightens up feedback loops, which means defects are found sooner, rework is reduced, and efficiency gains are accrued. The argument for a one hour deployment lead time is to ensure a developer receives feedback on their code commits within the same day, rather than the next working day with a one day deployment lead time. 

Advocating for a one hour deployment lead time that exceeds the required throughput level is wrong, due to: 

  • Context. A one day deployment lead time might mean eight hours waiting for test feedback from an overnight build, before a 30 minute automated deployment to production. Alternatively, it might mean a 30 minute wait for automated tests to complete, before an eight hour manual deployment. A developer might receive actionable feedback on the same day. 
  • Cost. Incremental improvements are insufficient for a one hour deployment lead time. Additional funding is inevitably needed, as radical changes in build, testing, and operational activities are involved. In Lean Manufacturing, this is the difference between kaizen and kaikaku. It is the difference between four minutes refactoring a single test in an eight hour test suite, and four weeks parallelising all those tests into a 30 minute execution time.  
  • Culture. Radical changes necessary for a one hour deployment lead time can encounter strong resistance when customer demand is already satisfied, and there is no unmet business outcome. The lack of business urgency makes it easier for people to refuse changes, such as a change management team declining to switch from a pre-deployment CAB approval to a post-deployment automated audit trail.  
  • Constraints. A one hour deployment lead time exceeding the required throughput level is outside Discontinuous Delivery. There is no constraint to find and remove in the technology value stream. There is instead an upstream constraint in the fuzzy frontend. Time and money would be better invested in business development or product design, rather than Continuous Delivery. Removing the fuzzy frontend constraint could shorten the cycle time for product ideas, and uncover new revenue streams. 

The correlation between deployment lead time and technical quality makes an indefinite investment in Continuous Delivery tempting, but overinvestment is a real possibility. Redirecting continuous improvement efforts at the fuzzy frontend after Continuous Delivery has been achieved is the key to unlocking more customer demand, raising the required throughput level, and creating a whole new justification for funding further Continuous Delivery efforts.

Example – Gardenz

Gardenz is a retailer. It has an ecommerce website that sells garden merchandise. It takes one week to deploy a new website version, and it happens once a month. The product manager estimates weekly deployments of new product features would increase customer sales. 

The Gardenz website is in a state of Discontinuous Delivery, as the required throughput level is unmet. The developers previously needed five days to establish monthly deployments, so ten days is estimated for weekly.

The Gardenz constraint is manual regression testing. It causes so much rework between developers and testers that deployment lead time cannot be less than one week.

The Gardenz developers and testers run improvement cycles to merge into a single delivery team, and replace their manual regression tests with automated functional tests. After fourteen days of effort, a deployment lead time of one day is possible. This allows the website to be deployed once a week, in under a day.

Gardenz has moved up a throughput level to weekly deployments, and the product manager is satisfied. Now they need to decide whether to invest further in daily deployments, beyond customer demand. As weekly deployments took 14 days, 28 days of time and money can be estimated for daily. 

The removal of the testing constraint on weekly deployments causes a planning constraint to emerge for daily deployments. New feature ideas cannot move through product planning faster than two days, no matter whether the deployment lead time is one day, one hour, or one minute. The product manager decides to invest the available time and money into removing the planning constraint. Daily deployments are earmarked for future consideration.

Example – Puzzle Planet

Puzzle Planet is a media publisher. Every month, it sells a range of puzzle print magazines to newsagent resellers. Its magazines come from a fully automated content pipeline. It takes one day for a magazine to be automatically created and published to print distributors. 

The Puzzle Planet content pipeline is in a state of Continuous Delivery. The required throughput level of monthly magazines is met. It took two developers six months to reach a one week content lead time, and a further nine months to exceed the throughput level with a one day content lead time. 

There is no constraint within the content pipeline. The content pipeline also serves a subscription-based Puzzle Planet app for mobile devices, in an attempt to enter the digital puzzle market. Subscribers receive new puzzles each day, and an updated app version every two months. Uptake of the app is limited, and customer demand is unclear.

Puzzle Planet has benefitted from exceeding its required throughput level with a one day content lead time. The content pipeline is highly efficient. It produces high quality puzzles with near-zero mistakes, and handles print distributors with no employee costs. It could theoretically scale up to hundreds of magazine titles. However, a one week content lead time would serve similar purposes.

The problem is the lack of customer demand for the Puzzle Planet app. Digital marketing is the constraint for Puzzle Planet, not its content pipeline or print magazines. An app with few customers and bi-monthly features will struggle in the marketplace, regardless of content updates.

As it stands, the nine months spent on a one day content lead time was an overinvestment in Continuous Delivery. The nine months of funding for the content pipeline could have been invested in digital marketing instead, to better understand customer engagement, retention, and digital revenue opportunities. If more paying customers can be found for the Puzzle Planet app, the one day content lead time could be turned around into a worthy investment.

Acknowledgements

Thanks to Adam Hansrod for his feedback.

Continuous Delivery levels

“When I’m asked how long it’ll take to implement Continuous Delivery, I used to say ‘it depends’. That’s a tough conversation starter for topics as broad as culture, engineering excellence, and urgency.

Now, I use a heuristic that’s a better starter – ‘around twice as much effort than your last step change in deployments’. Give it a try!”

Steve Smith

TL;DR:

  • Continuous Delivery is challenging and time-consuming to implement in enterprise organisations.
  • The author has an estimation heuristic that ties levels of deployment throughput with the maximum organisational effort required.
  • A product manager must choose the required throughput level for their service.
  • Cost Of Delay can be used to calculate a required throughput level.

Introduction

Continuous Delivery is about a team increasing the throughput of its deployments until customer demand is sustainably met. In Accelerate, Dr. Nicole Forsgren et al demonstrate that Continuous Delivery produces:

  • High performance IT. Better throughput, quality, and stability.
  • Superior business outcomes. Twice as likely to exceed profitability, market share, and productivity goals.
  • Improved working conditions. A generative culture, less burnout, and more job satisfaction.

If an enterprise organisation has IT as a Cost Centre, Continuous Delivery is unlikely to happen organically in one delivery team, let alone many. Systemic continuous improvement is incompatible with incentives focussed on project deadlines and cost reduction targets. Separate funding may be required for adopting Continuous Delivery, and approval may depend on an estimate of duration. That can be a difficult conversation, as the pathways to success are unknowable at the outset.

Continuous Delivery means applying a multitude of technology and organisational changes to the unique circumstances of an organisation. An organisation is a complex, adaptive system, in which behaviours emerge from unpredictable interactions between people, teams, and departments. Instead of linear cause and effect, there is a dispositional state representing a set of possibilities at that point in time. The positive and/or negative consequences of a change cannot be predicted, nor the correct sequencing of different changes.

An accurate answer to the duration of a Continuous Delivery programme is impossible upfront. However, an approximate answer is possible. 

Continuous Delivery levels

In Site Reliability Engineering, Betsey Byers et al describe reliability engineering in terms of availability levels. Based on their own experiences, they suggest ‘each additional nine of availability represents an order of magnitude improvement. For example, if a team achieves 99.0% availability with x engineering effort, an increase to 99.9% availability would need a further 10x effort from the exact same team.

In a similar fashion, deployment throughput levels can be established for Continuous Delivery. Deployment throughput is a function of deployment frequency and deployment lead time, and common time units can be defined as different levels. When linked to the relative efforts required for technology and organisational changes, throughput levels can be used as an estimation heuristic.

Based on ten years of author experiences, this heuristic states an increase in deployments to a new level of throughput requires twice as much effort as the previous level. For instance, if two engineers needed two weeks to move their service from monthly to weekly deployments, the same team would need one month of concerted effort for daily deployments.

The optimal approach to implement Continuous Delivery is to use the Improvement Kata. Improvement cycles can be executed to exploit the current possibilities in the dispositional state of the organisation, by experimenting with technology and organisational changes. The direction of travel for the Improvement Kata can be expressed as the throughput level that satisfies customer demand.

A product manager selects a throughput level based on their own risk tolerance. They have to balance the organisational effort of achieving a throughput level with predicted customer demand. The easiest way is to simply choose the next level up from the current deployment throughput, and re-calibrate level selection whenever an improvement cycle in the Improvement Kata is evaluated.

Context matters. These levels will sometimes be inaccurate. The relative organisational effort for a level could be optimistic or pessimistic, depending on the dispositional state of the organisation. However, Continuous Delivery levels will provide an approximate answer of effort immediately, when an exact answer is impossible. 

Quantifying customer demand

A more accurate, slower way to select a deployment throughput level is to quantify customer demand, via the opportunity costs associated with potential features for a service. The opportunity cost of an idea for a new feature can be calculated using Cost of Delay, and the Value Framework by Joshua Arnold et al.

First, an organisation has to establish opportunity cost bands for its deployment throughput levels. The bands are based on the projected impact of Discontinuous Delivery on all services in the organisation. Each service is assessed on its potential revenue and costs, its payment model, its user expectations, and more. 

For example, an organisation attaches a set of opportunity cost bands to its deployment throughput levels, based on an analysis of revenue streams and costs. A team has a service with weekly deployments, and demand akin to a daily opportunity cost of £20K for planned features. It took one week of effort to achieve weekly deployments. The service is due to be rewritten, with an entirely new feature set estimated to be £90K in daily opportunity costs. The product manager selects a throughput level of daily deployments, and the organisational effort is estimated to be 10 weeks.

Acknowledgements

Thanks to Alun Coppack, Dave Farley, and Thierry de Pauw for their feedback.

Continuous Delivery target measures

“I’m often asked by senior leaders in different organisations how they should measure software delivery, and kickstart a Continuous Delivery culture. Accelerate and Measuring Continuous Delivery have some of the answers, but not all of them.

This for anyone wondering how to effectively measure Product, Delivery, and Operations teams as one, in their own organisation…”

Steve Smith

TL;DR:

  • Teams working in an IT cost centre are often judged on vanity measures such as story points and incident count.
  • Teams need to be measured on outcomes linked to business goals, deployment throughput, and availability.
  • A product manager, not a delivery lead or tech lead, must be accountable for all target measures linked to a product and the teams working on it.

Introduction 

In A Typology of Organisational Cultures, Ron Westrum defines culture as power, rule, or performance-oriented. An organisation in a state of Discontinuous Delivery has a power or rule-oriented culture, in which bridging between teams is discouraged or barely tolerated.

A majority of organisations mired in Discontinuous Delivery have IT as a Cost Centre. Their value streams crosscut siloed organisational functions in Product, Delivery, and Operations. Each silo has its own target measures, which reinforce a power or rule-oriented culture. Examples include page views and revenue per customer in Product, story points and defect counts in Delivery, and incident counts and server uptime in Operations.

These are vanity measures. A vanity measure in a value stream is an output of one or a few siloed activities, in a single organisational function. As a target, it is vulnerable to individual bias, under-reporting, and over-reporting by people within that silo, due to Goodhart’s Law.  Vanity measures have an inherently low information value, and incentivise people in different silos to work at cross-purposes.

Measure outcomes, not outputs

Measuring Continuous Delivery by the author describes how an organisation can transition from Discontinuous Delivery to Continuous Delivery, and dramatically improve service throughput and reliability. A performance-oriented culture needs to be introduced, in which bridging between teams is encouraged.

The first step in that process is to replace vanity measures with actionable measures. An actionable measure in a value stream is a holistic outcome for all activities. It has a high information value. It has some protection against individual bias, under-reporting, and over-reporting, because it is spread across all organisational functions. 

Actionable measures for customer success could include conversion rate, and customer lifetime value. Accelerate by Dr Nicole Forsgren et al details the actionable measures for service throughput:

  • Deployment frequency. The time between production deployments. 
  • Deployment lead time. The time between a mainline code commit and its deployment. 

Accelerate does not extend to service reliability. The actionable measures of service reliability are availability rate, and time to restore availability.

Target measures for service throughput and reliability need to be set for services at the start of a Continuous Delivery programme. They increase information flow, cooperation, and trust between people, teams, and organisational functions within the same value stream. They make it clear the product manager, delivery team, and operations team working on the same service share a responsibility for its success. It is less obvious who is accountable for choosing the target measures, and ensuring they are met.  

Avoid delivery tech lead accountability

One way to approach target accountability is for a product manager to be accountable for a deployment frequency target, and a delivery tech lead accountable for a deployment lead time target. This is based on deployment lead time correlating with technical quality, and its reduction depending on delivery team ownership of the release process.

It is true that Continuous Delivery is predicated on a delivery team gradually assuming sole ownership of the release process, and fully automating it as a deployment pipeline. However, the argument for delivery tech lead accountability is flawed, due to: 

  • Process co-ownership. The release process is co-owned by delivery and operations teams at the outset, while it is manual or semi-automated. A delivery tech lead cannot be accountable for a team in another organisational function. 
  • Limited influence. An operations team is unlikely to be persuaded by a delivery tech lead that release activities move to a delivery team for automation. A delivery tech lead is not influential in other organisational functions.
  • Prioritisation conflict. The product manager and delivery tech lead have separate priorities, for product features and deployment lead time experiments. A delivery tech lead cannot compete with a product manager prioritising product features on their own.
  • Siloed accountabilities. A split in throughput target accountability perpetuates the Product and IT divide. A delivery tech lead cannot force a product manager to be invested in deployment lead time.  

Maximise product manager accountability

In a power or rule-oriented culture, driving change across organisational functions requires significant influence. The most influential person across Product, Delivery, and Operations for a service will be its budget holder, and that is the product manager. They will be viewed as the sponsor for any improvement efforts related to the service. Their approval will lend credibility to changes, such as a delivery team owning and automating the release process. 

Product managers should be accountable for all service throughput and reliability targets. It will encourage them to buy into Continuous Delivery as the means to achieve their product goals. It will incentivise them to prioritise deployment lead time experiments and operational features alongside product features. It will spur them to promote change across organisational functions.

In The Decision Maker, Dennis Bakke advocates effective decision making as a leader choosing a decision maker, and the decision maker gathering information before making a decision. A product manager does not have to choose the targets themselves. If they wish, they can nominate their delivery tech lead to gather feedback from different people, and then make a decision for which the product manager is accountable. The delivery and operations teams are then responsible for delivering the service, with sufficient technical discipline and engineering skills to achieve those targets.

A product manager may be uncomfortable with accountability for IT activities. The multi-year research data in Accelerate is clear of the business benefits for a faster deployment lead time: 

  • Less rework. The ability to quickly find defects in automated tests tightens up feedback cycles, which reduces rework and accelerates feature development.  
  • More revenue protection. Using the same deployment pipeline to restore lost availability, or apply a security patch in minutes, limits revenue losses and reputational damage on failure.
  • Customer demand potential. A deployment lead time that is a unit of time less than deployment frequency demonstrates an ability to satisfy more demand, if it can be unlocked.  

Summary

Adopting Continuous Delivery is based on effective target measures of service throughput and reliability. Establishing the same targets across all the organisational functions in a value stream will start the process of nudging the organisational culture from power or rule-oriented to performance-oriented. The product manager should be accountable for all target measures, and the delivery and operations teams responsible for achieving them.

It will be hard. There will be authoritative voices in different organisational functions, with a predisposition to vanity measures. The Head of Product, Head of Delivery, and Head of Operations might have competing budget priorities, and might not agree on the same target measures. Despite these difficulties, it is vital the right target measures are put in place. As Peter Drucker said in The Essential Drucker, ‘if you want something new, you have to stop doing something old’.

Acknowledgements

Thanks to Charles Kubicek, Dave Farley, Phil Parker, and Thierry de Pauw for their feedback.

Who Runs It

What are the different options for production support in IT as a Cost Centre? How can deployment throughput and application reliability be improved in unison? Why is You Build It You Run It so effective for both Continuous Delivery and Operability?

This series of articles describes a taxonomy for production support methods in IT as a Cost Centre, and their impact on both Continuous Delivery and Operability.

  1. You Build It Ops Run It
  2. You Build It You Run It
  3. You Build It Ops Run It at scale
  4. You Build It You Run It at scale
  5. You Build It Ops Sometimes Run It
  6. Implementing You Build It You Run It at scale
  7. You Build It SRE Run It

The series is summarised below. Availability targets should be chosen according to estimates of revenue loss on failure, which can be verified by Chaos Engineering or actual production incidents. There is an order of magnitude of additional engineering effort/time associated with an additional nine of availability. You Build It SRE Run It is best suited to four nines of reliability and more, You Build It You Run It is required for weekly deploys or more, and Ops Run It remains relevant when product demand is low.

Rebuild The Thing Wrong

What happens when the launch of a software product rebuild is delayed due to feature parity? How can an organisation rebuild a software product the right way, and minimise opportunity costs as well as technology risks?

Introduction

To adapt to rapidly changing markets, organisations must enhance their established products as well as exploring new product offerings. When the profitability of an existing product is limited by Discontinuous Delivery, a rewrite should be considered a last resort. However, it might become necessary if a meaningful period of code, configuration, and infrastructure rework cannot achieve the requisite deployment throughput and production reliability.

Rebuilding a software product from scratch is usually a costly and risky one-off project, with little attention paid to customer needs. Replicating all the features in an existing product will take weeks or months, and there may be a decision to wait for feature parity before customer launch. It can be a while before Continuous Delivery becomes a reality.

The production cutover will be a big bang migration, with a high risk of catastrophic failure. The probability and cost of cutover failure can be reduced by Decoupling Deploy From Launch, and regularly deploying versions of the new product prior to launch.

The Gamez rebuild

For example, a fictional Gamez retail organisation has an ecommerce website. The website is powered by a monolithic application known as eCom, which is hosted in an on-premise data centre. Recently viewed items, gift lists, search, and customer reviews are features that are often praised by customers, and believed to generate the majority of revenue. eCom deployments are quarterly, production reliability is poor, and data centre costs are too high.

There is a strong business need to accelerate feature delivery, and a rewrite of the Gamez website is agreed. The chosen strategy is a rip and replace, in which the eCom monolith will be rebuilt as a set of cloud-native services, hosted in a public cloud provider. Each service is intended to be an independently testable, deployable, and scalable set of microservices, that can keep pace with market conditions.

There is an agreement to delay customer launch until feature parity is achieved, and all eCom business logic is replicated in the services. Months later, the services are deemed feature complete, the production cutover is accomplished, and the eCom application is turned off.

The feature parity fallacy

Feature parity is a fallacy. Rebuilding a software product with the same feature set is predicated on the absurd assumption that all existing features generate sufficient revenue to justify their continuation. In all likelihood, an unknown minority of existing features will account for a majority of revenue, and the majority of existing features can be deferred or even discarded.

The main motivation for rebuilding a software product is to accelerate delivery, validate product hypotheses, and increase revenues as a result. Waiting weeks or months for feature parity during a rebuild prolongs Discontinuous Delivery, and exacerbates all the related opportunity costs that originally prompted the rewrite.

Back at Gamez, some product research conducted prior to the rebuild is unearthed, and it contains some troubling insights. Search generates the vast majority of website revenue. Furthermore, product videos is identified as a missing feature that would quickly surpass all other revenue sources. This means a more profitable course of action would have been to launch a new product videos feature, replace the eCom search feature, and validate learnings with customers. An opportunity to substantially increase profitability was missed.

Strangler Fig

Rebuilding a software product right means incrementally launching features throughout the rebuild. This can be accomplished with the Strangler Fig architectural pattern, which can be thought of as applying the Expand and Contract pattern at the application level.

Strangler Fig means gradually creating a new software product around the boundaries of an existing product, and rebuilding one feature at a time until the existing product is no more. It is usually implemented by deploying an event interceptor in front of the existing product. New and existing features are added to the new software product as necessary, and user requests are routed between products. This approach means customers can benefit from new features much sooner, and many more learning opportunities are created.

Using the Strangler Fig pattern would have allowed Gamez to generate product videos revenue from the outset. At first, a customer router microservice would have been implemented in front of the eCom application. It would have sent all customer requests to eCom bar product videos, which would have gone to the first microservice. Later on, search and customer review services would have been replaced the same eCom functionality.

Case studies

  1. The Road To Continuous Delivery by Michiel Rook details how Strangler Fig helped a Dutch recruitment company replace its ecommerce website
  2. Strangulation Pattern Of Choice by Joshua Gough describes how Strangler Fig allowed a British auction site to replace its ecommerce website
  3. Legacy Application Strangulation Case Studies by Paul Hammant lists a series of product rewrites made possible by Strangler Fig
  4. How to Breakthrough the Old Monolith by Kyle Galbraith visualises how to slice a monolith up into microservices, using Strangler Fig
  5. Estimation Is Evil by Ron Jeffries mentions how incremental launches would have benefitted the eXtreme Programming (XP) birth project Chrysler C3, rather than a yearlong pursuit of feature parity

Build The Right Thing and Build The Thing Right

Should an organisation in peril start its journey towards IT enabled growth by investing in IT delivery first, or product development? Should it Build The Thing Right with Continuous Delivery first, or Build The Right Thing with Lean Product Development?

Introduction

The software revolution has caused a profound economic and technological shift in society. To remain competitive, organisations must rapidly explore new product offerings as well as exploiting established products. New ideas must be continuously validated and refined with customers, if product/market fit and repeatable sales are to be found.

That is extremely difficult when an organisation has the 20th century, pre-Internet IT As A Cost Centre organisational model. There will be a functionally segregated IT department, that is accounted for as a cost centre that can only incur costs. There will be an annual plan of projects, each with its scope, resources, and deadlines fixed in advance. IT delivery teams will be stuck in long-term Discontinuous Delivery, and IT executives will only be incentivised to meet deadlines and reduce costs.

If product experimentation is not possible and product/market fit for established products declines, overall profitability will suffer. What should be the first move of an organisation in such perilous conditions? Should it invest first in Lean Product Development to Build The Right Thing, and uncover new product offerings as quickly as possible? Or should it invest in Continuous Delivery to Build The Thing Right first, and create a powerful growth engine for locating future product/market fit?

Build The Thing Right first

In 2007, David Shpilberg et al published a survey of ~500 executives in Avoiding the Alignment Trap in IT. 74% of respondents reported an under-performing, undervalued IT department, unaligned with business objectives. 15% of respondents had an effective IT department, with variable business alignment, below average IT spending, and above average sales growth. Finally, 11% were in a so-called Alignment Trap, with negative sales growth despite above average IT spending and tight business alignment.

Avoiding the alignment trap in IT

Avoiding the Alignment Trap in IT (Shpilberg et al) – source praqma.com

Shpilberg et al report “general ineffectiveness at bringing projects in on time and on the dollar, and ineffectiveness with the added complication of alignment to an important business objective”. The authors argue organisations that prematurely align IT with business objectives will fall into an Alignment Trap. They conclude organisations should build a highly effective IT department first, and then align IT projects to business objectives. In other words, Build The Thing Right before trying to Build The Right Thing.

The conclusions of Avoiding the Alignment Trap in IT are naive, because they ignore the implications of IT As A Cost Centre. An organisation with ineffective IT will undoubtedly suffer if it increases business alignment and investment in IT as is. However, that is a consequence of functionally siloed product and IT departments, and the antiquated project delivery model tied to IT As A Cost Centre. Projects will run as a Large Batch Death Spiral for months without customer feedback, and invariably result in a dreadful waste of time and money. When Shpilberg et al define success as “delivered projects with promised functionality, timing, and cost”, they are measuring manipulable project outputs, rather than customer outcomes linked to profitability.

Build The Right Thing first

It is hard to Build The Right Thing without first learning to Build The Thing Right, but it is possible. If flow is improved through co-dependent product and IT teams, the negative consequences of IT As A Cost Centre can be reduced. New revenue streams can be unlocked and profitability increased, before a full Continuous Delivery programme can be completed.

An organisation will have a number of value streams to convert ideas into product offerings. Each value stream will have a Fuzzy Front End of product and development activities, and a technology value stream of build, testing, and operational activities. The time from ideation to customer is known as cycle time.

Flow in a value stream can be improved by:

  • understanding the flow of ideas
  • reducing batch sizes
  • quantifying value and urgency

The flow of ideas can be understood by conducting a Value Stream Mapping with senior product and IT stakeholders. Visualising the activities and teams required for ideas to reach customers will identify an approximate cycle time, and the sources of waste that delay feedback. A Value Stream Mapping will usually uncover a shocking amount of rework and queue times, with the majority in the Fuzzy Front End.

For example, a Value Stream Mapping might reveal a 10 month cycle time, with 8 months spent on ideas bundled up as projects in the Fuzzy Front End, and 2 months spent on IT in the technology value stream. Starting out with Build The Thing Right would only tackle 20% of cycle time.

Reducing batch sizes means unbundling projects into separate features, reducing the size of features, and using Work In Process (WIP) Limits across product and IT teams. Little’s Law guarantees distilling projects into small, per-feature deliverables and restricting in-flight work will result in shorter cycle times. In Lean Enterprise, Jez Humble, Joanne Molesky, and Barry O’Reilly describe reducing batch sizes as “the most important factor in systemically increasing flow and reducing variability”.

Quantifying value and urgency means working with product stakeholders to estimate the Cost Of Delay of each feature. Cost Of Delay is the economic benefit a feature could generate over time, if it was available immediately. Considering how time will affect how a feature might increase revenue, protect revenue, reduce costs, and/or avoid costs is extremely powerful. It encourages product teams to reduce cycle time by shrinking batch sizes and eliminating Fuzzy Front End activities. It uncovers shared assumptions and enables better trade-off decisions. It creates a shared sense of urgency for product and IT teams to quickly deliver high value features. As Don Reinertsen says in the The Principles of Product Development Flow, “if you only measure one thing, measure the Cost Of Delay”.

For example, a manual customer registration task generates £100Kpa of revenue, and is performed by one £50Kpa employee. The economic benefit of automating that task could be calculated as £100Kpa of revenue protection and £50Kpa of cost reduction, so £5.4Kpw is lost while the feature does not exist. If there is a feature with a Cost Of Delay greater than £5.4Kpw, the manual task should remain.

Build The Right Thing case study – Maersk Line

Black Swan Farming – Maersk Line by Joshua Arnold and Özlem Yüce demonstrates how an organisation can successfully Build The Right Thing first, by understanding the flow of ideas, reducing batch sizes, and quantifying value and urgency. In 2010, Maersk Line IT was a £60M cost centre with 20 outsourced development teams. Between 2008 and 2010 the median cycle time in all value streams was 150 days. 62% of ~3000 requirements in progress were stuck in Fuzzy Front End analysis.

Arnold and Yüce were asked to deliver more value, flow, and quality for a global booking system with a median cycle time of 208 days, and quarterly production releases in IT. They mapped the value stream, shrank features down to the smallest unit of deliverable value, and introduced Cost Of Delay into product teams alongside other Lean practices.

After 9 months, improvements in Fuzzy Front End processes resulted in a 48% reduction in median cycle time to 108 days, an 88% reduction in defect count, and increased customer satisfaction. Furthermore, using Cost Of Delay uncovered 25% of requirements were a thousand times more valuable than the alternatives, which led to a per-feature return on investment six times higher than the Maersk Line IT average. By applying the same Lean principles behind Continuous Delivery to product development prior to additional IT investment, Arnold and Yüce achieved spectacular results.

Build The Right Thing and Build The Thing Right

If an organisation in peril tries to Build The Right Thing first, it risks searching for product/market fit without the benefits of fast customer feedback. If it tries to Build The Thing Right first, it risks spending time and money on Continuous Delivery without any tangible business benefits.

An organisation should instead aim to Build The Right Thing and Build The Thing Right from the outset. A co-evolution of product development and IT delivery capabilities is necessary, if an organisation is to achieve the necessary profitability to thrive in a competitive market.

This approach is validated by Dr. Nicole Forsgren et al in Accelerate. Whereas Avoiding The Alignment Trap In IT was a one-off assessment of business alignment in IT As A Cost Centre, Accelerate is a multi-year, scientifically rigorous study of thousands of organisations worldwide. Interestingly, Lean product development is modelled as understanding the flow of work, reducing batch sizes, incorporating customer feedback, and team empowerment. The data shows:

  • Continuous Delivery and Lean product development both predict superior software delivery performance and organisational culture
  • Software delivery performance and organisational culture both predict superior organisational performance in terms of profitability, productivity, and market share
  • Software delivery performance predicts Lean product development

On the reciprocal relationship between software delivery performance and Lean product development, Dr. Nicole Forsgren et al conclude “the virtuous cycle of increased delivery performance and Lean product management practices drives better outcomes for your organisation”.

Exapting product development and technology

An organisational ambition to Build The Right Thing and Build The Thing Right needs to start with the executive team. Executives need to recognise an inability to create new offerings or protect established products equates to mortal peril. They need to share a vision of success with the organisation that articulates the current crisis, describes a state of future prosperity, and injects urgency into day-to-day work.

The executive team should introduce the Improvement Kata into all levels of the organisation. The Improvement Kata encourages problem solving via experimentation, to proceed towards a goal in iterative, incremental steps amid ambiguities, uncertainties, and difficulties. It is the most effective way to manage a gradual co-emergence of Lean Product Development and Continuous Delivery.

Experimentation with organisational change should include a transition from IT As A Cost Centre to IT As A Business Differentiator. This means technology staff moving from the IT department to work in long-lived, outcome-oriented teams in one or more product departments, which are accounted for as profit centres and responsible for their own investment decisions. One way to do this is to create a Digital department of co-located product and technology staff, with shared incentives to create new product offerings. Handoffs and activities in value streams will be dramatically reduced, resulting in much faster cycle times and tighter customer feedback loops.

Instead of an annual budget and a set of fixed scope projects, there needs to be a rolling budget that funds a rolling plan linked to desired outcomes and strategic business objectives. The scope, resources, and deadlines of the rolling plan should be constantly refined by validated learnings from customers, as delivery teams run experiments to find problem/solution fit and problem/market for a particular business capability.

Those delivery teams should be cross-functional, with all the necessary personnel and skills to apply Design Thinking and Lean principles to problem solving. This should include understanding the flow of ideas, reducing batch sizes, and the quantifying value and urgency. As Lean Product Development and Continuous Delivery capabilities gradually emerge, it will become much easier to innovate with new product offerings and enhance established products.

It might take months or years of investment, experimentation, and disruption before an organisation has adopted Lean Product Development and Continuous Delivery across all its value streams. It is important to protect delivery expectations and staff welfare by making changes one value stream at a time, by looking for existing products or new product offerings stuck in Discontinuous Delivery.

Acknowledgements

Thanks to Emily Bache, Ozlem Yuce, and Thierry de Pauw for reviewing this article.

Further Reading

  1. Lean Software Development by Mary and Tom Poppendieck
  2. Designing Delivery by Jeff Sussna
  3. The Essential Drucker by Peter Drucker
  4. Measuring Continuous Delivery by Steve Smith
  5. The Cost Centre Trap by Mary Poppendieck
  6. Making Work Visible by Dominica DeGrandis

IT as a Business Differentiator

How can Continuous Delivery power innovation in an organisation?

When an organisation is in a state of Continuous Delivery, its technology strategy can be described as IT as a Business Differentiator. IT staff will work in one or more product departments, which are accounted for as profit centres in which profits are generated from incoming revenues and outgoing costs. A profit centre provides services to customers, and is responsible for its own investment decisions and profitability.

IT as a Business Differentiator promotes IT to be a front office function. There will be a rolling budget, and a rolling plan consisting of dynamic product areas with scope, resources, and deadlines constantly refined by feedback. Long-lived, outcome-oriented delivery teams will implement experiments to find product/market fit for a particular business capability.

This is in direct contrast to Nicholas Carr’s 2003 proclamation that IT Doesn’t Matter, to which history has not been kind. Carr failed to predict the rise of Agile Development, Lean Product Development, and in particular Cloud Computing, which has commoditised many lower-order technology functions. These advancements have contributed to the ongoing software revolution termed “Software Is Eating The World” by Marc Andreessen in 2011, which has caused a profound economic and technological shift in society.

Continuous Delivery as the norm

IT as a Business Differentiator is an Internet-inspired, 21st century technology strategy in which IT contributes to uncovering new revenue streams that increase overall profitability for an organisation. This means executives and managers are incentivised to maximise revenue generating activities, as well as controlling cost generating activities.

Continuous Delivery is table stakes for IT as a Business Differentiator, as IT executives and managers are accountable for delays between ideation and customer launch. There will be an ongoing investment in technology and organisational change, to ensure deployment throughput meets market demand. There will be a focus on optimising flow by eliminating handoffs, reducing work in progress, and removing wasteful activities. The reliability strategy will be to Optimise For Resilience, in order to minimise failure response time and blast radius.

IT as a Business Differentiator and Continuous Delivery were validated by Dr. Nicole Forsgren et al, in the 2018 book Accelerate. Surveys of 23,000 people working at 2,000 organisations worldwide revealed:

  • Continuous Delivery results in high performance IT
  • High performance IT leads to simultaneous improvements in the stability and throughput of IT delivery, without trade-offs
  • High performance IT means an organisation is twice as likely to exceed profitability, market share, and productivity goals
  • Continuous Delivery also results in less rework, an improved organisational culture, reduced team burnout, and increased job satisfaction

Leaving IT As A Cost Centre

If an organisation has institutionalised IT as a Cost Centre as its technology strategy, moving to IT as a Business Differentiator would be difficult. It would require an executive-level decision, in one of the following scenarios:

  • Competition – rival organisations are increasing their market share
  • Cognition – IT is recognised as the engine of future business growth
  • Catastrophe – a serious IT failure has an enormously negative financial impact

If the executive leadership of the organisation agree there is an existential crisis, they should publicly commit to IT as a Business Differentiator. That should include an ambitious vision of success that explains the current crisis, describes a state of future economic prosperity, and injects a sense of urgency into the day-to-day work of personnel.

There is no recipe for moving from IT as a Cost Centre to IT as a Business Differentiator. As a complex, adaptive system, an organisation will have a dispositional state of time-dependent possibilities, rather than linear cause and effect. A continuous improvement method such as the Improvement Kata should be used to experiment with different changes. Experiments could include:

  • co-locating IT delivery teams with their product stakeholders
  • removing cost accounting metrics from IT executive incentives
  • creating a Digital department of product and IT staff, as a profit centre

This leaves the open question of whether an IT department should adopt Continuous Delivery before, during, or after a move from IT as a Cost Centre to IT as a Business Differentiator.

Further Reading

  1. The Principles Of Product Development Flow by Don Reinertsen
  2. Measuring Continuous Delivery by the author
  3. Lean Enterprise by Jez Humble, Joanne Molesky, and Barry O’Reilly
  4. Utility vs. Strategic Dichotomy by Martin Fowler
  5. Products Not Projects by Sriram Narayan

Acknowledgements

Thanks to Thierry de Pauw for his feedback.

Deployment pipeline design and the Theory Of Constraints

How should you design a deployment pipeline? Short and wide, long and thin, or something else? Can you use a Theory Of Constraints lens to explain why pipeline flexibility is more important than any particular pipeline design?

TL;DR:

  • Past advice from the Continuous Delivery community to favour short and wide deployment pipelines over long and thin pipelines was flawed
  • Parallelising activities between code commit and production in a short and wide deployment pipeline is unlikely to achieve a target lead time 
  • Flexible pipelines allow for experimentation until a Goldilocks deployment pipeline can be found, which makes Continuous Delivery easier to implement

Introduction

The Deployment Pipeline pattern is at the heart of Continuous Delivery. A deployment pipeline is a pull-based automated toolchain, used from code commit to production. The design of a deployment pipeline should be aligned with Conway’s Law, and a model of the underlying technology value stream. In other words, it should encompass the build, testing, and operational activities required to launch new product ideas to customers. The exact tools used are of little consequence.

Advice on deployment pipeline design has remained largely unchanged since 2010, when Jez Humble recommended “make your pipeline wide, not long… and parallelise each stage as much as you can“. A long and thin deployment pipeline of sequential activities is easy to reason about, but in theory parallelising activities between build and production will shorten lead times, and accelerate feedback loops. The trade-off is an increase in toolchain complexity and coordination costs between different teams participating in the technology value stream.

For example, imagine a technology value stream with sequential activities for automated acceptance tests, exploratory testing, and manual performance testing. This could be modelled as a long and thin deployment pipeline.

If those testing activities could be run in parallel, the long and thin deployment pipeline could be re-designed as a short and wide deployment pipeline.

Since 2010, people in the Continuous Delivery community – including the author – have periodically recommended short and wide deployment pipelines over long and thin pipelines. That advice was flawed.

The Theory Of Constraints, Applied

The Theory Of Constraints is a management paradigm by Dr. Eli Goldratt, for improving organisational throughput in a homogeneous workflow. A constraint is any resource with capacity equal to, or less than market demand. Its level of utilisation will limit the utilisation of other resources. The aim is to iteratively increase the capacity of a constraint, until the flow of items can be balanced according to demand. The Theory Of Constraints is applicable to Continuous Delivery, as a technology value stream should be a homogeneous workflow that is as deterministic and invariable as possible.

When a delivery team is in a state of Discontinuous Delivery, its technology value stream will contain a constrained activity with a duration less than the current lead time, but too large for the target lead time. The duration might be greater than the target lead time, or the largest duration of all the activities. A short and wide deployment pipeline will not be able to meet the target lead time, as the duration of the parallel activities will be limited by the constrained activity.

In the above example, assume the current lead time is 14 days, and manual performance testing takes 12 days as it involves end-to-end performance testing with a third party.

Assume customer demand results in a target lead time of 7 days. This means the delivery team are in a state of Discontinuous Delivery, and a long and thin deployment pipeline would be unable to meet that target.

A short and wide deployment pipeline would also be unable to achieve the target lead time. The parallel testing activities would be limited by the 12 days of manual performance testing, and future release candidates would queue before the constrained activity. An obvious countermeasure would be for some release candidates to skip manual performance testing, but that would increase the risk of production incidents.

This means long and thin vs. short and wide deployment pipelines is a false dichotomy.

Pipeline Design and The Theory Of Constraints

In The Goal, Dr. Eli Goldratt describes the Theory Of Constraints as an iterative cycle known as the Five Focussing Steps: identify a constraint, reduce its wasted capacity, regulate its item arrival rate, increase its capacity, and then repeat.

If the activities in a technology value stream can be re-sequenced, re-designing a deployment pipeline is one way to reduce wasted time at a constrained activity, and regulate the arrival of release candidates. Pipeline flexibility is more important than any particular pipeline design, as it enables experimentation until a Goldilocks deployment pipeline can be found.

The constrained activity should not be the first activity after release candidate creation. This would reduce subsequent release candidate queues, and statistical fluctuations in unconstrained activities. However, constraint time should never be wasted on items with knowable defects, and most activities in a deployment pipeline are testing activities.

One Goldilocks deployment pipeline design is for all unconstrained testing activities to be parallelised before the constrained activity. This should be combined with other experiments to save constraint time, and regulate the flow of release candidates to minimise queues and statistical fluctuations. Such a pipeline design will make it easier for delivery teams to successfully implement Continuous Delivery.

In the above example, assume the short and wide deployment pipeline can be re-designed so manual performance testing occurs after the other parallelised testing activities. This ensures release candidates with knowable defects are rejected prior to performance testing, which saves 1 day in queue time per release candidate. End-to-end performance testing scenarios are gradually replaced with stubbed performance tests and contract tests, which saves 6 days and means the target lead time can be accomplished. 

If there is no constrained activity in a technology value stream, the delivery team is in a state of Continuous Delivery and a constraint will exist either upstream in product development or downstream in customer marketing. Further deployment pipeline improvements such as automated filtering of test scenarios could increase the speed of release candidate feedback, but the priority should be tackling the external constraint if product cycle time from ideation to customer is to be improved.

Acknowledgements

Thanks to Thierry de Pauw for his feedback on this article.

Continuous Delivery and the Theory Of Constraints

How should you actually implement Continuous Delivery?

Adopting Continuous Delivery takes time. You have a long list of technology and organisational changes to consider. You have to work within the unique circumstances of your organisation. You’re constantly surrounded by strange problems, half-baked theories, off the shelf solutions that just don’t work, and people telling you Amazon is nothing to worry about

How do you identify and remove the major impediments in your build, testing, and operational activities? How do you avoid spending weeks, months, or years on far-reaching changes that ultimately have no impact on your time to market?

TL;DR:

  • Continuous Delivery means applying technology and organisational changes to the unique circumstances of an organisation
  • If a Continuous Delivery programme does not focus on the activities with the most rework and/or queue times, there is a high probability of sub-optimal outcomes
  • The Theory Of Constraints is a management paradigm for improving organisational throughput, while simultaneously decreasing both inventory and operating expense.
  • The Theory Of Constraints can be applied to Continuous Delivery, as the build, testing, and operational activities in a technology value stream should be homogeneous
  • The Five Focussing Steps can be used to identify constrained activities, and then introduce the necessary technology and organisational changes to reduce rework and/or queue times

Introduction

Technology can bring benefits if, and only if, it diminishes a limitation – Dr. Eli Goldratt

An organisation will have one or more technology value streams. Each one will represent a sequence of activities that converts business ideas into value-adding software for customers. The first part of a technology value stream will be design and development activities, which are inherently non-deterministic and highly variable. The second part will be build, testing, and operational activities, which should be as deterministic and invariable as possible. When an organisation lacks the necessary stability and throughput in its technology value streams to satisfy market demand, it is in a state of Discontinuous Delivery.

Continuous Delivery is a set of holistic principles and practices to improve the stability and throughput of IT delivery. It needs a substantial investment in technology changes:

  • Version Control: put code, configuration, infrastructure definitions, migration scripts, etc. in version control to preserve change history
  • Environments: automate self-service, on-demand provisioning of short-lived test environments and incremental release patterns such as Canary Deployments
  • Development: use Test Driven Development with Pair-Programming to build quality into a codebase, and use Trunk Based Development with Continuous Integration to ensure it is always releasable
  • Architecture: establish an Evolutionary Architecture to encourage loosely-coupled, independently deployable services
  • Testing: automate parallelisable acceptance tests with dynamic test data, use Smoke Testing to validate deployments, and use Exploratory Testing to uncover new information
  • Operability: aggregate logs and metrics to create a centralised telemetry platform for visualising normal conditions, and alerting on abnormal conditions

It also requires a significant amount of organisational changes:

  • Batch Size: reduce features per release candidate to decrease variability, feedback delays, risk, overheads, inefficiencies, and lead time via Little’s Law
  • Management: devolve decision making to the employees closest to the work to empower better choices in design, testing etc.
  • Culture: grow a performance-oriented culture of cooperation and collaboration to increase information flow between teams
  • Responsibilities: change responsibilities for testing and on-call production support to encourage a shared sense of accountability
  • Risk: use continuous code reviews via Pair-Programming and traceability of production changes to replace Risk Management Theatre with adaptive risk assessment
  • Skills: invest in employee training to ease the introduction of new practices and tools e.g. infrastructure automation
  • Structures: convert siloed functional teams into cross-functional teams, and align team and software architecture with Conway’s Law to enable faster delivery

These changes are challenging in isolation, but it is their application to the unique circumstances and constraints of an organisation that makes Continuous Delivery so difficult. Adopting Continuous Delivery means implementing widespread changes in the highly uncertain conditions of a complex, adaptive system, in which behaviours are emergent and interactions are unpredictable.

A Continuous Delivery programme should aim to reduce rework and queue times in a technology value stream, as they are the most common sources of waste. Rework is any activity that must be repeated due to a failure, such as a tester being given a defective release candidate. Queue time is time spent queuing for an activity, such as a deployment waiting for a tester to become available. There are many potential causes of rework and queue time, including the following:

Cause Description
Snowflake environments Manual environment provisioning
Brittle deployments Unreliable code, config, or infra changes
Environment contention Slow access to test data, services, or environments
End-To-End Testing Reams of slow, non-deterministic tests
Rigid architecture Application deployments coupled together
Excessive toil Manual, repetitive build/testing/operations tasks

Over the years, the advice on how to implement Continuous Delivery has evolved. The Continuous Delivery book suggested using the Plan-Do-Check-Act Cycle to experiment with changes, and implement a maturity model. Lean Enterprise recommended the Improvement Kata be used to create a Continuous Improvement framework, with Plan-Do-Check-Act used to structure different experiments. This can be very effective, but there needs to be a way to focus on the activities where a reduction in rework and/or queue times will have an immediate, positive impact. Otherwise, there is a high probability of sub-optimal outcomes, stakeholder dissatisfaction, and Discontinuous Delivery remaining the norm.

Given sufficient Continuous Delivery expertise, heuristics can be used to locate those activities in a technology tech value stream. An alternative method is needed when those heuristics are unavailable, or insufficient to overcome resistance to change.

Example – Network Activation

Think of an 18 month Continuous Delivery programme for a multi-team, multi-application Network Activation platform. The delivery teams already use Pair-Programming, Test-Driven Development, Trunk Based Development, and Continuous Integration. The on-prem technology value stream has 2 pre-production activities – Functional Test and Integration Test – and a production deployment involves 10 semi-automated and manual tasks. Problems include snowflake environments, brittle deployments, and End-To-End Testing. There is no data on deployments, beyond an average lead time to production of 34 days. The product owner wants it to be 14 days.

Over 18 months there is a concerted effort to create a consistent deployment pipeline, with the Cost of Delay used to prioritise pipeline features. Aggregate Artifacts and Artifact Containers are used to automate deployments, infrastructure provisioning errors are eliminated, and pipeline dashboards are built. Verify Branch By Abstraction is used to incrementally decouple applications, by replacing object serialisation with RESTful APIs. API Examples and Consumer Driven Contracts are used to verify application interactions. There are also a series of organisational changes to policies, processes, and teams, to reduce queue time for Functional Test and Integration Test.

These changes produce many benefits, including a 95% build time improvement from 1 hour to 3 minutes, and a 43% improvement in test deployment time from 3.5 hours to 2.5 hours. However, there is no impact whatsoever on the average lead time of 34 days. On that basis, the Continuous Delivery programme is unsuccessful.

The Theory Of Constraints

In The Goal, Dr. Eli Goldratt sets out the Theory Of Constraints. The Theory Of Constraints is a management paradigm for improving organisational throughput, by optimising a small number of constraints in a homogeneous workflow. It states the goal of an organisation is to make money, by increasing throughput while simultaneously decreasing both inventory and operating expense.

A constraint is any resource with capacity equal to or less than market demand, and it will limit the throughput of the organisation. A constraint is internal when market demand is greater than throughput, and external when throughput is greater than market demand. An hour lost at a constraint is an hour lost for the whole organisation.

A non-constraint is any resource with capacity greater than market demand. The level of utilisation of a non-constraint is determined by a constraint elsewhere. An increase in non-constraint capacity is a waste of time and money, as it can only increase queued work before a constraint and cannot reduce work starvation after a constraint. An hour lost at a non-constraint is an illusion.

The central premise of the Theory Of Constraints is to iteratively increase the capacity of a constraint, until the flow of work items can be balanced according to market demand. The Five Focussing Steps are used to achieve this:

  1. Identify the constraint(s): calculate market demand for a product, and which activities have capacity less than or equal to demand
  2. Exploit the constraint(s): increase constraint utilisation by eliminating idle time, and time wasted on lower priority and/or defective work items
  3. Subordinate everything else to the constraint(s): protect a constraint from excessive inventory or work starvation by maintaining a buffer of materials, and regulating the arrival of new work items into the system based on constraint capacity
  4. Elevate the constraint(s): offload constraint work items to a non-constraint by investing in additional equipment, people, and/or allocation of work items to third parties
  5. If a constraint has been broken, go back to step 1, but do not allow inertia to cause a constraint

Continuous Delivery and the Theory Of Constraints

The Theory Of Constraints can be applied to Continuous Delivery, as the build, testing, and operations activities in a technology value stream should be homogeneous. There will be one, or at most a few constrained activities, caused by rework and/or queue times. The Five Focussing Steps can be used to identify and optimise the constraint(s), until the flow of release candidates is balanced from mainline commit to production and Continuous Delivery is achieved. At that point, an external constraint will emerge upstream in product development or downstream in sales. If market demand increases later on, a new internal constraint might surface within the technology value stream.

Identify the constraint

A constraint is identified by establishing the market demand for a product, and then calculating how much time each activity contributes to meeting that demand. Overall market demand should be specified by the product owner as a target Lead Time lt, and measured from mainline commit to production launch.

lt = X minutes/hours/days

The ability of an activity a n to meet lt should be expressed as its Activity Time at n. It can be measured as the median between its finish time aft n and the finish time of the prior activity aft n – 1, for all occurrences of a n in a time period t. This measurement ensures queue time and process time are both accounted for. The measurement unit should be minutes, hours, or days based on lt. It might be necessary to measure variability in Activity Time as well, if one or more activities exhibit an undesirable amount of variability.

at n = median ( ( aft n - aft n - 1 ) for a n in t )

If an activity has at n greater than lt, it is a constraint. If a high percentage of release candidates consistently fail to pass the activity on the first attempt, it is unstable and rework must be reduced. Alternatively, if release candidates are regularly delayed before starting the activity, it is too slow to begin and queue time must be reduced.

If no activity has at n greater than lt, there is no internal constraint. This means lt is not ambitious enough, or there is an external constraint outside the technology value stream.

Applying the Theory Of Constraints to the Network Activation platform with its target lead time of 14 days reveals a constraint. Visualising each activity as queue time and process time shows Production queue time was often greater than 14 days. This was caused by a well-intentioned release management policy of deploying release candidates into production when ready, and then waiting for a launch date pre-agreed with an upstream billing provider months earlier. Launch dates were delayed when testing finished late, but never brought forward when testing finished early. 18 months of hard work reduced variability, Functional Test queue time, and Integration Test queue time, but those improvements merely led to an increase in Production queue time until launch dates occurred.

Optimise the constraint

A constraint is optimised by experimenting with the full range of technology and organisational changes recommended by Continuous Delivery. There needs to be a concerted effort to reduce the average and variation in Activity Time for the constrained activity, until the target Lead Time can be satisfied.

Exploiting a constrained activity means ensuring it is always working, and always doing valuable work. This aligns with Continuous Delivery emphasising the automation of repetitive tasks to free up people for higher-order problems. Automated infrastructure provisioning, automated unit and acceptance testing, and moving to an Evolutionary Architecture are all examples of technology changes to reduce time spent at a constrained activity. The most effective organisational change is to reduce batch sizes, as a smaller batch will have shorter process times and queue times.

Subordinating unconstrained activities means limiting the flow of mainline commits, builds, and deployments to match the constrained activity. This is best accomplished with Work In Process (WIP) Limits, which encourage people to collaborate on a few work items at any point in time. Stop The Line teaches people to prioritise a releasable codebase over developing more features, and eXtreme Programming practices such as Test-Driven Development and Pair-Programming can also foster a shared team cadence.

Elevating a constrained activity means investing in people and tools to increase its capacity. This might be achieved with technology changes, such as a move to short-lived test environments via fully automated infrastructure provisioning. It might also involve organisational changes such as paying third party suppliers for more test data, hiring more engineers, or running training courses to improve Staff Liquidity.

For instance, an assortment of changes should be tried if End-To-End Testing activity is a constraint. Time wasted should be eliminated by ensuring testers are uninterruptible, making more testers available, and running testing slots 24 hours a day. Defective release candidates should be minimised by moving all other testing activities in front of End-To-End Testing, adding automated contract tests, and rejecting release candidates with one or more test failures. Over time the end-to-end tests should be incrementally replaced with automated acceptance tests, monitoring dashboards, and automated anomaly detection.

Example: MediaTech

Consider a 7 month Continuous Delivery programme for a multi-team, multi-application MediaTech platform. The delivery teams use Trunk Based Development with some Pair Programming, but there is little Test-Driven Development. The on-prem technology value stream has 2 pre-production activities – Functional Test and Release Test – and a production deployment involves 60 semi-automated and manual tasks. Problems include snowflake environments, brittle deployments, environment contention, a rigid architecture, End-To-End Testing, and excessive toil. There is no data on deployments, beyond an average interval of 3 weeks. The product owner has set a target lead time of 1 week, at an interval of 1 week.

The range of problems at MediaTech makes Continuous Delivery very challenging. There is a proposal to create a fully automated deployment pipeline, with infrastructure provisioning, deployments, and telemetry. Other suggested options include the promotion of Test-Driven Development, implementing zero downtime deployments, and immutable release candidates. However, there are concerns about the time needed to make such changes.

After 2 months, deployment data is scraped from chat channels, and applying the Five Focussing Steps shows sweeping changes are not immediately required. The constraint on a target lead time of 1 week is Release Test process time, due to perennial environment instability caused by configuration and test data issues. Interestingly, there is no obvious constraint on a target lead time of 2 weeks. As a result, the product owner agrees to a target lead time of 2 weeks at an interval of 2 weeks, as an intermediate milestone.

In the Theory Of Constraints, a lack of skilled people can constitute a constraint. When the above data is shared with the MediaTech operations team, they reveal an unseen constraint on the 14 day target lead time. There is a single release planner, who endures a heavy workload for each production deployment on top of their day to day work. Everyone agrees the Continuous Delivery programme needs to focus on automating the release planning tasks, and improving the existing automated tests.

Over the next few months, the release planning workload is reduced and the automated tests are stabilised. Callout rota emails by the release planner are replaced by a shared rota, co-owned by the delivery teams. A release note process overseen by the release planner involving 3 different teams is replaced by a fully automated release note tool. Furthermore, end-to-end tests in Integration Test and Release Test are prioritised by their non-determinism, and gradually rewritten as build-time acceptance tests. The target lead time of 2 weeks at an interval of 2 weeks is successfully achieved, and after several production deployments the product owner decides the new throughput is sufficient. The Continuous Delivery programme is considered a success, and without a fully automated deployment pipeline.

Further Reading

  1. Beyond The Goal by Dr. Eli Goldratt
  2. Measuring Continuous Delivery by the author
  3. Resilience as a Continuous Delivery Enabler by the author

Acknowledgements

Thanks to Thierry de Pauw for his feedback on this article.

« Older posts

© 2021 Steve Smith

Theme by Anders NorénUp ↑