Steve Smith

On Tech

Page 2 of 8

You Build It You Run It

What is You Build It You Run It, and why does it have such a positive impact on operability? Why is it important to balance support cost effectiveness with operability incentives?

This is part of the Who Runs It series.

Introduction

The usual alternative to You Build It Ops Run It is for a Delivery team to assume responsibility for its Run activities, including deployments and production support. This is often referred to as You Build It You Run It.

You Build It You Run It consists of single-level swarming support, with developers on-call. There is also a Service Desk to handle customer requests. The  toolchain needs to include anomaly detection, alert notifications, messaging, and incident management tools, such as Prometheus, PagerDuty, Slack, and ServiceNow.

As with You Build It Ops Run It, Service Desk is an L1 team that receives customer requests and will resolve simple technology issues wherever possible. A development team in Delivery is also L1, and they will monitor dashboards, receive alerts, and respond to incidents. Service Desk should escalate tickets for particular website pages or user journeys into the incident management system, which would be linked to applications.

Delivery engineering costs and on-call support will both be paid out of CapEx, and Operations teams such as Service Desk will be under OpEx. As with You Build It Ops Run It, the Service Desk team might be outsourced to reduce OpEx costs. CapEx funding for You Build It You Run It will compel a product manager to balance their desired availability with on-call costs. OpEx funding for Delivery on-call should be avoided wherever possible, as it encourages product managers to artificially minimise risk tolerance and select high availability targets irregardless of on-call costs.

Continuous Delivery and operability

Swarming support means Delivery prioritising incident resolution over feature development, in line with the Continuous Delivery practice of Stop The Line and the Toyota Andon Cord. This encourages developers to limit failure blast radius wherever possible, and prevents them from deploying changes mid-incident that might exacerbate a failure. Swarming also increases learning, as it ensures developers are able to uncover perishable mid-incident information, and cross-pollinate their skills.

You Build It You Run It also has the following advantages for product development:

  • Short deployment lead times – lead times will be minimised due to no  handoffs
  • Minimal knowledge synchronisation costs – developers will be able to easily share application and incident knowledge, to better prepare themselves for future incidents
  • Focus on outcomes – teams will be empowered to deliver outcomes that test product hypotheses, and iterate based on user feedback
  • Short incident resolution times – incident response will be quickened by no support ticket handoffs or rework
  • Adaptive architecture – applications will be architected to limit failure blast radius, including bulkheads and circuit breakers
  • Product telemetry – dashboards and alerts will be continually updated by developers, to be multi-level and tailored to the product context
  • Traffic knowledge – an appreciation of the pitfalls and responsibilities inherent in managing live traffic will be factored into design work
  • Rich situational awareness – developers will respond to incidents with the same context, ways of working, and tooling
  • Clear on-call expectations – developers will be aware they are building applications they themselves will support, and they should be remunerated

You Build It You Run It creates the right incentives for operability. When Delivery is responsible for their own deployments and production support, product owners will be more aware of operational shortfalls, and pressed by developers to prioritise operational features alongside product ideas. Ensuring that application availability is the responsibility of everyone will improve outcomes and accelerate learning, particularly for developers who in IT As A Cost Centre are far removed from actual customers. Empowering delivery teams to do on-call 24×7 is the only way to maximise incentives to build operability in.

Production support as revenue insurance

The most common criticism of You Build It You Run It is that it is too expensive. Paying Delivery team members for L1 on-call standby and callout can seem costly, particularly when You Build It Ops Run It allows for L1-2 production support to be outsourced to cheaper third party suppliers. This perception should not be surprising, given David Wood’s assertion in The Flip Side Of Resilience that “graceful extensibility trades off with robust optimality”. Implementing You Build It You Run to increase adaptive capacity for future incidents may look wasteful, particularly if incidents are rare.

A more holistic perspective would be to treat production support as revenue insurance for availability targets, and consider risk in terms of revenue impact instead of incident count. A production support policy will cover:

  • Availability protection
  • Availability restoration on loss

You Build It You Run It maximises incentives for Delivery teams to focus from the outset on protecting availability, and it guarantees the callout of an L1 Delivery engineer to restore availability on loss. This should be demonstrable with a short Time To Restore (TTR), which could be measured via availability time series metrics or incident duration. That high level of risk coverage will come at a higher premium. This means You Build It You Run It will be more cost effective for applications with higher availability targets and greater potential for revenue loss.

You Build It Ops Run It offers a lower level of risk coverage at a lower premium, with weak incentives to protect application availability and an L2 Application Operations team to restore application availability. This will produce a higher TTR  than You Build It You Run It. This may be acceptable for applications with lower availability targets and/or limited potential for revenue loss.

The cost effectiveness of a production support policy can be calculated per availability target by comparing its availability restoration capability with support cost. For example, at Fruits R Us there are 3 availability targets with estimated maximum revenue losses on availability target loss. Fruits R Us has a Delivery team with an on-call cost of £3K per calendar month and a TTR of 20 minutes, and an Application Operations team with a cost of £1.5K per month and a TTR of 1 hour.

Projected availability loss per team is a function of TTR and the £ maximum availability loss per availability target, and lower losses can be calculated for the Delivery team due to its shorter TTR.

At 99.0%, Application Operations is as cost effective at availability restoration of a 7 hour 12 minute outage as the Delivery team, and Fruits R Us might consider the merits of You Build It Ops Run It. However, this would mean Application Operations would be unable to build operability in and increase availability protection, and the Delivery team would have few incentives to contribute.

At 99.5%, the Delivery team is more cost effective at availability restoration of a 3 hour 36 minute outage than Application Operations.

At 99.9%, the Delivery team is far more cost effective at availability restoration of a 43 minute 12 second outage. The 1 hour TTR of Application Operations means their £ projected availability loss is greater than the £ maximum availability loss at 99.9%. You Build It You Run It is the only choice.

The Who Runs It series:

  1. You Build It Ops Run It
  2. You Build It You Run It
  3. You Build It Ops Run It at scale
  4. You Build It You Run It at scale
  5. You Build It Ops Sometimes Run It
  6. Implementing You Build It You Run It at scale
  7. You Build It SRE Run It

Acknowledgements

Thanks to Thierry de Pauw.

Operability measures

Why is it important to measure operability? What should the trailing indicators and leading indicators of operability?

Introduction

In How To Measure Anything, Douglas Hubbard states organisations have a Measurement Inversion, and waste their time measuring variables with a low information value. This is certainly true of IT reliability, which is usually badly measured if at all. By proxy, this includes operability as well.

In many organisations, reliability is measured by equating failures with recorded production incidents. Incident durations are calculated for Mean Time Between Failures (MTBF) and Mean Time To Repair (MTTR), or there is just an overall incident count. These are classic vanity measures. They are easy to implement and understand, but they have a low information value due to the following:

  • Quantitative measures such as incident count have no reflection on business drivers, such as percentage of unrecoverable user errors
  • Manual recording of incidents in a ticket system can be affected by data inaccuracies and cognitive biases, such as confirmation bias and recency bias
  • Goodhart’s Law means measuring incidents will result in fewer incident reports. People adjust their behaviours based on how they are measured, and measuring incidents will encourage people to suppress incident reports with potentially valuable information.

If operability is to be built into applications, there is a need to identify trailing and leading indicators of operability that are holistic and actionable. Measures of operability that encourage system-level collaboration rather than individual productivity will pinpoint where improvements need to be made. Without those indicators, it is difficult to establish a clear picture of operability, and where changes are needed.

Effective leading and trailing indicators of software delivery should be visualised and publicly communicated throughout an organisation, via internal websites and dashboards. Information radiators help engineers, managers, and executives understand at a glance the progress being made and alignment with organisational goals. Transparency also reduces the potential for accidents and bad behaviours. As Louis Brandeis said in Other People’s Money “sunlight is said to be the best of disinfectants; electric light the most efficient policeman”.

Availability as a trailing indicator

Failures should be measured in terms of application availability targets, not production incidents. Availability measurements are easy to implement with automated time series metrics collection, easy to understand, and have a high information value. Measurements can be designed to distinguish between full and partial degradation, and between unrecoverable and recoverable user errors.

For example, a Fruits R Us organisation has 99.0%, 99.5%, and 99.9% as its availability targets A product manager for an Oranges application selects 99.5% for at least the first 3 months.

Availability should be measured in the aggregate as Request Success Rate, as described by Betsey Beyer et al in Site Reliability Engineering. Request Success Rate can approximate degradation for customer-facing or back office applications, provided a well-defined notion of successful and unsuccessful work. It covers partial and full downtime for an application, and is more fine-grained than uptime versus downtime.

When an application has a Request Success Rate lower than its availability target, it is considered a failure. The average time to restore availability can be tracked as a Mean Time To Repair metric, and visualised in a graph alongside availability.

At Fruits R Us, the Oranges application communicates with upstream consumers via a HTTPS API. Its availability is constantly measured by Request Success Rate, which is implemented by checking the percentage of upstream requests that produce a HTTP response code lower than HTTP 500. When the Request Success Rate over 15 minutes is lower than the availability target of 99.5%, it is considered a failure and a production incident is raised. An availability graph can be used to illustrate availability, incidents, and time to repair as a trailing indicator of operability.

Leading indicators of operability

Failures cannot be predicted in a production environment as it is a complex, adaptive system. In addition, it is easy to infer a false narrative of past behaviours from quantitative data. The insights uncovered from an availability trailing indicator and the right leading indicators can identify inoperability prior to a production incident, and they can be pattern matched to select the best heuristic for the circumstances.

A leading indicator should be split into an automated check and one or more exploratory tests. This allows for continuous discovery of shallow data, and frees up people to examine contextual, richer data with a higher information value. Those exploratory tests might be part of an operational readiness assessment, or a Chaos Day dedicated to particular applications. Leading indicators of operability can include:

Learning is a vital leading indicator of operability. An organisation is more likely to produce operable, reliable applications if it fosters a culture of continuous learning and experimentation. After a production incident, nothing should be more important than everyone in the organisation having the opportunity to accumulate new knowledge, for their colleagues as well as themselves.

The initial automated check of learning should be whether a post-incident review is published within 24 hours of an incident. This is easy to automate with a timestamp comparison between a post-incident review document and the central incident system, easy to communicate across an organisation, and highly actionable. It will uncover incident reviews that do not happen, are not publicly published, or happen too late to prevent information decay.

Another learning check should be the throughput of operability tasks, comprising the lead time to complete a task and interval between completing tasks. Tasks should be created and stored in a machine readable format during operability readiness assessments, Chaos Days, exploratory testing, and other automated checks of operability. Task lead time should not be more than a week, and task interval should not exceed the fastest learning source. For example, if operability readiness assessments occur every 90 days and Chaos Days are 30 days then at least one operability task should be completed per month.

Acknowledgements

Thanks as usual to Thierry de Pauw for reviewing this series

Build Operability in

What is operability, how does it promote resilience, and how does building operability into your applications drive Continuous Delivery adoption?

Introduction

The origins of the 20th century, pre-Internet IT As A Cost Centre organisational model can be traced to the suzerainty of cost accounting, and the COBIT management and governance framework. COBIT has recommended sequential Plan-Build-Run phases to maximise resource efficiency since its launch in 1996. The Plan phase is business analysis and product development, Build is product engineering, and Run is product support. The justification for this is the high compute costs and high transaction costs for a release in the 1990s.

With IT As A Cost Centre, Plan happens in a Product department, and Build and Run happens in an IT department. IT will have separate Delivery and Operations groups, with competing goals:

  • Delivery will be responsible for building features
  • Operations will be responsible for running applications

Delivery and Operations will consist of functionally-oriented teams of specialists. Delivery will have multiple development teams. Operations will have Database, Network, and Server teams to administer resources, a Service Transition team to check operational readiness prior to launch, and one or more Production Support teams to respond to live incidents.

Siloisation causes Discontinuous Delivery

In High Velocity Edge, Dr. Steven Spear warns over-specialisation leads to siloisation, and causes functional areas to “operate more like sovereign states”. Delivery and Operations teams with orthogonal priorities will create multiple handoffs in a technology value stream. A handoff means waiting in a queue for a downstream team to complete a task, and that task could inadvertently produce more upstream work.

Furthermore, the fundamentally opposed incentives, nomenclature, and risk appetites within Delivery and Operations teams will cause a pathological culture to emerge over time. This is defined by Ron Westrum in A Typology of Organisational Cultures as a culture of power-oriented interactions, with low cooperation and neglected responsibilities.

Plan-Build-Run was not designed for fast customer feedback and iterative product development. The goal of Continuous Delivery is to achieve a deployment throughput that satisfies product demand. Disparate Delivery and Operations teams will inject delays and rework into a technology value stream such that lead times are disproportionately inflated. If product demand dictates a throughput target of weekly deployments or more, Discontinuous Delivery is inevitable.

Robustness breeds inoperability

Most IT cost centres try to achieve reliability by Optimising For Robustness, which means prioritising a higher Mean Time Between Failures (MTBF) over a lower Mean Time To Repair (MTTR). This is based on the idea a production environment is a complicated system, in which homogeneous application processes have predictable, repeatable interactions, and failures are preventable.

Reliability is dependent on operability, which can be defined as the ease of safely operating a production system. Optimising For Robustness produces an underinvestment in operability, due to the following:

  • A Diffusion Of Responsibility between Delivery and Operations. When Operations teams are accountable for operational readiness and incident response, Delivery teams have little reason to work on operability
  • A Normalisation of Deviation within Delivery and Operations. When failures are tolerated as rare and avoidable, Delivery and Operations teams will pursue cost savings rather than an ability to degrade on failure

That underinvestment in operability will result in Delivery and Operations teams creating brittle, inoperable production systems.

Symptoms of brittleness will include:

  • Inadequate telemetry – an inability to detect abnormal conditions
  • Fragile architecture – an inability to limit blast radius on failure
  • Operator burnout – an inability to perform heroics on demand
  • Blame games – an inability to learn from experience

This is ill-advised, as failures are entirely unavoidable. A production environment is actually a complex system, in which heterogeneous application processes have unpredictable, unrepeatable interactions, and failures are inevitable. As Richard Cook explains in How Complex Systems Fail the complexity of these systems makes it impossible for them to run without multiple flaws“. A production environment is perpetually in a state of near-failure.

A failure occurs when multiple flaws unexpectedly coalesce and impede a business function, and the costs can be steep for a brittle, inoperable application. Inadequate telemetry widens the sunk cost duration from failure start to detection. A fragile architecture expands the opportunity cost duration from detection until resolution, and the overall cost per unit time. Operator burnout increases all costs involved, and blame games allow similar failures to occur in the future.

Resilience needs operability

Optimising For Resilience is a more effective reliability strategy. This means prioritising a lower MTTR over a higher MTBF. The ability to quickly adapt to failures is more important than fewer failures, although some failure classes should never occur and some safety-critical systems should never fail.

Resilience can be thought of as graceful extensibility. In The Theory of Graceful Extensibility, David Woods defines it as “a blend of graceful degradation and software extensibility”. A complex system with high graceful extensibility will continue to function, whereas a brittle system would collapse.

Graceful extensibility is derived from the capacity for adaptation in a system. Adaptive capacity can be created when work is effectively managed to rapidly reveal new problems, problems are quickly solved and produce new knowledge, and new local knowledge is shared throughout an organisation. These can be achieved by improving the operability of a system via:

  • An adaptive architecture
  • Incremental deployments
  • Automated provisioning
  • Ubiquitous telemetry
  • Chaos Engineering
  • You Build It You Run It
  • Post-incident reviews

Investing in operability creates a production environment in which applications can gracefully extend on failure. Ubiquitous telemetry will minimise sunk cost duration, an adaptive architecture will decrease opportunity cost duration, operator health will aid all aspects of failure resolution, and post-incident reviews will produce shareable knowledge for other operators. The result will be what Ron Westrum describes as a generative culture of performance-oriented interactions, high cooperation, and shared risks.

Dr. W. Edwards Deming said in Out Of The Crisis that “you cannot inspect quality into a product”. The same is true of operability. You cannot inspect operability into a product. Building operability in from the outset will remove handoffs, queues, and coordination costs between Delivery and Operations teams in a technology value stream. This will eliminate delays and rework, and allow Continuous Delivery to be achieved.

Acknowledgements

Thanks to Thierry de Pauw for the review

Multi-Demand Operations

How can Multi-Demand Operations eliminate handoffs and adhere to ITIL? Why are Service Transition, Change Management, and Production Support activities inimical to Continuous Delivery? How can such Policy Rules can be turned into ITIL-compliant Policy Guidelines that increase flow?

This is part 5 of the Strategising for Continuous Delivery series

Know Operations activities

When an organisation has IT As A Cost Centre, its IT department will consist of siloed Delivery and Operations groups. This is based on the outdated COBIT notion of sequential Plan-Build-Run activities, with Delivery teams building applications and Operations teams running them. If the Operations group has adopted ITIL Service Management, its Run activities will include:

  • Service Transition – perform operational readiness checks for an application prior to live traffic
  • Change Management – approve releases for an application with live traffic 
  • Tiered Production Support – monitor for and respond to production incidents for applications with live traffic  

Well-intentioned, hard working Operations teams in IT As A Cost Centre will be incentivised to work in separate silos to implement these activities as context-free, centralised Policy Rules.

See rules as constraints

Policy Rules from Operations will inevitably inject delays and rework into a technology value stream, due to the handoffs and coordination costs involved.  One of those Policy Rules will likely constrain throughput for all applications in a high demand group, even if it has existed without complaint in lower demand groups for years.

Service Transition can delay an initial live launch by weeks or months. Handing over an application from Delivery to Operations means operational readiness is only checked at the last minute. This can result in substantial rework on operational features when a launch deadline looms, and little time is available. Furthermore, there is little incentive for Delivery teams to assess and improve operability when Operations will do it for them.

Change Management can delay a release by days or weeks. Requesting an approval means a Change Advisory Board (CAB) of Operations stakeholders must find the time to meet and assess the change, and agree a release date. An approval might require rework in the paperwork, or in the application changeset. Delays and rework are exacerbated during a Change Freeze, when most if not all approvals are suspended for days at a time. In Accelerate, Dr. Nicole Forsgren et al prove a negative correlation between external approvals and throughput, and conclude “it is worse than having no change approval process at all”.

Tiered Production Support can delay failure resolution by hours or days. Raising a ticket incurs a progression from a Level 1 service desk to Level 2 support agents, and onto Level 3 Delivery teams until the failure is resolved. Non-trivial tickets will go through one or more triage queues until the best-placed responder is found. A ticket might involve rework if repeated, unilateral reassignments occur between support levels, teams, and/or individuals. This is why Jon Hall argues in ITSM and why three-tier support should be replaced with Swarming “the current organizational structure of the vast majority of IT support organisations is fundamentally flawed”.

These Policy Rules will act as Risk Management Theatre to varying degrees in different demand groups. They are based on the misguided assumption that preventative controls on everyone will prevent anyone from making a mistake. They impede knowledge sharing, restrict situational awareness, increase opportunity costs, and actively contribute to Discontinuous Delivery.

Example – MediaTech

At MediaTech, an investment in re-architecting videogames-ui and videogames-data has increased videogames-ui deployment frequency to every 10 days. Yet the Website Services demand group has a target of 7 days, and using the Five Focussing Steps reveals Change Management is the constraint for all applications in the Website Services technology value stream.

A Multi-Demand lens shows a Change Management policy inherited from the lower demand Supplier Integrations and Heritage Apps demand groups. All Website Services releases must have an approved Normal Change, as has been the case with Supplier Integrations and Heritage Apps for years. Normal Changes have a lead time of 0-4 days. This is the most time-consuming activity in Operations, due to the handoffs between approver groups. It is the constraint on Website Services like videogames-ui.

Create ITIL guidelines

Siloed Operations activities are predicated on high compute costs, and the high transaction cost of a release. That may be true for lower demand applications in an on-premise estate. However, Cloud Computing and Continuous Delivery have invalidated that argument for high demand applications. Compute and transaction costs can be reduced to near-zero, and opportunity costs are far more significant.

The intent behind Service Transition, Change Management, and Production Support is laudable. It is possible to re-design such Policy Rules into Policy Guidelines, and implement ITIL principles according to the throughput target of a demand group as well as its service management needs. Those Policy Rules can be replaced with Policy Guidelines, so high demand applications have equivalent lightweight activities while lower demand applications retain the same as before.

Converting Operations Policy Rules into Policy Guidelines will be more palatable to Operations stakeholders if a Multi-Demand Architecture is in place, and hard dependencies have previously been re-designed to shrink failure blast radius. A deployment pipeline for high demand applications that offers extensive test automation and stable deployments is also important.

Multi-Demand Service Transition

Service Transition can be replaced by Delivery teams automating a continual assessment of operational readiness, based on ITIL standards and Operations recommendations. Operational readiness checks should include availability, request throughput, request latency, logging dashboards, monitoring dashboards, and alert rules.

There should be a mindset of continual service transition, with small batch sizes and tight production feedback loops used to identify leading signals of inoperability before a live launch. For example, an application might have automated checks for the presence of a Four Golden Signals dashboard, and Service Level Objective alerts based on Request Success Rate.

Multi-Demand Change Management

Change Management can be streamlined by Delivery teams automating change approval creation and auditing. ITIL has Normal and Emergency Changes for irregular changes. It also has Standard Changes for repeatable, low risk changes which can be pre-approved electronically. Standard Changes are entirely compatible with Continuous Delivery.

Regular, low risk changes for a high demand application should move to a Standard Change template. Low risk, repeatable changes would be pre-approved for live traffic as often as necessary. The criteria for Standard Changes should be pre-agreed with Change Management. Entry criteria could be 3 successful Normal Changes, while exit criteria could be 1 failure.

Irregular, variable risk changes for high demand applications should move to team-approved Normal Changes. The approver group for low and medium risk changes would be the Delivery team, and high risk changes would have Delivery team leadership as well. Entry criteria could be 3 successful Normal Changes and 100% on operational readiness checks.

A Change Freeze should be minimised for high demand applications. For 1-2 weeks before a peak business event, there could be a period of heightened awareness that allows Standard Changes and low-risk Normal Changes only. There could be a 24 hour Change Freeze for the peak business event itself, that allows Emergency Changes only.

The deployment pipeline should have traceability built in. A change approval should be linked to a versioned deployment, and the underlying code, configuration, infrastructure, and/or schema changes. This should be accompanied by a comprehensive engineering effort from Delivery teams for ever-smaller changesets, so changes can remain low risk as throughput increases. This should include Expand-Contract, Decouple Release From Launch, and Canary Deployments for zero downtime deployments.

Multi-Demand Production Support

Tiered Production Support can be replaced by Delivery teams adopting You Build It, You Run It. A Level 1 service desk should remain for any applications with direct customer contact. Level 2 and Level 3 support should be performed by Delivery team engineers on-call 24/7/365 for the applications they build. 

Logging dashboards, monitoring dashboards, and alert rules should be maintained by engineers, and alert notifications should be directed to the team. In working hours, a failure should be prioritised over feature development, and be investigated by multiple team members. Outside working hours, a failure should be handled by the on-call engineer. Teams should do their own incident management and post-incident reviews.

You Build It, You Run It maximises incentives for Delivery teams to build operability into their applications from the outset of development. Operational accountability should reside with the product owner. They should have to prioritise operational features against user features, from a single product backlog. There should be an emphasis on reliable live traffic over feature development, cross-functional collaboration within and between teams, and a cross-pollination of skills. 

Example – MediaTech

At MediaTech, a prolonged investment is made in Operations activities for Website Services. The Service Transition and  Tiered Production Support teams are repurposed to concentrate solely on lower demand, on-premise applications. Website Services teams take on continual service transition and You Build It, You Run It themselves. This provokes a paradigm shift in how operability is handled at MediaTech, as Website Services teams start to implement their own telemetry and share their learnings when failures occur.

Change Management agree with the Website Services teams that any application with a deployment pipeline and automated rollback can move to Standard Change after 3 successful Normal Changes. In addition, agreement is reached on experimental, team-approved Normal Changes. Applications with the Standard Change entry criteria and have passed all operational checks no longer require CAB approval for irregular changes.

The elimination of handoffs and rework between Website Services and Operations teams means videogames-ui and videogames-ui deployment frequency can be increased to every 5 days. The applications are finally in a state of Continuous Delivery, and the next round of improvements can begin elsewhere in the MediaTech estate.

This is part 5 of the Strategising for Continuous Delivery series

  1. Strategising for Continuous Delivery
  2. The Bimodal Delusion
  3. Multi-Demand IT
  4. Multi-Demand Architecture
  5. Multi-Demand Operations

Acknowledgements

Thanks to Thierry de Pauw for reviewing this series.

Multi-Demand Architecture

How can Multi-Demand Architecture accelerate reliability and delivery flow? Why should Policy Rules be based on Continuous Delivery predictors? What is the importance of a loosely-coupled architecture? How can architectural Policy Rules benefit Continuous Delivery and reliability

This is Part 4 of the Strategising for Continuous Delivery series

Increase flow with policies

Policy Rules are not inherently bad. Some policies should be established across all demand groups, to drive Continuous Delivery adoption:

  • Software management should be based on Work In Progress (WIP) limits to reduce batch sizes, visual displays, and production feedback
  • Development should involve comprehensive version control, a loosely-coupled architecture, Trunk Based Development, and Continuous Integration
  • Testing should include developer-driven automated tests, tester-driven exploratory testing, and self-service test data

These practices have been validated in Accelerate as statistically significant predictors of Continuous Delivery. A loosely-coupled architecture is the most important, with Dr. Forsgren et al stating “high performance is possible with all kinds of systems, provided that systems – and the teams that build and maintain them – are loosely coupled”.

Design rules for loose coupling

Team and application  architectures aligned with Conway’s Law enable applications to be deployed and tested independently, even as the number of teams and applications in an organisation increases. An application should represent a Bounded Context, and be an independently deployable unit.

The reliability level of an application cannot exceed the lowest reliability level of its hard dependencies. In particular, the reliability of an application in a lower demand group may be limited by an on-premise runtime environment. Therefore, a Policy Rule should be introduced to reduce coupling between applications, particularly those in different demand groups.

Data should be stored in the same demand group as its consumers, with an asynchronous push if it continues to be mastered in a lower demand group. Interactions between applications should be protected with stability patterns such as Circuit Breakers and Bulkheads. This will allow teams to shift from Optimising For Robustness to Optimising For Resilience, and achieve new levels of reliability.

Example – MediaTech

At MediaTech, there is a commitment to re-architecting video game dataflows. An asynchronous data push is built from videogames-data to a new videogames-details service, which transforms the data format and stores it in a cloud-based database. When this is used by videogames-ui, a reliability level of 99.9% is achieved. Reducing requests into the MediaTech data centre also improves videogames-ui latency and videogames-data responsiveness.

Unlock testing guidelines

Reducing coupling between applications in different demand groups also allows for context-free Policy Rules to be replaced with context-rich Policy Guidelines. Re-designing a policy previously inherited from a lower demand group can eliminate constraints in a high demand group, and result in dramatic improvements in delivery flow. A Policy Rule that all applications must do End-To-End Testing can be replaced with a Policy Guideline that high demand applications do Contract Testing, while lower demand applications continue to do End-To-End Testing. Such a Policy Guideline could be revisited later on for lower demand applications unable to meet their own throughput target.

At MediaTech, the End-To-End Testing between videogames-ui and videogames-data is stopped. Website Services teams take on more testing responsibilities, with Contract Testing used for the videogames-data asychronous data push. Eliminating testing handoffs increases videogames-ui deployment frequency to every 10 days, but every 7 days remains unattainable due to operational handoffs.

This is part 4 of the Strategising for Continuous Delivery series

  1. Strategising for Continuous Delivery
  2. The Bimodal Delusion
  3. Multi-Demand IT
  4. Multi-Demand Architecture
  5. Multi-Demand Operations

Acknowledgements

Thanks to Thierry de Pauw for reviewing this series.

Multi-Demand IT

What is Multi-Demand IT? How does it provide the means to drive a Continuous Delivery programme with incremental investments, according to product demand?

This is Part 3 of the Strategising for Continuous Delivery series

Introduction

Multi-Demand IT is a transformation strategy that recommends investing in groups of technology value streams, according to their product demand. While Bimodal IT recommends upfront, capital investments based on an architectural division of applications, Multi-Demand favours gradual investments in Continuous Delivery across an IT estate based on product Cost of Delay.

A technology value stream is a sequence of activities that converts product ideas into value-adding changes. A demand group is a set of applications in one or more technology value streams, with a shared throughput target that must be met for Continuous Delivery to be achieved. There may also be individual reliability targets for applications within a group, based on their criticality levels.

Uncover demand groups

An IT department should have at least three demand groups representing high, medium, and low throughput targets. This links to Dr. Nicole Forsgren’s research in The Role of Continuous Delivery in IT and Organizational Performance, and Simon Wardley’s Pioneers, Settlers, and Town Planners model in The Only Structure You’ll Ever Need. Additional demand groups representing very high and very low throughput targets may emerge over time. Talented, motivated people are needed to implement Continuous Delivery within the unique context of each demand group.

Multi-Demand creates a Continuous Delivery investment language. Demand groups make it easier to prioritise which applications are in a state of Discontinuous Delivery, and need urgent improvement. The aim is to incrementally invest until Continuous Delivery is achieved for all applications in a demand group.

Applications will rarely move between demand groups. If market disruption or upstream dependents cause a surge in product demand, a rip and replace migration will likely be required as a higher demand group will have its own practices, processes, and tools. When product demand has been filled for an application, its deployment target is adjusted for a long tail of low investment. The new deployment target will retain the same lead time as before, with a lower interval. This ensures the application remains launchable on demand.

A high or medium demand group should contain a single technology value stream. This means all applications with similar demand undergo the same activities and tasks. This reduces cognitive load for teams, and ensures all applications will benefit from a single flow efficiency gain. A low demand group is more likely to have multiple technology value streams, especially if some of its applications are part of a legacy estate.

Example – MediaTech

Assume MediaTech adopts Multi-Demand for its IT transformation. There is a concerted effort to assess technology value streams, and forecast product demand. As a result, the following demand groups are created:

videogames-ui is in the sole Website Services technology value stream, while videogames-data is in one of the Heritage Applications technology value streams.

Create Multi-Demand policies

A demand group will have a policy set which determines its practices, processes, and tools. Inspired by Cynefin, a policy can be a:

  • Policy Fix: single group, such as heightened permissions for teams in a specific group
  • Policy Rule: multi-group single implementation, such as mandatory use of a central incident management system for all groups
  • Policy Guideline: multi-group multi-implementation, such as mandatory test automation with different techniques in each group

A policy will shape one or more activities and tasks within a technology value stream. Each demand group should have a minimal set of policies, as Little’s Law dictates the higher the throughput target, the fewer activities and tasks must exist. Furthermore, applying the Theory Of Constraints to Continuous Delivery shows throughput in a technology value stream will likely be constrained by the impact of a single policy on a single activity.

At MediaTech, the Multi-Demand lens shows videogames-data is in a state of Continuous Delivery while videogames-ui is in Discontinuous Delivery. This is due to the inheritance of End-To-End Testing, CAB meetings, and central production support policies from Heritage Apps, which has lower product demand and a very different context.

Policy Rules should be treated with caution, as they ignore the context and throughput target of a particular demand group. A Policy Rule can easily incur handoffs and rework that constrain throughput in a high demand group, even if it has existed for lower demand groups for years. This can be resolved by turning a Policy Rule into a Policy Guideline, and re-designing an activity per-demand. For example, End-To-End Testing might be in widespread use for all medium and low demand applications. It will likely need to be replaced with Contract Testing or similar with high demand applications.

This is Part 3 of the Strategising for Continuous Delivery series

  1. Strategising for Continuous Delivery
  2. The Bimodal Delusion
  3. Multi-Demand IT
  4. Multi-Demand Architecture
  5. Multi-Demand Operations

Acknowledgements

Thanks to Thierry de Pauw for reviewing this series.

The Bimodal delusion

Why is Bimodal IT so fundamentally flawed? Why is it just a rehash of brownfield versus greenfield IT? What is the delusion that underpins it?

This is Part 2 of the Strategising for Continuous Delivery series

Introduction

Bimodal IT is a notoriously bad method of IT transformation. In 2014, Simon Mingay and Mary Mesaglio of Gartner recommended in How to Be Digitally Agile Without Making a Mess that organisations split their IT departments in two. The authors proposed a Mode 1 for predictability and stability of traditional backend applications, and a Mode 2 for exploration and speed of digital frontend services. They argued this would allow an IT department to protect high risk, low change systems of record, while experimenting with low risk, high change systems of engagement.

Example – MediaTech

For example, a MediaTech organisation has an on-premise application estate with separate development, testing, and operations teams. Product stakeholders demand an improvement from monthly to weekly deployments and from 99.0% to 99.9% reliability, so a commitment is made to Bimodal. Existing teams continue to work in the Mode 1 on-premise estate, while new teams of developers and testers start on Mode 2 cloud-based microservices.

This includes a Mode 2 videogames-ui team, who work on a new frontend that synchronously pulls data from a Mode 1 videogames-data backend application.

Money for old rope

Bimodal is a transformation strategy framed around technology-centric choices, that recommends capital investment in systems of engagement only. It is understandable why these choices might appeal to IT executives responsible for large, mixed estates of applications. Saying Continuous Delivery is only for digital frontend services can be a rich source of confirmation bias for people accustomed to modernisation failures.

However, the truth is Bimodal is just money for old rope. The Bimodal division between Mode 1 and Mode 2 is the same brownfield versus greenfield dichotomy that has existed since the Dawn Of Computer Time. Bimodal has the exact same problems:

  • Mode 1 teams will find it hard to recruit and retain talented people
  • Mode 1 teams will trap the domain knowledge needed by Mode 2 teams
  • Mode 2 teams will depend on Mode 1 teams
  • Mode 2 services will depend on Mode 1 applications

The dependency problems are critical. Bimodal architecture is predicated on frontend services distinct from backend applications, yet the former will inevitably be coupled to the latter. A Mode 2 service will have a faster development speed than its Mode 1 dependencies, but its deployment throughput will be constrained by inherited Mode 1 practices such as End-To-End Testing and heavyweight change management. Furthermore, the reliability of a Mode 2 service can only equal its least reliable Mode 1 dependency.

At MediaTech, the videogames-ui team are beset by problems:

  • Any business logic change in videogames-ui requires End-To-End Testing with videogames-data
  • Any failure in videogames-data prevents customer purchases in videogames-ui
  • Mode 1 change management practices still apply, including CABs and change freezes
  • Mode 1 operational practices still apply, such as a separate operations team and detailed handover plans pre-release

As a result, the videogames-ui team are only able to achieve fortnightly deployments and 99.0% reliability, much to the dissatisfaction of their product manager.

The delusion

This is the Bimodal delusion – that stability and speed are a zero-sum game. As Jez Humble explains in The Flaw at the Heart of Bimodal IT, “Gartner’s model rests on a false assumption that is still pervasive in our industry: that we must trade off responsiveness against reliability”. Peer-reviewed academic research by Dr. Nicole Forsgren et al such as The Role of Continuous Delivery in IT and Organizational Performance has proven this to be categorically false. Increasing deployment frequency does not need to have a negative impact on costs, quality, or reliability.

This is Part 2 of the Strategising for Continuous Delivery series

  1. Strategising for Continuous Delivery
  2. The Bimodal Delusion
  3. Multi-Demand IT
  4. Multi-Demand Architecture
  5. Multi-Demand Operations

Acknowledgements

Thanks to Thierry de Pauw for reviewing this series.

Strategising for Continuous Delivery

What strategy should an IT department adopt to incrementally and iteratively transform itself? How should different methods of software development, testing, and operations be managed at the same time?

TL;DR:

  • An organisation mired in IT As A Cost Centre needs to understand where and how to invest in Continuous Delivery
  • Bimodal IT by Gartner is merely a rehash of the brownfield vs. greenfield IT dichotomy, with all the same problems
  • Multi-Demand IT is a sophisticated approach to investing in Continuous Delivery according to product demand and capabilities
  • Multi-Demand IT explains how to overhaul traditional architecture, service transition, change management, and production support

Introduction

Organisations that wish to remain competitive in the years to come must explore new offerings, expand successful differentiators, and exploit established products. A 21st century, digital first organisational model of IT As A Business Differentiator focussed on cloud computing, smart mobile devices, and big data analytics is required.

This is tremendously difficult when an organisation has the 20th century, pre-Internet IT As A Cost Centre organisational model. This refers to disparate Product and IT departments, in which IT is a cost centre with fixed scope, fixed resource, and fixed deadline projects. Segregated development, testing, and operations teams mired in long-term Discontinuous Delivery are ill-equipped to rapidly build and run products.

Strategy is choice

An organisation-wide transformation from IT As A Cost Centre to IT As A Business Differentiator is an arduous, multi-year journey. It is hard to know where to invest in Continuous Delivery when an organisation has a large, mixed estate of well-established on-premise applications and emergent cloud applications. Such applications can act as significant revenue generators, regardless of deployment frequency and runtime environment. This means visionary leadership and a sense of urgency are required from IT executives, and a Continuous Delivery strategy.

A strategy is not a vision, a plan, a list of best practices, or an affirmation of the status quo. A strategy is a unified set of choices including a desire to succeed, a declaration on where to succeed, and a statement of how to succeed. It is a commitment to hard choices, amongst options with asymmetric value.

An effective Continuous Delivery strategy can be used to establish a culture of Continuous Improvement, powered by the Improvement Kata. It allows a programme of radical, far-reaching changes to be built around the choices made. This helps people to understand, and be motivated by changes to how teams work, how they interact, and the processes and tools they use.

On that basis, how can an IT department devise a Continuous Delivery strategy? How can it incrementally and iteratively transform itself?

This is part 1 of the Strategising for Continuous Delivery series

  1. Strategising for Continuous Delivery
  2. The Bimodal Delusion
  3. Multi-Demand IT
  4. Multi-Demand Architecture
  5. Multi-Demand Operations

Acknowledgements

Thanks to Thierry de Pauw for reviewing this series.

Rebuild The Thing Wrong

What happens when the launch of a software product rebuild is delayed due to feature parity? How can an organisation rebuild a software product the right way, and minimise opportunity costs as well as technology risks?

Introduction

To adapt to rapidly changing markets, organisations must enhance their established products as well as exploring new product offerings. When the profitability of an existing product is limited by Discontinuous Delivery, a rewrite should be considered a last resort. However, it might become necessary if a meaningful period of code, configuration, and infrastructure rework cannot achieve the requisite deployment throughput and production reliability.

Rebuilding a software product from scratch is usually a costly and risky one-off project, with little attention paid to customer needs. Replicating all the features in an existing product will take weeks or months, and there may be a decision to wait for feature parity before customer launch. It can be a while before Continuous Delivery becomes a reality.

The production cutover will be a big bang migration, with a high risk of catastrophic failure. The probability and cost of cutover failure can be reduced by Decoupling Deploy From Launch, and regularly deploying versions of the new product prior to launch.

The Gamez rebuild

For example, a fictional Gamez retail organisation has an ecommerce website. The website is powered by a monolithic application known as eCom, which is hosted in an on-premise data centre. Recently viewed items, gift lists, search, and customer reviews are features that are often praised by customers, and believed to generate the majority of revenue. eCom deployments are quarterly, production reliability is poor, and data centre costs are too high.

There is a strong business need to accelerate feature delivery, and a rewrite of the Gamez website is agreed. The chosen strategy is a rip and replace, in which the eCom monolith will be rebuilt as a set of cloud-native services, hosted in a public cloud provider. Each service is intended to be an independently testable, deployable, and scalable set of microservices, that can keep pace with market conditions.

There is an agreement to delay customer launch until feature parity is achieved, and all eCom business logic is replicated in the services. Months later, the services are deemed feature complete, the production cutover is accomplished, and the eCom application is turned off.

The feature parity fallacy

Feature parity is a fallacy. Rebuilding a software product with the same feature set is predicated on the absurd assumption that all existing features generate sufficient revenue to justify their continuation. In all likelihood, an unknown minority of existing features will account for a majority of revenue, and the majority of existing features can be deferred or even discarded.

The main motivation for rebuilding a software product is to accelerate delivery, validate product hypotheses, and increase revenues as a result. Waiting weeks or months for feature parity during a rebuild prolongs Discontinuous Delivery, and exacerbates all the related opportunity costs that originally prompted the rewrite.

Back at Gamez, some product research conducted prior to the rebuild is unearthed, and it contains some troubling insights. Search generates the vast majority of website revenue. Furthermore, product videos is identified as a missing feature that would quickly surpass all other revenue sources. This means a more profitable course of action would have been to launch a new product videos feature, replace the eCom search feature, and validate learnings with customers. An opportunity to substantially increase profitability was missed.

Strangler Fig

Rebuilding a software product right means incrementally launching features throughout the rebuild. This can be accomplished with the Strangler Fig architectural pattern, which can be thought of as applying the Expand and Contract pattern at the application level.

Strangler Fig means gradually creating a new software product around the boundaries of an existing product, and rebuilding one feature at a time until the existing product is no more. It is usually implemented by deploying an event interceptor in front of the existing product. New and existing features are added to the new software product as necessary, and user requests are routed between products. This approach means customers can benefit from new features much sooner, and many more learning opportunities are created.

Using the Strangler Fig pattern would have allowed Gamez to generate product videos revenue from the outset. At first, a customer router microservice would have been implemented in front of the eCom application. It would have sent all customer requests to eCom bar product videos, which would have gone to the first microservice. Later on, search and customer review services would have been replaced the same eCom functionality.

Case studies

  1. The Road To Continuous Delivery by Michiel Rook details how Strangler Fig helped a Dutch recruitment company replace its ecommerce website
  2. Strangulation Pattern Of Choice by Joshua Gough describes how Strangler Fig allowed a British auction site to replace its ecommerce website
  3. Legacy Application Strangulation Case Studies by Paul Hammant lists a series of product rewrites made possible by Strangler Fig
  4. How to Breakthrough the Old Monolith by Kyle Galbraith visualises how to slice a monolith up into microservices, using Strangler Fig
  5. Estimation Is Evil by Ron Jeffries mentions how incremental launches would have benefitted the eXtreme Programming (XP) birth project Chrysler C3, rather than a yearlong pursuit of feature parity

Build The Right Thing and Build The Thing Right

Should an organisation in peril start its journey towards IT enabled growth by investing in IT delivery first, or product development? Should it Build The Thing Right with Continuous Delivery first, or Build The Right Thing with Lean Product Development?

Introduction

The software revolution has caused a profound economic and technological shift in society. To remain competitive, organisations must rapidly explore new product offerings as well as exploiting established products. New ideas must be continuously validated and refined with customers, if product/market fit and repeatable sales are to be found.

That is extremely difficult when an organisation has the 20th century, pre-Internet IT As A Cost Centre organisational model. There will be a functionally segregated IT department, that is accounted for as a cost centre that can only incur costs. There will be an annual plan of projects, each with its scope, resources, and deadlines fixed in advance. IT delivery teams will be stuck in long-term Discontinuous Delivery, and IT executives will only be incentivised to meet deadlines and reduce costs.

If product experimentation is not possible and product/market fit for established products declines, overall profitability will suffer. What should be the first move of an organisation in such perilous conditions? Should it invest first in Lean Product Development to Build The Right Thing, and uncover new product offerings as quickly as possible? Or should it invest in Continuous Delivery to Build The Thing Right first, and create a powerful growth engine for locating future product/market fit?

Build The Thing Right first

In 2007, David Shpilberg et al published a survey of ~500 executives in Avoiding the Alignment Trap in IT. 74% of respondents reported an under-performing, undervalued IT department, unaligned with business objectives. 15% of respondents had an effective IT department, with variable business alignment, below average IT spending, and above average sales growth. Finally, 11% were in a so-called Alignment Trap, with negative sales growth despite above average IT spending and tight business alignment.

Avoiding the alignment trap in IT

Avoiding the Alignment Trap in IT (Shpilberg et al) – source praqma.com

Shpilberg et al report “general ineffectiveness at bringing projects in on time and on the dollar, and ineffectiveness with the added complication of alignment to an important business objective”. The authors argue organisations that prematurely align IT with business objectives will fall into an Alignment Trap. They conclude organisations should build a highly effective IT department first, and then align IT projects to business objectives. In other words, Build The Thing Right before trying to Build The Right Thing.

The conclusions of Avoiding the Alignment Trap in IT are naive, because they ignore the implications of IT As A Cost Centre. An organisation with ineffective IT will undoubtedly suffer if it increases business alignment and investment in IT as is. However, that is a consequence of functionally siloed product and IT departments, and the antiquated project delivery model tied to IT As A Cost Centre. Projects will run as a Large Batch Death Spiral for months without customer feedback, and invariably result in a dreadful waste of time and money. When Shpilberg et al define success as “delivered projects with promised functionality, timing, and cost”, they are measuring manipulable project outputs, rather than customer outcomes linked to profitability.

Build The Right Thing first

It is hard to Build The Right Thing without first learning to Build The Thing Right, but it is possible. If flow is improved through co-dependent product and IT teams, the negative consequences of IT As A Cost Centre can be reduced. New revenue streams can be unlocked and profitability increased, before a full Continuous Delivery programme can be completed.

An organisation will have a number of value streams to convert ideas into product offerings. Each value stream will have a Fuzzy Front End of product and development activities, and a technology value stream of build, testing, and operational activities. The time from ideation to customer is known as cycle time.

Flow in a value stream can be improved by:

  • understanding the flow of ideas
  • reducing batch sizes
  • quantifying value and urgency

The flow of ideas can be understood by conducting a Value Stream Mapping with senior product and IT stakeholders. Visualising the activities and teams required for ideas to reach customers will identify an approximate cycle time, and the sources of waste that delay feedback. A Value Stream Mapping will usually uncover a shocking amount of rework and queue times, with the majority in the Fuzzy Front End.

For example, a Value Stream Mapping might reveal a 10 month cycle time, with 8 months spent on ideas bundled up as projects in the Fuzzy Front End, and 2 months spent on IT in the technology value stream. Starting out with Build The Thing Right would only tackle 20% of cycle time.

Reducing batch sizes means unbundling projects into separate features, reducing the size of features, and using Work In Process (WIP) Limits across product and IT teams. Little’s Law guarantees distilling projects into small, per-feature deliverables and restricting in-flight work will result in shorter cycle times. In Lean Enterprise, Jez Humble, Joanne Molesky, and Barry O’Reilly describe reducing batch sizes as “the most important factor in systemically increasing flow and reducing variability”.

Quantifying value and urgency means working with product stakeholders to estimate the Cost Of Delay of each feature. Cost Of Delay is the economic benefit a feature could generate over time, if it was available immediately. Considering how time will affect how a feature might increase revenue, protect revenue, reduce costs, and/or avoid costs is extremely powerful. It encourages product teams to reduce cycle time by shrinking batch sizes and eliminating Fuzzy Front End activities. It uncovers shared assumptions and enables better trade-off decisions. It creates a shared sense of urgency for product and IT teams to quickly deliver high value features. As Don Reinertsen says in the The Principles of Product Development Flow, “if you only measure one thing, measure the Cost Of Delay”.

For example, a manual customer registration task generates £100Kpa of revenue, and is performed by one £50Kpa employee. The economic benefit of automating that task could be calculated as £100Kpa of revenue protection and £50Kpa of cost reduction, so £5.4Kpw is lost while the feature does not exist. If there is a feature with a Cost Of Delay greater than £5.4Kpw, the manual task should remain.

Build The Right Thing case study – Maersk Line

Black Swan Farming – Maersk Line by Joshua Arnold and Özlem Yüce demonstrates how an organisation can successfully Build The Right Thing first, by understanding the flow of ideas, reducing batch sizes, and quantifying value and urgency. In 2010, Maersk Line IT was a £60M cost centre with 20 outsourced development teams. Between 2008 and 2010 the median cycle time in all value streams was 150 days. 62% of ~3000 requirements in progress were stuck in Fuzzy Front End analysis.

Arnold and Yüce were asked to deliver more value, flow, and quality for a global booking system with a median cycle time of 208 days, and quarterly production releases in IT. They mapped the value stream, shrank features down to the smallest unit of deliverable value, and introduced Cost Of Delay into product teams alongside other Lean practices.

After 9 months, improvements in Fuzzy Front End processes resulted in a 48% reduction in median cycle time to 108 days, an 88% reduction in defect count, and increased customer satisfaction. Furthermore, using Cost Of Delay uncovered 25% of requirements were a thousand times more valuable than the alternatives, which led to a per-feature return on investment six times higher than the Maersk Line IT average. By applying the same Lean principles behind Continuous Delivery to product development prior to additional IT investment, Arnold and Yüce achieved spectacular results.

Build The Right Thing and Build The Thing Right

If an organisation in peril tries to Build The Right Thing first, it risks searching for product/market fit without the benefits of fast customer feedback. If it tries to Build The Thing Right first, it risks spending time and money on Continuous Delivery without any tangible business benefits.

An organisation should instead aim to Build The Right Thing and Build The Thing Right from the outset. A co-evolution of product development and IT delivery capabilities is necessary, if an organisation is to achieve the necessary profitability to thrive in a competitive market.

This approach is validated by Dr. Nicole Forsgren et al in Accelerate. Whereas Avoiding The Alignment Trap In IT was a one-off assessment of business alignment in IT As A Cost Centre, Accelerate is a multi-year, scientifically rigorous study of thousands of organisations worldwide. Interestingly, Lean product development is modelled as understanding the flow of work, reducing batch sizes, incorporating customer feedback, and team empowerment. The data shows:

  • Continuous Delivery and Lean product development both predict superior software delivery performance and organisational culture
  • Software delivery performance and organisational culture both predict superior organisational performance in terms of profitability, productivity, and market share
  • Software delivery performance predicts Lean product development

On the reciprocal relationship between software delivery performance and Lean product development, Dr. Nicole Forsgren et al conclude “the virtuous cycle of increased delivery performance and Lean product management practices drives better outcomes for your organisation”.

Exapting product development and technology

An organisational ambition to Build The Right Thing and Build The Thing Right needs to start with the executive team. Executives need to recognise an inability to create new offerings or protect established products equates to mortal peril. They need to share a vision of success with the organisation that articulates the current crisis, describes a state of future prosperity, and injects urgency into day-to-day work.

The executive team should introduce the Improvement Kata into all levels of the organisation. The Improvement Kata encourages problem solving via experimentation, to proceed towards a goal in iterative, incremental steps amid ambiguities, uncertainties, and difficulties. It is the most effective way to manage a gradual co-emergence of Lean Product Development and Continuous Delivery.

Experimentation with organisational change should include a transition from IT As A Cost Centre to IT As A Business Differentiator. This means technology staff moving from the IT department to work in long-lived, outcome-oriented teams in one or more product departments, which are accounted for as profit centres and responsible for their own investment decisions. One way to do this is to create a Digital department of co-located product and technology staff, with shared incentives to create new product offerings. Handoffs and activities in value streams will be dramatically reduced, resulting in much faster cycle times and tighter customer feedback loops.

Instead of an annual budget and a set of fixed scope projects, there needs to be a rolling budget that funds a rolling plan linked to desired outcomes and strategic business objectives. The scope, resources, and deadlines of the rolling plan should be constantly refined by validated learnings from customers, as delivery teams run experiments to find problem/solution fit and problem/market for a particular business capability.

Those delivery teams should be cross-functional, with all the necessary personnel and skills to apply Design Thinking and Lean principles to problem solving. This should include understanding the flow of ideas, reducing batch sizes, and the quantifying value and urgency. As Lean Product Development and Continuous Delivery capabilities gradually emerge, it will become much easier to innovate with new product offerings and enhance established products.

It might take months or years of investment, experimentation, and disruption before an organisation has adopted Lean Product Development and Continuous Delivery across all its value streams. It is important to protect delivery expectations and staff welfare by making changes one value stream at a time, by looking for existing products or new product offerings stuck in Discontinuous Delivery.

Acknowledgements

Thanks to Emily Bache, Ozlem Yuce, and Thierry de Pauw for reviewing this article.

Further Reading

  1. Lean Software Development by Mary and Tom Poppendieck
  2. Designing Delivery by Jeff Sussna
  3. The Essential Drucker by Peter Drucker
  4. Measuring Continuous Delivery by Steve Smith
  5. The Cost Centre Trap by Mary Poppendieck
  6. Making Work Visible by Dominica DeGrandis
« Older posts Newer posts »

© 2020 Steve Smith

Theme by Anders NorénUp ↑