Month: November 2012

The Strangler Pipeline – Autonomation

9 November, 2012 / Steve Smith

The Strangler Pipeline is grounded in autonomation

Previous entries in the Strangler Pipeline series:

The introduction of Continuous Delivery to an organisation is an exciting opportunity for Development and Operations to Automate Almost Everything into a Repeatable Reliable Process, and at Sky Network Services we aspired to emulate organisations such as LMAX, Springer, and 7Digital by building a fully automated Continuous Delivery pipeline to manage our Landline Fulfilment and Network Management platforms. We began by identifying our Development and Operations stakeholders, and establishing a business-facing programme to automate our value stream. We emphasised to our stakeholders that automation was only a step towards our end goal of improving upon our cycle time of 26 days, and that the Theory Of Constraints warns that automating the wrong constraint will have little or no impact upon cycle time.

Our determination to value cycle time optimisation above automation in the Strangler Pipeline was soon justified by the influx of new business projects. The unprecedented growth in our application estate led to a new goal of retaining our existing cycle time while integrating our greenfield application platforms, and as our core business domain is telecommunications not Continuous Delivery we concluded that fully automating our pipeline would not be cost-effective. By following Jez Humble and Dave Farley’s advice to “optimise globally, not locally”, we focussed pipeline stakeholder meetings upon value stream constraints and successfully moved to an autonomation model aimed at stakeholder-driven optimisations.

Described by Taiichi Ohno as one of “the two pillars of the Toyota Production System“, autonomation is defined as automation with a human touch. It refers to the combination of human intelligence and automation where full automation is considered uneconomical. While the most prominent example of autonomation is problem detection at Toyota, we have applied autonomation within the Strangler Pipeline as follows:

Commit stage. While automating the creation of an aggregate artifact when a constituent application artifact is committed would reduce the processing time of platform creation, it would have zero impact upon cycle time and would replace Operations responsibility for release versioning with arbitrary build numbers. Instead the Development teams are empowered to track application compatibilities and create aggregate binaries via a user interface, with application versions selectable in picklists and aggregate version numbers auto-completed in order to reduce errors.
Failure detection and resolution. Although creating an automated rollback or self-healing releases would harden the Strangler Pipeline, we agreed that such a solution was not a constraint upon cycle time and would be costly to implement. When a pipeline failure occurs it is recorded in the metadata of the application artifact, and we Stop The Line to prevent further use until a human has logged onto the relevant server(s) to diagnose and correct the problem.
Pipeline updates. Although the high frequency of Strangler Pipeline updates implies value in further automation of its own Production release process, a single pipeline update cannot improve cycle time and we wish to retain scheduling flexibility – as pipeline updates increase the probability of release failure, it would be unwise to release a new pipeline version immediately prior to a Production platform release. Instead a Production request is submitted for each signed off pipeline artifact, and while the majority are immediately released the Operations team reserve the right to delay if their calendar warns of a pending Production platform release.

Autonomation emphasises the role of root cause analysis, and after every major release failure we hold a session to identify the root cause of the problem, the lessons learned, and the necessary counter-measures to permanently solve the problem. At the time of writing our analysis shows that 13% of release failures were caused by pipeline defects, 10% by misconfiguration of TeamCity Deployment Builds, and the majority originated in our siloed organisational structure. This data provides an opportunity to measure our adoption of the principles of Continuous Delivery according to Shuhari:

shu – By scaling our automated release mechanism to manage greenfield and legacy application platforms, we have implemented Repeatable Reliable Process, Automate Almost Everything, and Keep Everything In Version Control
ha – By introducing combinational static analysis tests and a pipeline user interface to reduce our defect rate and TeamCity usability issues, we have matured to Bring The Pain Forward and Build Quality In
ri – Sky Network Services is a Waterscrumfall organisation where Business, Development, and Operations work concurrently on different projects with different priorities, which means we sometimes fall foul of Conway’s Law and compete over constrained resources to the detriment of cycle time. We have yet to achieve Done Means Released, Everybody Is Responsible, and Continuous Improvement

An example of our organisational structure impeding cycle time would be the first release of the new Messaging application 186-13, which resulted in the following value stream audit:

While each pipeline operation was successful in less than 20 seconds, the disparity between Commit start time and Production finish time indicate significant delivery problems. Substantial wait times between environments contributed to a lead time of 63 days, far in excess of our average lead time of 6 days. Our analysis showed that Development started work on Messaging 186-13 before Operations ordered the necessary server hardware, and as a result hardware lead times restricted environment availability at every stage. No individual or team was at fault for this situation – the fault lay in the system, with Development and Operations working upon different business projects at the time with non-aligned goals.

With the majority of the Sky Network Services application estate now managed by the Strangler Pipeline it seems timely to reflect upon our goal of retaining our original cycle time of 26 days. Our data suggests that we have been successful, with the cycle time of our Landline Fulfilment and Network Management platforms now 25 days and our greenfield platforms between 18 and 21 days. However, examples such as Messaging 186-13 remind us that cycle time cannot be improved by automation alone, and we must now redouble our efforts to implement Done Means Released, Everybody Is Responsible, and Continuous Improvement. By building the Strangler Pipeline we have followed Donella Meadows‘ change management advice to “reduce the probability of destructive behaviours and to encourage the possibility of beneficial ones” and given all we have achieved I am confident that we can Continuously Improve together.

My thanks to my colleagues at Sky Network Services

The Strangler Pipeline – Legacy and greenfield

2 November, 2012 / Steve Smith

The Strangler Pipeline uses the Stage Strangler pattern to manage legacy and greenfield applications

Previous entries in the Strangler Pipeline series:

When our Continuous Delivery journey began at Sky Network Services, one of our goals was to introduce a Repeatable, Reliable Process for our Landline Fulfilment and Network Management platforms by creating a pipeline deployer to replace the disparate Ruby and Perl deployers used by Development and Operations. The combination of a consistent release mechanism and our newly-developed Artifact Container would have enabled us to Bring The Pain Forward from failed deployments, improve lead times, and easily integrate future greenfield platforms and applications into the pipeline. However, the simultaneous introduction of multiple business projects meant that events conspired against us.

While pipeline development was focussed upon improving slow platform build times, business deadlines for the Fibre Broadband project left our Fibre, Numbering, and Providers technical teams with greenfield Landline Fulfilment applications that were compatible with our Artifact Container and incompatible with the legacy Perl deployer. Out of necessity those teams dutifully followed Conway’s Law and created deployment buttons in TeamCity housing application-specific deployers as follows:

Fibre: A loathed Ant deployer
Numbering: A loved Ant deployer
Providers: A loved Maven/Java deployer

Over a period of months, it became apparent that this approach was far from ideal for Operations. Each Landline Fulfilment platform release became a slower, more arduous process as the Perl deployer had to be accompanied by a TeamCity button for each greenfield application. Not only did these extra steps increase processing times, the use of a Continuous Integration tool ill-suited to release management introduced symptoms of the Deployment Build antipattern and errors started to creep into deployments.

While Landline Fulfilment releases operated via this multi-step process, a pipeline deployer was developed for the greenfield application platforms. The Landline Assurance, Wifi Fulfilment, and Wifi Assurance technical teams had no time to spare for release tooling and immediately integrated into the pipeline. The pipeline deployer proved successful and consequently demand grew for the pipeline to manage Landline Fulfilment releases as a single aggregate artifact – although surprisingly Operations requested the pipelining of greenfield applications first, due to the proliferation of per-application, per-environment deployment buttons in TeamCity.

A migration method was therefore required for pipelining the entire Landline Fulfilment platform that would not increase the risk of release failure or incur further development costs, and with those constraints in mind we adapted the Strangler pattern for Continuous Delivery as the Stage Strangler pattern. First coined by Martin Fowler and Michael Feathers, the Strangler pattern describes how to gradually wrap a legacy application in a greenfield application in order to safely replace existing features, add new features, and ultimately replace the entire application. By creating a Stage Interface for the different Landline Fulfilment deployers already in use, we were able to kick off a series of conversations with the Landline Fulfilment technical teams about pipeline integration.

We began the Stage Strangler process with the Fibre application deployer, as the Fibre team were only too happy to discard it. We worked together on the necessary changes, deleting the Fibre deployer and introducing a set of version-toggled pipeline deployment buttons in TeamCity. The change in release mechanism was advertised to stakeholders well in advance, and a smooth cutover built up our credibility within Development and Operations.

While immediate replacement of the Numbering application deployer was proposed due to the Deficient Deployer antipattern causing per-server deployment steps for Operations, the Numbering team successfully argued for its retention as it provided additional application monitoring capabilities. We updated the Numbering deployer to conform to our Stage Interface and eliminate the Deficient Deployer symptoms, and then wrote a Numbering-specific pipeline stage that delegated Numbering deployments to that deployer.

The Providers team had invested a lot of time in their application deployer – a custom Maven/Java deployer with an application-specific signoff process embedded within the Artifactory binary repository. Despite Maven’s Continuous Delivery incompatibilities, build numbers being polluted by release numbers, and the sign-off process triggering the Artifact Promotion antipattern, the Providers team resolutely wished to retain their deployer due to their sunk costs. This resulted in a long-running debate over the relative merits of the different technical solutions, but the Stage Strangler helped us move the conversation forward by shaping it around pipeline compatibility rather than technical uniformity. We wrote a Providers-specific pipeline stage that delegated Providers deployments to that deployer, and the Providers team removed their signoff process in favour of a platform-wide sign-off process managed by Operations.

As all greenfield applications have now been successfully integrated into the pipeline and the remaining Landline Fulfilment legacy applications are in the process of being strangled, it would be accurate to say that the Stage Strangler pattern provided us with a minimal cost, minimal risk method of integrating applications and their existing release mechanisms into our Continuous Delivery pipeline. The use of the Strangler pattern has empowered technical teams to make their own decisions on release tooling, and a sign of our success is that development of new pipeline features continues unabated while the Numbering and Providers teams debate the value of strangling their own deployers in favour of a universal pipeline deployer.

Month: November 2012

The Strangler Pipeline – Autonomation

The Strangler Pipeline – Legacy and greenfield

Recent Posts

Categories

Archives

Recent Posts

Recent Comments

Archives

Categories

Meta

The Strangler Pipeline – Autonomation

The Strangler Pipeline – Legacy and greenfield

Recent Posts

Tags

Categories

Archives

Recent Posts

Recent Comments

Archives

Categories

Meta