On Tech

Tag: Strangler Pipeline

The Strangler Pipeline – Autonomation

The Strangler Pipeline is grounded in autonomation

Previous entries in the Strangler Pipeline series:

  1. The Strangler Pipeline – Introduction
  2. The Strangler Pipeline – Challenges
  3. The Strangler Pipeline – Scaling Up
  4. The Strangler Pipeline – Legacy and Greenfield

The introduction of Continuous Delivery to an organisation is an exciting opportunity for Development and Operations to Automate Almost Everything into a Repeatable Reliable Process, and at Sky Network Services we aspired to emulate organisations such as LMAX, Springer, and 7Digital by building a fully automated Continuous Delivery pipeline to manage our Landline Fulfilment and Network Management platforms. We began by identifying our Development and Operations stakeholders, and establishing a business-facing programme to automate our value stream. We emphasised to our stakeholders that automation was only a step towards our end goal of improving upon our cycle time of 26 days, and that the Theory Of Constraints warns that automating the wrong constraint will have little or no impact upon cycle time.

Our determination to value cycle time optimisation above automation in the Strangler Pipeline was soon justified by the influx of new business projects. The unprecedented growth in our application estate led to a new goal of retaining our existing cycle time while integrating our greenfield application platforms, and as our core business domain is telecommunications not Continuous Delivery we concluded that fully automating our pipeline would not be cost-effective. By following Jez Humble and Dave Farley’s advice to “optimise globally, not locally”, we focussed pipeline stakeholder meetings upon value stream constraints and successfully moved to an autonomation model aimed at stakeholder-driven optimisations.

Described by Taiichi Ohno as one of “the two pillars of the Toyota Production System“, autonomation is defined as automation with a human touch. It refers to the combination of human intelligence and automation where full automation is considered uneconomical. While the most prominent example of autonomation is problem detection at Toyota, we have applied autonomation within the Strangler Pipeline as follows:

  • Commit stage. While automating the creation of an aggregate artifact when a constituent application artifact is committed would reduce the processing time of platform creation, it would have zero impact upon cycle time and would replace Operations responsibility for release versioning with arbitrary build numbers. Instead the Development teams are empowered to track application compatibilities and create aggregate binaries via a user interface, with application versions selectable in picklists and aggregate version numbers auto-completed in order to reduce errors.
  • Failure detection and resolution. Although creating an automated rollback or self-healing releases would harden the Strangler Pipeline, we agreed that such a solution was not a constraint upon cycle time and would be costly to implement. When a pipeline failure occurs it is recorded in the metadata of the application artifact, and we Stop The Line to prevent further use until a human has logged onto the relevant server(s) to diagnose and correct the problem.
  • Pipeline updates. Although the high frequency of Strangler Pipeline updates implies value in further automation of its own Production release process, a single pipeline update cannot improve cycle time and we wish to retain scheduling flexibility –  as pipeline updates increase the probability of release failure, it would be unwise to release a new pipeline version immediately prior to a Production platform release. Instead a Production request is submitted for each signed off pipeline artifact, and while the majority are immediately released the Operations team reserve the right to delay if their calendar warns of a pending Production platform release.

Autonomation emphasises the role of root cause analysis, and after every major release failure we hold a session to identify the root cause of the problem, the lessons learned, and the necessary counter-measures to permanently solve the problem. At the time of writing our analysis shows that 13% of release failures were caused by pipeline defects, 10% by misconfiguration of TeamCity Deployment Builds, and the majority originated in our siloed organisational structure. This data provides an opportunity to measure our adoption of the principles of Continuous Delivery according to Shuhari:

  • shu – By scaling our automated release mechanism to manage greenfield and legacy application platforms, we have implemented Repeatable Reliable Process, Automate Almost Everything, and Keep Everything In Version Control
  • ha – By introducing combinational static analysis tests and a pipeline user interface to reduce our defect rate and TeamCity usability issues, we have matured to Bring The Pain Forward and Build Quality In
  • ri – Sky Network Services is a Waterscrumfall organisation where Business, Development, and Operations work concurrently on different projects with different priorities, which means we sometimes fall foul of Conway’s Law and compete over constrained resources to the detriment of cycle time. We have yet to achieve Done Means Released, Everybody Is Responsible, and Continuous Improvement

An example of our organisational structure impeding cycle time would be the first release of the new Messaging application 186-13, which resulted in the following value stream audit:

Messaging 186-13 value stream

While each pipeline operation was successful in less than 20 seconds, the disparity between Commit start time and Production finish time indicate significant delivery problems. Substantial wait times between environments contributed to a lead time of 63 days, far in excess of our average lead time of 6 days. Our analysis showed that Development started work on Messaging 186-13 before Operations ordered the necessary server hardware, and as a result hardware lead times restricted environment availability at every stage. No individual or team was at fault for this situation – the fault lay in the system, with Development and Operations working upon different business projects at the time with non-aligned goals.

With the majority of the Sky Network Services application estate now managed by the Strangler Pipeline it seems timely to reflect upon our goal of retaining our original cycle time of 26 days. Our data suggests that we have been successful, with the cycle time of our Landline Fulfilment and Network Management platforms now 25 days and our greenfield platforms between 18 and 21 days. However, examples such as Messaging 186-13 remind us that cycle time cannot be improved by automation alone, and we must now redouble our efforts to implement Done Means Released, Everybody Is Responsible, and Continuous Improvement. By building the Strangler Pipeline we have followed Donella Meadows‘ change management advice to “reduce the probability of destructive behaviours and to encourage the possibility of beneficial ones” and given all we have achieved I am confident that we can Continuously Improve together.

My thanks to my colleagues at Sky Network Services

The Strangler Pipeline – Legacy and greenfield

The Strangler Pipeline uses the Stage Strangler pattern to manage legacy and greenfield applications

Previous entries in the Strangler Pipeline series:

  1. The Strangler Pipeline – Introduction
  2. The Strangler Pipeline – Challenges
  3. The Strangler Pipeline – Scaling Up

When our Continuous Delivery journey began at Sky Network Services, one of our goals was to introduce a Repeatable, Reliable Process for our Landline Fulfilment and Network Management platforms by creating a pipeline deployer to replace the disparate Ruby and Perl deployers used by Development and Operations. The combination of a consistent release mechanism and our newly-developed Artifact Container would have enabled us to Bring The Pain Forward from failed deployments, improve lead times, and easily integrate future greenfield platforms and applications into the pipeline. However, the simultaneous introduction of multiple business projects meant that events conspired against us.

While pipeline development was focussed upon improving slow platform build times, business deadlines for the Fibre Broadband project left our Fibre, Numbering, and Providers technical teams with greenfield Landline Fulfilment applications that were compatible with our Artifact Container and incompatible with the legacy Perl deployer. Out of necessity those teams dutifully followed Conway’s Law and created deployment buttons in TeamCity housing application-specific deployers as follows:

  • Fibre: A loathed Ant deployer
  • Numbering: A loved Ant deployer
  • Providers: A loved Maven/Java deployer

Over a period of months, it became apparent that this approach was far from ideal for Operations. Each Landline Fulfilment platform release became a slower, more arduous process as the Perl deployer had to be accompanied by a TeamCity button for each greenfield application. Not only did these extra steps increase processing times, the use of a Continuous Integration tool ill-suited to release management introduced symptoms of the Deployment Build antipattern and errors started to creep into deployments.

While Landline Fulfilment releases operated via this multi-step process, a pipeline deployer was developed for the greenfield application platforms. The Landline Assurance, Wifi Fulfilment, and Wifi Assurance technical teams had no time to spare for release tooling and immediately integrated into the pipeline. The pipeline deployer proved successful and consequently demand grew for the pipeline to manage Landline Fulfilment releases as a single aggregate artifact – although surprisingly Operations requested the pipelining of greenfield applications first, due to the proliferation of per-application, per-environment deployment buttons in TeamCity.

A migration method was therefore required for pipelining the entire Landline Fulfilment platform that would not increase the risk of release failure or incur further development costs, and with those constraints in mind we adapted the Strangler pattern for Continuous Delivery as the Stage Strangler pattern. First coined by Martin Fowler and Michael Feathers, the Strangler pattern describes how to gradually wrap a legacy application in a greenfield application in order to safely replace existing features, add new features, and ultimately replace the entire application. By creating a Stage Interface for the different Landline Fulfilment deployers already in use, we were able to kick off a series of conversations with the Landline Fulfilment technical teams about pipeline integration.

We began the Stage Strangler process with the Fibre application deployer, as the Fibre team were only too happy to discard it. We worked together on the necessary changes, deleting the Fibre deployer and introducing a set of version-toggled pipeline deployment buttons in TeamCity. The change in release mechanism was advertised to stakeholders well in advance, and a smooth cutover built up our credibility within Development and Operations.

Deploying Fibre

While immediate replacement of the Numbering application deployer was proposed due to the Deficient Deployer antipattern causing per-server deployment steps for Operations, the Numbering team successfully argued for its retention as it provided additional application monitoring capabilities. We updated the Numbering deployer to conform to our Stage Interface and eliminate the Deficient Deployer symptoms, and then wrote a Numbering-specific pipeline stage that delegated Numbering deployments to that deployer.

Deploy Numbering

The Providers team had invested a lot of time in their application deployer – a custom Maven/Java deployer with an application-specific signoff process embedded within the Artifactory binary repository. Despite Maven’s Continuous Delivery incompatibilitiesbuild numbers being polluted by release numbers, and the sign-off process triggering the Artifact Promotion antipattern, the Providers team resolutely wished to retain their deployer due to their sunk costs. This resulted in a long-running debate over the relative merits of the different technical solutions, but the Stage Strangler helped us move the conversation forward by shaping it around pipeline compatibility rather than technical uniformity. We wrote a Providers-specific pipeline stage that delegated Providers deployments to that deployer, and the Providers team removed their signoff process in favour of a platform-wide sign-off process managed by Operations.

Deploy Providers

As all greenfield applications have now been successfully integrated into the pipeline and the remaining Landline Fulfilment legacy applications are in the process of being strangled, it would be accurate to say that the Stage Strangler pattern provided us with a minimal cost, minimal risk method of integrating applications and their existing release mechanisms into our Continuous Delivery pipeline. The use of the Strangler pattern has empowered technical teams to make their own decisions on release tooling, and a sign of our success is that development of new pipeline features continues unabated while the Numbering and Providers teams debate the value of strangling their own deployers in favour of a universal pipeline deployer.

Deploy Anything

The Strangler Pipeline – Scaling up

The Strangler Pipeline scales via a Artifact Container and Aggregate Artifacts

Previous entries in the Strangler Pipeline series:

  1. The Strangler Pipeline – Introduction
  2. The Strangler Pipeline – Challenges

While Continuous Delivery experience reports abound from organisations such as LMAX and Springer, the pipelines described tend to be focussed upon applying the Repeatable, Reliable Process and Automate Everything principles to the release of a single application. Our Continuous Delivery journey at Sky Network Services has been a contrasting experience, as our sprawling application estate has led to significant scalability demands in addition to more common challenges such as slow build times and unrepeatable release mechanisms.

When pipeline development began 18 months ago, the Sky Network Services application estate consisted of our Network Inventory and Landline Fulfilment platforms of ~25 applications, with a well-established cycle time of monthly Production releases.

However, in a short period of time the demand for pipeline scalability skyrocketed due to the introduction of Fibre Broadband, Landline Assurance, Wifi Fulfilment, Wifi Realtime, and Wifi Assurance:

This means that in under a year our application estate doubled in size to 6 platforms of ~65 applications with the following characteristics:

  • Different application technologies – applications are Scala or Java built by Ant/Maven/Ruby, with Spring/Yadic application containers and Tomcat/Jetty/Java web containers
  • Different platform owners – the Landline Fulfilment platform is owned by multiple teams
  • Different platforms for same applications – the Orders and Services applications are used by both Landline Fulfilment and Wifi Fulfilment
  • Different application lifecycles – applications may be updated every day, once a week, or less frequently

To attain our scalability goals without sacrificing cycle time we followed the advice of Jez Humble and Dave Farley that “the simplest approach, and one that scales up to a surprising degree, is to have a [single] pipeline“, and we built a single pipeline based upon the Artifact Container and Aggregate Artifact pipeline patterns.

For the commit stage of application artifacts, the pipeline provides an interface rather than an implementation. While a single application pipeline would be solely responsible for the assembly and unit testing of application artifacts, this strategy would not scale for multi-application pipelines. Rather than incur significant costs in imposing a common build process upon all applications, the commit interface asks that each application artifact be fully acceptance-tested, provide associated pipeline metadata, and conform to our Artifact Container. This ensures that application artifacts are readily accessible to the pipeline with minimal integration costs, and that the pipeline itself remains independent of different application technologies.

For the creation of platform artifacts, the pipeline contains a commit stage implementation that creates and persists aggregate artifacts to the artifact repository. Whereas an application commit is automatically triggered by a version control modification, a platform commit is manually triggered by a platform owner specifying the platform version and a list of pre-built constituent application artifacts. The pipeline compares constituent metadata against its aggregate definitions to ensure a valid aggregate can be built, before creating an aggregate XML file to act as a version manifest for future releases of that platform version. The use of aggregate artifacts provides a tool for different teams to collaborate on the same platform, different platforms to share the same application artifacts, and for different application lifecycles to be encapsulated behind a communicable platform release version.

While the Strangler Pipeline manages the release of application artifacts via a Repeatable Reliable Process akin to a single application pipeline, the use of the Aggregate Artifact pattern means that an incremental release mechanism is readily available for platform artifacts. When the release of an aggregate artifact into an environment is triggered, the pipeline inspects the metadata of each aggregate constituent and only releases the application artifacts that have not previously entered the target environment. For example, if Wifi Fulfilment 1.0 was previously released containing Orders 317 and Services 192, a release of Wifi Fulfilment 2.0 containing Orders 317 and Services 202 would only release the updated Services artifact. This approach reduces lead times and by minimising change sets reduces the risk of release failure.

A good heuristic for pipeline scalability is that a state of Authority without Responsibility is a smell. For example, we initially implemented a per-application configuration whitelist as a hardcoded regex within the pipeline. That might have sufficed in a single application pipeline, but the maintenance cost in a multi-application pipeline became a painful burden as different application-specific configuration policies evolved. The problem was solved by making the whitelist itself configurable, which empowered teams to be responsible for their own configuration and allowed configuration to change independent of a pipeline version.

In hindsight, while the widespread adoption of our Artifact Container has protected the pipeline from application-specific behaviours impeding pipeline scalability, it is the use of the Aggregate Artifact pattern that has so successfully enabled scalable application platform releases. The Strangler Pipeline has the ability to release application platform versions containing a single updated application, multiple updated applications, or even other application platforms themselves.

The Strangler Pipeline – Challenges

The Strangler Pipeline introduced a Repeatable Reliable Process for start/stop, deployment, and database migration

Previous entries in the Strangler Pipeline series:

  1. The Strangler Pipeline – Introduction

To start our Continuous Delivery journey at Sky Network Services, we created a cross-team working group and identified the following challenges:

  • Slow platform build times. Developers used brittle, slow Maven/Ruby scripts to construct platforms of applications
  • Different start/stop methods. Developers used a Ruby script to start/stop individual applications, server administrators used a Perl script to start/stop platforms of applications
  • Different deployment methods. Developers used a Ruby script to deploy applications, server administrators used a Perl script to deploy platforms of applications driven by a Subversion tag
  • Different database migration methods. Developers used Maven to migrate applications, database administrators used a set of Perl scripts to migrate platforms of applications driven by the same Subversion tag

As automated release management is not our core business function, we initially examined a number of commercial and open-source off-the-shelf products such as ThoughtWorks GoLinkedIn GluAnt Hill Pro, and Jenkins. However, despite identifying Go as an attractive option we reluctantly decided to build a custom pipeline. As our application estate already consisted of ~30 applications, we were concerned that the migration cost of introducing a new release management product would be disproportionately high. Furthermore, a well-established Continuous Integration solution of Artifactory Pro and a 24-agent TeamCity build farm was in situ, and to recommend discarding such a large financial investment with no identifiable upfront value would have been professional irresponsibility bordering upon consultancy. We listened to Bodart’s Law and reconciled ourselves to building a low-cost, highly scalable pipeline capable of supporting our applications in order of business and operational value.

With trust between Development and Operations at a low ebb, our first priority was to improve platform build times. With Maven used to build and release the entire application estate, the use of non-unique snapshots in conjunction with the Maven Release plugin meant that a platform build could take up to 60 minutes, recompiled the application binaries, and frequently failed due to transitive dependencies. To overcome this problem we decreed that using the Maven Release plugin violated Build Your Binaries Only Once, and we placed Maven in a bounded CI context of clean-verify. Standalone application binaries were built at fixed versions using the Axel Fontaine solution, and a custom Ant script was written to transform Maven snapshots into releasable artifacts. As a result of these changes platform build times shrank from 60 minutes to 10 minutes, improving release cadence and restoring trust between Development and Operations.

In the meantime, some of our senior Operations staff had been drawing up a new process for starting/stopping applications. While the existing release procedure of deploy -> stop -> migrate -> set current version -> start was compatible with the Decouple Deployment From Release principle, the start/stop scripts used by Operations were coupled to Apache Tomcat wrapper scripts due to prior use. The Operations team were aware that new applications were being developed for Jetty and Java Web Server, and collectively it was acknowledged that the existing model left Operations in the undesirable state of Responsibility Without Authority. To resolve this Operations proposed that all future application binaries should be ZIP archives containing zero-parameter start and stop shell scripts, and this became the first version of our Binary Interface. This strategy empowered Development teams to choose whichever technology was most appropriate to solve business problems, and decoupled Operations teams from knowledge of different start/stop implementations.

Although the Binary Interface proved over time to be successful, the understandable desire to decommission the Perl deployment scripts meant that early versions of the Binary Interface also called for deployment, database migration, and symlinking scripts to be provided in each ZIP archive. It was successfully argued that this conflated the need for binary-specific start/stop policies with application-neutral deploy/migrate policies, and as a result the latter responsibilities were earmarked for our pipeline.

Implementing a cross-team plan of action for database migration has proven far more challenging. The considerable amount of customer-sensitive data in our Production databases encouraged risk aversion, and there was a sizeable technology gap. Different Development teams used different Maven plugins and database administrators used a set of unfathomable Perl scripts run from a Subversion tag. That risk aversion and gulf in knowledge meant that a cross-team migration strategy was slow to emerge, and its implementation remains in progress. However, we did experience a Quick Win and resolve the insidious Subversion coupling when a source code move in Subversion caused an unnecessary database migration failure. A pipeline stage was introduced to deliver application SQL from Artifactory to the Perl script source directories on the database servers. While this solution did not provide full database migration, it resolved an immediate problem for all teams and better positioned us for full database migration at a later date.

With the benefit of hindsight, it is clear that the above tooling discrepancies, disparate release processes, and communications issues were rooted in Development and Operations historically working in separate silos, as forewarned by Conway’s Law. These problems were solved by Development and Operations teams coming together to create and implement cross-team policies, and this formed a template for future co-operation on the Strangler Pipeline.

The Strangler Pipeline – Introduction

Continuously Delivering greenfield and legacy applications en masse

I recently gave a talk at Agile Horizons 2012 on behalf of my amazing employer Sky Network Services, detailing our yearlong Continuous Delivery journey and the evolution of our Strangler Pipeline. As a follow-up I intend to write a series of articles on our pipeline, as it is a narrative far removed from the “pipelining a single greenfield application” model often found in Continuous Delivery experience reports.

Sky Network Services is an agile, innovative technology company that produces telecommunications middleware for BSkyB. Despite a plethora of talented technical/non-technical staff and an enviable reputation for delivering quality software, an in-house analysis in mid-2011 identified a number of problems:

  • Many applications used different methods of deployment, start, stop, and database migration in different environments
  • There was little visibility of which application versions were progressing through the test environments at any given time
  • Releasing a minor bug fix for an application necessitated a re-release of the parent platform
  • Development teams and Operations teams were constrained to separate silos

At this point we were attracted to the Continuous Delivery value proposition, albeit with the additional challenge of scaling our pipeline to manage an estate of legacy/greenfield applications that in the past year has doubled in size.

In this series of articles I aim to cover:

  1. Challenges – how we solved the more common Continuous Delivery challenges
  2. Scaling Up – how we scaled our pipeline to manage our ever-growing application estate
  3. Legacy and Greenfield – how we simultaneously release legacy and greenfield applications
  4. Autonomation – how we established a Continuous Delivery transformation across a Waterscrumfall organisation

© 2024 Steve Smith

Theme by Anders NorénUp ↑