Zero Outage Map
The obvious question is why choosing an IT value chain concept and what does it entail? Let’s start with what it is and what it entails. The value chain is well known business concept, described by Michael Porter in 1985(1) (see graphic 2). The principles are actually pretty simple:
- A sequence of related steps (primary activities) that successively create value for related stakeholders, e.g. customers, shareholders. The key to this definition is that the value outcome of the chain is greater than the sum of the parts.
- The additional value is created by the synergy of working together, based on an integrating and automating commonality (supporting activities), which can be common business functions (e.g. procurement), technology (e.g. master data of a supply chain) or infrastructure (e.g. a conveyer belt setup in a production line). The key is that this common connecting tissue makes the value chain more efficient, repeatable and predictable.
IT cannot really claim predictability, the typical end-to-end maturity is fairly low, as is the level of collaboration across organisational and technology silos. IT has grown by answering technology disruptions with dedicated solutions in a siloed manner, never having the time to mature and truly integrate them into the overall landscape.
When thinking about this, it becomes glaringly obvious that the value chain concept is highly applicable to IT, and specifically to Zero Outage. First, when decomposing the steps required to deliver end-to-end services, one can quickly see how they need to be logically chained, still allowing iteration. Second, to deliver to the Zero Outage quality level, the chain needs to be highly automated, fault tolerant and most importantly predictable.
Leverage of IT4ITTM
In order to design the Zero Outage Map as an end-to-end value chain, it seems obvious to leverage the IT4ITTM Value Chain standard(2), published by The Open Group, which provides a description of the IT landscape how to run IT as a business (see graphic 3). Effectively The Open Group applied and translated the Porter Value Chain to the IT problem. The huge advantage is that this is defined as an open industry standard, continuously reviewed and evolved by a representative group of consumers and providers.
The IT4ITTM value chain is structured into four value streams, describing the well-known Plan, Build, Run phases, but adding the Deliver phase, which segregates the concerns of creating service release packages and the actual instantiation in production via service order and fulfilment catalogues. This is a direct response to IT trends, such as DevOps, Cloud and Service Broker becoming pervasive
Furthermore, based on the value chain concept, The Open Group developed and published the IT4IT Reference Architecture standard, which provides a functional and an information model prescribing how to deliver services in a business fashion. This is a very good starting point for the Zero Outage Industry Standard to expand on, adding Zero Outage-specific architectural policies and data model aspects.
(1) Porter, Michael E., "Competitive Advantage". 1985, Ch. 1, pp 11-15. The Free Press. New York.
(2) IT4IT™ is a trademark of The Open Group
When looking at the IT4ITTM Value Chain, the commonality and applicability to the Zero Outage problem is obvious. All phases significant in the evolution of the Zero Outage interpretation of the value chain, articulated by the Zero Outage Map:
- Properly rationalize the business demand and “Plan” for Zero Outage quality services within the given service delivery boundary conditions, such as strategic, financial, legal, and architectural. This entails strategic criteria, such as an operating model tuned towards Zero Outage, as well as specific design criteria, such as properly architecting the infrastructure stack to ensure the required level of resilience and security.
- Next, the services need to be “Built” to meet the criteria determined in the “plan” phase. In particular the specific design criteria need to be translated into the appropriate non-functional requirements, such as availability, performance and security, which are key to delivering Zero Outage. The Zero Outage Industry Standard will provide practical guidance to develop actionable requirements, which guide the proper service design and development of the service and are particularly useful to hold service providers accountable when sourcing element of the service.
- Traditionally “plan” and “build” was followed by “run”, but the revolution of virtualisation technology, sourcing models and development methodologies (agile, DevOps) required the innovation of a fourth, intermediate step called “Deliver”. These revolutions all have one major consequence: complexity, making it much more difficult to see and track what is going on. That is in direct contradiction to the keys of the value chain concept, namely collaboration and predictability. The traditionalists would say “Keep cloud away from Zero Outage” and the New Agers would counter “Cloud solves your resilience problem by definition”. As always, there is a point in both, therefore we need to make services work in hybrid environments (being the dynamic mix of traditional and cloud elements), and we need to make them manageable. One crucial element is the ability to construct services from various catalogues and various providers, and to activate such services in heterogeneous, hybrid infrastructure environments while keeping them under control. This includes control of usage and respective charging, as well as manageability. Even though there is more to it, this is the essence of “deliver”, what IT4IT calls Request to Fulfil.
- Finally and probably best known in terms of processes and management maturity is the “Run” phase, which assures that the services in use are delivered at the required Zero Outage quality level and stay that way through the reality of inevitable, continuous and dynamic change. However, Zero Outage changes the game significantly. Organisations used to be fairly nonchalant with the notion of “proactive operations”, it seemed to be “good enough” to automate the known and fatalistically accept the disruption of new surprises. However, Zero Business Outage means that there can’t be surprises. Therefore Zero Outage requires innovation of the traditional “run” to anticipating and preventing issues before they disrupt the business.
In addition to the four main phases of Plan, Build, Deliver and Run, the Zero Outage Map also reflects the fact that services become obsolete, hence a phase to “Retire” services has been added. Zero Outage services typically require costly and/or labour intensive components, which should be released and made available to other use cases when no longer required.
Like the IT4ITTM Value Chain we place the Service Model in the centre as the connecting tissue and the heart of the Zero Outage Industry Standard, as described earlier, and of specific relevance to the platform and security workstream work.
We have chosen to depict the value chain as a circular rather than a linear model, knowing that modern IT requires continuous iteration between various capabilities within and between phases. In addition we chose to depict supporting functions as a surrounding frame and focused on selected functions specifically important for Zero Outage:
- Governance: Zero Outage has a lot to do with guarantee of a certain quality, which in turn requires governance and control to ensure it actually happens. In addition, it requires governance continuously across the entire value chain. At any given point one needs to be able to determine the current state and service delivery and required course of action.
- Analytics & Reporting: this is a key enabler as it provides crucial insights to continuously improve service delivery. It could be argued that this is a sub function of Governance and Risk Management, but it has evolved to be a science. Big data has evolved technologies to become a source of innovation in and of itself, e.g. the determination of anomalies and anticipation of failure is only feasible through the level of analysis that can be done today.
- Risk Management: decisions always need to find the right balance between conflicting priorities and service delivery boundary conditions. In order to avoid any business degradation one can’t really afford surprises, hence risks need to be proactively understood and managed, especially the impact of actions and changes.
- Supplier Management: multi-supplier service delivery is mainstream, even though the maturity of it is often more on the low end. One of the objectives of the Zero Outage Industry Standard is to structure and streamline the cooperation of suppliers jointly delivering Zero Outage compliant services. One of the key elements is to make the touchpoints between suppliers transparent and measurable.
The service model provides the common context throughout the execution of the value chain, from Plan to Run. It is the source of truth that captures and shares the relevant information about a service at any given point in time.
One can compare the model to master data controlling a supply chain. The consistency and integrity of the service model is the basis for achieving transparency and traceability of the characteristics of a service throughout its lifetime.
The service model evolves over the lifecycle of a service throughout the value chain:
- Conceptual service – the conceptual model represents the output of planning the service, essentially the description “why” and “what” needs to be built in a Zero Outage compliant architectural context.
- Logical service – the logical model expands on the conceptual, adding conclusions “how” the service is built to meet the Zero Outage relevant non-functional requirements in a Zero Outage compliant technology and system architecture.
- Physical service – the physical model further expands on the logical model “with what” the service has been realised in the physical world and how it's being managed and kept current. We delineate between two instances of the physical service: the “Desired Service” model as output of the request fulfilment and the “Actual Service” model as being recognized and managed in operations.
So, throughout the value chain the respective capabilities are based upon a formal specification of the service, always talking about the same service but at different levels of granularity and specificity. The generic structure of the service model is described in the “Layered Model” section below, which the platform workstream will expand upon with detailed design.
A value chain implementation typically includes tools from various vendors, hence it is mandatory to have a common interpretation of the service model data, requiring a common syntax and semantics of its attributes. This is what the IT4ITTM standard started to create and continues to evolve. However, it is likely that Zero Outage use cases will require additional specifications, therefore we plan to cooperate with the Open Group and drive those additions through their standardisation process.
After exploring the Zero Outage Map on the highest level, let‘s drill down one step deeper and look at the capabilities required for the individual phases of the value chain and their respective relevance for the Zero Outage use cases.
In this chapter the Zero Outage Map specification adds to the content of the IT4ITTM standard, but also slightly deviates. This is because the IT4ITTM standard focuses on the functional and information model rather than the capability level, which it only loosely articulates and mostly from a traditional IT perspective. Therefore this capability view adds forward thinking to reformulate known IT capabilities for the new requirements of the digital enterprise.
The overall capability view in Graphic 5 expands each phase with a cycle of 4 core capabilities that articulate what needs to happen. These cycles however neither work independently, nor in a fixed waterfall-type manner. To the contrary, there are typically iterations within the cycle and interoperability between the cycles at any given point in time.
The important fact is that all these interactions happen in a transparent and traceable fashion, tracked in the common context of the service model, which evolves in the level of detail and prescriptiveness over the course of the value chain. It is important to understand the service model concept before diving into the capabilities themselves.
Generic description of the capabilities
- Service Strategy and Sourcing – description of the business boundary conditions for delivering the specific service quality level for the target market, e.g. the Zero Outage justification for a target market (market opportunity, financials etc.), the portfolio priorities and the related sourcing strategy. The resulting conceptual service model codifies the architectural consequences of delivering the service according to these strategic boundary conditions.
- Enterprise Architecture Management – the methodology, the architecture and technology guidelines for the overall service and the underlying application(s) e.g. native cloud applications need to be architected as a set of integrated micro services in order to leverage the resilience capabilities of the underlying cloud infrastructure. The resulting structure, design guidelines and architectural policies become part of of the conceptual model.
- Business Demand Management – rationalizing, harmonizing and grouping the business and operational demand for existing and new services into an actionable portfolio backlog. Actionable means translating the business purpose into a qualified demand specifically describing the non-functional characteristics. That allows to effectively prioritize the backlog according to the strategic and architectural guidelines, which guides agile development.
- Portfolio Management – determining/analysing the scope and value of the “to be” portfolio based upon what has been done (“as is”), what should be done (value generating backlog priorities) and what can be done (budget, skills). The result is a conceptual model articulating a clear proposal/expectation per a desired service in the portfolio.
Specific relevance to Zero Outage: in order to achieve Zero Outage, it is NOT enough to simply improve the maturity managing the service. The objective is to fundamentally change the approach from reactive to proactive management, to avoid issues before they occur. The prerequisites for addressing that problem are transparency and predictability of what is being done, which requires consistent and formal documentation throughout the lifecycle of a service, as the basis of learning what we don’t know. One could argue that all capabilities are important, but two seem most relevant:
- Enterprise Architecture – transparency and predictability starts with a structured (service-oriented) architecture of the service model and the platform that provides the appropriate required architectural and technical criteria.
- Business Demand Management – one can avoid the majority of issues by building the right service and building it right in the first place. The latter is critical in avoiding service degradation: building the service based on the right availability, performance and security requirements. Practical guidelines and examples as to how to translate generic demand e.g. business continuity into the right architecture are critical for Zero Outage and will make a notable difference.
Generic description of the capabilities
- Requirements Management – for the services to be newly sourced/developed or updated, the related business demands must be translated into functional requirements. Most important for Zero Outage is the translation of architectural policies into non-functional requirements, e.g. security, performance, resilience. Analysing, scoping, prioritizing and planning the requirements as consumable, packaged value.
- Service Design Engineering – translating the conceptual into a logical model of the service. Developing user stories and detailed design from the respective requirements in the backlog. Driving sourcing decisions on the service component level.
- Service Development and Testing – the actual development of the service, which might entail SW development, sourcing or buying elements of the SW, system integration etc. There is no limitation regarding the choice or mix of agile and waterfall models. Once the consumable value is integrated, the function of the service and the non-functional constraints needs to be verified for real life.
- Service Release Management – once the service is sufficiently tested (based on the intended quality level) and the risk analysis meets the release criteria, the service is ready for general release, a deployable release package is generated and verified that it can be successfully deployed.
Specific relevance to Zero Outage: building the service continues the thread of designing and building the service right. The following two capabilities are especially exposed to that problem:
- Requirements Management – planning the right architecture is the first step, now this needs to be broken down into actionable and measurable non-functional requirements that can be acted upon by developers to create better code, which is better suited to the target production environments. The ability to formulate high-quality non-functional requirements is typically a low maturity in most organizations, hence practical best practices will help tremendously.
- Service Design Engineering – consequently the next critical step is to design a Zero Outage compliant service, which translates the non-functional requirements into the appropriate service architecture and its related logical service model. This involves decisions, such as which deployment model (traditional vs. cloud) to use for which layer, and taking the interdependencies between layers into account. Zero Outage design principles will prescribe the right approach to service modelling and its underlying technologies. Equally important for Zero Outage is the design of the required test cases and their level of automation.
Generic description of the capabilities
- Service Offer Creation – instead of just passing a release along to the operations team, the activation of the service is based upon an offer catalogue specification from which services can be consumed. Therefore the release package is published as a service catalogue entry. In today’s virtualised world, this may involve aggregation of various service components from different catalogues into one consumable item.
- Service Consumption Management – providing a seamless consumption experience via the service catalogue (e.g. as part of the self-service portal) facilitating the shopping, request and ordering process, hiding the complexity of the service from the user (individual or business).
- Service Activation - when being ordered, the service needs to be activated based on the consumer requirements. This involves determining the required physical or virtual infrastructure on which to realise the service, interlocking with the change process. Again, in today’s hybrid IT world, this may likely involve deploying components of the service to different physical infrastructures using different fulfilment engines.
Service Usage Management - measuring the usage of the activated service and facilitating the appropriate charging, specified in the delivery model, e.g. cross-charge or general allocation through the IT Financial Management system in place.
Specific relevance to Zero Outage: the structure of the catalogue based service activation is important to sustain data integrity, especially due to the complexity and lack of transparency of cloud and multi-supplier models. The following two capabilites prominently drive that characteristic:
- Service Offer Creation – it is critical to model the complexity of virtualization and distribution of the service components in the service catalogue system. The modeled structure of related components drives the automatic service aggregation of the components from the underlying fulfilments catalogue(s).
- Service Activation – parts of the service may reside in different deployment models controlled by different suppliers, hence it is critical to maintain access points in the desired service model (e.g. through a standardized API gateway). The deployment models might dynamically change, hence integration points are critical to manage these changes while maintaining the related Zero Outage requirements, such as the resilience level of a component. Again, the aforementioned design principles will include that prescription.
Generic description of the capabilities
- Preventive Health Management – classically this includes monitoring the availability and performance of the services and driving event management. For Zero Outage that is not sufficient though, it is all about anticipating issues before they occur.
- Service Assurance – managing processes to assure the required level of services are being met, which includes help desk, incident and problem management. Involving the ability to analyse and diagnose the relevance, potential impact and cause of potential issues, proactively avoiding the breach of service levels.
- Knowledge Management & Automation – translating learnings into predictable and repeatable actions. Determining the required tasks, priorities and timelines to repair keeping the resilience level of the service overall. Translating best practices into automated runbooks, minimizing human requirements and failure.
- Configuration & Change Management – discovering the actual state of the service, comparing it to the desired state and reconciling inconsistencies (via the physical service model). This provides the basis for managing the risk and the execution of required changes. Those actions could be automatic done and tracked (e.g. adding bandwidth) or formally governed and requiring interaction (e.g. version upgrade), based on the guidelines specified in the service model. Patching is a specific type of change, like release, the difference becomes marginal in the agile world.
Specific relevance to Zero Outage: the intention of Zero Outage is to guarantee no service degradation, while changes are performed on components of the service. That means we cannot afford surprises managing services, hence we need to manage key characteristics (e.g. resilience) proactively at all times, and we need to learn what we don’t know. One could argue that all Run capabilities (integrated) are critical for Zero Outage, but two seem to be most critical for proactive resilience:
- Preventive Health Management – we need to learn what we don’t know, to anticipate issues before they occur. While the end-to-end monitoring of structured data from the known environment is very important, it is not enough. It needs to evolve towards understanding patterns of behaviour, pinpointing abnormal behavior, investigating and mitigating/resolving those proactively. Transparency and integration of structured data is the basis for establishing a system of record, but the ability to capture vast amount of unstructured data, integrate and analyse it in the context of the system of record brings it a different level. The use of modern big data technologies is a true innovation opportunity towards establishing a system of insight for IT, that helps avoiding surprises.
- Configuration & Change Management – still today, many issues resulting in outages are based on poorly managed changes. Managing the risk of change and sustaining the service model integrity is key to achieving Zero Outage. This is not only a process question, but also needs to be reflected in the underlying data structure, namely the service model and the way access points are defined and brokered.