Zero Outage Implications to Platform Architecture and Design
This document describes architectural and design principles that are needed to avoid outages in any IT implementation which wants to comply with Zero-Outage Industry Standard (ZOIS). Following the guidelines described in this document will help reaching a “Zero Outage” compliant solution which delivers value to customers with uninterrupted uptime. In the following parts of the document, we will call this a ZO implementation. From these guideline, we intend to derive best practices in detailed documents.
A Zero-Outage (ZO) implementation will consist of multiple technology layers. Each layer has dependencies, on “lower” and/or “higher” layers, plus dependencies within the layer itself.
We need to break-down each layer into smaller elements, which we will call “building blocks”, and detail for each of them what are the requirements that need to be met to contribute to a comprehensively compliant ZO implementation. We also describe the interactions between building blocks and the requirements that these interactions need to respect to maintain an overall compliant infrastructure.
The following figure depicts the three layers (IT Infrastructure, Systems, and Business Solution) in which our architectural model is broken-down; as shown, in a concrete implementation, these layers will be running in data centres, deployed either as on premise or as cloud.
In a real operating scenario, issues can be caused by planned and unplanned events occurring on all above noted layers. These events are, for example, natural catastrophes, a series of HW/SW failures, changes that are required to be made within the implementation, and many more. Countermeasures against adverse events will be needed in all layers, such countermeasures  are needed in all phases of the life-cycle of a ZO implementation from plan to build, deliver, and run and will also need to range across platform, security, people processes.
The approach we take is to identify possible issues for all parts of a ZO implementation and define countermeasures against these. Countermeasures first need to be implemented in the architecture and design of the infrastructure, these are the most effective and efficient ones to systematically avoid issues. There will be different ways to reach a ZO implementation – also influenced by their purpose –, so the concrete set of countermeasures will have to be defined for each implementation. In the following sections we will describe common current best practices that should help achieve a ZO implementation.
The following figure describes building blocks of our three main layers in a simplified way building blocks consist out of a number of smaller parts, which we call components.
As an example, the building block Network is built of components like cabling, routers, switches, etc.
In ZOIS, the focus is on systems running critical business solutions which require an “always-on” (Zero Outage) service.
To reach an “always-on” solution and operational state, the components of a ZO implementation must meet certain resilience requirements. This needs to be implemented in tight connection with what is described in processes, people, and security work streams of our ZO Industry Standard.
The architecture of a ZOIS compliant implementation must ensure that all building blocks are compliant, or the corresponding ZO requirement needs to be covered by a different building block in another, usually higher level of the stack. No building block is allowed to introduce known unmanaged risks for ZO and each must guarantee that all errors and failures, which may occur are of no business impact or are handled in another layer. To achieve this, a compliant design must be able to manage problems caused by planned events (like updates) as well as unplanned events (like failures) without service outage or noteworthy business impact.
The architectural guiding mechanisms to achieve ZO targets shouldn’t negatively impact the characteristics of the solution provided. For instance, performance should not be degraded by design in order to guarantee ZO. In some circumstances, slight degradation in performance maybe acceptable during a failure if the intended use is still guaranteed and the targeted performance SLA is still achieved.
Building blocks are descriptive and not mandatory, there can be ZO implementation where some building blocks may be absent or converged.
The simple “stack” of layers and their building blocks shown in the previous chapter is useful to get a list of topics to be addressed; however, the way the building blocks interact, is more complicated. To get a clearer view of the architectural needs, the following figure shows the interdependencies and connections (physical and logical) between all parts of a ZO implementation. Connections allow data exchange and – depending on what they connect – can be APIs, communication protocols, etc. Knowing how building blocks interact is important for example to identify where compatibility issues could occur. The most important connections are shown in the following figure:
Connections and interdependencies can be uni-directional or bi-directional. Also, there can be connections between the same type of components (for scalability reasons) as well as of course between different components (to combine functions like storage and computing). These connections will affect the availability, and whenever a change in version is required which may introduce risks, these connections can help identify compatibility issues. The compatibility is a core requirement to be considered when setting up or upgrading a landscape or parts of it.
We can distinguish between two categories of connections:
- Connections in the same layer:
- Applications are connected to each other; data exchange between applications can happen on one system or between applications running on different systems (this is indicated by the systems being in one “box”).
- Systems have connections to operating system (OS) and to database(s),
- Server, Storage and Networking components are all tightly inter-dependent with each other. They may run on one or more virtualization layer.
- Connections between different layers:
- As shown in the figure, the “System Layer” sits in the middle of the 3 layers and has connections to and from applications as well as to and from the Infrastructure layer
While connections within each layer must be working to achieve ZO, this is not sufficient, as connections between layers are just as important. So, if there is a change applied to one component, the compatibility with all connected components within and between layers needs to be ensured. Therefore, understanding and tracking the connections is required to run and maintain a ZO implementation without degradation.
A Zero Outage Architecture is normally required for the critical systems of a Customer/Service Provider. Parts of a ZO Architecture can of course also be used for non-critical environments if deemed adequate. Therefore, at the beginning of the design of a business service, stands the criticality definition of the services and of the infrastructure.
The goal of a ZO implementation is to avoid outages of the business solution whilst building blocks of the infrastructure may suffer a failure or require routine (planned) maintenance.
Ideally, each building block should have, for all functions it provides, specific resilience internal to its operations that can address all issues that could arise during its operation preventing outages to the building block itself.
However, it may not always be appropriate or cost effective to avoid defects within each building block of a ZO implementation but the goal of “no business impact” is still the target of the design. To overcome missing resilience in one building block, it needs to be delivered by a specified different layer.
The architectural and functional ZO characteristics described in this document need to be met independently of the installation type, be it on premise, cloud, or hybrid. The mechanisms used as countermeasures to possible problems must be chosen accordingly.
There are different categories of errors that will affect layers and building blocks of a ZO implementation differently. All of these must be taken care of. See the following figure:
For example, a catastrophic event might be an earthquake, an environmental shortage of resources a breakdown of power supply, and a misconfiguration like inappropriate setting of critical parameters in a firewall.
Based on their nature, problems caused by a specific event category will hit the different layers with different levels of impact and more or less directly, but all need to be handled properly in a successful ZO implementation Of course, this list may not be complete and needs to adapt to any future findings.
These are physical events like fire, flooding, or earthquakes. These are clearly unplanned events.
Under environmental shortages we indicate the deterioration or loss of a critical physical resources that is supplied regularly to the data centre. For example, an environmental shortage could be a breakdown of power supply or a lack in cooling. These are mostly unplanned events, but some events may be more likely depending on the season, an element which should be considered in the planning phase.
If no appropriate countermeasures are taken, updates may be requiring planned downtime or cause unplanned downtime:
- Updates in many cases cause planned downtimes, if the applications cannot be updated in their uptime (server restart required, etc.).
- This category of problems should be definitely avoided in a ZO compliant architecture via proper design of all software components – and required processes –, that should be designed (either inherently or via redundancy) to avoid downtime during upgrades
- Updates can bring unplanned downtimes:
- One example if the process runs longer than expected or terminates with unexpected result.
- Another example is version changes causing incompatibility problems between components
- All such problems need to be address with version control and knowledge about current compatibility between components version, plus specific testing (incl. staging) when needed
Bugs in software are a well-known problem. None of these are planned. In a ZO implementation, they must be prevented from causing outages thru robust SW design practices within each component, and with an architectural design of the ZO solution which has resiliency built in.
Hardware replacement may be necessary because of obsolescence or failure. Hardware defects are often confined to specific samples but at times they can be present with greater probability in a certain lot, or affect the whole production. Hardware defects have the potential to determine downtime as well as requiring planned events (scheduled replacements, for example), which in turn could possibly determine ill-effects, too. Failure of a specific sample shall not cause outages for the functions provided: this will be achieved by countermeasures, for example inserting redundancy in the design. Also, the whole replacement process should be designed such as it does not cause a downtime (see also “serviceability”).
Misconfigurations are not planned but can easily happen. Misconfiguration can happen for any part that can be configured. It can also happen – and is harder to detect – if a configuration works for the configured component (software or hardware), but does not work in the end-to-end-process. One example could be to configure the usage of Unicode, which may cause problems in other systems, not being Unicode enabled. Misconfiguration could also happen, if a system parameter like a port number is changed that affects a URL used to call its function.
This is one category that can affect all layers and building blocks and cannot be solved purely on an architectural level, but will strongly rely on people and processes.
Given the breadth of the topic an entire section of the Zero Outage Industry Standard is related to Security.
We have seen that every part of a Zero Outage implementation can be affected by events causing outages. Since an outage of any layer will affect the whole solution, each single building block, including the connections of building blocks with each other, should be made as robust as possible against adverse events. While the ZO goal might be simpler to achieve by addressing redundancy within the same building block or corresponding layer, it may also be possible to provide redundancy mechanism within different building blocks. Typically these reside in a higher layer of the infrastructure.
It is important to note that when we mention “countermeasures” we are not referring only to a reactive approach, like making sure that the ZO solution reacts properly to an event. Countermeasures need to be considered in the different phases of the life-cycle – Plan, Build, Deliver, and Run, and the same event can be handled with different countermeasures in the different phases. For instance, talking about redundancy of components, this must be planned in detail during design (Plan phase), but can only work if they are properly tested (Build phase) and if people and processes (Run phase) are properly defined to take advantage of what was done in the design phase.
The following figure depicts an example of one planned and one unplanned event possibly causing problems for a layer of a ZO implementation and some examples for countermeasures to these problems including the phase where to implement the countermeasure.
As an example of a problem affecting an element of a ZO implementation plus the countermeasure implemented, building block “Network” is described on the level of components: Cabling, routers, and load balancer. Replacing a router or its software can cause problems if some connected components can’t deal with the new version. The countermeasure would be good planning based on knowledge of the compatibility in the network, and a proper replacement process. For the example of a cable being broken, a redundant, tested cabling is used as a countermeasure.
Note that more than one countermeasure may address the same problem and some countermeasures address several problems.
We have seen that problems can be
- Generic - like a power outage which will affect multiple layers and can be tackled on each component level
- Specific – like when an application version causes incompatibility with other connected components, which must be solved on the specific applications or their API level.
As a general principle of the countermeasure approach, at least generic problems should be addressed on the level of each affected component or at a higher level. For example, you can have redundant storage devices in each server or use (more) redundant servers with the same content as a countermeasure to avoid outages caused by a single storage failure.
No matter, which mechanisms are used in a concrete Zero Outage implementation there are some key features – described in the following – that need to be implemented in every Zero Outage implementation.
All functions need to have mechanisms to deal with a defect of any part. However, the design aspect to prevent a Single-Point-of-Failure can be anywhere in the design of the solution. Redundancy is one of the most important ways to deal with these type of issues. As an example, the overall design must take care for failures in network connection within a data centre or between datacentres. Robustness against this scenario can be covered by having redundant cable interconnects, until even creating full mirror of the infrastructure in a second data centre.
A ZO implementation requires a high level of control on all building blocks; this can only be achieved, if there are clear boundaries for all parts of that implementation. These boundaries shall allow to clearly see which parts (systems, components services, etc.) belong to a ZO implementation, and which don’t. An appropriate diagram on the included layers and building blocks can help greatly.
The boundaries should cover all aspects for the ZO product including hardware, software, people and brought in services. Services that are brought in to form part of the ZO implementation should have suitable OLAs (Operational Level Agreements.) and SLAs (Service Level Agreements). Example for such definitions are given later.
The Zero-Outage Industry standard (ZOIS) makes no direct claim for a certain level of performance, nevertheless, the following considerations apply:
- Obviously, the performance of the ZO implementation must be sufficient for the intended use. This is important to keep in mind, because some measures meant to ensure availability might negatively affect the performance. An example would be processes to ensure data integrity and consistency in distributed storage.
- Also, the performance of each single component must ensure that internal processes of the concrete ZO implementation are working. Too big a latency being an example possibly causing outages when time-out thresholds are hit.
- In addition to that, specific features of a ZO implementation require special attention for performance. During a failure event, the infrastructure may run with less available resources than normal. We call this a “ZO Risk Phase”. Even in this scenario the performance of all parts need to be sufficient to still ensure ZO capability.
This should include the ability to detect issues, and in particular to detect them early enough - before they really determine failures and outages affecting the business service. This requires a comprehensive monitoring of the complete ZO implementation (incl. analytical and maybe cognitive capabilities).
Security is independent of all other features, since attacks might target almost any layer of a ZO implementation. Security issues can cause outages in the system (e.g. denial-of-service) or system being compromised. This must be taken care of in the architecture by only using parts allowing to run a state-of-the-art security and must be accompanied during delivery, implementation and operation by all the requirements detailed in the ESARIS security framework.
Serviceability [see https://en.wikipedia.org/wiki/Serviceability_(computer)] or supportability is not so much a functional aspect addressing the business functions provided by an ZO implementation, but is the set of measures by which the functions are kept working.
One aspect specific to the ZOIS approach is the replaceability of all components in a ZO implementation. This quality is closely related to redundancy. Since parts can fail, replacement must be either possible as an “online function” without outages or degradation in operational performance. Alternatively, if a lower level of redundancy is implemented, all the functions must be distributed to a sufficient number of components so that the failure of one of them does not impact the overall functionality.
After planning, designing and building a ZO implementation, in order to be able to keep it up-and-running, you need to control it. We need at least three things to achieve this:
- Landscape data and infrastructure data: Knowing your implementation by having information available about all components, their versions, relations and dependencies.
- Monitoring: You only can control, what you can measure. You need to be monitoring the infrastructure for abnormal behaviour and send notifications of such situations to resolution teams or dedicated systems, ideally prior that failure occurs. To have the required data, control mechanisms must be part of the ZO Implementation’s architecture. This is a prerequisite for keeping any ZO implementation within the defined parameters.
- Staging / Test Lab: Testing that any new software and hardware really works. This can only be done end-to-end in an environment similar to that used in production; however, this test system must be separated sufficiently from production, so that there cannot be side effects from failed tests.
Whenever a change to a component within the landscape is planned, an impact analysis must be done in order to determine potential impacts on other building blocks or the business solution overall. To do this, infrastructure data, system data, and applications’ versions data is required, we call this set of information “Landscape Data”. Such data must also include relationships between building blocks working together on the same or different levels.
Landscape data includes multiple set of information, like the following:
- System data, including relations to host, database host, application versions running in the system
- Databases including type and version
- Dependencies between systems…
- … for the phases of development, test, quality assurance, consolidation, to production: These system roles need usually to have a well-defined sequence of updates, which has to be planned accordingly
- … between systems that are used in one business process, exchanging data – for example, data sent from a system used to handle customer information needs to be “understood” by the system used to create the purchase.
Infrastructure data is just as important and needs to include information such as:
- Information on your virtualization layer
- All servers including versions
- All storage devices including versions
- All network components including versions
- Data related to items like power supply, conditioning systems, and the likes, may also be important
Any proposed changes to a ZO implementation must be planned using the latest up to data/current data of the existing landscape. Only based on updated information on the current state of an implementation, a consistent target state can be achieved.
Monitoring is based on landscape and infrastructure data and needs to include all building blocks of all layers and must be woven into the ZO implementation. For that purpose, landscape and infrastructure data are required to establish a baseline for monitoring. Monitoring must work on all levels: It must provide information on low-level building blocks and must also include information on processes working across multiple building blocks.
The following figure shows the layers and which minimum monitoring types are required to ensure ZO capabilities; detailed knowledge across layers needs to be available, and this must be constantly kept updated, to allow for effective monitoring.
Monitoring shall be accurate enough to allow for predictive maintenance and not just to react on occurring problems: by monitoring parameters with sufficient “granularity” in terms of values and in terms of sampling time, it is possible to derive curves that highlight trends which could lead to degraded performance or failures.
The monitoring can be grouped as follows:
- Infrastructure management: You need to be aware of the status of all elements in the IT infrastructure layer. Monitoring of critical resources (like CPU and memory utilization) is required. All critical control points, like ports and interfaces, should also be monitored. In this domain should be included also the management of the physical data-centre infrastructure, including the connections to the outside world
- Landscape management: Monitoring and management of the entire systems landscape including taking into account connections between systems.
- Application/Process monitoring gives the end-to-end view of your business applications and processes. Monitoring the status of the single applications, of the running processes and the amount of critical resources that these are consuming is key to provide indication of adverse events that have happened or are about to happen.
Monitoring is a prerequisite for predictive maintenance, which is highly desirable in a ZO implementation. Predictive Maintenance is possible if:
- Architecture and design allow to perform non-intrusive maintenance (which, as stated in previous chapters, is a fundamental requirement for a ZO implementation)
- Monitoring is detailed and granular enough to detect trends
- The speed of intervention is higher than the speed of degradation detected by monitoring
- The design is such that the complete failure of the parameter under monitoring does not cause any outage (built-in redundancy)
The connections between components are monitored through the “attachment points” on the components implementing or using them. Normally, both components need to monitor the incoming and outgoing state of their connections to other components. If more than one connection is used, all involved elements need to be monitored together.
Staging allows to test any changes in component versions (hardware as well as software) before using them productively. The staging area must completely be isolated from the productive area. When we talk about a ZO implementation, staging means a reduced landscape that is representative of the production environment and that can be used for testing, and verification. The staging landscape must be suitable to allow testing of versions of any building block.
The following figure depicts a generic ZO landscape.
When you are in a virtual software environment isolation is not as obvious as shown in the figure above. Still, you need to find a way to test changes in a ZO implementation in an isolated way without affecting the productive use.
In principle, staging is a very comprehensive approach to testing where the change is tested in a setting which is as close as possible to the production environment, to ensure the maximum degree of compatibility and risk reduction. Staging is not always s must, if the same level of risk-reduction it provides can be achieved with other mechanisms.
We will highlight a few examples of countermeasures in the following text, to outline that there can be different countermeasures applied to a problem which can be applied in different phases of the ZO implementation life-cycle
More details of the different classes and countermeasures impacting the different building blocks are described in the corresponding building block sections.
These are problems physically effecting the implementation, and some of these affect a bigger geographical area, so the countermeasures need to take the geographical location into account. So, for example, while the effect of an earthquake can be handled in the physical ZO implementation by making the building resistant to earthquakes, this may also be accompanied by installing a redundant data centre. The 2nd data centre in this case needs to be in such a distance that it will not be affected by the same catastrophic event as the 1st one. Setting-up the redundancy of course needs to take into account also all connections between the data- centres so that a switch between the operating data- centres can be performed at any time without an outage.
In most cases, these problems need to be solved by adding redundancy to the source of critical supply provided to the data centre, for instance:
- To avoid power outages, it is recommended to provide redundant power supply, backup power generators, or temporary sources of power like batteries and UPS systems
- Cooling system should also be made redundant, using more cooling units than strictly needed, so that a single (or more) failures, would not jeopardize the datacentre temperature. If the source of cooling is water, provision for redundant sources of it should be made
- A lack in data connection caused, for example, by a cable cut during street maintenance would require a separate connection on a diverse fibre path, for instance one entering the build from a completely different and disjoint path. To have better coverage, street maintenance in the vicinity of the date- centre might even be planned together with the responsible authorities.
For software bugs, there are numerous approaches to avoid them or their ill-effects in a productive environment. Software development – and the processes involved – is not so much of a platform topic. However, it obviously affects to ZO implementation.
For hardware, its specific quality (measured as the probability of failure, or MTBF) needs to be considered when planning redundancy levels, As an example, wear off in hardware might be addressed by selecting high-quality hardware to minimize unexpected downtime or increase the level of redundancy if less reliable items are used. Also, a process to exchange defective hardware must be in place. Finally, testing (incl. staging), may be used to identify problems in versions compatibility.
Let’s use this to discuss how a problem class needs to be addressed on different levels and phases of the life-cycle. To avoid misconfigurations to cause outages:
- In the plan phase, clearly design and document configurations. Ensure that all components are provided with this information
- In the build phase, make sure that design details are documented exactly and tested. The best option is to test all changes end-to-end in the staging environment. As a minimum, the changed component and the ones to which it connects should be tested
- In the deliver phase, make sure that implementation details (like configurations) are thoroughly tested and documented
- In the run phase, no configuration changes should be implemented without a verification in a staging area first
Now that we have a list of problem classes, we can distinguish between countermeasures to problem categories to be generic or business-process-specific in nature, because some countermeasures will have influence on the functions of the implementation. This differentiation is necessary to provide people interested in ZOIS with ideas how to use our approach reasonably.
The ZOIS can apply to implementations independently of their deployment model used for their application, be it on premise, cloud, or hybrid.
Note that while all deployment models are valid, there are specific characteristics in the way handling them, so the concrete deployment of any ZO implementation needs to be clearly defined to allow appropriate handling.
Handling, especially in the hybrid and cloud deployments will usually be distributed between several parties, the users of the business solution and the providers of layers as a service: Following our architectural model, there are the following models of consumption and related responsibilities:
- Owning all layers: The customer – usually a company – using the business solutions and implementing/maintaining them is the same.
- Consuming Infrastructure as a Service (IaaS): The infrastructure layer is consumed from a provider, system and apps/process layer are owned.
- Consuming Platform as a Service (PaaS): Infrastructure and system layer is consumed from a provider, apps/process layer is owned.
- Consuming a Business Solution as a Service: Here, all layers would be consumed.
So even if there are clearly separated deployment models and responsibilities in ZO implementations, the handling of responsibilities is not part of the architecture document for the following reasons:
- Independently of the deployment model, the architecture needs to fulfil the ZO architectural needs on all layers
- The responsibilities cannot be clearly assigned to the deployment model: All layers can be handled by one company in on premise but in a private cloud as well.
- Even in an on premise implementation, a 3rd-Party provider can be responsible for the infrastructure layer or certain systems, for example.
On a very generic level we can say that availability of the implementation is the key asset of a ZO implementation. This, however can be achieved in two ways:
- Avoiding errors by design and putting very high quality into the implementation of parts, installing high levels of control, etc. – we can call this approach “quality-focused”.
- Reducing the impact and duration of errors, making the implementation resilient to go back the desired state with minimum effort – we can call this approach “resiliency-focused”.
These approaches come with their own effects the characteristics. With the ZOIS we strive for the best implementation possible, so very probably we need to use ideas from all approaches in a ZO implementation.
 [From https://en.wikipedia.org/wiki/Countermeasure] A countermeasure is a measure or action taken to counter or offset another one. As a general concept, it implies precision, and is any technological or tactical solution or system […] designed to prevent an undesirable outcome in the process. […]