Zero Outage Design Principles
The establishment of design principles for IT elements or entire topical areas, e.g., cloud or on-premise environments, is one of the most formidable challenges in the IT business. From our point of view, it is vital to address this issue right at the start.
This chapter will provide an overview of our understanding of design principles and why we are optimistic as regards design principles being beneficial and necessary for the current challenges in the IT business world.
First, it is important to establish a harmonized view on what design principles mean.
One important cornerstone of the Zero Outage Quality Standard entails design principles for IT-elements and IT-infrastructures. On one hand, we need design principles for describing the necessary features of the IT-element in order to achieve the Zero Outage Industry Standard. On the other hand, we need principles to compile those principles required for becoming, for instance, a Zero Outage cloud environment, which facilitates agile and proactive operations with the aim of to tackling the challenge of meeting current customer expectations.
The Zero Outage Design Principles are geared to the technical features of the IT-elements and related services and make them work collaboratively in complex IT-environments. We are not focused on specific technology or how, for example, high availability or resilience will be implemented by various companies.
The Zero Outage Industry Standard will describe the necessary combination of features and services in conjunction with IT-elements in order to contribute to the whole Zero Outage service.
From the technical point of view, two different types of Zero Outage Design Principles exist:
- General Zero Outage Design Principles that suit all Zero Outage IT-elements, e.g., high availability of power supplies
- Specific Zero Outage Design Principles aimed to deliver design principles for specific IT-elements or IT environments, such as LAN storage devices
There is another crucial point to observe. Technical design principles diminish the value of the features without their being implemented in the whole IT-environment and the operational model.
An additional point involves the necessity of looking at these design principles from the angle of the operational model. Only in connection with simplified and standardized day-to-day operations and best practices for further development of the IT-infrastructure, can the customers’ expectations be fulfilled in the long haul.
Viewed from a life-cycle aspect, there are currently three phases:
- Per-domain Zero Outage requirements (general and specific principles of redundant design, as detailed in the specific per-domain chapters) are to be met by all platforms (hardware, software) involved in the solution.
- Each platform should be respectful of the agreed on interfaces and API, where applicable.
- A validation plan should be prepared and approved. Proper validation documentation (reports) shall be provided at the end of the phase.
- A deployment plan should be prepared and approved. Proper checks and tests at the end of the deployment are to be provided to confirm a correct deployment.
- Proper support is to be provided during an agreed soak-time period after the initial build phase.
- This phase consists mainly of all life-cycle management activities, e.g., standard changes.
- CIP (continuous improvement processes), for instance, automation procedures, are used for increasing the efficiency of the on-premise or cloud operations.
These three phases of the IT solution life-cycle require different Zero Outage Design Principles.
The general idea here is to establish some reference models with the Zero Outage Design Principles for various solutions targeted at guaranteeing the highest level of resilience and quality. A reference model with an open design standard is necessary so that proactive IT-environments be established in order to generate a faster application of new features and flexibility.
General Zero Outage Design Principles describe best practices in connection with IT-products, being universal for all elements and services. These principles are directed at both the general technology-based features and the prerequisites required for having them implemented in a shared IT-infrastructure in an effective way.
Apart from the definition of the design principles, we will establish best practice certification procedures with the purpose of testing, among others, the high availability of the elements in several situations. With this two-step approach:
- Define and describe the design principles
- Define and provide test procedures for certification of the design principles
With this method, you will be in the position to increase the quality and resilience of your IT-environment.
Our target is to start with the definition of these principles for the upcoming release. The following chapter serves to provide an outlook as to how such Design Principles may look and where the current status of our discussion is at this time.
General Zero Outage Design Principles are the basis of all Zero Outage IT-elements. Together with the specific Zero Outage Design Principles reference, we are capable of defining reference models and certification procedures.
This list provides an initial view of the general Zero Outage Design Principles:
- On the IT-element level, the following redundancy points need to be considered:
- Redundant power supplies with online replacement procedures
- Redundant interfaces with online replacement procedures
- Battery back-ups
- Reliable IT-elements with a minimum number of installed base systems
- All types of upgrades, patches, and the like have to be non-disruptive
- A health check procedure for providing a brief system status has to be made
- Online replacement procedures witout disruptions need to be available
On-premise or cloud level:
- All connections have to have the following redundancies and have to be checked regularly:
- Redundant cable paths
- Redundant virtual paths
- Redundant drive paths
- Fully redundant supply-chain to guarantee supply continuity
- A redundancy and resilience check needs to be defined
- Standardized risk report (elements & general view of the platform) requires implementation
- The on-premise or cloud environment has no out-of-support components, neither hardware nor software ones
- The architecture of the infrastructure does not contain single-point-of-failure (SPOF) situations (this should be checked on a yearly basis)
- The environment is built up by applying the configuration recommendations of the hardware and the software vendors.
- Any deviation will be analyzed by the related vendor(s) – a positive analysis result will be documented with a certificate.
- The environment is regularly tested end to end and these results are to be documented. These tests also include:
- redundancy tests
- failover procedures
Our objective is to complete this list in the next release, with your support.
In addition to the general design principles for all IT elements, various design principles for specific types of elements, namely storage devices, server devices, and the like will be defined. The following chapter depicts the current status and includes ideas of our talks on how Zero Outage Design Principles may look.
The specific Design Principles are layered on top of the general Design Principles.
This chapter will provide an initial list of these specific design principles for storage devices.
Operational or life-cycle-oriented design principles for storage devices:
- Availability of the following online operational activities:
- Migration procedure for hardware replacement
- Up and down scaling functionalities, for example, in connection with the workload (I/O, CPU power, ...) increase or decrease
- Online implementation of software updates, patches, and more
- Predefined rollout procedures for software updates, patches, etc. in cloud environments, for instance
- Check-procedures for storage decommissioning
- Single point of failure check for main features, such as checking for mirror consistency
- Check-procedure for missing data replication or back-up
- Definition of standards for application integration back-up, like SAP integration or Oracle integration
- Automatic capacity on demand with call-back functionality
- Securing traceable purging data from replaced disks
- Establishment of runbooks or best practice guides for regular maintenance tasks
- Call-back functionality for hardware and software failures
This chapter will provide an initial list of specific design principles for network devices.
Technological or feature-oriented design principles for network elements:
- Redundant memory and routing engines (processors)
- Redundant cards and ports
- Redundant links
- fiber path diversity, to avoid single point of failures due to fibre cuts
- multi-homing, for the avoidance of single connections between connection end-points
- Implementation of geo-redundancy to establish a robust WAN connectivity
- In-building redundancy through segregation in different DC areas as, for example, fire protection
- Online simulations for cluster tests for network elements to ensure functionality (firewalls, switches, load balancers, configuration files)
- Support for the implementation of proprietary features of one vendor in a multi-vendor environment
- Call-back functionality for hardware and software failures with pattern detections
Taking a closer look at the description of our scope of Zero Outage Design Principles, you will observe that our main area of emphasis is not placed on new technologies. The Zero Outage Design Principles initiative objective serves as an ideal implementation within the Zero Outage Quality Standard of the features of the IT-elements in a shared environment. Additionally, we will provide support for this by means of general certification procedures related to the practice.