General challenges in the life-cycle of shared IT infrastructures
We take the example of release and patch implementation and health checks as a model for how we want process best practices for “General challenges in the life cycle of shared IT infrastructures” to look. These technical procedures, which are closely related to the implemented process framework, are a key factor towards achieving the Zero Outage quality standard.
Our stand on the subject matter to come in the following chapters contains well structured and advanced procedures for IT elements in IT environments. The challenge lies in connecting the specific IT element solutions to E2E solutions.
There is a three-step approach to best practices:
- Start with the situation at hand, define the current best practice in an E2E perspective such as a “best practice lowest common denominator” with a strong focus on the interfaces, prerequisites, and dependencies.
- Design an FMO, a future-mode-of-operation-best-case solution, from the customer, vendor and service provider perspective for the specific challenge and determine an applicable and easy-to-realize solution.
- Share the new best practice with the implementation procedure.
The examples of release and patch implementation in shared-IT infrastructures and health checks will provide the first-phase results of the three-step approach.
Release and patch implementation in shared-IT infrastructures, for instance, in all cloud environment is a great challenge for customers, vendors and service providers. It is a time-consuming process; moreover, a risk remains even if all quality assurance measures from all parties are applied.
From the technological point of view and the associated operational model, broad differences in the various cloud models exist, for instance, the open cloud, termed enterprise cloud. The implementation procedures for new software vary in the different cloud models. In this work, we will focus on shared-IT environments, which we call enterprise cloud or on-premise models. The next publication will place emphasis on the implementation procedures for open-cloud models.
On-premise or enterprise cloud environments provide diverse services which are not designed via an open-stack architecture. These kinds of services, such as database services, are mainly dependent on resilience and availability hardware. Owing to the enormous number of different applications and customer demands, these environments need to be extremely flexible and adjustable because of new features. These environments play an important role on account of the IT consolidation for global companies. As applications in addition to the open-stack architecture will be available in the long-term, enterprise cloud or on-premise environments are necessary. This leads us to hybrid cloud models.
There are indeed some general challenges for the “legacy” cloud environment in terms of patch and release implementations:
- Each element of the shared-IT environment, for instance, storage devices, network devices, or operational systems have different implementation procedures
- Each vendor delivers updates, patches, and bug fixes in a different way and in varying time intervals
- Coordination of downtimes and risk time is more and more complex, such as a change freeze (due to customer business)
- The growing number of delivered patches and releases leads to several parallel patch cycles with effects on monitoring, automation, and more
- The work output for patch cycles is about 35% of the maintenance output for the platform life-cycle management
Vendors deliver advanced and highly sophisticated implementation procedures for each IT element in a shared-IT infrastructure. Each vendor has to invest considerable effort in the quality assurance and automatization of their technical patches and release implementation procedures.
Our initial scope entails providing a best-practice procedure for the implementation of the platform from an E2E perspective. We focus on the technical aspects, which should align with your implemented processes framework.
There are solutions for reducing the number of implementation activities, which serve to simplify and lower the number of patch cycles.
- Some vendors cooperate with other partners and deliver bundles or appliances of IT elements, such as storage and networks in building blocks. They regularly provide patch sets for the entire appliances. This is a giant step forward towards minimizing the complexity of enterprise cloud or on-premise maintenance.
- Another solution concerns establishing the concerted test laboratories for cloud or on-premise environments, where the service providers share their test environments for partners and define patch sets together.
Within the Zero Outage initiative, we discuss the prerequisites for a general reduction of patch cycles in a shared infrastructure.
The following chapters will provide an overview of our best-practice approach for patch and release implementation in enterprise cloud platforms with multiple suppliers.
In the current state, we are able to provide a check list for the implementation procedure. The main steps include:
- Regular check procedures for mandatory patches or new releases, for instance
- A preparation phase as regards the patch implementation
- Testing of the implementation procedure (incl. explaining effects on other CIs)
- Implementation of the patch
The following chapter will describe the specific steps in the bullet points.
The first step of the procedure involves the implementation of regular checkpoints for circulating information on new patches or releases. The best-method experience incorporates this as a fixed item in regular service meetings with your partners.
Check procedures include the following steps:
- Regularly check for the availability of new patches and releases
- Implement the check as part of the service meeting with your vendor
- Implement the automatic notification for mandatory bug fixes
- Gather information on new and abandoned features
- Check for dependencies on other components, such as parameter settings or required firmware levels, and the like
A formidable challenge for both parties, the vendor and the provider, entails deciding if a “mandatory” patch or bug fix is mandatory for the specific IT infrastructure
Throughout the following steps, most of the change runbook content will be defined. The use of change runbook templates and standardized building blocks, for instance, back-out methods or health care phase serve to support the preparation of the implementation.
Practice shows that starting with a certain level of criticality the vendor should be involved in the generation of the runbook. The use of Change-runbooks is described in the process section.
Use of catalogs
Most of the following steps are are highly recurrent. Catalogs with templates for reusability are extremely useful and can drastically reduce work in these steps.
The preparation should include the following steps:
- Check for the following dependencies:
- Check if there are other IT components that have to be implemented, for instance, with known error databases
- Check for dependencies on applications or operational systems
- Check for interface parameter settings
- Define the test procedure for checking the implementation procedure, incl. tests of dependent components
- o Define elements for test and test procedure
- o Define test teams
- o Define acceptance criteria for the test
- Define fall-back scenario
- Define health check procedure
- Check the operational and maintenance procedures for necessary adoptions,
e.g. new or changed features can lead to other maintenance tasks
- Check for necessary changes in automation tools,
e.g. new releases can result in different start & stop procedures
- Check for necessary changes in monitoring and event management tools
- Inform partners about the planned implementation date
Not all of the steps are necessary for each and every patch or release implementation. Rules should be implemented when specific steps need to be prepared and when the vendors need to be included for their support in the preparation process.
The basis for the testing process should be predefined test plans for specific maintenance activities, which need to be adopted for the current maintenance activity. These templates should be defined (one time only) for the specific environment with all partners.
In the event of complex changes (for instance, new versions with new functionalities), common tests carried out with the partner being involved are useful, especially when the new versions generate changes in the monitoring, maintenance and automation procedures.
The test phase should include the following activities:
- Prepare the test environment as closely as possible in alignment with the production environment
- Execute tests with or without involvement of the partner – depends on risks and criticality
- Test the implementation procedure
- Test the back-out scenario
- Test the health check procedure
- Test monitoring and event management adoptions
- Test the automation and operational tool adoptions
- Gather feedback and approval from the test teams
- Share test results with your partners, for example, feedback on their known error data base
- Adopt the change runbook with findings from the tests (if necessary)
- Adopt test plan with findings from the tests (if necessary)
- Prepare changes of CMDB, for instance
Our experience shows us that necessary changes for the operational processes are frequently not part of the test procedure. Monitoring and event management, automation tools and the automata themselves are generally not a part of the test procedures and generate complications in the operational units.
In a ZO environment, all standard maintenance activities should consist of online activities. The implementation of the patch is mainly the less time-consuming part of the whole change procedure. However, the whole patch cycle for large cloud environments requires weeks to months.
Due to the fact that multiple patch cycles run parallel, the IT environment has different software versions for the IT elements. This needs to be considered in the operational teams.
The implementation phase should include the following activities:
- Inform involved partners about the date, time and health care phase of the implementation
- Execute the implementation including all other component dependencies
- Check the implementation
- run health check before and after implementation
- check feedback from change team and involved operational teams
- Adopt support tools
- Monitoring and event management
- Change of data in CMDB (autmatically or manually)
- Change of operational and automation tools
- Activation of adopted operational procedures
- Run health care phase
Because of the complexity and the number of implementations having a high risk, practice points to the advantages of a health care phase. You will find information on the health care phase in the process work stream.
There are two major objectives for the next steps concerning the described subject:
- The provided best practice is focused on on-premise and enterprise cloud environments. We need to define a similar procedure for open cloud environments as well.
- In addition to the improvement of the current best-practice model—by way of the feedback we hope you provide, we will start the second phase of our three-step approach and define the best-case scenario. The scope of this best-case scenario is a substantial simplification of the technical procedures in the light of the proactive Zero Outage approach.
Please provide your feedback via “firstname.lastname@example.org”.