On conversing with some ten different people about the concept of a hybrid model or a similar term, you will hear at least eight different viewpoints, most of them being right. In this regard, we will outline our position concerning the hybrid model and define our area of focus.
A hybrid model or scenario deals with the merging of on-premise, and all kinds of cloud services. Moreover, this very model, in association with the corresponding operational model, needs to generate a service bundled into one for customers. Indeed, the Zero Outage best practices supplement these services by applying a proactive quality management framework.
Being our first publication on this subject, we hereby provide an initial overview of the following points:
- Comparison of the technical and operational model features and the aspects of open and enterprise cloud environments
- Two hands-on experiences derived from current hybrid models
- the first concerns the technical perspective regarding changes
- the latter reflects the monitoring and the event management
The objective of this work lies in establishing a strong interconnection with the other work stream key topics, e.g., Zero Outage Design Principles, typical IT challenges, and best practices. This all culminates in a proactive Zero Outage quality model as regards the technological and the operational model as well.
We are using the following terms to show two poles of the hybrid model in a black and white view.
- Enterprise cloud solutions:
We described this model in Chapter 1.3 with the pets analogy. In simple terms, we are talking about shared infrastructures with, e.g. large databases with a strang relation between the hardware and the stability
- Open cloud solutions:
This model is also described in Chapter 1.3 and applies the cattle analogy. The application is the driver for resilience of the service and hardware does not have that big influence.
We use these terms and analogies to show our experience in this area and provide an Initial view of our architecures.
In Chapter 1.2, we will apply the analogy of pets and cattle, which serves as an overview of two different IT architectures.
In black and white:
- The resilience and availability of the open cloud architecture is predominantly steered by the specific software design, such as web services and more
- The resilience and availability of the enterprise cloud environment is mainly driven by the hardware design, like data base systems, Legacy Cloud.
Apart from current endeavors made towards converting software solutions into open cloud environments, there are two reasons as to why hybrid models are bound to endure.
- Some applications cannot be converted into open cloud environments, for instance, for commercial or legal reasons. Both environments are necessary for customer solutions and the need for interaction between vendors and providors.
- Technical aspects of an open stack environment using enterprise cloud procedures, e.g., hardware replacement, still exist.
With this in mind, to illustrate the differences and the similarities between both cloud strategies, we will draw on the analogy of “pets and cattle”.
This chapter provides an overview of the differences between the operational models of enterprise computing/cloud and the public cloud. From our point of view, it is necessary that both operational models be integrated into a service from a single source.
Scale UP / Pets
|Public Cloud Computing:
Scale UP / Cattle
|Infrastructure emulates static, highly available application platform
|Modular, loosely coupled distributed application architecture, APIs for each service
|Application designed from few large instances, designed to never break
|Horizontal (automatic) scaling for each service
|Technical instances are closely monitored
|Services are closely monitored
|Powerful, complex, reliable hardware and software
|Open source building blocks
|More flexibility by virtualization, private clouds
|Application is designed to cope with frequent failures
|Expensive, proprietary hardware
|Inexpensive, commodity hardware
|Monitoring of configuration items (CIs) and stable correlation between the CIs
|Monitoring of dynamic virtual layers, elements, …
|Configuration management is designed for static CIs and CIs correllation
|Configuration management is designed for dynamic CI correlations
|Less frequency of changes with partial high risk to the entire environment
|Partial high frequency of changes (daily/weekly) with less risk. For some layers, e.g. connectivity layer, less frequency of changes with partial high risk to the environment
In the next chapter we will show what this differences means for management tools and operation.
The list above describes two extreme values of the hybrid model. In this chapter, we will provide a short summary of the effects regarding monitoring and alerting.
Challanges for hybrid model monitoring and alerting
- Within on-premise or enterprise cloud Environments, the main focus of monitoring lies within a continuous tracking of the health status of each component. Even end-2-end monitoring results are almost achieved by identifying every single event and understanding its Impact on the whole service chain.
- Within hybrid or open cloud environments the main focus of monitoring has shifted towards security, growth, and resource challenges. How do I control dynamic user growth? How can we continue to provide excellent support to such a distributed environment? What can you do now that will help support a healthier cloud in the future?
Do we need a paradigm shift in monitoring and alerting?
- So in fact not every failure of a physical device means that there is a critical incident or even event. In a public cloud the applications should manage an outage of a server by themselves just starting another VM out of an always available pool of sufficient machines. If business transactions are moved fully or partly to the cloud, the transactional character needs to be kept when switching between machines so that with whatever mechanism, the erroneous / broken transactionen need to be stopped and DB changes need to be reverted.
As long as you can reach, e.g., the availability zone of a public cloud the failure of one or even more physical devices is not an issue at all.
- As a result, application monitoring is getting more and more important. The first step in ensuring application availability is determining when there is a problem so that the appropriate corrective action can be triggered. Application monitoring can be used to check that users can access applications, and, for remotely hosted servers, network monitoring can determine if network bandwidth restrictions are contributing to availability problems.
What is our strategy?
Regardless of the type of solution you work with, there are some very good cloud monitoring, management and health maintenance considerations:
- Utilize automation and proactive remediation services wherever possible.
- Never forget to set good access control policies and always monitor security access.
- “Who watches the watchmen?” Always ensure that your monitoring system is running optimally and that configurations are kept updated.
- Not all workloads, apps, or data sets are alike – make sure to create appropriate monitoring profiles as needed.
- Take the time to understand your own cloud and all its intricacies and dependencies before selecting a monitoring solution. The more you know, the better a monitoring tool can fit in.
This example shows that the every bullet point in Chapter 1.4.1 has an impact on the technology, operations model and management systems. On the following pages, we will describe some of these effects and presents a first possible solution for building up a hybrid model.
“IT goes agile” is the slogan for open cloud solutions. There are several methods to choose from for attaining this agility:
- Agile PM
There are solutions which combine two of these methods and thus take the best of two worlds:
- Implement ‘continuous intention’ of DevOps into the open cloud environment deployment process
- Reduce efforts by identifying and eliminating defects at the earliest possible time of deployment
- Use selected Agile project management techniques to gain speed
- lean decision and steering attitude
- high flexibility due to proactive change culture
- force continuous improvement by ‘Retrospectives’
As the customer expects highly integrated services, the challenging target entails combining these methods with on-premise models to form a hybrid model. This leads to high demands on people, processes and technology.
Besides the differences described above, both models ared based on the same concept and ask for standardized handovers.
Both cloud strategies need to use the same process framework, such as ITIL. The difference lies in the usage or handling of the framework, for instance, the change management processes.
A good example concerns how change management should be dealt with. In the enterprise cloud environment, a change (hotfixes, patches, new releases) leads to a change request, for instance, in terms of analysis, development time, updating procedure and planned downtime, which may involve a downtime during the update. In a public cloud environment, a change is similar to a small event and does not necessarily lead to an in-depth analysis and planned downtimes.
Example Enterprise Cloud: Software Change Management Process
This chapter focuses on the Software Change Management process and on how this process incorporates the cattle analogy, while demonstrating how best practices (high automation) facilitate a business scale up. A Software Change Management process review is relevant for a platform-as-a-service (PaaS) that provides an in-memory database with multiple application services.
Due to online customer ordering, approval workflow, and automated provision- ing systems within the cloud service, change control will be significantly affected, and will need to adapt existing processes
In order to better understand the process, certain details on the in-memory PaaS should first be provided:
- This PaaS technology was invented to serve people and businesses intent on easily extending cloud and on-premise applications by means of adding a new functionality for optimizing existing investments.
- Connectivity and integration of applications of on-premise and cloud environments eliminate the need for data silos and make digital access simple, secure and moreover, scalable.
- The platform facilitates a rapid building and running of your applications; it solves old and new problems and empowers you to engage with customers as well.
The benefits of Cloud Computing as presented in Chapter 5.3.1 were mapped to the Software Change Management process of that PaaS.
The process itself – Software Change Management for in-memory PaaS is presented below in a high level manner:
Modular, loosely coupled distributed application architecture, APIs for each service.
The development culture moves more and more to an API based microservices approach. It is an approach to develop a single application as a suite of small services, each running in its own process and communication with lightweight mechanisms, which is often an HTTP resource API. These microservices are built around business capabilities and are independently deployable by fully automated deployment machinery. Connecting the microservices approach to software changes or updates it means that the proper functionality of the API’s need to be ensured by each and every software update. Once ensured this new application architecture enables frequent updates, easy rollbacks etc. as stated in 2. & 3.
- Horizontal (automatic) scaling for each service.
The PaaS as well as already some applications on top follow this highly automated change process. The prerequisites for this process include that every change is planned, tested, authorized, documented and deployed in a controlled manner. All these activities are automated to ensure scaling up business and efficiency for each service is mangeable even with low workforce adjustment.
- Services are monitored & Application is designed to cope with frequent failures:
All changes or change requests coming from Development and Operation Teams, Customers and Stakeholders are monitored and documented in change management tools. Through frequent testing and continuous delivery methodologies small parts of the service can always be rolled back to cope with frequent failures. This applies also to the overall design of the applications that need to follow a constant cycle of fast releases and continuous delivery:
The company is following an Innovation Cycle with DevOps methodologies embedded – frequent testing during software increment deployment activities is a must.
In order to provide a functioning and smooth testing environment the different virtual layers are constantly monitored and improved to reach near zero downtime operating mode.
- Open Source Building blocks
Open Source building blocks are used in our IaaS layer that is primarily used by the in-memory data base. This enables customers and partners of that PaaS technology to leverage a continuous stream of cloud-based innovation in an open source environment, which helps lead to application and skill portability across cloud services or on-premise software products that offer Cloud Foundry and Open Stack.
The technological details of Cloud Foundry as the industry standard for Cloud Platforms and the path to being cloud enabled are explained in the next picture:
For more information publicly available, please visit: http://docs.cloudfoundry.org/concepts/overview.html
- Inexpensive, commodity hardware
Since data storage, compute power and Virtual Machines have dropped significantly in costs, scaling-up for mass business is not a big obstacle anymore. A Data Center strategy is in place that comprise own and partner data centers that follow highest quality and efficiency standards such has virtual, re- and pro active distribution of infrastructure in the Infrastructure-as-a-Service layer. With regards to Software Change Management different testing and live landscapes for redundancy purposes are easily affordable.
- Configuration management is designed for dynamic CI correlations
The fast dynamics of changing configuration items (e.g. 100 new/changed IP’s per minute) cannot be handled anylonger without highly automated configuration management (CM) applications that are built to fulfill a majority of configuration tasks automatically.
- High frequency of changes (daily/weekly) with less risk. For some layers e.g. connectivity layer, less frequency of changes with partly high risk to the environment
With the invention of a Cloud Innovation Cycle the delivery of features and releases reached a frequency that small product increments can be deployed in a fast and secure manner with easy roll-back functionality in case of a failure so that risk is minimized.
With these improvements, the platform can be continuously improved to support future developments (e.g. IoT or Artificial Intelligence), while achieving a high availability of service. In general, it is important that there is a paradigm change in workforce enablement and corresponding responsibility. The DevOps approach, which combines Development, Operations and Quality Management, brings these three elements closer together both on an organizational and tools level. Without the right enablement of the development workforce operations, responsibility will not be taken and vice versa. Regarding certification and attestation demands, which require clear fraud protection, Development and Operations were separated in the past as a result. These changes lead to updates in the established segregation of duties and other fraud mitigating measures to reflect the closer working and enablement model.
The description of differences and similarities of the hybrid model, provided in chapter 1.3.1, shows the framework conditions for a hybrid operational model. We will focus on the following core topics of the operational model:
The coming pages will serve to provide an overview of the main elements of an operational model for a hybrid solution.
In most cases, open cloud environments will be implemented in existing IT ecosystems with operations teams, CMDBs, ITS, etc. Existing ecosystems already possess an extremely high level of flexibiliy, but a revolution would have to take place to make a traditional ecosystem with a hybrid model agile within a Zero Outage quality model.
The way we see it, the following steps are pivotal for establishing a ZO hybrid operational model:
Develop and roll out of a combined “Cattle and Pet“ operational model
- Technology: Define architectural and technical framework of the ‘cattle environment’ from an operational view. When we implement a cattle environment in an existing IT ecosystem, it is necessary to enlarge the functionalities and features of the attached management systems.
These are major design priniciples for an implementation of an open cloud infrastructure.
- Process: Identify basic process requirements; identify conflicts and solutions (also towards ITIL). From a process perspective, there are several parts of the operational model which call for reviewing:
- Automation process
- Application of ITIL processes (not the framework itself)
- Management tools
- Maintenance: The implementation of an open cloud environment in an traditional ecosystem will need an agile and lean maintenance framework. In black and white, we need to combine the “never-touch-a-running-system” world with the agility of an “online game”.
Define and implement operations policies within an “agile“ operational framework, for instance, Continuous Deployment
- Define and implement the operational key elements within the open cloud‚ ‘Continuous Deployment‘ framework. It is vital to establish the most efficient operational framework for the open cloud environment, but it is always important to keep an eye on the current management systems and operational models. A hardware change is a good example of this. There are various ways in an open cloud and traditional cloud environment to change Hardware, but e.g. triggers, process framework, management tools, configuration databases will be the same.
- Set up roles, requirements & responsibilities of open cloud operations within ‘Continuous Deployment‘.
- Implement adequate tools and procedures as well as the support of steering logic and governance
Ensure staff with suitable skills & mindset
- Adapt job description to agile skill & attitude requirements for new hirings
- Implement adequate measures, training schemes and controls for ongoing development of current staff
- Necessary lean-operations attitude in a nutshell:
- Agile operations emphatically does not equate to:
- Cowboy administrators running amok on the systems, with no plan or documentation.
- Operators must commit to putting everything into version control.
- They must accept nothing less than 100% automation.
- Quite the opposite:
- Agile operations require great self-discipline.
- Operators must commit to putting everything into version control.
- They must accept nothing less than 100% automation.
- Manual actions must never be permitted.
- Agile operations emphatically does not equate to:
Our experience showed that hybrid models are necessary to provide current customer services. This operations model necessitates additional skills and broader understanding for the other side of the hybrid model.
The key requirement regarding to operational teams in a hybrid model is to combine both operational models, open cloud and on-premise or enterprise cloud in, the mindset of These teams, which can sometimes lead to a “split personality”.
Due to the different operations and approach of a public cloud it might be reasonable to isolate the central public cloud team from the individuals operating the legacy elements of an infrastructure. When deploying and planning to operate your own private cloud, using and adapting current data center operational staff is more complex. There are many skillsets and legacy processes that need to evolve when managing cloud-based services; virtualized servers, networking and storage, back-up and recovery, software updates and patching, and application installation and maintenance. Experience has shown that many organizations encounter resistance or push-back from existing operational personnel, who are now being asked to adapt to new processes and techniques. Usually the legacy staff have every good intention of doing well at their job but might have difficulty accepting that their legacy skills and experience are not appropriate or no longer a “best practice” in this new style of IT.
Many legacy datacenters and operational teams are organized by technology or skillsets. Often, teams are organized into silos with a department manager, based on technologies such as servers, OSs, storage, back-up and recovery, networking, monitoring, operations, and patching and updates. Because a private cloud involves new processes and best practices that span all of these areas, this is a good time to consider shifting personnel into different team structures more in line with a service-oriented architecture.
We recommend that you keep the following example in mind while reading the following chapter. An operations engineer obtains information on defective hardware. In black and White, there are at least two possibilities for resolving the issue:
- In a enterprise cloud infrastructure, Reparation or replacement procedures will start which lead to several change activities and tests. The defective IT-element results in incidents and changes.
- In an open cloud Environment, there is a third Option to – keep the defection IT-element in the IT-environment and deactivate it. Due to the usage of lower cost IT-elements, this solution is becoming more and more important. In such a procedure, the “incident” is more like an event without any investigation by the operation engineers.
The operational engineer needs all the necessary information for both flows and needs rulesets as to which direction to take.
The list below provide an overview of our experience with regard to operational Teams, with the key aspect of combining the different cloud models into a single customer Service.
The will to run trustworthy IT-Ecosystems (“keep the lights ON”) in a high standardized and automized way. Apart from the well-known key words, e.g. secure, resilient, performant, proactive (monitoring services which avoid problems before reoccurence) it is necessary to combine both sides of the hybrid model into a single service for the customer.
Focus on the long-term strategic steps over the short-term tactical measures, which means:
- Avoid the preference for building short-term micro services and work-arounds triggered by narrow viewpoints.
- Prevent a mishmash of techniques and technologies including a maintenance nightmare. A ruleset when the operations teams use an agile procedure and when to use an traditional procedure or technology will support overall resilence of the IT-Ecosystem
- Strive for a long-term technical vision in cooperation with architects, product management and management systems and align with a long-term business outlook.
Higher ratio of changes and adoptions at least at the open cloud side but more and more on the enterprise cloud side puts one personal characteristic in the center – constant impulse to streamline the overall flow of work.
The following table shows four bullet points which are essential for the mindset of an hybrid model operations engineer:
- includes not only operational work
- BUT includes work which is coming into operations
- AND includes work which is going out of operations
- disciplined and structured approach of operations towards DevOps philosophy including
- operational aspects of a Release Management
- operational aspects of a Data Management
The number of changes requires an agile document management system and becomes more critical. A sufficient and efficient documentation consists of the following elements:
- Concise, accurate, also high level and technically up2date (videos; wiki; …)
- infrastructure overviews
- release procedures
- training materials and media
- critical aspects of infrastructure (e.g. security; data- and network architecture)
A Training scheme system which provides the opportunity to simulate procedures like those pilots do when training themselves in flight simulators will increase quality and effectiveness.
Combing different cloud models into individual customer services results in added high demands in other areas than the technology and operations model. We point out two main obstacles:
Audit & compliance
- The combination of traditional and agile operational models – and particularly DevOps – will raise all kinds of red flags with auditors. One reference example from practice shows that strict separation between development and operations must not be a standard any more.
- elaborate and supply compensating controls, e.g. transparent monitoring and logging of automated flows
- meet with key auditors before implementing DevOps or other agile tools and methods in an existing IT-Ecosystem
- Toolset: high level of resistance if it is proposed to fulfill the ITIL processes through smart toolset instead of the corporate standard
- “Don’t fight city hall!”: Respect standard, but use APIs for connecting to it
There are several aspects we need to keep in mind on integrating an open cloud operations model into an existing IT ECO-system. Most of these aspects are closely related to the integration of a Continous Development approach.
It will definitely be necessary to integrate open cloud operations into a future open cloud agile framework (Continuous Deployment) to avoid bottlenecks or brake-shoe effects for the product delivery pipeline.
The following list shows the main concerns in this area:
- In the classic agile area, we expect team members to be fully dedicated to the effort, and to have a “musketeer” attitude (all for one and one for all)
- But this is could be in conflict with classic operations because
- Inability to dedicate team members to agile DevOps/SCRUM- teams
- Because of the nature of their work (emergent and ticket based), planning can be difficult
- Ops and infrastructure work often does not fit well within a 2-week sprint
- Ops tends to be a shared service among multiple teams
- IT tends to have “crunch times” during major implementations, production releases, and unexpected infrastructure mishaps
- IT tends to operate as a “pool” of resources, with no one person knowing it all
- Challenge of emergent work
- Ops work is ticket and event driven – development work is plan and portfolio driven
- Ops work has to be done now and fast (if a network goes down you can’t wait for the next ‘sprint’ to fix it)
- It is very different from projectable software development and deployment activities
- whole DevOps team also has to understand infrastructure needs and requirements
- suggestion of alternatives that will make life easier for the teams and for operations
- understanding of what the workload is for a given configuration
- pairing with team members on operations-related tasks to help spread understanding
- Conflicting timelines of infrastructure /operations’ and developers’ work
- The timelines of Ops’ work, especially when it comes to racks, stacks, data centers, and network configuration, is often measured in weeks or months instead of days
- Agile teams tend to work in 2-week cycles, often not looking too far ahead at any given time
- In addition: default answer for timeline from operations is “it depends!” – because key metrics are not known
- You can’t improve what you don’t measure; key metrics within operations must be known:
- Lead time: How long it take between the time you get a request and you finish it
- Cycle time: How long it takes between the time you start working a request and you finish it.
- You can’t improve what you don’t measure; key metrics within operations must be known:
- As you collect better and deeper metrics, you can give lead and cycle times for different kinds of work
The field of tension between traditional operations, agile operations and the associated technology and the Zero Outage quality model is a big challenge for current IT-Ecosystems. This first release provides an overview about main topics and collects some aspects of our experience with this topic.
This publication will be our startingpoint to develop, hopefully with your support, a hybrid model which achives the Zero Outage Industry Standard.