Monitoring

Monitoring is an important part when considering a zero outage infrastructure. Monitoring in the traditional sense of the hardware layers and components that make up the environment is important as well as monitoring the business application and user experience.

In this chapter we will describe approaches that can be adopted as part of a zero outage solution to monitor the environment, whether in a local datacentre, hybrid or cloud solution.

Efficient monitoring requires the application / business process to be fully understood. As an example checking that a website is up and online is a basic monitor, but the speed at which its responding or user experience is more valuable measure.

As the complexity of the environment has increased the level of monitor possible has also increased. Part of this evolution we are becoming more reliant on automaton, to perform or implement known actions. Further information around the evolution of monitoring can be found within the Platform section.

Traditionally we have measured the physical components that make up the service. An example we would monitor a physical server for health, a view from the monitoring station or console would be green if all healthy. This green status is devised as an overall status of the server, different equipment vendors may monitor different components to establish the server health. For example the temperature and output voltage levels would be monitored for its power supply, should the temperature increase above the defined threshold then an alarm would be generated.

Monitoring is traditionally achieved by the monitored device sending an alert message to the monitoring console. It’s also important to be pro-active in monitoring, and a good example of this would be to send a “ping” packet to the server and monitor for a response back. As if the server lost its network connection it wouldn’t be able to send out the alert message, and by actively checking the server is still responding ensures that the server is able to communicate.

The frequency of pro-active monitoring is also an important factor, too frequent would generate a false load on the server and not frequent enough may result in the device being offline for a while before the alert is received.

Monitoring also needs to extend to the physical location the datacentre. The datacentre maybe a local facility, or could be space rented within a serviced datacentre. Where the datacentre is local the environmental factors need to be monitored, these include power, thermal and humidity factors, external connectivity and security are the key factors that need to be monitored. Where the service is provided in a shared facility these factors would be a consideration of the service provider delivered as part of the operational agreement.

When multiple applications are working together to provide the service its’ also important to ensure that the connectivity between the applications is monitored and where they are time sensitive this is also monitored. Traditionally a test processes is run to ensure the connectivity between the applications, and when the test process fails an alarm raised.

Server Monitor elements

  • CPU
  • Memory
  • Disk space
  • Processes

Application Metrics

  • Error and success rates
  • Service failures and restarts
  • Performance and latency of responses
  • Resource usage

Network

  • Connectivity
  • Error rates and packet loss
  • Latency
  • Bandwidth utilization

Cloud monitoring is a wide category, which includes many aspects that are required to be monitored which traditionally in the past wouldn’t have been necessary. Monitoring should include web, cloud applications, infrastructure, network, platform, application, user experience and micro services. There is a new generation of monitoring tools that provide cloud monitoring and frequently offer cloud capacity and tracking of user experience speed from websites.

The adoption of cloud based solution has also introduced a new term “Elasticity.” This term describes how the level or resources needed for the service or application can be tuned, to ensure efficient usage of resources and operational costs. Traditionally hardware is provided to run the application to meet the business requirements in terms of performance levels. There will be times where there are excessive resources due to lower workload requirements (Low demand, off peak) and times where performance starts to fall off due to insufficient resources available. (High demand, peak operating hours.) As cloud providers provide computing power based on usage it is a requirement to scale the resources up and down to optimise application performance. This results in additional computing power being provided to allow the application to scale / grow to meet peak demands, and then reduce back in off peak. This operating model provides the most cost effective solutions for businesses operating services in the cloud.

Monitoring the application performance will then allow for additional compute resources to be added or removed. Elastic applications provides a method to deliver optimal resources to meet the business requirements.

  • Over-provisioning – Allocating more resources than required, may incur additional operating costs
  • Under-provisioning – Allocating fewer resources, may impact user experience

It is also important to factor in the time taken for additional resources to be brought on / off line, provisioning time. The time taken to bring additional resources online will be dependent on the resource and how complicated the environment is. This must be factored in as part of the monitoring to allowing sufficient time to make the resources available.

It’s important to understand what is required to be monitored, and selecting a tool or sets of tools that provide this capability.

The complexity of IT infrastructure today is constantly increasing and with the demand resulting of trends like digitization, that will not stop any time soon. Ever more devices, software and sensors in the data centre are producing events or data that is important in order to understand their state and the state of the services running on that infrastructure.

Services are the key delivery to customers and they are comprised of various elements of the layered stack (defined as the infrastructure, system and application layer.) Services delivered to the customer can be of many types, starting by pure infrastructure services (IaaS) going all the way up in the layers to Database as a Service (DBaaS) or Software as a Service (SaaS). The more we climb the stack, the more complex a troubleshooting can become.

In order to reduce important KPI’s like meantime to detect (MTTD), meantime to repair (MTTR), uptime of services or the general business impact, it is important to monitor the complete stack that delivers the service to the customer and be able to understand the possible impact of an issue to the customer service. E.g. what impact does a failed network port introduce to a SaaS Service.

Finally it also includes the customer application that needs monitoring in order to understand the end user experience, means does the service work, is the performance within the SLA boundaries for example.

Figure 1: Cross Domain Events from Layered Services Stack

The approaches taken in the past up until now are very domain specific. For example the networking operators monitoring their network, storage operators their storage. Then in the case of a failure or even incident, those groups have to start their investigation. Often issues are noticed in another area, before they are recognised within the specific domain.

The number of events being reported is on the increase and the static filtering (not covering dynamic changes in the environment) used to get them under control is a huge challenge for the operating staff. Just the number of events are a challenge that is hard to solve today, and that is only the tip of iceberg. With the current approach a monitoring of ever more complex environments will not work. That’s why there is a new approach necessary and current monitoring environments have to evolve towards it.

How can such a new approach look like and what does it need to offer in order to satisfy KPI’s for monitoring environments and operations?

A new approach needs to include and partially automate things like:

  • Event reduction using dynamic rules instead of static filters
  • Detection of issues also cross domain using machine learning and analytical algorithms
  • Cross domain event correlation
  • Automated notification of identified issues
  • Collaborative process support to resolve issues
  • Knowledge recycling with automated solution delivering for recurring issues
  • Automated or half-automated issue resolution
  • Integration of upper- and lower level monitoring tools to support the end user experience monitoring of the complete stack comprising the service
  • Integration of various data sources (e.g. syslog, snmp, sensor data, and many more) and 3rd party tools (e.g. Remedy, Service Now, CMDBs, Support Databases, etc.)

The above approach should also take into account that starting with a “base monitoring” containing some of the features listed above must be possible and an agile approach building the system further with additional toolsets must be guaranteed. That can include COTS Products (Common off the Shelf) as well as Open Sources tools.

Figure 2: HL Event to Remediation to Resolution Process

The way of monitoring cloud services outlined in the chapters above is clearly taking an approach including analytics, data science and artificial intelligence. Therefore, we need to understand the data fed into such a solution that is produced as of today or might be produced in the future. Looking at what we find in terms of events and monitored data in the DC, it could be categorized in three major buckets:

  1. Event driven data such as syslog events, SNMP traps, application logs or event data from chat systems used by the customer to communicate about the service. Also twitter data could be part of this category, depending on the service monitored.
  2. Network data that is captured from the real-time communication of the network traffic.
  3. Sensor data measured from hardware components (e.g. Disks, SFP’s, environmental data, etc.).

These three buckets need to be analysed further for their importance and value towards an Event Monitoring. According to the nature of the data, it needs to be extracted, transformed and loaded (ETL Process) into the toolsets that comprises the event monitoring solution. In this stage, enrichment of data can additionally take place and would be an essential step making the data more valuable.

For sure this pre-processing of event data also needs to include things like classification, definition of importance, possible recurrence and other necessary attributes.

When establish a new application it’s important that the monitoring is also established during the planning, build and deployment phases. This data can then be used to generate a baseline data set, and once established as a profile it is easy to spot anomalies. Utilising the baseline data can help when setting monitoring thresholds. A good example is an accounting / invoicing system. During the month the requirements for the application can be considered as low, (Invoices being paid, and new orders being registered.) As the month ends the resource requirements of the system will increase as the accounts will need to be reconciled based on the monthly transactions, invoices paid and fresh statements generated. As this workload is expected thresholds can be increased automatically to prevent false triggers of alerts due to increased workload.

The same baseline data can be used to spot trends, and take action to prevent service performance impacts. Considering the example of account month end reconciliation additional compute resources can be provided in advance as the workload requirements of the service increase, and then once over the resources can be provided back to the pool allowing for their use elsewhere. When combined with a cloud based application, cloud service monitoring and a level of automation, additional resources can be automatically allocated when using an elastic cloud service.

Building the event monitoring solution should also consider interfaces to external systems providing supportive or other data enhancing detection and/or remediation processes. For example, including data about resource availability like compute power, storage space, networking resources, etc. can help to find space for failover of virtual machines in case the server it runs on needs to be serviced or fails.

Furthermore, a connection to the vendor’s support systems could also help, to pull data that can inform about software corrections, security vulnerabilities, hardware issues and other relevant information for troubleshooting.

There are many more systems that can provide valuable data to an event monitoring solution, but it is important to start with the minimum required and slowly grow into a more robust and complex 360° monitoring solution.

Documentation is a mandatory requirement. It’s key to being able to operate successfully, troubleshoot, escalate or just to know who to contact. As the applications that makeup the different business processes it important to know, understand and document the estate.

Accurate and up to date documentation on how an application is built, its dependencies and performance must be known. If the business requirements are not known then it will be impossible to determine if there is an issues, where it lies and the steps you need to take to resolve.

Documentation covers many areas of the business application including details of the application, function, inter dependencies and performance criteria. Additional information on troubleshooting, next stage escalation and business owner is minimal details that should be recorded.

Often predictive and preventive maintenance are “features” on the wish list of solutions using ML (Machine Learning) and AI (Artificial Intelligence). There are also other names suggesting the same functionality.

All in common is, that the system must be able to “learn” and “understand” the data fed, to predict certain issues or prevent certain failures.

Predictive Maintenance is more about real or near real-time processing to learn about events that predict an issue in the near future.

Preventive Maintenance is about a type of batch processing of datasets that allow the prevention of e.g. a hardware failure. It typically uses large sets of historical and current data to find pattern that clearly identify a situation that leads into a failure of a hardware component. That could be ECC memory errors identified or an abnormal performance behaviour of a disc or several ones.

Some of the COTS monitoring tools bring functionality for predictive monitoring, but not for preventive maintenance. Never the less both should be considered for a complete solution.

The silo way of operations today is not ideal for monitoring cloud services. As mentioned before dedicated teams are responsible monitoring their domains. Monitoring a web service needs cross functional and cross domain operations. That means, operation teams need to work much closer together using a common tool landscape that helps them to see all issue that are part of an incident and work those incident through in a collaborative process. In order to do so, teams consisting of infrastructure, system and application specialists should be part of it working closely together ensuring a consistent end user experience.

Also parts of a remediation process will be taken over by machine learning algorithms such as issue detection. Furthermore, remediation of issues could be automated if there is a known resolution that can be automated. Thus, parts of a remediation process will be taken away by operations people and other new tasks are added to their responsibility.

The monitoring approach discussed herein is aimed to focus on the key principals of the ZO initiative. Find and resolve any issues before they affect customer services that will in most cases also lead into a business impact to the customer and thus to the Service Provider as well. Keep time to detect those issues as small as possible and resolve incidents as fast as possible in case they cannot be avoided.

Add predictive and preventive components in order to avoid issues, minimizing incidents. Continuously check the environment for further improvements or necessary changes and add them as necessary. Start with a minimum and grow up in an agile manner to build the system out.

Gartner’s AIOps definition can help to understand the value of such systems and provides information on what is available.

It’s key to set important KPI’s against a basis that allows comparison and measurement of improvements or setbacks with this approach of event monitoring and actually with every new solution.

Efficient monitoring requires a Network Operations Centre or Network Monitoring Centre (NOC), where staff can be on-hand to monitor the business applications to ensure that are operational. The role of the NOC over recent years has changed and continues to grow and evolve as the level of application complexity increases, business and emerging requirements develop further, to now include many different functions.

Fault Management – Monitors the network for out of tolerance conditions that may or may not impact operations. When a condition occurs recognise, track, and resolve a fault that may occur.

Configuration Management – Collection and recording of configuration of devices that form the business application. Many network issues are related to configuration changes that didn’t work as planned, and recording and documenting them are a vital part to proactive change management.

Accounting – Monitoring and recording of service provider charges, and ensuring they are charged back to the department using the service. Frequently administration work also falls under this category.

Performance Management – Monitoring the application performance, and tracking within the layers as to induvial function performance to ensure the user experience or defined SLAs are achieved.

Security – Another important role that is usually part of the services provided or followed by the NOC.

NOCs are being driven to focus on managing application performance more so than availability in part because placing greater emphasis on ensuring acceptable application performance for key applications, to implement more effective IT processes and to be able to troubleshoot performance problems faster.

Many NOCs have begun the shift away from having NOC personnel sitting at screens all day waiting for green lights to turn yellow or red. Many organizations have implemented tools to automate most Level 1 issues.

This trend, combined with the trend to increase the skill set of NOC personnel, indicates that more intelligence is being placed in the NOC, and that intelligence is a combination of people and tools.

Although the development of automation tasks has continued the majority of the NOCs work is on a reactive basis, identify a problem only after end user impact. Further development in this area is required especially where the business application is cloud based.