General Business Continuity Process guidance

Process Overview

Figure 4: Process Overview

Business Impact Analysis

  • The Business Impact Analysis (BIA) provides essential Information about customer’s critical business processes and supporting IT services (including the necessary resources), about damage scenarios and damage progression over the course of time (criticality), about recovery time objectives and priorities, and the continuity and availability requirements.
  • A separate risk assessment provides essential information about possible risks to service continuity and measures for risk minimization and options for action.
  • The Service recipient or customer is responsible for performing the BIA; but IT Providers are also involved in a supportive capacity in accordance with its IT service responsibilities. In case the customer does not conduct a BIA, the service provider should check for potential business impact in consultation with the customer as part of proactive ITSCM.
  • During this phase of the IT service continuity management process, only the possible effects of an emergency are of interest, not their potential cause; rather, the focus is purely on the consequences that are to be anticipated. Categorization of business processes as ‘uncritical’ (within the BIA) automatically indicates a lower recovery priority. In the context of ITSCM critical means time-critical, i.e. a fast resumption is required in order to prevent major damages.

Business Process Analysis

In order to carry out a BIA, it is necessary to have an overview of all the relevant IT services involved in each of the customer’s business processes. Details of the responsible contacts must also be included. The business processes are sorted by type according to whether they are core processes or supporting processes. Core processes are processes that contribute directly towards achieving one or more of the company’s business objectives. The following table illustrates a potential way of structuring the business processes:

Type of business processOutage/suspension of business process results in:
Core business process

  • Extremely high business losses that pose a threat to the company's existence

  • Serious impairment of the ability to perform business tasks (only possible externally or not at all)

  • Violation of laws with consequences for the business or employees

  • Severe damage to business reputation accompanied by serious loss of market

Important business process

  • Considerable financial losses that does not threaten company’s existence

  • Activities are restricted to an extent that cannot be tolerated

  • Violation of laws with tolerable consequences

  • Damage to the company's image that is strongly perceived by the market

Supporting business process with medium significance

  • Tolerable business losses

  • Tolerable impairment of the ability to perform tasks

  • Violation of laws with minor consequences

  • Insignificant damage to the company's image that goes unperceived

Supporting business process with low significance

  • No noteworthy business losses

  • No noteworthy restrictions imposed on activities

  • No consequences as a result of violating laws

  • No noteworthy damage to the company's image

Supporting business process with very low significance

  • No impairment of business activities; Disruption does not affect customer’s business activities or operations in any way

Criticality Analysis (Analysis of Continuity Requirements)

Working in consultation with the customer, the criticality analysis is used to identify what are known as the “critical” business processes from among all the business processes, including the associated resources such as IT services, and to determine the service continuity requirements.

The term ‘critical’ describes business processes whose disruption or outage would cause serious consequences for customer’s business. Accordingly, criticality serves as a measure for the progression of damage over the course of time.

Main objective of the criticality analysis is to identify customer’s continuity requirements and to define emergency prevention and recovery measures which focus directly on customer’s essential business activities. Various conditions must be defined for carrying out the analysis of the continuity requirements. These include:

  • Damage scenarios and criticality categories
  • Damage progression periods and critical dates/special business periods during which the required process availability deviates from the average

Subsequently, the following must be conducted:

  • Evaluation of criticality of the individual business processes
  • Determination of business processes’ Maximum Tolerable Period of Disruption (MTPD)
  • Determination of process resources’ (here services’) Recovery Time Objective (RTO)
  • Determination of data’ Recovery Point Objective (RPO)

Damage Scenarios and Criticality Categories

As process criticality cannot be quantitatively calculated, it is classed in qualitative criticality categories. We believe, the categories should be based on the following five levels:

  • Very high / critical
  • High
  • Medium
  • Low
  • None

Possible damage scenarios could include:

  • Financial loss
  • Impairment of business activities
  • Violations/breaches of laws, regulations, or contracts
  • Impairment of business reputation (image loss)
  • Impairment of personal integrity
  • Impairment of the right of informational self-determination

In consultation with the customer, it must be agreed which damage scenarios and their individual characteristics are to apply when assessing a process.

Maximum Tolerable Period of Disruption (MTPD)

  • The maximum tolerable period of disruption (MTPD) describes the time frame within which all the necessary business processes or activities including the resources that support these processes (such as an IT service) must have been restored (so that the company does not enter a phase where its very existence is under threat). [BSI100-4]
  • Former British Standard BS25999 defines the MTPD as the duration after which a company’s viability will be irrevocably threatened.
  • The new international standard ISO 22301 concretes the MTPD and defines it as the time it would take for adverse impacts, which might arise as a result of not providing a product/service or performing an activity, to become unacceptable for a company. The standard also uses the term “maximum acceptable outage (MAO)” as a synonym for this.

Definition as per Germany’s Federal Office for Information Security (Bundesamt Für Sicherheit in der Informationstechnik – BSI)

Type of incidentExplanationTreatment
DisruptionShort-term outage of processes or resources incurring only slight damageTreatment is part of usual troubleshooting. Restoration within SLAs
EmergencyLonger-term outage of processes or resources with high or very high damageTreatment requires specific emergency organisation
CrisisSerious emergency, limited substantially to the institution, threatening its existence or adversely affecting people’s lives.As crises do not adversely affect surroundings or public life extensively, they can mostly be overcome within the institution itself
CatastropheMajor emergency not limited in terms of space and time, for example as a consequence of flood or earthquakeFrom an institution’s point of view, a catastrophe is itself a crisis, and is handled internally by its own emergency organisation in conjunction with external relief organisations.

 

RTO vs RPO Explained

Figure 5: RTO vs RPO

Recovery Time Objective (RTO)

  • The recovery time objective (RTO) indicates the period (starting from the time of suspension) within which the IT service must be restored at least to minimum operation. The RTO time must be less than the MTPD [BSI100-4].
  • The ISO 22301 international standard defines the RTO as the period of time following an incident within which the business product or business service must be resumed, activity must be resumed, or resources must be recovered.
  • The IT service can be recovered either by starting an emergency operation with reduced capacities and resources within the same environment as normal operation or at an alternative location or by providing an alternative IT service using other resources. Thereby, the “emergency service level” may lie below to the service level target originally agreed with the customer.

Possible RTO (Recovery Time Objective):

DRC 0: No Service Level
DRC 1: < 1 month
DRC 2: < 2 weeks
DRC 3: < 48 hours

Recovery Point Objective (RPO)

  • This value indicates the maximum tolerable level of data loss and imposes requirements in terms of a company’s data backup strategy. It is always important to determine this value in cases where information plays a key role in ensuring the operational capability of a business process [BSI100-4].
  • The ISO 22301 international standard describes the RPO as the point at which the information required for an activity must be restored in order for the activity to resume in the context of emergency operation.
  • The recovery point objective is the time period that is allowed to elapse between two data backups, i.e., the maximum amount of data/transactions that can be lost between the time of the last backup and the system outage.

Possible RPO (Recovery Point Objective):

DRC 0: No Service Level
DRC 1: < 2 days
DRC 2: < 1 day
DRC 3: < 15 minutes

AreaDisaster Recovery Classes
DRC 0
DRC 1DRC 2DRC 3
No recoveryNew assets
recovery in months
Existing
assets recovery in weeks
Dedicated assets
recovery in hours
Disaster Recovery

  • Backup exists

  • Data loss possible

  • No recovery

  • Existing assets are used for DRC 2


  • Backup exists

  • Data loss possible

  • Recovery on basis of new assets

  • Existing assets are used for DRC 2


  • Backup exists

  • Data loss possible

  • Recovery on basis of existing assets (DRC 0 and DRC 1)


  • Redundant data management

  • Redundant assets

  • Distribution over different fire areas

  • Recovery to up to 50% performance

Supplier responsibility

  • None


  • Procurement of assets

  • Network setup

  • Recovery from backup


  • Procurement of assets

  • DRC 2 -> DRC 0/1 assignment for use of assets

  • Network setup

  • Recovery from backup


  • Procurement of redundant assets

  • Activation of passive assets

  • Network rerouting

Client responsibility

  • Classification of systems

  • Prioritisation of system for use for DRC 2


  • Classification of systems

  • Prioritisation of systems for use for DRC 2

  • Creation of data consistency (Planning before disaster, execution after disaster)


  • Classification of systems

  • Prioritisation of recovery

  • Creation of data consistency (Planning before disaster, execution after disaster)


  • Creation of data consistency (Planning before disaster, execution after disaster)

Service level

  • RTO: None

  • RPO: None

  • EMO: None


  • RTO: < 1 month

  • RPO: < 2 days

  • EMO: Purchase assets after event


  • RTO: < 2 weeks

  • RPO: < 2 days

  • EMO: Existing assets for exchange


  • RTO: < 48 hours

  • RPO: < 48 hours

  • EMO: 50% of redundant assets

 

IT Service Continuity Strategy

The IT Service Continuity Strategy serves to analyse and to document potential failure scenarios as well as to develop suitable options into the fast resumption of the critical IT service and with that the critical business activities.

The ITSC strategy ensures that a balance is maintained between costs arising from the implemen­tation of risk minimization measures and costs arising from the provision of options for the recovery of critical IT services supporting critical business processes within a specified timescale.

Among other things, the IT Service Continuity Strategy comprises the following:

  • specification of IT services and thus business processes to be specifically protected,
  • identification of crucial damage scenarios,
  • definition of types of disruption evaluated by the customer as threatening to the organization’s existence,
  • definition of respective response measures

Figure 6: IT Service Continuity Strategy

IT Service Continuity Strategy Options

Recovery of normal service operation and thus resumption of business activities can usually be achieved in differ­ent ways. Still, alternatives may differ in parameters such as Recovery Time Objective (RTO), costs and reliability of the solution. Primary objective is to identify the main alternatives and to choose the most suitable alternative.

Based on customer’s corporate goals and core business, a rough IT Service Continuity Strategy is defined. The table below shows possible strategies:

Strategy optionExamples of strategies:Assessment of remaining risk
MinimumOnly business processes/IT services whose criticality is recognized as very high (maximum criticality) are protected.
Total costs must be limited to x €. Damage potentials are covered to a great extent by insurances.
High remaining risk
LowPriority 1 business processes/IT services are protected.
Total costs must be limited to x €.
Medium to high remaining risk
MediumThe most important core business processes/IT services are protected. For implementation of continuity measures internal resources should be used as much as possible.Medium remaining risk
HighExtensive protection of critical business processes/IT services.
Compliance with legal regulations and prevention of loss of reputation has top priority.
Low remaining risk

Conditions of ITSC strategy: RTO and Cost-Effectiveness

ITSC Strategy options provide different possibilities for closing the gap between the actual and target situation. Still, the following two requirements must be met when implementing strategy options:

  • The Recovery Time Objective (RTO) defined for the IT service and its resources must be fulfilled.
  • Implementation of the strategy option must be cost-efficient, i.e. costs arising from the implementation of the strategy must be lower than costs expected to arise from the damage caused by the absence or failure of resources needed for service delivery.

IT Service Continuity Plan

IT service continuity management is divided into two core processes: proactive implementation of emergency prevention measures (Service Continuity Management) and reactive execution of emergency recovery measures in case of an actual emergency (Service Recovery / Emergency Management). In the following subchapters, different ITSCM approaches, procedures and plans are described.

IT Service Continuity Plans summarize responsibilities to be assumed and tasks to be fulfilled in case of a disrup­tion/outage of a critical IT service. The IT Service Continuity Plan serves to proactively identify disruption potentials as well as to develop proactively suitable strategies for a fast recovery of the disrupted service and thus critical business process. The IT Service Continuity Plan should comprise the following:

  • Scope
  • Responsibilities and competencies
  • ITSCM organization, tasks, and procedures
  • Business process & damage analysis (BIA), business continuity requirements, criticality analyses, risk assessment, list of priorities, IT Service continuity requirements
  • ITSCM strategy and options for the various scenarios, cost-benefit analysis, assumption of risk, description of measures for reducing the risk
  • Preparatory recovery measures (organizational and technical)
    (Main objective is to supply, install or upgrade the required equipment within a very short time)
  • Recovery measures (emergency operation)
    (Description of recovering in the emergency operation mode or different alternatives)
  • Measures for returning to normal operation
    (Description of returning to normal operation mode; Given that there are dependencies, the return to normal operation must be coordinated)
  • Alerting and Escalation
    (The alerting and escalation procedures to be adopted in the event of an outage or a disruption are an integral part of the continuity plan.)
  • Testing and exercises
    (Specific scheduling and description of the selected test and practice variations)
  • Maintenance and control: Continuous improvement of IT service continuity management on the basis of review and audit results)

IT Service Recovery Plan

IT Service Recovery plans contain the specific instructions and necessary information for IT service recovering and restoring the service. Therefore, service recovery plans supplement the service continuity plans and provide the basis for the work undertaken by the relevant emergency / recovery teams. Access to this information/these plans (including the contact lists and CMDB) must be ensured even in an emergency. The Service Recovery Plan should comprise the following:

  • Responsibilities
  • Alerting and escalation, central situation office
  • ITSC/Emergency Management Organization, Emergency/Crisis Squad
  • Tasks and competencies
  • Immediate measures
  • Recovery and restoration strategy: Possible options for action, maximum tolerable period of disruption (MTPD), recovery time objective (RTO).
  • Recovery procedure: Implementation of measures, monitoring to ensure timely implementation, assessment of measures;
  • Emergency operation: A description of emergency operation, including any restrictions that apply to the SLA or service support time;
  • Interfaces and dependent components
  • Return to normal operation: Description of an approach for returning to normal operation mode; once the resources required for normal operation are available again, emergency operation has to be returned to normal operation.
  • Analysis of emergency response and service recovery: The emergency response and service recovery should be analysed so that improvement measures can be adopted in respect of any weaknesses identified. In addition to the suggestions for improvement, the emergency response follow-up work including service recovering and restoration also includes producing an overall report.

Continuity Tests

Testing represents a critical point within the IT Service Continuity Management and is the only method for verifying if the strategies, agreements, plans and procedures implemented work in practice. In addition, tests and exercises verify the assumptions underlying the concepts applied.

With the aid of tests and exercises correct implementation of single measures or whole sets of measures is verified and technical function­ality is tested. Furthermore, exercises provide information about the quality of documentation, i.e. if the documen­tation enables service staff members to perform their tasks in case of an emergency or an unplanned non-availability.

The IT service continuity plans must be tested once a year as part of a regular check, particularly with regard to the service continuity & service availability requirements.

There are different types of testing and exercising. These range from a straightforward review of individual measures right through to complex testing/exercising of a simulated emergency situation. Tests and exercises require a very intensive cooperation by the customer.

Continuity Tests

Walk-through Test

During the straightforward desk check (also referred to as the “table top exercise” or “walk-through test”), the participants work through the plans on a theoretical level and check the plausibility of their contents and the assumptions made. The functionality of the contents described is evaluated “as is”. A complex variant is also possible, whereby a scenario is defined and worked through in theory. This test type is easy to implement and used for initial validation.

Functional Test

During this type of test, individual elements (as procedures, sub-processes or groups of systems) are tested for their functionality. In ITSCM, the recovery of individual elements should be tested in terms of functionality, e.g. a remote connection. This type of test should supplement the full test and must not be used as a substitute for it.

Simulation Test

This test can be used to test particular responses and schedules under specific conditions through realistic simulations for specific events or situations.

Communication and Alerting Exercise in ITSCM

This exercise verifies the ITSCM procedures for reporting, escalation and alerting.

Emergency/Crisis Squad Exercise in ITSCM

The exercise of the emergency/crisis squad (or exercise of emergency/crisis management) trains the cooperation within the SI Emergency Squad and the collaboration between the SI Emergency Squad and the operational emergency teams.

Continuity Tests Guidance

We recommend to minimally perform 1-2 Functional Tests per year on the critical service infrastructure. The scope should be certain infrastructures and not all. So it could be, that in Year one, the general Data-Center failover Mechanisms are tested while in year two certain platform failovers are tested and then in year three the application failovers.

So, it is not possible nor sense-full to try to test it all each year and therefore careful multi year planning should be used to keep everyone informed what has been tested with which results, where the improvement measures are and what is being tested in the near future