India's leading Information Risk Management (IRM) company
  About CII SecureSynergy: ISO27001 certified company          
 
IRM HOME
   
Introduction
Services
  - Consulting
  - Training
Newsletter
News
Security Library
InfoSec Glossary
Contact / Feedback
   
 
AUDIT FACILITATION
Vet an Indian company
 
 
END-USER CERTIFICATION
Certified Information Security Aware User (CISAU)
 
 
CII HOME
Other CII Services
:: WTO
:: TQM
:: TPM
:: Technology & IPR
:: SME
:: Skills Initiative
:: Library
:: IRM
:: Invest India Services
:: Green Business
:: Exports
:: Environment Mgmt
:: Energy
:: Corporate Advisory
:: Climate Change
:: Business Development Services
 
 
 
 
 
 
Surviving Disaster
Felix Mohan, CEO - SecureSynergy
 

Introduction
In a world of real time transactions and just-in-time inventory, losing a phone system may wreak more havoc than a fire in the building. Every business faces minor downtimes, and major unknowns; hence it is important to have plans in place that guarantee business continuity.

Before the September 11 attack, Business Continuity Planning was considered an expenditure, which did not bring any return on investment. Events like 9/11 serve as constant reminders that it is vital for every company to have plans in place to ensure business continuity, and the continuity of suppliers and logistics. BCP costs relatively less in comparison to what the company could potentially lose in a major incident. Therefore it seems highly prudent that organizations of all sizes seriously research and develop a plausible and efficient contingency plan.

Business continuity and disaster recovery planning are now accepted as basic requirements for every business organisation. It is widely accepted that a detailed Disaster Recovery Plan should not only exist, but should be up to date. It should reflect the actual on-going needs of the business activity or function.

In recent years, data has become a critical corporate asset essential to business continuity. The ability to recover crucial data quickly after a disaster is a fundamental requirement of economic viability.

With good plans in place, frequently tested, and well audited, there is every possibility that an organisation will cope with adverse events and continue in business, satisfying its customers, meeting its commitments and making a return on investment.

Contingency planning is considered a part of organization's risk assessment and other security programs. The team responsible for contingency planning must be aware of risks to the system and recognize whether the current contingency plan is able to address residual risks completely and effectively.

 
 

IT contingency planning represents a broad scope of activities designed to sustain and recover critical IT services following an emergency. Contingency planning fits into a much broader emergency preparedness environment that includes organisational and business process continuity and recovery planning.

An organisation would select a plan to properly prepare for response, recovery, and continuity in case of disruptions affecting the organization's IT system, business process, and the facility. Because there is an inherent relationship between an IT system and the business process it supports, there should be coordination between each plan during development and updation to ensure that the strategies and supporting resources are in line with the contingency objective.

Crisis Management Plan
Crisis Management Plan is designed to maximize human survival and preservation of property, minimize danger, restore normal operations, and assure responsive communications. Crisis Management Plan procedures should be coordinated with all other plans and can become an appendix to Business Continuity Plan.

Cyber Incident Response Plan
The cyber Incident Response Plan establishes the procedures to address cyber attacks against an organization's IT system. This plan details the procedure to enable security personnel to identify, mitigate, and recover from malicious security incidents, such as unauthorized access to a system or data, denial of service, or unauthorized changes to system hardware, software or data (malicious logic, such as virus, worm or Trojan horse)

Disaster Recovery Plan
Disaster Recovery Planning involves developing a plan and preparing for a disaster before it takes place in the hopes of minimizing loss and ensuring the availability of critical systems and personnel. It consists of a set of activities aimed at reducing the likely hood and limiting the impact of disaster events on a critical business processes.

Other objectives of disaster recovery planning include:

 
:: Providing a sense of security
:: Minimizing risk of delays
:: Guaranteeing the reliability of standby systems
:: Providing a standard for testing the plan
:: Minimizing decision-making during a disaster
 

More than off-site storage or backup processing, organizations should develop written, comprehensive disaster recovery plans that address all the critical operations and functions of the business. The plan should include documented and tested procedures, which, if followed, will ensure the ongoing availability of critical resources and continuity of operations.

A disaster plan, however, is similar to liability insurance: it provides a certain level of comfort in knowing that if a major catastrophe occurs, it will not result in financial disaster. Insurance alone is not adequate because it may not compensate for the incalculable loss of business during the interruption of the business that never returns.

Elements of successful Disaster Recovery Plan Program

Information collection

The business processes of a company need to be identified, together with the components of IT infrastructure used to support each process. IT infrastructure components may include application software, servers, operating systems, data and storage systems, local and wide area networks and client systems including PCs and peripherals.

This process also measures the impact of unplanned interruption on each business process and IT infrastructure.

Risk analysis
The basic steps of risk analysis include;

 
:: Identify business processes and their associated IT infrastructure resource requirements (i.e., the data, application, systems, and networks that are used in delivering the business process). Prioritize business processes according to time sensitivity and criticality.
:: Identify existing threats to business processes and infrastructure resources
::

Define objectives for strategies to eliminate avoidable risks and to minimize the impact of risks that cannot be eliminated

 

The ability of a company to cope with the interruption of a business process determines the tolerance of the business process. In practical terms, tolerance may be expressed as a rupee value- the cost to the company if the business process is interrupted for a period of time.

Critical applications are defined as such because, regardless of duration of the outage or the time of month in which an outage occurs, there are no substitute methods for providing the functions of the application. Electronic commerce applications used by on-line brokers, for example are clearly mission critical.

These critical resources are described as Single Point Failures. Within any complex system, there are usually components or processes that, if not replicated or otherwise backed up by redundant capabilities, represent points of failure for the entire system. A large part of disaster avoidance planning comes down to identifying Single Point of Failures, wherever they exist and eliminating them.

Identify risk based on phenomenon, which includes;

 
:: Water damage (whether from leaky pipes or floods)
:: Fire (heat) damage (whether from arson, equipment overheating, environment contamination, lightning strikes, etc)
:: Power failure (originating at the customer premises or across the power grid)
:: Network failure (LAN or WAN, whether component or link based)
:: Mechanical hardware failure or software failure (whether due to human error, short circuits, normal parts wear and tear, or building collapse following an earthquake)
:: Accidental or deliberate destruction or corruption of hardware, software, or data (by hackers, disgruntled employees, industry saboteurs, terrorists, or misbehaving software)
:: Other causes (including forces evacuation for environmental hazards, aircraft crashes , etc)
 

Disaster Avoidance System
The purpose of disaster avoidance systems is to provide an automated mechanism for detecting certain disaster potentials (and to respond to them, where possible) before they develop into unplanned interruptions of normal business processes.

These include:

 
:: Systems for water detection that can provide early warning of leaks and water-related hazards
:: Systems for the detection of pre-ignition gases, smoke and other indicators of impending fire to enable proactive response that will ensure the health and safety of personnel and prevent the loss of data and equipment to fire
:: Systems for the detection of airborne contamination levels that are associated with employee illness, data loss, equipment malfunction, and fires
:: System for the suppression of fires
:: Systems for the continuation of electrical power in the presence of a utility power outage
:: Systems for the physical security of corporate computing and telecommunication facilities
 

Water Detection
Water can intrude into sensitive information processing and storage facilities as well as user work areas, in a variety of ways and from a variety of sources. Some common sources of flooding are;

 
:: Facility plumbing leaks
:: Air conditioning
:: Water cooling systems
:: Sprinkler systems
 

Detection systems ranging from simple battery operated alarms to sophisticated sensing cables and ceiling grids- are available to detect the presence of water wherever it is found and either signal an audible alarm or relay hazard alert message to a system or network management console.

Fire Suppression
Fire prevention begins with facility design and construction. Fire-resistant construction materials, firewall placement and facility compartmentalization can play major roles in limiting the scope, duration and destructiveness of a fire.

Contamination Detection
The entry of airborne contaminant particles into electrical equipment can cause short circuits and even flash fires in electronic equipment.

White-glove method: this method assesses the level of contamination by wiping the exposed surface of a piece of equipment with a white glove. The particulate on the glove is analysed to determine the type of contaminant in the center environment.

Aspirating pump: This method involves the installation of a pump. Air samples are collected through an air intake, and the contents are analysed.

Precombustion Detection
Heat and smoke detectors are available in a variety of types, shapes, and sizes to alert personnel to hazardous conditions, such as;

 
:: Photoelectric detectors, which detect the smoke produced by smoldering fires that involve PVC installation
:: Ionization detectors, which detect fires involving more flame than smoke
:: Temperature detectors, which detect heat in excess of a present value
:: Rate of rise heat detectors, which monitor rates of heat exceeding a preset threshold (useful in environments subject to significant ambient temperature changes such as nuclear power generation facilities or heavy manufacturing environments)
:: Air sampling detectors, which detect the invisible by-products of materials as they degrade during the pre-combustion stages of fire
 

Power Failure
Interruptions in electrical power can result from a variety of factors, such as;

 
:: Transformer failure or line damage
:: Natural disasters and damage from severe weather.
:: Utility company outages
:: Inadequate power-handling capacity in multi or single-tenant buildings.
:: Sabotage and terrorism
 

Providing alternatives or backups for the facility power supply is one method to insulate the company against external conditions that are beyond its ability to control.

Day -to day power related problems should be addressed such as line dips and surges, traverse and common-mode interference or noise and in some areas, brownouts.

Uninterrupted Power Supply affords protection against both mandatory and prolonged outages.

Additional intelligence is also added by UPS vendors such as;

 
:: Simple Network Management Protocol (SNMP) support, enabling the transmission of information about UPS status to system or network management software
::

Event Logging, providing the means to store information about power events to facilitate the troubleshooting of power-related problems

::

Temperature monitoring, capturing temperature information on UPS components as well as power outlets and signaling operations if preset thresholds are exceeded.

 
Other alternatives include
 
::

On-site power generation: Vendors offering self-generation packages sized to meet specific load needs that are operated and maintained by the provider

::

Deliverable power generation: For companies that prefer not to go into the power generation, a second class of portable power providers sell UPS systems and wiring services and a contract to appear on site with a portable generator (or to deliver power from a separate generating facility) in an emergency or during periods when utility outages are expected.

::

Electrical power loss insurance: Policies in certain countries allow an organisation to go for insurance, which compensates a business customer for losses incurred during a power outage.

 

Data Recovery Planning
A successful business recovery comes down to a simple axiom: Shorten the time to data

For the company experiencing an unplanned interruption in time-sensitive, mission-critical business processes, the primary objective is to establish access to application data quickly and by whatever means possible.

Time to data is a determinant of post-disaster business survival. Creating strategies to shorten time to data is the primary mission of disaster recovery planning.

Once provisions have been made to minimize the likelihood of avoidable disasters, attention turns to developing strategies to restore infrastructure supports for critical business processes in the wake of disasters that cannot be effectively avoided.

Restoration speed of data to a usable form is determined by the sensitivity of the company to the duration of an unplanned interruption. To address different degrees of sensitivity, several techniques of data restoration have evolved over time. These include;

 
::

Routine data backup to magnetic tape using backup/restore software and the removal of backups to off-site storage. It requires the retrieval of stored backup tapes, transport to system recovery facility, and restoration of data to a new storage platform via software

::

Routine data backup to an electronic "tape vault" via a wide area network interconnect or the internet. Restoration may require the physical retrieval of tapes and their transport to a system recovery site, or it may be possible using a WAN link between the recovery system and the electronic tape vault

::

Remote mirroring of data to a second (or third) storage platform via WAN interconnect. Restoration is unnecessary. Recovery system is connected via WAN links to the remote mirror array, or the remote mirror may be located at the system recovery site.

 

System Recovery
There is interdependence between the centralized system backup strategies and the data protection strategies, along with other disaster recovery plan elements.

Once application criticality is defined, the risk analysis goes further to identify the hardware (both CPUs and storage devices) used by the application in performance of the critical or vital business function. During the emergency operations the business may be able to settle for far less processor and storage capacity than it normally utilizes.

If critical and vital applications run on several homogenous or compatible processors in normal business processes, it may be possible to replace several low-end servers with one higher end server (server consolidation). Through the use of right operating system software, even applications that reside on heterogeneous processors may be able to run in one processor. Again total capacity requirement, the backup server and related storage devices may be substantially less than that of the production environment.

The net result of this analysis is called minimum acceptable hardware configuration which must be implemented quickly in the event of a disaster.

Hot Sites
Hot sites are fully equipped IT operations facilities ready to operate within few hours. They contain the same set of hardware and software at primary and alternate data processing site.

Cold Site
By using the cold site strategy, the organisation has already prepared a facility with the requisite physical capabilities to serve as an alternate data processing site.
The facility may be used for other purposes, including off-site storage or new employee training, when not in use for disaster recovery.

Reciprocal Backup Agreement
In this arrangement, two companies having spare process time and compatible hardware capabilities agree formally or informally to backup up each other's critical applications. For example, in a simple arrangement Company A experienced a disaster, Company B would allow Company A to restore its critical applications on Company B's hardware. The reverse would be the case if Company B had a failure.

Redundant System
In the event of a disaster, redundant systems at a separate facility, which must be far enough distant so as not to have been affected by the same disaster, are brought online. Users are either transported to an operations center that is co-located to the backup site or are provided remote access to the backup CPU via a pre-established data communications network.

Service Bureaus
An organisation may contract with a service bureau to fully provide all alternate backup-processing services. The big advantage of this type of arrangement is the quick response and availability of the service bureau, ability to test, and that the service bureau may be available for more than backup alone.

Network Recovery
Network recovery plan formulation involves the department of at least three discrete recovery strategies to cover:

Internal enterprise network (defined as departmental or workgroup LANs) interconnected via a switched or routed backbone network, as well as separate or converged telephony networks.

"Local Loop" (Local Exchange Carrier services connecting the company facility to the LEC central office) and WAN Network Relocation, providing a means to rebuild mission-critical internal network services and to reroute WANs and telephony services to alternate end user and/or systems recovery sites in the wake of a disaster.

To assist in formulating effective strategies, it may be useful to define a loss scenario that will guide planning to assist in the development of internal network recovery strategy; for example, DR co-coordinators may wish to use a scenario of media or equipment failure. This scenario-based approach has the benefit of enabling flexible response to network interruptions of different kinds. It also provides a basis for analyzing and implementing preventive measures to protect against certain types of outages.

Strategies for End User Recovery
The End user recovery includes;

 
:: The location and provisioning of backup end user work facilities
:: The notification of employees who will staff the recovery site
:: The transportation of employees to the recovery site
:: The redirection of ground mails, telecommunications, and data networks to the recovery site
:: The acquisition of supplies at the recovery site
:: The application of remote access technologies for operating mission critical applications from the user recovery site
:: Emergency evacuation plan for personals from the corporate premises in the event of a hazardous or life-threatening disaster event.
 

Testing
Regular testing is required for effective DRP implementation. The following things should be considered for effective DRP testing

 
:: The DRP is tested to the fullest extent possible
:: The associated costs are not prohibitive
:: Service disruptions are minimal or non-existent
:: The tests provide a high degree of assurance in recovery capability
:: Evaluation of test results provides quality input to DRP maintenance
 

The Cycle Testing Paradigm:
Cycle testing consists of a series of exercises utilizing multiple methodologies that often increase in complexity and length from one phase to the next. The results of each test are assessed individually; improvements and error corrections are applied to the plan prior to beginning the next phase. At the end of the cycle the entire plan has been completely evaluated, in fact many portions of the plan will have been tested, assessed and updated multiple times. Small logistical errors that could prove to be major obstacles in full scale testing are isolated and removed from the plan. The iterative framework of the test cycle provides continuous DRP evolution.

In the volatile world of Information Technology, hardware and software upgrades, configuration changes and even business process life cycling can occur quickly in response to market demands and new service requirements. Cyclic recovery tests provide an efficient pathway to DRP maintenance by early recognition and correction of such problems. At the end of each exercise and prior to the next, comprehensive debriefing, audit and analysis are required in order to update the current test plan as well as each of the following phases of the cycle.

 
Illustration of a DRP Cycle Testing Scenario
 

Checklist testing:
Checklists are the DRP consultant's most valuable tool. They are inexpensive to implement and maintain and provide the backbone of the testing cycle. The checklists are team oriented and if used to their full potential provide multiple benefits.

For each business process, partition out areas of responsibility, select teams appropriate to the specific nature of the partition and allow the cumulative experience of the group to develop the checklist as appropriate. The grassroots involvement heightens recovery awareness and buy-in as the team members get a sense that their input is an integral component of the process.

A checklist test can be used to validate multiple components of the DRP, for example:

 
:: Emergency Call Tree verification
:: Key procedure validation
:: Hardware and software configuration documentation complete and current
:: Availability of process specific resources during DRP implementation
:: Tape backup libraries are complete and current with existing configuration
:: Recovery plan and all necessary operational manuals
 

Walk Through Testing:
Team members verbally "walk through" the specific steps as documented in the plan to confirm effectiveness, identify gaps, bottlenecks or other weaknesses in the plan. Often used in conjunction with previously validated checklist plans, this test provides the opportunity to review the plan with a larger subset of people allowing you to draw upon a correspondingly increased pool of knowledge and experiences. Staff will be familiarized with procedures, equipment and offsite facilities if required.

Simulation Testing:
As a disaster is simulated, normal operations will not be interrupted. Hardware, software, personnel, communications, procedures, supplies and forms, documentation, transportation, utilities, and alternate site processing should be thoroughly tested in a simulation test. Extensive travel, moving equipment, and eliminating voice or data communications may not be practical or economically feasible during a simulated test. However, validated checklists can provide a reasonable level of assurance for many of these scenarios.

The simulation test should be considered advanced and only implemented after the previous checklist and walk through tests have been validated. The output of the previous tests should be carefully analyzed before the proposed simulation to ensure that the lessons learned during the previous phases of the cycle have been applied.

Parallel testing:
A parallel test can be performed in conjunction with the checklist test or simulation test. Under this scenario, historical transactions such as the prior business day's transactions are processed against preceding day's backup files at the contingency processing site or hot site. All reports produced at the alternate site for the current business date should agree with those reports produced at the alternate processing site.

Full-interruption testing:
A full-interruption test activates the total disaster recovery plan. The test is likely to be costly and could disrupt normal operation, and therefore should be approached with caution. Again, the importance of due diligence with respect to previous phases of the cycle cannot be overstated.

It is important to note that the test cycle can consist of one or more of the advanced testing methods. A Test Cycle should consist of a minimum of three phases, a Checklist, Walkthrough and at least one of the advanced testing methodsTraining
Training should be provided at least annually; newly recruited personnel who'll be assigned the planning responsibilities should receive training shortly after they have joined. Contingency planning personnel should be trained to execute their recovery procedures without referring the actual document.

Plan Maintenance
Disaster Recovery Plan may get obsolete if the organisation may reorganize and the critical business units may be different than when the plan was first created. Most commonly, changes the location or configuration of hardware, software, and other components.

Role of a consultant
Consultant brings specialized knowledge to the planning that may facilitate the speedy development of an effective plan. Consultant who works within a specific industry may combine an understanding of the industry with a methodology for disaster recovery planning. This reduces the learning curve, in turn can help to speed plan development. Consultant can also bring a fresh eye to the project, noticing recovery requirements that may be overlooked by someone who is too close to the data center he or she is seeking to protect.

Conclusion
Disaster recovery planning involves more than off-site storage or backup processing. Organizations should develop written, comprehensive disaster recovery plans that address all the critical operations and functions of the business. The plan should include documented and tested procedures, which, if followed, will ensure the ongoing availability of critical resources and continuity of operations.

 
 
Disaster statistics:
http://www.drplanning.org/
Other references:
http://www.disasterrecoveryworld.com/
http://www.disaster-recovery-plan.com/
http://www.drii.org/index.cfm
http://www.drj.com
http://www.disaster-resource.com/
http://csrc.nist.gov/publications/nistpubs/800-34
 
 
Updated: 01 June 2004
 
 
SEND FEEDBACK ON THIS ARTICLE
 
 
 
 
 
 
 
Information Risk Management (IRM) Service for Industry
in partnership with SecureSynergy
IT SECURITY TRAINING
CII has designed courses for Board of Directors, CEOs, CFOs, CIOs and Management Decision Makers in areas affecting IT Security Governance and implementation of enterprise-wide security programs.
::. MUST  READ .::
Role of IT in Corp Governance
IT Security Governance
Information Security - A Business Enabler
IRM - A BPO Imperative

Say yes to
S T A N D A R D S  &  R E G U L A T O R Y
C O M P L I A N C E

Regulation establishes security duties and standards to foster better governance...
 
 
 
 
 
 
All rights reserved :: Confederation of Indian Industry (CII) © Copyright 2004-2008
Copyright  ::  Disclaimer  ::  Privacy