| Surviving
Disaster |
| Felix
Mohan, CEO - SecureSynergy |
| |
|
Introduction
In a world of real time transactions and just-in-time
inventory, losing a phone system may wreak more
havoc than a fire in the building. Every business
faces minor downtimes, and major unknowns; hence
it is important to have plans in place that guarantee
business continuity.
Before the September 11 attack, Business Continuity
Planning was considered an expenditure, which
did not bring any return on investment. Events
like 9/11 serve as constant reminders that it
is vital for every company to have plans in place
to ensure business continuity, and the continuity
of suppliers and logistics. BCP costs relatively
less in comparison to what the company could potentially
lose in a major incident. Therefore it seems highly
prudent that organizations of all sizes seriously
research and develop a plausible and efficient
contingency plan.
Business continuity and disaster recovery planning
are now accepted as basic requirements for every
business organisation. It is widely accepted that
a detailed Disaster Recovery Plan should not only
exist, but should be up to date. It should reflect
the actual on-going needs of the business activity
or function.
In recent years, data has become a critical corporate
asset essential to business continuity. The ability
to recover crucial data quickly after a disaster
is a fundamental requirement of economic viability.
With good plans in place, frequently tested, and
well audited, there is every possibility that
an organisation will cope with adverse events
and continue in business, satisfying its customers,
meeting its commitments and making a return on
investment.
Contingency planning is considered a part of organization's
risk assessment and other security programs. The
team responsible for contingency planning must
be aware of risks to the system and recognize
whether the current contingency plan is able to
address residual risks completely and effectively.
|
| |
 |
| |
|
IT contingency planning represents
a broad scope of activities designed to sustain
and recover critical IT services following an
emergency. Contingency planning fits into a much
broader emergency preparedness environment that
includes organisational and business process continuity
and recovery planning.
An organisation would select a plan to properly
prepare for response, recovery, and continuity
in case of disruptions affecting the organization's
IT system, business process, and the facility.
Because there is an inherent relationship between
an IT system and the business process it supports,
there should be coordination between each plan
during development and updation to ensure that
the strategies and supporting resources are in
line with the contingency objective.
Crisis Management Plan
Crisis Management Plan is designed to maximize
human survival and preservation of property, minimize
danger, restore normal operations, and assure
responsive communications. Crisis Management Plan
procedures should be coordinated with all other
plans and can become an appendix to Business Continuity
Plan.
Cyber Incident Response Plan
The cyber Incident Response Plan establishes
the procedures to address cyber attacks against
an organization's IT system. This plan details
the procedure to enable security personnel to
identify, mitigate, and recover from malicious
security incidents, such as unauthorized access
to a system or data, denial of service, or unauthorized
changes to system hardware, software or data (malicious
logic, such as virus, worm or Trojan horse)
Disaster Recovery Plan
Disaster Recovery Planning involves developing
a plan and preparing for a disaster before it
takes place in the hopes of minimizing loss and
ensuring the availability of critical systems
and personnel. It consists of a set of activities
aimed at reducing the likely hood and limiting
the impact of disaster events on a critical business
processes.
Other objectives of disaster recovery planning
include:
|
| |
| :: |
Providing a sense of security |
| :: |
Minimizing risk of delays |
| :: |
Guaranteeing the reliability
of standby systems |
| :: |
Providing a standard for testing
the plan |
| :: |
Minimizing decision-making during
a disaster |
|
| |
|
More than off-site storage or
backup processing, organizations should develop
written, comprehensive disaster recovery plans
that address all the critical operations and functions
of the business. The plan should include documented
and tested procedures, which, if followed, will
ensure the ongoing availability of critical resources
and continuity of operations.
A disaster plan, however, is similar to liability
insurance: it provides a certain level of comfort
in knowing that if a major catastrophe occurs,
it will not result in financial disaster. Insurance
alone is not adequate because it may not compensate
for the incalculable loss of business during the
interruption of the business that never returns.
Elements of successful Disaster Recovery Plan
Program
Information collection
The business processes of a company need to be
identified, together with the components of IT
infrastructure used to support each process. IT
infrastructure components may include application
software, servers, operating systems, data and
storage systems, local and wide area networks
and client systems including PCs and peripherals.
This process also measures the impact of unplanned
interruption on each business process and IT infrastructure.
Risk analysis
The basic steps of risk analysis include;
|
| |
| :: |
Identify business processes
and their associated IT infrastructure resource
requirements (i.e., the data, application,
systems, and networks that are used in delivering
the business process). Prioritize business
processes according to time sensitivity and
criticality. |
| :: |
Identify existing threats to
business processes and infrastructure resources |
| :: |
Define objectives for strategies
to eliminate avoidable risks and to minimize
the impact of risks that cannot be eliminated
|
|
| |
|
The ability of a company to cope
with the interruption of a business process determines
the tolerance of the business process. In practical
terms, tolerance may be expressed as a rupee value-
the cost to the company if the business process
is interrupted for a period of time.
Critical applications are defined as such because,
regardless of duration of the outage or the time
of month in which an outage occurs, there are
no substitute methods for providing the functions
of the application. Electronic commerce applications
used by on-line brokers, for example are clearly
mission critical.
These critical resources are described as Single
Point Failures. Within any complex system, there
are usually components or processes that, if not
replicated or otherwise backed up by redundant
capabilities, represent points of failure for
the entire system. A large part of disaster avoidance
planning comes down to identifying Single Point
of Failures, wherever they exist and eliminating
them.
Identify risk based on phenomenon, which includes;
|
| |
| :: |
Water damage (whether from
leaky pipes or floods) |
| :: |
Fire (heat) damage (whether
from arson, equipment overheating, environment
contamination, lightning strikes, etc) |
| :: |
Power failure (originating at
the customer premises or across the power
grid) |
| :: |
Network failure (LAN or WAN,
whether component or link based) |
| :: |
Mechanical hardware failure
or software failure (whether due to human
error, short circuits, normal parts wear and
tear, or building collapse following an earthquake) |
| :: |
Accidental or deliberate destruction
or corruption of hardware, software, or data
(by hackers, disgruntled employees, industry
saboteurs, terrorists, or misbehaving software) |
| :: |
Other causes (including forces
evacuation for environmental hazards, aircraft
crashes , etc) |
|
| |
|
Disaster Avoidance System
The purpose of disaster avoidance systems is to
provide an automated mechanism for detecting certain
disaster potentials (and to respond to them, where
possible) before they develop into unplanned interruptions
of normal business processes.
These include:
|
| |
| :: |
Systems for water detection
that can provide early warning of leaks and
water-related hazards |
| :: |
Systems for the detection of
pre-ignition gases, smoke and other indicators
of impending fire to enable proactive response
that will ensure the health and safety of
personnel and prevent the loss of data and
equipment to fire |
| :: |
Systems for the detection of
airborne contamination levels that are associated
with employee illness, data loss, equipment
malfunction, and fires |
| :: |
System for the suppression of
fires |
| :: |
Systems for the continuation
of electrical power in the presence of a utility
power outage |
| :: |
Systems for the physical security
of corporate computing and telecommunication
facilities |
|
| |
|
Water Detection
Water can intrude into sensitive information processing
and storage facilities as well as user work areas,
in a variety of ways and from a variety of sources.
Some common sources of flooding are;
|
| |
| :: |
Facility plumbing leaks |
| :: |
Air conditioning |
| :: |
Water cooling systems |
| :: |
Sprinkler systems |
|
| |
|
Detection systems ranging from
simple battery operated alarms to sophisticated
sensing cables and ceiling grids- are available
to detect the presence of water wherever it is
found and either signal an audible alarm or relay
hazard alert message to a system or network management
console.
Fire Suppression
Fire prevention begins with facility design and
construction. Fire-resistant construction materials,
firewall placement and facility compartmentalization
can play major roles in limiting the scope, duration
and destructiveness of a fire.
Contamination Detection
The entry of airborne contaminant particles into
electrical equipment can cause short circuits
and even flash fires in electronic equipment.
White-glove method: this method assesses
the level of contamination by wiping the exposed
surface of a piece of equipment with a white glove.
The particulate on the glove is analysed to determine
the type of contaminant in the center environment.
Aspirating pump: This method involves the
installation of a pump. Air samples are collected
through an air intake, and the contents are analysed.
Precombustion Detection
Heat and smoke detectors are available in a variety
of types, shapes, and sizes to alert personnel
to hazardous conditions, such as;
|
| |
| :: |
Photoelectric detectors, which
detect the smoke produced by smoldering fires
that involve PVC installation |
| :: |
Ionization detectors, which
detect fires involving more flame than smoke |
| :: |
Temperature detectors, which
detect heat in excess of a present value |
| :: |
Rate of rise heat detectors,
which monitor rates of heat exceeding a preset
threshold (useful in environments subject
to significant ambient temperature changes
such as nuclear power generation facilities
or heavy manufacturing environments) |
| :: |
Air sampling detectors, which
detect the invisible by-products of materials
as they degrade during the pre-combustion
stages of fire |
|
| |
|
Power Failure
Interruptions in electrical power can result from
a variety of factors, such as;
|
| |
| :: |
Transformer failure or line
damage |
| :: |
Natural disasters and damage
from severe weather. |
| :: |
Utility company outages |
| :: |
Inadequate power-handling capacity
in multi or single-tenant buildings. |
| :: |
Sabotage and terrorism |
|
| |
|
Providing alternatives or backups
for the facility power supply is one method to
insulate the company against external conditions
that are beyond its ability to control.
Day -to day power related problems should be addressed
such as line dips and surges, traverse and common-mode
interference or noise and in some areas, brownouts.
Uninterrupted Power Supply affords protection
against both mandatory and prolonged outages.
Additional intelligence is also added by UPS vendors
such as;
|
| |
| :: |
Simple Network Management Protocol
(SNMP) support, enabling the transmission
of information about UPS status to system
or network management software |
| :: |
Event Logging, providing
the means to store information about power
events to facilitate the troubleshooting
of power-related problems
|
| :: |
Temperature monitoring,
capturing temperature information on UPS
components as well as power outlets and
signaling operations if preset thresholds
are exceeded.
|
|
| |
| Other
alternatives include |
| |
| :: |
On-site power generation:
Vendors offering self-generation packages
sized to meet specific load needs that are
operated and maintained by the provider
|
| :: |
Deliverable power generation:
For companies that prefer not to go into
the power generation, a second class of
portable power providers sell UPS systems
and wiring services and a contract to appear
on site with a portable generator (or to
deliver power from a separate generating
facility) in an emergency or during periods
when utility outages are expected.
|
| :: |
Electrical power loss insurance:
Policies in certain countries allow an organisation
to go for insurance, which compensates a
business customer for losses incurred during
a power outage.
|
|
| |
|
Data Recovery Planning
A successful business recovery comes down to a
simple axiom: Shorten the time to data
For the company experiencing an unplanned interruption
in time-sensitive, mission-critical business processes,
the primary objective is to establish access to
application data quickly and by whatever means
possible.
Time to data is a determinant of post-disaster
business survival. Creating strategies to shorten
time to data is the primary mission of disaster
recovery planning.
Once provisions have been made to minimize the
likelihood of avoidable disasters, attention turns
to developing strategies to restore infrastructure
supports for critical business processes in the
wake of disasters that cannot be effectively avoided.
Restoration speed of data to a usable form is
determined by the sensitivity of the company to
the duration of an unplanned interruption. To
address different degrees of sensitivity, several
techniques of data restoration have evolved over
time. These include;
|
| |
| :: |
Routine data backup to
magnetic tape using backup/restore software
and the removal of backups to off-site storage.
It requires the retrieval of stored backup
tapes, transport to system recovery facility,
and restoration of data to a new storage
platform via software
|
| :: |
Routine data backup to
an electronic "tape vault" via
a wide area network interconnect or the
internet. Restoration may require the physical
retrieval of tapes and their transport to
a system recovery site, or it may be possible
using a WAN link between the recovery system
and the electronic tape vault
|
| :: |
Remote mirroring of data
to a second (or third) storage platform
via WAN interconnect. Restoration is unnecessary.
Recovery system is connected via WAN links
to the remote mirror array, or the remote
mirror may be located at the system recovery
site.
|
|
| |
|
System Recovery
There is interdependence between the centralized
system backup strategies and the data protection
strategies, along with other disaster recovery
plan elements.
Once application criticality is defined, the risk
analysis goes further to identify the hardware
(both CPUs and storage devices) used by the application
in performance of the critical or vital business
function. During the emergency operations the
business may be able to settle for far less processor
and storage capacity than it normally utilizes.
If critical and vital applications run on several
homogenous or compatible processors in normal
business processes, it may be possible to replace
several low-end servers with one higher end server
(server consolidation). Through the use of right
operating system software, even applications that
reside on heterogeneous processors may be able
to run in one processor. Again total capacity
requirement, the backup server and related storage
devices may be substantially less than that of
the production environment.
The net result of this analysis is called minimum
acceptable hardware configuration which must be
implemented quickly in the event of a disaster.
Hot Sites
Hot sites are fully equipped IT operations facilities
ready to operate within few hours. They contain
the same set of hardware and software at primary
and alternate data processing site.
Cold Site
By using the cold site strategy, the organisation
has already prepared a facility with the requisite
physical capabilities to serve as an alternate
data processing site.
The facility may be used for other purposes, including
off-site storage or new employee training, when
not in use for disaster recovery.
Reciprocal Backup Agreement
In this arrangement, two companies having spare
process time and compatible hardware capabilities
agree formally or informally to backup up each
other's critical applications. For example, in
a simple arrangement Company A experienced a disaster,
Company B would allow Company A to restore its
critical applications on Company B's hardware.
The reverse would be the case if Company B had
a failure.
Redundant System
In the event of a disaster, redundant systems
at a separate facility, which must be far enough
distant so as not to have been affected by the
same disaster, are brought online. Users are either
transported to an operations center that is co-located
to the backup site or are provided remote access
to the backup CPU via a pre-established data communications
network.
Service Bureaus
An organisation may contract with a service bureau
to fully provide all alternate backup-processing
services. The big advantage of this type of arrangement
is the quick response and availability of the
service bureau, ability to test, and that the
service bureau may be available for more than
backup alone.
Network Recovery
Network recovery plan formulation involves the
department of at least three discrete recovery
strategies to cover:
Internal enterprise network (defined as departmental
or workgroup LANs) interconnected via a switched
or routed backbone network, as well as separate
or converged telephony networks.
"Local Loop" (Local Exchange Carrier
services connecting the company facility to the
LEC central office) and WAN Network Relocation,
providing a means to rebuild mission-critical
internal network services and to reroute WANs
and telephony services to alternate end user and/or
systems recovery sites in the wake of a disaster.
To assist in formulating effective strategies,
it may be useful to define a loss scenario that
will guide planning to assist in the development
of internal network recovery strategy; for example,
DR co-coordinators may wish to use a scenario
of media or equipment failure. This scenario-based
approach has the benefit of enabling flexible
response to network interruptions of different
kinds. It also provides a basis for analyzing
and implementing preventive measures to protect
against certain types of outages.
Strategies for End User Recovery
The End user recovery includes;
|
| |
| :: |
The location and
provisioning of backup end user work facilities |
| :: |
The notification
of employees who will staff the recovery site |
| :: |
The transportation
of employees to the recovery site |
| :: |
The redirection
of ground mails, telecommunications, and data
networks to the recovery site |
| :: |
The acquisition
of supplies at the recovery site |
| :: |
The application
of remote access technologies for operating
mission critical applications from the user
recovery site |
| :: |
Emergency evacuation
plan for personals from the corporate premises
in the event of a hazardous or life-threatening
disaster event. |
|
| |
|
Testing
Regular testing is required for effective DRP
implementation. The following things should be
considered for effective DRP testing
|
| |
| :: |
The DRP is tested
to the fullest extent possible |
| :: |
The associated
costs are not prohibitive |
| :: |
Service disruptions
are minimal or non-existent |
| :: |
The tests provide
a high degree of assurance in recovery capability
|
| :: |
Evaluation of test
results provides quality input to DRP maintenance
|
|
| |
|
The Cycle Testing Paradigm:
Cycle testing consists of a series of exercises
utilizing multiple methodologies that often increase
in complexity and length from one phase to the
next. The results of each test are assessed individually;
improvements and error corrections are applied
to the plan prior to beginning the next phase.
At the end of the cycle the entire plan has been
completely evaluated, in fact many portions of
the plan will have been tested, assessed and updated
multiple times. Small logistical errors that could
prove to be major obstacles in full scale testing
are isolated and removed from the plan. The iterative
framework of the test cycle provides continuous
DRP evolution.
In the volatile world of Information Technology,
hardware and software upgrades, configuration
changes and even business process life cycling
can occur quickly in response to market demands
and new service requirements. Cyclic recovery
tests provide an efficient pathway to DRP maintenance
by early recognition and correction of such problems.
At the end of each exercise and prior to the next,
comprehensive debriefing, audit and analysis are
required in order to update the current test plan
as well as each of the following phases of the
cycle.
|
| |
 |
| Illustration
of a DRP Cycle Testing Scenario |
| |
|
Checklist testing:
Checklists are the DRP consultant's most valuable
tool. They are inexpensive to implement and maintain
and provide the backbone of the testing cycle.
The checklists are team oriented and if used to
their full potential provide multiple benefits.
For each business process, partition out areas
of responsibility, select teams appropriate to
the specific nature of the partition and allow
the cumulative experience of the group to develop
the checklist as appropriate. The grassroots involvement
heightens recovery awareness and buy-in as the
team members get a sense that their input is an
integral component of the process.
A checklist test can be used to validate multiple
components of the DRP, for example:
|
| |
| :: |
Emergency Call
Tree verification |
| :: |
Key procedure validation
|
| :: |
Hardware and software
configuration documentation complete and current
|
| :: |
Availability of
process specific resources during DRP implementation
|
| :: |
Tape backup libraries
are complete and current with existing configuration
|
| :: |
Recovery plan and
all necessary operational manuals |
|
| |
|
Walk Through Testing:
Team members verbally "walk through"
the specific steps as documented in the plan to
confirm effectiveness, identify gaps, bottlenecks
or other weaknesses in the plan. Often used in
conjunction with previously validated checklist
plans, this test provides the opportunity to review
the plan with a larger subset of people allowing
you to draw upon a correspondingly increased pool
of knowledge and experiences. Staff will be familiarized
with procedures, equipment and offsite facilities
if required.
Simulation Testing:
As a disaster is simulated, normal operations
will not be interrupted. Hardware, software, personnel,
communications, procedures, supplies and forms,
documentation, transportation, utilities, and
alternate site processing should be thoroughly
tested in a simulation test. Extensive travel,
moving equipment, and eliminating voice or data
communications may not be practical or economically
feasible during a simulated test. However, validated
checklists can provide a reasonable level of assurance
for many of these scenarios.
The simulation test should be considered advanced
and only implemented after the previous checklist
and walk through tests have been validated. The
output of the previous tests should be carefully
analyzed before the proposed simulation to ensure
that the lessons learned during the previous phases
of the cycle have been applied.
Parallel testing:
A parallel test can be performed in conjunction
with the checklist test or simulation test. Under
this scenario, historical transactions such as
the prior business day's transactions are processed
against preceding day's backup files at the contingency
processing site or hot site. All reports produced
at the alternate site for the current business
date should agree with those reports produced
at the alternate processing site.
Full-interruption testing:
A full-interruption test activates the total disaster
recovery plan. The test is likely to be costly
and could disrupt normal operation, and therefore
should be approached with caution. Again, the
importance of due diligence with respect to previous
phases of the cycle cannot be overstated.
It is important to note that the test cycle can
consist of one or more of the advanced testing
methods. A Test Cycle should consist of a minimum
of three phases, a Checklist, Walkthrough and
at least one of the advanced testing methodsTraining
Training should be provided at least annually;
newly recruited personnel who'll be assigned the
planning responsibilities should receive training
shortly after they have joined. Contingency planning
personnel should be trained to execute their recovery
procedures without referring the actual document.
Plan Maintenance
Disaster Recovery Plan may get obsolete if the
organisation may reorganize and the critical business
units may be different than when the plan was
first created. Most commonly, changes the location
or configuration of hardware, software, and other
components.
Role of a consultant
Consultant brings specialized knowledge to the
planning that may facilitate the speedy development
of an effective plan. Consultant who works within
a specific industry may combine an understanding
of the industry with a methodology for disaster
recovery planning. This reduces the learning curve,
in turn can help to speed plan development. Consultant
can also bring a fresh eye to the project, noticing
recovery requirements that may be overlooked by
someone who is too close to the data center he
or she is seeking to protect.
Conclusion
Disaster recovery planning involves more than
off-site storage or backup processing. Organizations
should develop written, comprehensive disaster
recovery plans that address all the critical operations
and functions of the business. The plan should
include documented and tested procedures, which,
if followed, will ensure the ongoing availability
of critical resources and continuity of operations.
|
| |
| |
Disaster
statistics:
http://www.drplanning.org/
Other references:
http://www.disasterrecoveryworld.com/
http://www.disaster-recovery-plan.com/
http://www.drii.org/index.cfm
http://www.drj.com
http://www.disaster-resource.com/
http://csrc.nist.gov/publications/nistpubs/800-34 |