3. Anticipating Disruptions and Assessing Their Likelihood

One of the bedrock characteristics of disruptions is that they are almost never the result of a single failure. A large-scale disruption is usually the result of a confluence of several factors. Furthermore, there are typically many signs that a disruption is about to take place. Like the tremors that precede a volcanic eruption, these telltale signs point to an impending catastrophe. Such signs are often missed or ignored by managers. But when the conditions for a disruption are present and not addressed, the likelihood of a disruption—even a low-probability one—is not very low anymore. When the telltale signs start to appear, a disruption may be imminent even though its timing, place, and exact form may be unknown.

The Confluence of Causes

On the night between December 2 and December 3, 1984, about 500 liters of water inadvertently entered Methyl-Isocyanate (MIC)1 storage tank #610 in the Union Carbide plant in Bhopal, India. MIC is one of many “intermediates” used in pesticide production and is a dangerous chemical. Lighter than water but twice as heavy as air in its gas form, when it escapes into the atmosphere it remains close to the ground.2

The water leakage resulted in a runaway chemical reaction in the tank, with a rapid rise in temperature and pressure. The heat generated by the reaction, and the presence of an iron catalyst (produced by the corrosion of the stainless steel tank wall) resulted in a reaction of such momentum that the gases that formed could not be contained by safety systems. As a result, 40 tons of MIC poured out of the tank for nearly two hours and escaped into the air, spreading as far as eight kilometers downwind, over a city of nearly 900,000.3

The effect on the people living next to the plant—just over the fence—was immediate and devastating. Many died in their sleep, others fled in terror from their homes, blinded and choking, only to die in the street. Many others died days and weeks later.

Most people who inhaled a large amount of MIC suffocated when their lungs clogged with fluids and their bronchial tubes constricted. The human toll has been estimated at 4,000 deaths and 500,000 lingering injuries.

The Bhopal disaster, which is often referred to as the worst industrial accident in history, did not take place in a perfectly maintained plant with a faultless safety record and well-rehearsed emergency notification and response systems. Several factors, including deteriorating safety standards, poor maintenance, and lack of training, contributed to the deadly environment at the plant. In retrospect it was easy to see it coming:4

• Gauges measuring temperature and pressure in the various parts of the plant showed signs of trouble, but those gauges, including those in the MIC storage tanks, were so notoriously unreliable that workers ignored them.

• The refrigeration unit for keeping MIC chilled (intended to prevent overheating and expansion) had been shut off for some time.

• The gas scrubber, designed to neutralize any escaping MIC, had been shut off for maintenance. Even if it had been operative, post-disaster inquiries revealed that it could have handled only one quarter of the actual pressure reached in the accident.

• The flare tower, designed to burn off MIC escaping from the scrubber, was also turned off, waiting for replacement of a corroded piece of pipe. The tower was also inadequately designed for its task, being capable of handling only a quarter of the volume of the gas actually released.

• The water curtain, designed to neutralize any remaining gas, was too short to reach the top of the flare tower, from which the MIC was billowing.

• The alarm on the storage tank failed to signal the increase in temperature on the night of the disaster.5

• MIC storage tank #610 was filled beyond recommended capacity.

• A storage tank that was supposed to be held in reserve for excess MIC already contained MIC.

It is likely that any one of these factors, had it not failed, could have forestalled or mitigated the disaster. It took the confluence of all of them to create the environment in which the tragedy was just waiting to happen. In addition, the lack of trained doctors and the efforts of the Indian government to distance itself from any responsibility exacerbated the impact of the accident.

It’s Not the Only Case

Large-scale disasters grab media headlines. In many cases, they are investigated thoroughly, revealing the various conditions and unheeded warning signs that led to the catastrophe. In other cases, the resulting lawsuits bring the story into the public eye. These investigations and court cases show how management failures to identify and remedy unsafe, insecure, or otherwise dangerous conditions are likely to foreshadow disaster.6 Examples include the following:

1. The Challenger Explosion On January 28, 1986, the space shuttle Challenger exploded in midair less than two minutes after takeoff from Cape Kennedy in Florida. The investigation blamed failed rubber “O-rings,” designed to seal the joints between sections of the shuttle’s solid rocket boosters. Engi neers had identified and reported degraded O-ring seals on previous missions dating back to 1982, with degradation increasing as ambient lift-off temperature fell below 53°F (12°C). The night before the accident, two engineers from Morton Thiokol, the firm responsible for making the solid-fuel rocket booster, recommended against the launch and went home, convinced that the launch would be canceled. Under pressure from NASA, Morton Thiokol management overruled the engineers. The temperature during liftoff was 36°F (2°C).7

Most of the factors contributing to the ill-fated launch had nothing to do with technology. In 1986, NASA was under pressure to launch a large number of missions; Morton Thiokol was due an incentive pay for each successful flight; the Challenger featured the heavily publicized “teacher in space”; and the mission had been canceled the night before. All of these factors added to the pressure on NASA management and contributed to its decision to overrule the engineers’ warnings and proceed with the launch. Interestingly, the Columbia Accident Investigation Board that probed the 2003 Columbia shuttle crash 17 years later concluded that “cultural traits and organizational practices detrimental to safety were allowed to develop [in NASA].”8

2. The Paddington Train Crash On October 5, 1999, a train passed a red signal, number 109, in the midst of the morning rush hour outside Paddington Station in West London and con tinued for some 700 meters into the path of a high speed train. As a result of the collision and the subsequent fires, 31 people died and 227 were hospitalized.9

Between 1993 and 1999, eight near-misses, or “signals passed at danger” (SPADs), had occurred at the location (Signal 109) where the eventual collision and explosion occurred. At the time of the crash, number 109 was one of the 22 signals with the greatest number of SPADs.10 In addition, the train engineer had been on the job for less than two weeks, with no special training on navigating the complicated route outside Paddington Station.

3. The Morton Explosion On the evening of April 8, 1998, a nine-foot-tall chemical reaction tank (“kettle”) containing 2,000 gallons of chemicals in Morton International’s plant in Patterson, New Jersey, exploded. A fiery stream of gas and liquid erupted through the roof of the kettle’s building, raining chemicals into the rest of the plant and the surrounding community. The explosion was the result of a runaway reaction inside the kettle, which was set to produce a run of Yellow 96, a dye used in tinting petroleum fuel products. Nine people were injured, two seriously.

The subsequent investigation revealed a series of failures. In two assessments conducted in 1990 and 1995, Morton did not consider runaway chemical reaction. As a result, workers were not prepared or trained to face a runway reaction in the production of Yellow 96; they did not know, for example, that at 380°F (193°C) the chemicals inside the kettle begin to decompose, initiating an even more violent runaway reaction. In addition, the kettle was not provided with sufficient cooling capacity. Furthermore, because a runaway reaction was not considered a possibility, raw materials were introduced to the kettle in bulk rather than in small batches, step by step. Finally, in 1996 the kettle size had been increased, making it harder to control.

But there were also telltale signs that pointed to an impending disruption. There were eight prior instances in which the process temperature in the kettle exceeded the normal range. None was investigated, even though the process and design changes resulting from any one of these investigations could have prevented the 1998 explosion.11

4. The Firestone Tire/Ford Explorer Rollovers In August 2000, Firestone recalled 6.5 million tires and Ford recalled another 13 million Firestone tires the following year. The recall was the result of 148 U.S. deaths and 525 injuries related to tread sep arations, blowouts, and other problems with Firestone tires on Ford Explorer vehicles.12 The companies spent more than $3 billion dollars on the recall.13 Ford sales dropped 11 percent in the following year and Firestone tire sales dropped even further. (Ford lost almost a full percent of market share despite gener ous incentives to buyers.) A century-long relationship between the two companies ended in acrimony.

Several contributing factors have been cited as causes. The Firestone plant in Decatur, Illinois, where most of the failed tires were made, used a unique pelletized rubber process that led to lower tread adhesion.14 In addition, the shoulder pocket15 design of the tires could have led to cracking, creating a starting point for a tire failure. But Ford also contributed to the problem. The high center of gravity of the Ford Explorer aggravated tire separation accidents by increasing the likelihood of vehicle rollovers after a tire blow-out at highway speed. In addition, Ford recommended under-inflating the Explorer tires (30 psi instead of the Firestone’s recommended 36psi) for improved ride quality, resulting in excess wear and heat build-up inside the tires, reducing the margin of safety on tire performance.

Some of the damages to Ford and Firestone stemmed from how long it took the companies to realize that they had a problem. Each month of delay and finger-pointing between the two companies added more tires to the recall and fatalities to the toll. This issue of detection, discussed in chapter 9, focuses on how companies discover and internalize ongoing disruptive events in order to start recovery processes.

5. The Chernobyl Nuclear Accident On April 25, 1986, reactor number 4 at the Chernobyl nuclear power plant blew up during a routine test of the facility, as a chain reaction went out of control, releasing 30 to 40 times the radioactivity of the atomic bombs dropped on Hiroshima and Nagasaki.16 Over 30 people were killed immediately but tens of thousands of others were affected with radiation illness. Cities and villages around the plant had to be evacuated and 200,000 people resettled.17 Cancer rates at Belarus and even the Ukraine are abnormally high almost 20 years after the accident.

Many factors contributed to the disaster. The RBMK18 reactor used in Chernobyl had many design flaws: It lacked a contamination shell; it used a carbon moderator rather than water (the carbon caught fire during the explosion); and the design included a “positive void coefficient” that caused reactions to speed up rather than slow down (as in other designs) when the water in the reactor boiled. More important, the reactor safety systems were disabled prior to the test and, contrary to the requirement to use 30 rods to retain a controlled environment during the test, the operators used only eight rods. These were indications of the fundamental lack of safety culture at the plant.19

Interestingly, Iuri Andorpov, then chairman of the KGB, enumerated several construction and safety weaknesses at the Chernobyl Nuclear Plant seven years before the accident. In a 1979 letter, he informed the leadership of the KGB of the problems, letting them know that the Ukrainian KGB had notified the secretariat of the Central Committee of the Soviet Union’s Communist Party, not only about the design flaws, but about a series of problem that caused 170 workers to suffer injuries at the plant during the first three quarters of 1978.20

As these examples illustrate, failure to use precursor data to identify and remedy systemic flaws can lead to a catastrophe. Even the 9/11 terrorist attacks were preceded by numerous other attacks on the United States, as well as a declaration of war by Osama Bin Laden, but none of those indicators was given the proper weight, and no official was able to “connect the dots.” Each of these “dots” represented an unheeded warning sign. Confucius is quoted as saying, “the common man marvels at the uncommon thing. The uncommon man marvels at the common thing.” Highly disruptive, uncommon events can often be tamed by those who make a policy of noticing the common danger signs.

It is typical to have an escalating series of failures that lead to an eventual disaster. To reduce the likelihood of future high impact disruptions, many industries have developed a management reporting and analysis system based on “near miss analysis.”

Near Miss Analysis

On March 24, 2004, Prince Charles of Britain and members of his staff took off from the Northolt RAF base in west London on their way to attend the funeral services for the Madrid terrorist bombing victims. At 08:30 A.M. at 11,500 feet above Newberry in Berkshire, the military HS146 of the Queen’s Flight came within less than 900 vertical feet and three horizontal miles of an Airbus A321, which was heading to Heathrow airport, coming from Cork in the Republic of Ireland, carrying 186 passengers. Both pilots and air traffic controllers recognized the potential conflict and acted immediately to avoid a midair collision. The incident was reported as a “near miss” by both pilots, as well as the air traffic controllers, to the British Civil Aviation Authority.

The aviation industry has long recognized the wisdom of learning from a mistake even when it does not cause an accident. It has established the Aviation Safety Reporting System (ASRS), which is used to collect and analyze confidential aviation incident reports that are submitted voluntarily. The purpose is to identify systemic or latent errors and hazards and to alert the industry about them. The ASRS receives more than 30,000 reports annually and issues alerts to the industry on a regular and as-needed basis. Most aviation experts agree that these efforts have resulted in an everincreasing level of civilian airline safety.

A similar system of reporting and investigating near misses is used by the U.S. military and by almost every other air force in the world. The U.S. military requires pilots to report “any circumstance in flight where the degree of separation between two aircraft is considered by either pilot to have constituted a hazardous situation involving potential risk of collision.”21

Figure 3.122 depicts the notion that numerous unsafe conditions and insecure processes can lead to hundreds of minor incidents or close calls. When these are not addressed, the result is likely to be dozens of incidents involving some property damage, leading to several major incidents involving large property damage and minor injuries. And, if nothing is done to address these increasingly serious losses, it is likely that a major incident involving serious injuries and loss of life will follow at some point in time.

Researchers at the University of Pennsylvania23 have outlined a seven-step near miss management process that uses data from near misses to prevent large disruptions. The steps include

(i) identification of an incident,

(ii) disclosure and filing,

(iii) distribution of incident data, (iv) root cause analysis,

(v) solution/improvement recommendation, (vi) dissemination, and

(vii) follow-up.

Naturally, different industries will have different rules as to what incidents should be identified as near misses and reported. Airline pilots report near-midair collisions by judging the distance and trajectories of the aircrafts involved. Hospital operating room personnel may report a near miss when a sponge is detected after an unresolved count or when the wrong limb is prepped for surgery or when the wrong patient is transferred to the operating room. Process industry managers may report incidents when reaction temperatures are too high or when harmless gasses are emitted inadvertently.

Reporting an incident is the step that starts the formal process. Naturally, managers of any near miss system should not look for a reduction in the number of reports over time as clear evidence of increased safety. To encourage reporting, managers may institute a system of anonymous communications.

Distribution of incident data serves to alert people to the likelihood of a hazard even before it is investigated. It also serves to solicit broader information that can help in the subsequent analysis. The root cause analysis then focuses on both the direct and underlying factors that led to the incident. The next step is to develop mitigation measures and contingency plans.

Once the results of the analysis and its recommended set of actions are distributed, management at every enterprise obtaining the report has a business decision to make. The implicit essence of most near miss reports is an estimate of the likelihood of a major disruption. In fact, management may decide not to act when the chances of major disruption are judged too small and the cost of the mitigation too high.

Only a Fool Learns from Experience

As the saying goes, the wise person learns from other people’s experience. There are several types of experiences from which organizations can learn. These include near misses that happen to them, accidents that happen to them, near misses that happen to others, and accidents that happen to others.

Many organizations learn from large disruptions that happen to them. (Sadly, as the Columbia accident proved, this is not always the case.) Frequently, organizations also institute internal systems of near miss tracking, analysis, and correction. Many chemical plants, transportation companies, mine operators, and other organizations that are involved in dangerous work stress such systems as part of their safety culture, and some of these systems operate industry-wide. In many industries, accident data are reported and collected by government or industry bodies that investigate and then disseminate the data and the conclusions to industry members.

The U.S. National Transportation Safety Board (NTSB) is an independent federal agency whose mission is to investigate accidents in the aviation, highway, marine, railroad, and pipeline modes of transportation. As of 2004, its aviation accident database covered 140,000 aviation accidents, with details on each one.24 The NTSB also publishes many studies with the results of its investigations and data analysis as well as recommended improvements to infrastructure, rolling stock, and practices.

In the same way, the U.S. Chemical Safety Board (CSB) is an independent federal agency whose mission is to prevent industrial chemical accidents. It investigates chemical incidents, determines root causes, and issues safety recommendations.25 The U.S. Environmental Protection Agency also requires the operators of more than 15,000 sites containing certain toxic and flammable substances to report their existence and each site accident under rule 112(r) promulgated under the Clean Air Act.26

All the CSB and EPA reports are publicly available. Similarly, the European Union Directive 96/82/EC (known as the Seveso II Directive) specifies the processes for reporting major accident hazards involving dangerous substances. Also, the Agency for U.S. Department of Health and Human Services, manages an incident reporting system. This system specifically includes not only occupational injuries and illnesses but also near misses.

Studying accidents and near misses gives managers an idea of the likelihood of major disruptions to their own operations and conditions the organization to recognize dangerous situations as they develop. Examples include unusual weather, an incomplete complement of workers, tight schedules, small slip-ups, and other conditions that in the past were associated with accidents.

Disruption Likelihood

Disruptions can be divided into three categories to facilitate estimating their likelihood: natural disasters, accidents, and intentional attacks. These categories differ in the relative roles that human beings and random factors play in their cause. Consequently, the methods of estimating their likelihood also differ.

Natural Disasters

Because many natural disasters are frequent, statistical models can be used to estimate the likelihood of their occurrence and their magnitude. Insurance companies have well-developed models of the likelihood of earthquakes, floods, or lightning strikes for various areas of the United States as well as for other countries. Insurance premiums can even serve as a proxy for the likelihood of the relevant risk.

The U.S. Geological Survey (USGS) estimates that the areas most susceptible to earthquakes in the United States include the western United States, the New Madrid zone in Missouri, and a few isolated locations on the United States East Coast. The USGS publishes maps, such as figure 3.2, depicting the occurrence of earthquakes over time, which can be used to gauge their likelihood. The USGS also publishes detailed data of earthquake frequency for each state in the United States and other regions of the world.

The U.S. National Oceanic and Atmospheric Administration (NOAA) publishes statistics about severe weather. For example, the frequency of tornadoes in Oklahoma City is shown in figure 3.3.27 Companies located in Oklahoma City can use such figures to plan and time severe weather evacuation drills for February, just before the peak season of March through June.28

Figure 3.4 depicts the time of day of tornadoes in Oklahoma City, indicating that they take place mostly in the afternoon and early evening hours.29 Again, by knowing the increased likelihood of tornadoes at these times, organizations can train the right work shift at a plant in emergency evacuation.

Such preparations proved life-saving when a tornado hit the GM plant in Oklahoma on May 8, 2003, at 5:30 P.M. None of the more than 1,000 employees who were at the plant was hurt because they all took shelter at the plant’s fortified safe room when the tornado sirens sounded at 5:00 P.M. The tornado hit during the most likely month and at the most likely time of day.

Climatological models define likely rainfall patterns, suggesting the probability of floods in wetter-than-expected regions or wildfires in drier-than-expected regions. A NOAA map of precipitation outlook is depicted in figure 3.5. Such data can guide long-term site selection decisions, hazard insurance coverage strategies, or employee training.

The frequency and size of near misses or small disruptions can actually help predict the chance of a bigger natural disaster. Many natural (and man-made) phenomena follow statistical rank-size laws (called Power Law distributions) that relate the size of the phenomenon to the frequency of the disruption. For example, the Gutenberg-Richter Law stipulates that for each 100 earthquakes of size 3 on the Richter scale, one can expect approximately 10 earthquakes of size 4 (which are 10 times stronger than size 3 quakes) and one earthquake of size 5.

Power Law distributions are a mathematical version of the well-known 80/20 rule, the intuitive notion that 20 percent (or in general a small fraction) of the events cause 80 percent (or, in general, most) of the impact. Financial losses caused by earthquakes, hurricanes, floods, and even stock market crashes follow a Power Law distribution, as do forest fires, electricity blackouts, industrial accidents, and insurance claims. The size of cities, the popularity of weblog sites, and other socio-economic activities are also distributed according to similar rules.30

In the context of assessing the likelihood of low-probability/ high-impact events, one can assess the odds of such large infrequent disruptions based on the observed large number of smaller events. Such relationships are not precise enough to estimate either the timing or the magnitude of a future disruption but they can be used to estimate the probability that it will happen during a future interval and the relative likelihood compared to other potential disruptions. Such statistical distributions are responsible for statements like “we are due for a big one” regarding earthquakes in California.

Accidents

Although most safety literature is concerned with prevention, the first step in any safety process should be an assessment of the likelihood of an accident. Most analyses aimed at assessing such likelihood are based on variations of the near miss framework presented in the previous section.

One study31 suggested that for every 300 accidents with no injury, one can expect approximately 30 accidents involving minor injury and one major accident involving serious injury or lost life. Based on 1.7 million accidents reported by 297 cooperating organizations, another study32 suggested that for every 600 accidents with no damage or injury there are likely to be 30 property damage accidents, 10 accidents with minor injuries, and one serious or disabling injury. None of these studies separates out accidents in organizations that instituted formal processes of learning from near misses, but the consistent pattern of many small accidents foreshadowing larger ones suggests an approximate way to assess the likelihood of large accidents.

To attack safety problems at their root, companies dealing with hazardous conditions have been working to reduce the number of incidents, which should reduce the accident rate and eliminate severe accidents. These companies have implemented process safety management (PSM)33 systems, including audit programs that verify compliance and implementation of safe procedures. Such processes have been successful in many cases in reducing dramatically the number of incidents. For example, figure 3.6 depicts the marked reduction of incidents at DuPont as a function of the audit effectiveness that was part of its PSM.34

**Figure 3.6**
PSM Audit Scores and PSM Incidents

Intentional Disruptions

Whereas natural disasters follow statistical Power Law curves and the likelihood of large accidents can be estimated from small mishaps, intentional disruptions follow a different logic. Intentional disruptions constitute adaptable threats in which the perpetrators seek both to ensure the success of the attack and to maximize the damage. Consequently, “hardening” one potential target against a given mode of attack may increase the likelihood that another target will be attacked or there will be a different type of attack. It also means that such attacks are likely to take place at the worst time and in the worst place—when the organization is most unprepared and vulnerable.

Because of their high frequency, labor strikes provide many examples in which a smart adversary will inflict damage using methods and timing that the company may not anticipate or that are designed to inflict maximum economic damage. In the summer of 2002, for example, the Longshore and Warehouse Union (ILWU) staged a work slowdown in the pacific coast ports. To maximize the effect of its actions, the union timed it to October, planning to choke the ports just as the volume of shipments from Southeast Asia increased before the holiday shopping season in the United States.

Taking a page from the same book, Britain’s Transport and General Workers’ Union started making preparations in August 2004 for a strike against the country’s biggest ports group, Associated British Ports.35 The plan was to cripple the ports in late September and October, just as the run-up to Christmas begins.

The Canadian Customs Excise Union also chose its timing carefully. In 2004, it planned for its customs agents in Quebec to begin checking all vehicles driving into the United States as a protest in a four-year salary dispute. The act was aimed to double wait times for travelers, since vehicles are typically inspected only once, by American customs. To maximize effectiveness, the union chose July 31 to start the practice—on Canada’s Civic Holiday long weekend. Many travelers were expected to visit the United States, ensuring particularly long lines and maximum disruption.

When Chapter 1199 of the International Service employees union planned to exert pressure on the Women and Infants Hospital in Rhode Island, it did not simply plan a strike. Knowing that the hospital was well prepared to use replacement workers and management personnel, it adjusted its tactics. In a September 1998 strike notice to its members, the union leadership said it was their intention to strike for one day. Chapter 1199 and other unions had used a series of one-day strikes elsewhere. “It is a crippling kind of tactic, because you can gear up for a strike, but it is hard to gear up, and step down, and gear up, and step down,” explained May Kernan, vice president of the hospital’s Marketing Communications.36

In 2003, Greenpeace wanted to torpedo efforts to test genetically modified wheat in the Eastern German state of Thuringia. Greenpeace did not, however, resort to demonstrations or other activities that Syngenta International AG, the Swiss agribusiness company, was expecting. Instead, Greenpeace sabotaged the site by sowing organic wheat throughout the test field, ruining trials because it would be impossible to tell the difference between genetically modified and conventional wheat.37

On November 28, 1995, French workers participated in their second nationwide strike in five days to protest austerity measures proposed by the government of Prime Minister Alain Juppe. In Paris, 85 bus drivers employed by the Parisian transportation authority, RATP, decided to create a disruption in support of the general strike. They knew exactly what to do. The 85 buses blocked the main RATP garage and within hours the entire bus and subway system ground to a halt throughout Paris.38

After the United States imposed tariffs on steel imports in March 2002, the World Trade Organization ruled that the tariffs were a violation of international trade rules. The WTO decision gave the European Union and several other countries the right to impose retaliatory tariffs on billions of dollars worth of American exports. Rather than retaliate by imposing steel tariffs, the EU decided to hit the Bush administration where the tariffs would hurt the most. It published a list of products targeted for tariffs that included citrus fruit, textiles, motorcycles, farm machinery, shoes, and other products. The common denominator for these products was that they were all made primarily in political “battleground states” that the Bush administration would need to win in the November 2004 U.S. presidential elections.39

All of these examples demonstrate the non-random, adaptive nature of purposeful disruptions. But terrorism, of course, is the ultimate form of intentional attack. The Madrid bombers did not blow up an airliner or attack an airport because after 9/11, airports around the world had enhanced security measures. The Madrid bombers struck an undefended target instead—trains in the heart of Madrid. The March 2004 attack took place at the height of the rush hour when the packed trains ensured maximum carnage. Clearly, labor actions and political maneuvering have nothing to do with terrorism; managers have only to remember that intentional disruptions will strike at the least defended place at the most inconvenient time, and they have to be ready for this.

Vulnerabilities of the Information System As supply chains interconnect across the globe, companies are becoming increasingly dependent on reliable information and communications. Design specifications and modifications, material orders and change orders, shipping notices and deviation alerts, and payments and rebates all move electronically across the globe at the speed of light. Furthermore, as companies automate the control of many physical processes, they become more vulnerable to computer hackers and saboteurs.40

Sophos Plc., a British computer security company, finds more than 1,000 new computer virus variants each month using a global network of monitoring stations. Many of these viruses are harmless, but not all are. Some viruses are designed to cripple a network by flooding the attacked web server with messages. Socalled “distributed denial of service” attacks spread from computer to computer by sending messages to random web addresses or via e-mail, using each computer’s e-mail address book to send the virus as an attachment to other computers. Particularly destructive past viruses include the 1999 Melissa, the 2000 Love Bug, the 2001 Code Red and Code Red II, and the 2003 Slammer worm.

Although many of these attacks are not targeted at a single company, modern connected and automated systems create new vulnerabilities, as the following example illustrates.

Vitek Boden of Brisbane, Australia, was unhappy at the beginning of 2000. His application for employment with the council managing the water system for the small town of Maroochy Shire was declined. Using a two-way radio, a basic telemetry system, and a laptop computer, Boden hacked into the council’s control system. On March 2000, he released close to one million liters of raw sewage into local streets, waterways, rivers, parks, and a Hyatt Regency tourist center. The cleanup took more than a week. The stench and the environmental damage lingered long afterward. While not a life-threatening attack, Boden’s assault demonstrated the vulnerability of automated systems to cyber-intrusions.

Assessing Intentional Threats Historical data are of limited use when trying to forecast the nature of a new intentional threat, because of its adaptive nature. For example, although it strengthened its airport security after the 9/11 attack, the United States is still unprepared for other types of attacks that may be more likely because of the increased security at airports.

To help assess the type of attack envisioned by intentional threats and estimate their likelihood, organizations ranging from Intel Corporation to the U.S. Navy to airport managers use “Red Team” exercises to refine and advance their understanding of intentional threats. In such exercises, a group of experts is tasked with thinking like an enemy, exploring the vulnerability of an organization, and simulating various attacks.

For example, in November 2000, the U.S. Department of Energy and the Utah Olympic Public Safety Command conducted “Black Ice,” a simulation designed to understand the vulnerabilities of crucial infrastructure components when a loss of infrastructure is compounded by a cyber attack. The exercise began with a fictitious ice storm damaging power and transmission lines, degrading the ability to generate and deliver power in the region. This was then exacerbated by a simulated cyber attack on the Supervisory Control Data Acquisition system, which controls the power grid in the area. The simulation exposed the vulnerability of Salt Lake City even to partial performance degradation of natural gas, water, and communications systems.

Similarly, in July 2002, Gartner Research and the U.S. Naval War College conducted “Digital Pearl Harbor,” a simulation designed to explore the potential of a cyber attack on critical infrastructure components. The experiment exposed the need for a central coordination role by government in the event of such an attack. It also discovered that isolated infrastructures proved much more difficult to breach.41 Such “Red Team” simulations help the defending (“blue”) team to assess its vulnerability and discover unforeseen dependencies. The simulation also “socializes” organizations that use it to think in terms of uncertainty, flexible response, responsibilities, and lines of authority, thus creating a more adaptive and resilient culture.

Summing the Likelihoods

Large-scale disruptions rarely take place without any warning. The likelihood of random disruptions can be assessed from actual, commonly available data regarding frequency of earthquakes, floods, hurricanes, or lightning strikes. In addition, such phenomena seem to follow statistical laws that allow inference of the likelihood of low-probability/high-impact disruptions from the occurrence of frequent small events.

Although accidents involve human factors, they also seem to follow similar relationships. A large number of small incidents may foreshadow more significant accidents. This is the basis for the near miss analysis. Learning from the small incidents can help organizations correct the conditions that lead to accidents, thereby lowering the likelihood of large disruptions.

The likelihood of intentional threats is more difficult to assess because these attacks adapt to defensive measures. To estimate modes of attack and likelihood, managers have to “think like the enemy,” using other organizations’ incident experience and simulated “red team” attacks.

Recognizing that many components have to fail for a system to experience a large disruption is the rationale behind the concept of layered defense described in chapter 8. The basic concept is that if one element of the system is breached, the system would not fail because other elements are capable of deterring an attack or avoiding an accident.

Estimating the probability of failure of various supply chain elements requires information that is rarely available to managers. But as investigation boards and legal proceedings have revealed, in many cases relevant data are on the record but not funneled into a useful place or not analyzed to bring out the information in the data. Thoughtful and carefully designed continuous collection of common data, feeding a frequent process of review and analysis, can point to impending disruptions and enhance the process of estimating disruption likelihoods.

As things generally are today, many companies do not have a formal process for anticipating disruptions and estimating their likelihood and those who do rely on managers’ subjective estimates. Such estimates are usually relative—ranking possible disruptions in terms of their likelihood vis-à-vis other potential disruptions. Such ranking is useful in constructing vulnerability maps (such as figure 3.2) to help prioritize management’s attention. They cannot, however, replace a rigorous process of threat likelihood analysis.

The limited ability to estimate the likelihood of specific disruptions means that significant management attention should be focused on general redundancy and flexibility measures. Such capabilities can help a company to recover from many potential disruptions, even those that cannot be imagined specifically.