Skip to main content

9. Detecting Disruptions

Published onApr 04, 2020
9. Detecting Disruptions
·

Baxter International faced a major crisis when its kidney dialysis filters became implicated in the deaths of patients. For some time, however, Baxter could not find anything wrong with the filters and therefore was not sure if the company had a significant problem on its hands.

Detecting a disruption means distinguishing a true problem from the sometimes considerable variations of normal day-to-day business. This is one of the most important tenets of reducing the likelihood of disruptions—the ability to identify outliers in the normal pattern of events.

The speed of detection may be crucial in avoiding disasters. The “nightmare scenario” among security professionals is not a nuclear holocaust or a dirty bomb. Instead, it is an attack in which the target does not realize it is under attack (or does not realize the severity or magnitude of the attack) until it is too late. And the result can be devastating.

The influenza pandemic of 1918 killed 40 million people worldwide;1 28 percent of the U.S. population was infected and 675,000 died2—and this was before the age of globalization and jet travel. Not realizing the magnitude and severity of the disease, Baltimore’s health commissioner, when asked on October 4, 1918, about closing schools and banning public gatherings, told the Baltimore Sun, “Drastic measures . . . only excite people, throw them into a nervous state and lower their resistance to disease.” Only after thousands of deaths, Baltimore, like most other U.S. cities, slowly came around to sterner measures, closing schools and outlawing public gatherings.3

Detecting Dialysis Deaths

Elderly kidney-dialysis patients sometimes die. So, in mid-August 2001, when four elderly patients became seriously ill and died shortly after dialysis treatment at a Madrid hospital, the deaths did not make the news. In accordance with standard rules, however, the hospital sent a notice to the maker of the $15 dialysis filters used in the treatment process. That maker was U.S.-based Baxter International, which had recently acquired Althin Medical AB, the Swedish maker of the dialysis products.

This seemingly random event was followed a week later by another incident, made more ominous because of its timing. Six more dialysis patients died in Valencia, Spain. Regional health officials investigated and discovered that all six Valencians had received their treatments from systems using the same single lot number of Baxter dialysis filters. Although the media published the news of the connection, other events, notably 9/11, soon overshadowed the story.

Even though Baxter did not make headlines, the company was neither oblivious nor apathetic to those first deaths in Spain. As a precaution, Baxter recalled the two lots associated with the deaths, halted distribution of that line of filters, conducted an internal investigation, and commissioned independent tests from TUV Product Service, a widely used international certification and inspection service. On October 9, after weeks of testing, TUV and Baxter announced that neither could find anything wrong with the filters, local water supplies, or equipment. “We do not see any connection between the dialysis and the deaths,” said Frank Pitzer of TUV.4

Even as Baxter and TUV were announcing that the filters were safe, more people were dying. On October 13, Alan Heller, the recently named head of Baxter’s renal division, was at a conference in San Francisco. There, he picked up a voice mail that told of terrible events being reported on TV in Croatia.5 In an average week, three to six dialysis patients die in Croatia from the inevitable complications of kidney failure. However, between the 8th and 13th of October, 23 Croatian kidney patients in eight different centers had died. Twenty-one had used a Baxter-made dialysis filter. The event made world headlines and prompted an emergency meeting of the Croatian cabinet.

Baxter immediately dispatched a team of medical specialists to Croatia to spearhead the investigation.6 “It is nearly impossible at this early point in time to identify with any certainty the causes of these patient deaths since there are many variables involved in these incidents,” said Dr. Jose Divino, the leader of Baxter’s investigative team.7 Baxter spokeswoman Patty O’Haer told the Reuters news agency that “data from hospitals show that there was more than one common element to all the patients, including what we call disposables—needles, dialysis and the solution used in the process.”8 Nor were the deaths identical; some patients died during dialysis and others after; some had heart attacks, while others suffered strokes or developed breathing problems.

Despite the unclear evidence, “I knew that there was too much there to be a coincidence,” Heller said.9 Baxter immediately recalled the filters and launched a full-scale investigation into the cause of the problem. More than two dozen staffers were assigned to scrutinize the filters for evidence of defects or tampering.

At first, Baxter could find nothing wrong. Then an engineer at Baxter’s Swedish filter-making plant noticed an anomalous bubble in one filter. The bubble seemed to be from the perfluorocarbon liquid used on some filters during the manufacturing process. Although the liquid was inert and nontoxic, it vaporized at a low temperature. Baxter investigators theorized that this could create small gas bubbles in the patient’s blood that then precipitated the deaths. After tests on rabbits replicated the symptoms, the company announced, on November 2, that it believed that this liquid was the cause of the disaster.

To this day, Baxter does not know exactly why any of the perfluorocarbon liquid was left inside the filters. The processes Baxter used at the acquired Swedish plant had been used for nearly a year and were used by other manufacturers. Perhaps the ultimate irony is that the offending liquid was used specifically for quality-control processes.

In all, more than 50 deaths were connected with the perfluorocarbon bubble. Although the majority of victims were from Spain and Croatia, other patients in Italy, Germany, Taiwan, Colombia, and the United States also succumbed to the effects of the faulty filters. Baxter discontinued manufacturing the problematic product line. Some 360 workers in Sweden and Florida lost their jobs. Baxter’s $100 million acquisition of Althin in 2000 resulted in $189 million in damage to the company’s bottom line in 2001 to cover the costs created by the defective filters.

Detecting Disruption in Global Organizations

Baxter, like many global businesses, faces numerous challenges in detecting problems associated with its products. The first basic challenge is to realize that there is a disruption. Some rate of product failure can be expected simply as a result of random causes, and it may be difficult to ascertain that there is a systemic problem. The second challenge is to identify the cause. Products may fail for numerous reasons, many of them involving exogenous factors or improper use. In addition, most products are built from parts provided by many suppliers; finding the root cause of a failure requires detection across multiple company lines, each of which has a stake in making sure that its parts are not at fault. In many cases, finding the root cause is the only way to ascertain that there is a problem. These challenges both delay the detection of the problem and create further confusion during the disruption. For most business disruptions, detection delays cost money. In Baxter’s case, the delay in detecting the problem with the filters cost lives.

How Do Organizations Detect Disruptions?

The challenge Baxter faced in its detection efforts was the difficulty in spotting a pattern that would indicate an aberration in daily variations of common random events. Although dialysis prolongs the life of kidney-failure victims, it does not cure the disease. Approximately 10 to 20 percent of dialysis patients succumb each year to the severity of their medical condition. An estimated 218 dialysis patients are expected to die every day during 2005 in the United States alone.10 Of course, this does not mean that exactly 218 people will die every day.11 The daily number will fluctuate just because of the laws of probability and a spike in the number of deaths may simply be one of those fluctuations. In a region where the average number of daily deaths is low, the variability will be high. For example, in a region where the average is, say, two deaths per week, one can expect seven weeks a year with no deaths at all and approximately 17 weeks a year with three or more deaths.12 Spotting a problem in this context takes more than simply looking at higher-than-average death rates.

The process that many organizations use to distinguish between common highs and lows of any process (such as the temperature of an industrial boiler, the number of chips yielded per wafer, or terrorist “chatter”) and “outlier” events warranting special scrutiny is called statistical process control (SPC). With SPC, a company calculates upper and lower control limits—recording, for example, the normal range of variation of the number of daily and weekly deaths while the process is considered under control. Watching the process to identify unfavorable trends can help the company detect situations when the process is, or threatens to get, out of control. Thus, the numbers help a company, like Baxter, know when some rate of deaths is “too many to be a coincidence.”

Even when a problem is suspected, companies face a further challenge. Any given problem might have many potential root causes. Possible causes for the dialysis-related deaths included a statistical fluke (handled by SPC), contaminated water at the dialysis clinic, faulty Baxter dialysis equipment, contaminated needles, contaminated tubing, contaminated IV fluid, product tampering, physician error, or defective Baxter filters. Because the majority of these causes were outside of Baxter’s control, it was easy for the company to miss the true cause until it became more evident, in Croatia, that the filters probably played some role in the problem.

When Does an Organization “Know” Something?

With headquarters located in Deerfield, Illinois, with manufacturing and research facilities located in dozens of countries (including the acquired products division in Sweden), and with customers located throughout Spain and 99 other countries, Baxter faced the not-uncommon challenge of communication in a global enterprise. For example, the president of the Althin division learned of the problems in Croatia from a voice mail message that described the media firestorm developing in Croatia, rather than learning it through internal communications channels or from customers directly.

Detecting problems can take time because of the many ambiguities involved: is there a problem? where is it? and so on. Thus, the search for a cause is part of the detection process. The first set of independent tests commissioned by Baxter took four weeks to examine the cytotoxicity, intracutaneous reactivity, systemic toxicity, and hemolysis of the filters in accordance with ISO 10993 standards.13 These tests found nothing. The second investigation took more than two weeks to pinpoint the probable cause to be the perfluorocarbon liquid used at the Althin factory in Sweden.14

Many victims’ families and their lawyers argued that the latter deaths, in Croatia and elsewhere, could have been prevented. But this view fails to acknowledge the complexity and difficulty of identifying the cause of sporadic phenomena in an intricate product and service supply chain. In Baxter’s case, only 10 percent of the filters were even susceptible to the fault and only a few dozen filters out of many millions produced actually had the fatal fault. It took weeks to discover the faulty filters, with the deathtoll steadily increasing. An epidemic such as SARS or a potential biological terrorist attack could take days or weeks before it is identified.

Because early detection helps halt the spread and minimize the consequences of such an event, the city of Boston instituted an early warning system during the anthrax attacks of 2001: Fourteen Harvard Vanguard treatment centers reported data every day on 250,000 patients, with a database management system looking daily for suspicious patterns of flu-like symptoms. In 2002, the U.S. Center for Disease Control (CDC) announced plans for a similar system against bio-terror attack—looking for signs of anthrax, smallpox, or other disease outbreak in the aches, pains, and sniffles of 20 million patients.15 The system is based on Statistical Process Control (SPC) principles, looking for worrisome patterns such as geographical clusters of routine symptoms like respiratory infections and small rashes accompanied by fever, symptoms that may signal a bio-terror attack in progress.

One of the methods for early detection includes monitoring the daily purchasing of over-the-counter medications at pharmacies around the United States. The rationale is that people tend to selfmedicate when they feel sick, before they visit a doctor’s office. Thus, a spike in aspirin and cold medication sales might be an early indication of a flu outbreak or bio-terrorism attack in progress.16 In addition, the CDC and DHS have invested heavily in systems to monitor the water and food supplies, as well as air quality, in order to identify pathogens quickly.17

Although it was the fear of bio-terrorism that prompted the CDC and the DHS to create the national monitoring systems mentioned above, the systems’ greatest value may prove to have nothing to do with terrorists. The need for such monitoring systems was demonstrated in 1993 when Milwaukee’s water supply was fouled by the cryptosporidium microbe that entered the water system from cattle lot runoff. More than 400,000 people became ill and 100 died, in part because it took days for disease trackers to realize an epidemic was under way. The CDC, however, was able to identify quickly and point to the cause of a rash of suspicious lung-related complaints in Queens, New York, in late 2001. The breathing problems of the area’s residents were caused by fumes from the burning American Airlines jetliner, flight 587, which crashed there on November 12.

Companies face yet another level of challenge: bridging the gulf between having the data about a disruption and “internalizing” these data. Internalizing the data means absorbing them and communicating them internally so that relevant parties know of the situation with enough clarity to be able to contemplate possible actions. Several examples illustrate how difficult it is to bridge this gap and how long it may take to transmit the message, thus readying the organization for a response.

During the Kobe earthquake, Texas Instruments (TI) in the United States knew of the event even before the Japanese Prime Minister did. The quake broke a trans-Pacific data link and set off alarms that instantly alerted managers in the company’s corporate data center in Dallas, Texas, and TI’s attempts to contact the Kobe facilities uncovered the cause within minutes.

In contrast, it took Japanese government officials four hours to decide that the event warranted disrupting the schedule of higherups and to route the news through the chain of command up to the prime minister. After the quake, some faulted the central government for its slow response to the events in Kobe; clearly, lowlevel Japanese officials (and the local government in Kobe) knew about the quake immediately. The delay in informing the prime minister demonstrates the difficulties in getting a large organization to the point that it internalizes the event and starts thinking about a response.

Similarly, in the case of 9/11, there is no single identifiable moment one can point to and say “this is when the U.S. government knew about the terrorist attacks.” Instead, all that can be said is that different parts of the government knew different facts at different times and only after accumulating enough facts did enough of the government know enough information about the events to take action. In fact, when told of the attack on the morning of 9/11, U.S. president George W. Bush did not immediately internalize the meaning of the news and famously kept reading stories to children for seven more minutes.

Even before the attack itself, and regardless of all the events that were its precursors (see chapter 3), the U.S. 9/11 commission has found out that U.S. Federal Aviation Administration officials received 52 warnings prior to September 11, 2001, from their own security experts about potential al-Qaida attacks, including some that mentioned airline hijackings or suicide attacks. The report comments that aviation officials were “lulled into a false sense of security” and “intelligence that indicated a real and growing threat leading up to 9/11 did not stimulate significant increases in security procedures.”18 In other words, the FAA officials did not internalize the warnings.

The way most organizations are structured, information percolates up the chain of authority and commands percolate down to the people who implement them. This process takes time and is imperfect, but it works for non-emergency situations. In an emergency, people have to be empowered to bypass the normal structure of information. Unless the organization has created the requisite culture of distributed decision-making power (see chapter 15), there are numerous barriers to deviations from the normal process (such as actions without explicit authority), especially in the early hours when information is unclear.

Baxter did not expect to have problems with its filters. Regulations and painstaking procedures cover all phases of the design, development, and testing of medical devices. Manufacturing processes are, likewise, heavily regulated and monitored both by the company and by health authorities. Perfluorocarbon liquids are, in general, so inert and nontoxic that some of them have even been proposed as artificial blood substitutes. Yet the particular perfluorocarbon used by the Althin division of Baxter had never been explicitly tested for internal use by the supplier, 3M. Althin and Baxter never expected that liquid to get into patients; in theory, manufacturing processes ensured the removal of the test fluid.

Detecting and internalizing the unexpected is hardest because it often means questioning long-held assumptions about what is possible and moving information outside the normal channels. One of the keys to detection and fast response is the process of escalating knowledge, including the decisions regarding what to inform superiors about and when to do so. Some corporate cultures do this better than others.

Escalations and Warnings

As managers monitor ongoing processes, they continuously face the question of “common” vs. “special” aberrations. Although statistical process control charts can give an indication of an aberration, it typically takes a manager to decide the nature of the irregularity. Is it just a minor disturbance in the process monitored—be it the central processing unit workload, the boiler’s temperature, or the number of SUV rollovers—or does the abnormal reading indicate a real problem? Managers can either (i) do nothing and wait for additional data, (ii) investigate the situation, (iii) notify their superiors (“escalate” the notification), or (iv) take immediate action.

When deciding among the various courses of action, managers have to balance the chances and outcomes between two types of errors: If the problem is serious and they do not escalate or react, the delayed reaction may cause severe damage; if they alert their managers and trigger actions that prove unnecessary—because there was no real disruption—they will cost the company money (in dispatching technicians, activating emergency responses, or stopping ongoing industrial processes) and appear to have faulty judgment.

Companies such as Baxter face an especially hard-edged tradeoff on detection sensitivity. If Baxter recalled its products every time a suspicious death occurred it would have to cease selling its products, but missing a defect in a medical device can cause preventable deaths. There are no hard-and-fast rules for deciding when to alert and escalate. The more empowered and informed field personnel are, the more likely they are to make the right call.

Some companies are turning to automated processes to help them detect disruptions quickly. Consider SVS Inc., which processes gift-card transactions for 170 large retailers, including Radio Shack, Barnes & Noble, and Zales Jewelers, as well as restaurant chains and gas stations. As the largest gift-card and stored-value transaction processor, SVS handles 450 million transactions every year; downtime is costly and critical to the livelihood of the business. “In our business, SVS absolutely requires their online systems to be available 24 × 7,” says Pat Guenthner, vice president of the Louisville, Kentucky, Data Center.19 Because of that, the company invested in a monitoring system that lets technicians scrutinize remotely any increase in CPU utilization as a result of looping or other problem processes. Using SPC, the system also monitors transaction failure rates to identify developing disruptions. A rule-based system lets SVS define when to alert technicians, when to alert managers, or when to send an e-mail that might be seen in the morning instead of an emergency page in the middle of the night.

The Nokia and Ericsson case illustrates the value of fast detection and internalization of a disruption. Philips alerted both Nokia and Ericsson about the fire at its New Mexico plant at the same time. It was a relatively low-level alert with accompanying reassurances about fast recovery. Whereas Ericsson waited, Nokia acted immediately, alerting an internal troubleshooter of the potential problem and placing Philips on a “watch list.” Nokia’s heightened awareness allowed it to identify the severity of the disruption faster, leading it to take timely actions and lock up the resources for recovery.

Detection and Alarms

In many cases, disruptions are inevitable and warnings should be given immediately. In other cases, the proverbial writing is on the wall and the challenge is to internalize the impending disruption and take actions to mitigate the consequences.

When tornado sirens sounded at the GM plant in Oklahoma on May 8, 2003, the workers took shelter and none of the plant’s 3,000 workers was injured, despite extensive damage to a number of GM’s buildings. When Hibernia Bank learned that Hurricane Lili was bearing down on the Louisiana coast in October 2002, it implemented its “Hurricane Plan” emergency procedures. It prepositioned these recovery teams, arranged for lodging for teams and emergency data center staff, and alerted their backup hot-site IT service provider.20 In these cases, there was no question that the danger was imminent.

The impending war between the United States and Iraq was no secret in the spring of 2003. Dow Corning internalized this news and realized that this event would disrupt future shipping in the Atlantic—not directly through acts of violence, but indirectly because the U.S. government would monopolize much of the trans-Atlantic transportation capacity as it ferried tens of thousands of troops and their weaponry, ammunition, and supplies. In anticipation, Dow Corning accelerated its own shipments and built up inventories that helped it weather the subsequent reduction of available shipping capacity.21

On the other hand, in September 2000, the government of Britain did not immediately grasp the impact of the ongoing fuel strikes. Truckers, angered by a new fuel tax, blockaded refineries and fuel depots, thereby creating shortages at filling stations. On the surface, this event—which unfolded nightly on British television screens—seemed to be simply an inconvenience for the driving public. What the government failed to realize fully was that fuel shortages affected food deliveries, too. The result was that the country came within four days of food rationing before the government woke up to intervene in the dispute. Prime Minister Blair warned the strikers that police and troops would be used to clear the blocked fuel depots.22

At their early stages, many disruptions may seem innocuous. Suppliers, business partners, customers, and even governments may release reassuring information. Realizing the magnitude of a large disruption at an early stage requires analytical capability to understand “what does it mean?” and a deep understanding of the system in which the business operates. It is this understanding of the system that enables managers to recognize the sometimes distant and unlikely relationships between the firm and various external threats and disruptions.

To improve early detection, vulnerable companies have instituted special monitoring devices and, in some cases, added new capabilities. For example, following the closure of its Louisville hub resulting from an unexpected snowstorm (see chapter 13), UPS built its own meteorology department. The department issues detailed forecasts regarding key airports where UPS operates—routinely besting the U.S. meteorological service in its forecast accuracy.

New technologies may help companies deal with fast-developing disruptions. In 2004, seismologists in Los Angeles were testing a warning system that could detect an earthquake at its epicenter and send an electronic alert to authorities, companies, and the public. Because the electronic warning signal travels at the speed of light, it reaches parts of the affected areas seconds before the shock waves. Authorities and companies can use these precious seconds to secure or shut down critical processes immediately before the shaking starts, the power goes out, and the pipes rupture.

A tsunami detection system has been in place in the Pacific Ocean since 1948. It is based on signals from eight deep-ocean sensors mounted on buoys and about a hundred coastal monitors, all tuned to detect wave patterns characteristics of a tsunami. In the United States the National Weather Service operates a program called TsunamiReady, promoting emergency awareness, and coastal communities at risk have installed warning systems and disseminate information about evacuation procedures. The system is credited with saving hundreds of lives when Crescent City in Hawaii was evacuated before the tsunami generated by the 1964 Alaska earthquake reached the island. The government of Japan is spending $20 million per year on a completely automated system of tsunami warning based on additional sensors and automatic connection to media outlets for warning. Unfortunately, nations around the Indian Ocean have not installed such a system. The December 26, 2004, tsunami hit Indonesia, Thailand, India, Sri Lanka, Bangladesh, the Maldives, Myanmar, and even Somalia on the east coast of Africa, killing over 175,000 people and causing billions of dollars in damage. While Indonesia was hit within minutes of the quake, the tsunami took two hours to reach Sri Lanka, three hours to reach India, and six hours to reach Somalia. Yet no warning ever sounded.23

Monitoring systems can also provide past data to help catch near misses and developing patterns of disruption. For example, Amazon.com monitors seven variables in real time as it processes the ebb and flow of orders and packages to customers. Amazon gathers tracking data from transportation carriers such as UPS and FedEx, who deliver Amazon.com’s shipments to customers. Amazon.com knows, for example, that packages that get fewer than three scans are more likely to get lost or cause the customer to call Amazon.com about the shipment. The data not only helps Amazon.com detect problems; it also helps the company fine-tune which carrier it selects for which packages, thus improving its service.

Increased Detection Sensitivity as a Benefit for Business

A system of early detection put in place to avoid low-probability disruptions can, of course, routinely alert managers to smaller problems and day-to-day negative trends. Monitoring near misses can point to systemic process problems that are likely to manifest themselves as future disruptions.

Many of the initiatives of the Department of Homeland Security in the United States are focused on providing electronic monitoring of shipment integrity and shipment status. “Smart” container seals can show telltale signs of tampering, and radio frequency-based shipment identification (RFID) tags can help determine the content and location of shipments.24 Naturally, smart seals, along with closed-circuit cameras and theft-detection electronic tags in terminals and loading docks, can also help deter theft and vandalism.

The location technologies that let shippers and consignees know where shipments are (creating “visibility,” in the parlance of supply chain professionals), can reveal ominous tampering with shipments. Most of the time, however, shipment visibility applications are part of automatic event management systems used to alert consignees of late (or early) shipments and incorrect shipment content. Such shipment visibility systems can also be used to reroute shipments to areas where demand is unexpectedly high and away from areas where demand is lower than expected, or reroute shipments to cover late arrivals. Thus, shipment visibility systems designed for the primary purpose of detecting tampering can also be used to create flexibility in responding to day-to-day demand variations.

The Challenge

Is it anthrax or Sweet’n Low? Cyanide or almond paste? In the past, firefighters dealing with unknown substances leaking from turned tankers or rail cars had to wait hours to find out. But now, with the possibility of a bio-terror attack, emergency crews might have only minutes to determine if they’re dealing with a terrorist attack, a college prank, or a media-induced panic.25 Dozens of companies, federal laboratories, and research institutions around the world are racing to develop detection methods for warnings against nuclear, biological, or chemical attacks. The aim is to develop “smoke detector”-type systems—accurate systems that are “always on,” sounding an alarm when they detect radiation, biological, or chemical hazards in shipments, buildings, water, or food supply.26

Such early detection can give firms and the government time to implement containment and recovery operations and to prepare customers and the population at large for the disruption. In many cases, however, companies invest in early detection systems because of the tangible day-to-day management and control functionality that such systems provide. With early detection managers can still recover and expedite a late shipment, order another, or alert their customers to the problem. Thus, the same functionality that helps create a smooth, cost-efficient flow of goods and services also helps firms monitor their systems for disruptions.

Prepared or not, however, when disaster strikes, the first line of defense is typically redundancy: Having extra inventory, surplus capacity, or alternative supply sources can give a firm time to organize its response and recovery. This is the subject of the next chapter.

License
Copyright © 2005 Massachusetts Institute of Technology. (All rights reserved.)
Comments
0
comment

No comments here