The likelihood that a specific high-impact disruption will take place at a particular time at a given location is very small. However, the probability that something significant will happen somewhere, sometime during a year and disrupt a multinational company with dozens of facilities and many thousands of suppliers across the globe is not small at all. “I have 14,000 suppliers. I guarantee that with 14,000 suppliers, at least one of them is not performing well today,” said Tom Linton, chief procurement and supply chain officer at Flextronics.
One role of risk managers is to prepare options for crisis managers and their teams. An option represents the right, but not the obligation, to take a certain action. The type of option discussed here is not a financial option but rather a so-called real option or investment that gives crisis managers a tool or capability that they can use, at their choosing, in case of a disruption. To use a trivial example, when a firm invests in fire extinguishers in a building, it, in effect, acquires the option to use them in case of fire.
Building a resilient enterprise involves two broad categories of options: building redundancy and building flexibility of supply chain assets and processes. A related group of options, beyond the design of the supply chain itself, includes operational response capabilities: the specialized places, people, and processes needed to deploy redundant and flexible assets during a crisis. Companies such as Walmart, Cisco, and Intel develop response capabilities by creating a set of special plans, designated people, dedicated places, and specific processes to accelerate and coordinate an effective and efficient response.
The concept of a financial stock option provides a way to think about the value of preparing for disruption response. Real options are tangible assets that give the owner of the option the right, but not the obligation, to do something.1 For example, the extra inventory in a warehouse or the spare capacity in a factory is a real option stemming from supply chain design; the owner can use that spare stock or extra capacity during a supply disruption or a surge in demand. Similarly, crafting a well-documented business continuity plan to deal with a certain type of disruption allows crisis managers to activate that plan when the need arises.
Real options have two defining elements: a known upfront cost of creating the option (e.g., the capital cost of spare factory capacity or the cost of training and drilling) and an unknown payoff that is contingent on some uncertain future event (e.g., the value of business continuity if the redundant or flexible capacity is needed). The mathematics of real options weighs the uncertain benefits of being able to use the option in the future against the certain costs of creating it in the present. The model considers the statistical likelihood of the benefit over time and the cost of the money over the time period when the option is available. (Chapter 14 contrasts and compares real options vs. insurance coverage.)
The analytical methods to evaluate real options are beyond the scope of this book,2 but the general results are clear. If the likelihood of needing an option is high enough or the payoff from having the option, relative to its cost, is high enough, then the option will be worth the investment. Most crucially, the value of having an option increases if volatility increases, be it demand volatility or supply volatility (e.g., if the world has more disruptions or if the company is more exposed to these disruptions), then real options such as spare capacity become more valuable. In contrast, if there were no disruptions or unexpected demand fluctuations, there would be no value to investing in optional capabilities that never get used.
Naturally, good preparation involves a portfolio of options, including extra inventory, capacity, and sources of supply, or having the flexibility to change production schedules, input materials, and shipping lanes. Redundancy and flexibility extend into the supply chain. “You need to create fungible capacity among your network of suppliers so that you can absorb spikes in demand. You have to have a prequalified set of suppliers whom you can ramp up or down as needed,” said Martha Turner, vice president at management consultancy Booz & Co.3 In addition to creating options with ordinary supply chain assets, companies can also create specialized assets in the form of disruption management tools. Having options in place can increase an enterprise’s resilience by reducing the time to recovery, mitigating customer disruptions, and avoiding negative long-term consequences.
When Superstorm Sandy threatened the East Coast, AT&T sent in the COWs—which stands for “cell on wheels”—special truck trailers that can erect a high-capacity cellular network station anywhere, anytime. The self-contained trailers include multibeam antennas on a telescoping tower, a power generator, network equipment, and cooling equipment. Different COWs offer different functions, such as 5- and 18-beam cell towers, mobile command center, and microwave backhaul for relaying the combined traffic from the cellular signals to a distant high-capacity Internet connection.4 The company also has a fleet of smaller COLTs (cell on light truck) that use a satellite up-link to provide service.
If a natural disaster strikes, AT&T can send in these vehicles to immediately restore cellular service while the harder job of repairing downed cell towers and power lines takes place. “We designed the trailers so that the stuff in the central office is exactly the same equipment you have on the trailers that we pull around. We can pull these trailers anywhere. Put them in a parking lot and that parking lot becomes our central office,” said AT&T’s director of network disaster recovery Robert Desiato.5 In the case of Superstorm Sandy, the company had five days’ warning of the impending storm and could reposition its nationwide fleet of COWs to be ready to serve any areas hit by Sandy.
Yet COWs are not just for handling disruptions during a disaster. In 2012, AT&T sent nine COWs to Indianapolis for Super Bowl XLVI to help support the massive surge in cellular use that occurs when 85,000 fans, press, staff, and players inundate the stadium and surrounding area.6 The company deploys COWs to sporting events, festivals, large public gatherings, or any time the company expects a localized surge in demand for cell phone usage. An event like Superstorm Sandy creates both a supply disruption and a demand surge. The storm knocks out power and cell towers at the same time that call volumes increase two to four times over normal daily averages, according to Tim Harden, president, Supply Chain and Fleet Operations for AT&T Services, Inc. Whether it’s Superstorm Sandy or Superbowl Sunday, these purpose-built assets provide AT&T with an option to deliver needed bandwidth where demand outstrips supply regardless of the reason.
Redundant amounts of standard supply chain assets such as inventory, production capacity, and multiple facilities offer an obvious option for crisis managers. Such assets require no additional expertise to create and no qualitative changes to operational processes. They only require the willingness to invest in creating and maintaining spare amounts of familiar assets, giving crisis managers the option of utilizing the extra assets to mitigate the effects of a disruption.
“During the hurricane period in central America, we ensure we have more stocks in the region by applying a so-called hurricane factor in our safety stock levels,” said Frank Schaapveld, senior director supply chain Europe, Middle East, and Africa (EMEA) for medical-equipment maker Medtronic.7 Extra inventory of both finished products and parts can be utilized immediately after a disruption. Even if the inventory is not sufficient to cover the entire time-to-recovery, it allows crisis managers to “catch their breath” and organize for response—continuing operations while collecting data from suppliers, consulting with customers, and launching various recovery efforts.
All companies hold some inventory to cover both the average level of demand between regular cycles of manufacturing and shipments (cycle stock) and to handle routine fluctuations in supply and demand (safety stock). The mathematics of inventory management helps companies estimate the amount of cycle stock that balances the economics of production and transportation vs. the cost of holding inventory, as well as the amount of safety stock that is needed to provide a given level of customer service in the face of uncertain demand, as discussed in “The Story of Inventory” in chapter 1. In addition, work-in-process inventory is the stock held while the parts or products undergo some process, such as during shipment or conversion.
At its core, inventory provides a decoupling function between links in the supply chain. The traditional view of inventory management is that cycle stock allows each stage in the supply chain—such as ordering, production, or distribution—to operate at its own optimal rate, creating optimal order size, manufacturing batch size, and shipment size in accordance with the parameters of that activity. Safety stock allows for decoupling of activities from random variations. Thus, for example, distribution centers hold inventory in order to decouple the manufacturing process from random customer demands and manufacturing plants hold parts inventory to protect against variation in the inbound flow of parts. These operational safety stock sizing decisions generally assume a normal distribution of statistical variations and the ready possibility to reorder more product at any time.
In addition, companies may keep extra inventory for mitigating larger disruptions that could have a significant effect on operations. Such extra inventory also fulfills a decoupling function: isolating the company’s customers from a disruption. Whether a company chooses to hold these extra inventories depends on the value-at-risk, the cost of holding enough inventory to cover the value-at-risk, the cost/possibility/timing of procuring alternate supplies in the event of a disruption, and the likelihood of a disruption. Holding inventory to protect against large-scale disruptions may not be cost-effective because a long-duration supplier failure would require a large amount of extra inventory. In addition, the low frequency of large-scale disruptions means that such extra inventory will have to be carried for a long time before its value is realized, if ever.
The Hershey Company is the largest chocolate manufacturer in North America. It operates two plants next to each other in Hershey, Pennsylvania, and thus may be vulnerable to a local disruption. To mitigate this risk, the company keeps six months’ inventory of milk chocolate in giant blocks (so as to minimize the amount of chocolate subject to oxidation) in a refrigerated warehouse.8 Most companies cannot afford to have six months’ worth of parts, raw materials, or finished products. Yet, long and complex supply chains may have many points where inventory accumulates and can be used during a disruption.
GM faced a potential disruption of catalytic convertors when a maker of the ceramic honeycomb substrate inside the convertor suffered severe yield problems. Some 40 percent of the delicate honeycombs were being scrapped, which meant that there wasn’t enough supply to cover demand. This capacity bottleneck was rooted deep in a multitier supply chain, in which a Tier 4 supplier made the substrate, Tier 3 suppliers coated it with the catalyst, Tier 2 suppliers installed it in a metal shell, and Tier 1 suppliers assembled it into a complete exhaust system.
GM averted a disruption to car production by accelerating the manufacturing and shipment processes along the three-month catalytic converter supply chain. It also identified inventory hidden in manufacturing work in process (WIP), in safety stock held at various upstream echelons, and in products in transit. Each tier along the chain had these hidden inventories and, in total, these inventories could provide weeks, even months, of supply during the crisis. The three-month cycle time from substrate to vehicle meant there was three months’ of catalytic convertor inventory in various states scattered across the chain.
According to Fred Brown, GM’s director of assembly and stamping in the global supply chain organization, the company worked to accelerate each stage and each shipment, reducing the cycle time down to less than one month. Although it was more costly to run fast and lean, accelerating the cycle let GM access two months’ worth of converters that were in WIP inventories along the chain. This covered the production shortfall while the substrate maker fixed its yield problems. Once the yields improved, the substrate maker produced extra substrates to refill the buffers in the chain and return to the standard three-month cycle time.
Similarly, other companies, even those with lean supply chains, have also found hidden inventories in long global supply chains that could be used to mitigate disruptions. Intel, for example, weathered the impact of the 2011 Japanese tsunami, in part because of these hidden inventories diffused across its otherwise lean supply chain.
Note, however, that disruptions that take place “downstream” in the supply chain, in other words at an OEM plant or at a Tier 1 supplier, will have fewer opportunities to find “hidden inventories” and to accelerate the flow of materials and parts. In the case described above, GM was able to take advantage of hidden inventories and accelerate the flow exactly because the disruption was four tiers deep in the supply chain, and three months removed.
Whereas cycle stock and traditional safety stock can be modest in voothelume in a lean supply chain, the extra inventory required to cover high-impact disruptions may be quite large if the time-to-recovery is long. In addition, as mentioned above, the rarity of such disruptions means that these large inventories have to be carried for a long time. Consequently, such redundancy is too costly in most situations.
Furthermore, some materials might be subject to hazmat (hazardous material) regulations that impose added costs or restrictions on large inventories. When Intel faced a potential disruption in high-grade hydrochloric acid supplies, it struggled to find a facility willing to store four months’ supply of the highly corrosive acid (see chapter 7). Finally, the inventoried items might have a limited lifespan. In some industries, such as fashion and high technology, goods become obsolete quickly—newer and better models become available, reducing the value of the items in stock. In other industries, such as chemicals, food, and pharmaceuticals, materials have shelf-life limits, noted Tim Hendry, Intel’s vice president and director of fab materials. Whether because of cost, regulatory edicts, market forces, or material properties, there are practical limits to the amount of inventory that many commercial enterprises can keep.
Perhaps the highest cost of extra inventory—a factor that the Toyota Production System (TPS) brought to light—is the cost of quality issues. Faster inventory turns in lean systems accelerate the detection and mitigation of product defects and lapses in quality. In a traditional first-in-first-out make-to-stock inventory management system, defective parts arising from a defective manufacturing process may go undetected until those parts work their way through the inventory “pile” and make it into production or sale. In contrast, faster turns mean faster detection, learning, and improvement, which are keys to the concept of kaizen (continuous improvement), which is part of the Toyota Production System. Thus, although added inventory can allow a company to keep satisfying customers’ demand after a disruption (at least for a time), it increases costs and introduces product quality risks.
Redundancy comes in many forms. Having two suppliers with different risk profiles introduces redundancy—giving the company the ability to use materials or parts from whichever supplier is not disrupted. However, in many cases additional suppliers come with additional costs because each supplier provides a smaller volume that offers less opportunity for economies of scale and amortization of fixed costs such as tooling, engineering, and contract management. When a disruption at a wheel parts supplier forced GM to source parts from multiple suppliers, costs climbed as each one was handling a small part of the business (see chapter 11). In addition, suppliers may be reluctant to share innovation with the OEM when they know that the OEM also works with their competitors. Finally, in high-tech, automotive, aerospace, chemicals, and other industries, the introduction of more than one plant making a certain material complicates engineering and quality assurance processes.
Cisco considers both dual manufacturing sites and the qualification of alternate sites when assessing the resilience of new suppliers associated with new products. This assessment, together with the supplier’s inventory levels, capacity reservations, manufacturing rights, and other mitigation measures, contributes directly to Cisco’s own Resiliency Index (see chapter 7). This index is a scoring mechanism used at Cisco for all new product introductions.9
In addition to inventory, companies might have redundancy in their own production facilities and production capacity. For example, Medtronic opened a second distribution center in Europe after a risk assessment revealed that having only one distribution center posed too much risk.10 A company might also build and use more than one factory site for the same product line rather than a single factory location. Naturally, the decision to operate with multiple sites has many implications beyond resilience—the addition of a plant or a warehouse will affect costs, taxes, local employment, corporate social responsibility exposure, and customer service levels.
Flexibility is a strategy for increasing the number of potential uses for a given asset. This applies equally to production lines that can be configured to manufacture several products, the use of retail stores for e-commerce, or the cross training of employees so they can be moved between tasks, as needed. Plants, for example, can be either specialized (one location can produce only one product from one set of parts) or flexible (each plant can produce many different products from many different sets of parts). Flexibility enables resilience. If one asset or supplier becomes disrupted, other flexible assets can be redeployed to produce, store, or move the product handled by the disrupted asset.
Flexibility and redundancy complement each other. Redundancy—in particular extra inventory—provides near-instantaneous coverage, but only for a finite duration. Flexibility can cover longer duration disruptions with a shift in asset deployment but may take time to implement. Thus, redundancy provides time for organizations to “fire up” their flexible assets by reconfiguring equipment, repurposing machinery, contacting alternate suppliers, reassigning personnel, shipping raw materials to the back-up facility, and so forth. Both redundancy and flexibility are means to reduce the white-space gap between the moment of disruption and the beginning of recovered production, service, and supply.
Flexibility ensures that when the chips are down, they are not down for long. For example, a 2003 tornado ripped the roof off of P&G’s Pringles chips plant in Jackson, Tennessee. The damage threatened the snack supplies of the company’s United States, Latin American, and Asian customers. Fortunately, P&G had a second plant in Mechelen, Belgium, that also could make Pringles. P&G ordered extra raw materials with an accelerated lead-time, boosted production in Belgium, and also tapped European inventories to help cover the shortfall in US production.
Flexing the remaining capacity did take some work, however. Only two packing lines in Belgium could run the Asian SKUs as a result of case-count differences (14 vs. 18 units per case). And a special quality assurance team had to be formed to oversee production of the unique flavors needed for Japanese products. Overall, the Belgium plant delivered 18.6 million cans of crispy chips, and P&G discovered some opportunities to improve its supply chain.11
Full flexibility—being able to make anything anywhere—is usually prohibitively expensive or infeasible. The natural and economical specialization of machine tools and labor imply that full flexibility is too costly. Yet research shows that a company can achieve extremely high levels of flexibility at the enterprise level with only a modest amount of flexibility at the plant level.12 The strategy requires that each plant be able to make just two different products but arranging this flexibility in a special way.
For example, imagine having four factories, A, B, C, and D and four products, 1, 2, 3, and 4 with a particular pattern of flexibility: factory A can make products 1 and 2; factory B can make products 2 and 3, factory C can make products 3 and 4, and factory D can make products 4 and 1. If factory B gets disrupted, threatening deliveries of products 2 and 3, then both factories A and C can chip in to cover for the loss of factory B. Even if factory C is already running at capacity, it can still take over production of product 3 by shifting C’s responsibilities for product 4 to factory D. As long as some spare capacity exists somewhere in the network, production can be shifted. By creating a “daisy chain” of product assignments to factories, the company can literally shift production around the network and create a system that is flexible enough to make every product even if one facility is disrupted.
Flexibility requires some amount of system-wide redundancy. Assuming in the example above that each factory has the same capacity and that product demand is the same, this flexibility strategy would require each factory to boost production by about 33 percent on the average to cover for the disrupted fourth factory. Of course, part of the benefit of such an arrangement is that the extra capacity is needed on the average and not necessarily in each plant. As with the case of surge capacity at second-source suppliers, a company might use overtime, an additional shift, or might delay routine maintenance to boost production on a temporary basis.
Walmart’s wide selection and everyday low-price strategy depends on the smooth functioning of its distribution network. Walmart has 158 distribution centers (DCs) in the United States, with each DC serving about 90 to 100 stores within a 200-mile radius.13 If a distribution center goes down on account of weather, natural disaster, power outage, or other problems at the facility, then timely replenishment to that DC’s regional network of stores would stop. To prepare for these types of emergencies, each DC has two or three nearby DCs that are designated as back-ups. If one DC goes down, an emergency realignment of the service areas of the backup DCs fills in for the disabled DC.
This flexibility requires that Walmart have some spare capacity in each DC so that it can continue to serve its original region while also contributing to serving the region that has a disabled DC. Although a simplistic analysis would suggest that each of the two or three backup DCs would need between 33 and 50 percent extra capacity to make up for a 100 percent loss of another DC, the required redundant capacity is much smaller. First, each backup DC can increase its output through overtime and extra labor. More important, the dense DC coverage means that as the backup DCs devote a fraction of their capacity to serving the region of the missing DC, a shortage may take place for the regions served by the backup DCs. But those DC also have backup DCs who can help to make up any capacity shortage of the original backup DCs.
As part of its desire to provide faster and same-day service to its customers, Amazon embarked in 2011 on an ambitious program of building dozens of fulfillment centers around the United States. While intended to support its aggressive service level targets, the large number of facilities naturally allows the company to back up each one in case of a disruption, adding to Amazon’s resilience.
Risk pooling is a statistical phenomenon by which flexible systems have lower volatility risks than do specialized systems. For example, Dr. Pepper Snapple Group (DPS) built its Victorville, California, plant with flexible bottling lines that can each handle both cold- and hot-fill products, including carbonated soft drinks, energy drinks, teas, juices, and bottled water. Moreover, each line can handle different container sizes.14 The flexibility to make different products on the same equipment enables risk pooling—a reduction in overall risk or volatility by aggregating across multiple risks or volatilities.
Demand volatility of the individual bottle sizes and flavors is high owing to fickle consumer preferences, retailer promotions, seasonality, and weather. If each variety and size of beverage used a different bottling line, utilizations of each line would also be very volatile and many lines would be idle much of the time while others ran out of capacity. Yet, demand volatility for beverages overall is much lower—people drink something every day—so that each day the demand for a given flavor or type may be down while another one is up. On average, such random variations tend to cancel each other out and consequently the volatility of utilizing a flexible bottling line will be lower than the volatility of the demand for each flavor or size.
Risk pooling works especially well with product variants that are negatively correlated; if people buy more of a certain brand of breakfast cereal on a given day, chances are that sales of a competing brand will be lower. The total sales of breakfast cereals, however, will be less volatile from one day to the next than the sales of any given brand. This is also the reason why central holding of inventory in distribution centers reduces the overall levels of inventory required to serve retail stores subject to random demand variations. Even if purchases are independent, risk pooling reduces the amount of inventory required to provide a given service level. For independent and identical retail outlets, the required inventory will be reduced in proportion to the square root of the number of outlets served. (Of course, concentrating many assets in a single facility creates a different risk.)
Risk pooling is primarily a strategy to reduce operational demand volatility. But it is also an element of flexibility that can be helpful during times of disruption. For example, if a major natural disaster creates a surge in demand for bottled water, or if an equipment failure at a flavoring supplier prevents making some products, Dr. Pepper Snapple can shift production quickly at modest cost.
Postponement is a manufacturing strategy and a supply chain architecture in which a particular intermediate product can be quickly customized into any one of many different finished goods. It involves a design of the product and the production process so that the point of differentiation is delayed as much as possible. Rather than hold extensive inventories of each finished product variant, the company holds inventory in an unfinished intermediate state and performs the last step (customization) when it gets a firm order or when demand is more certain. The benefits of postponement arise from the mathematical fact that the required safety stock inventory of the unfinished product is relatively low as a result of risk pooling—the inventory of unfinished product benefits from the averaging of the demand for all the variants. The customized product can then be made to order or very close to the selling season when demand projections are relatively accurate.
Companies that have used postponement strategies include HP (printers customized for different countries),15 Reebok (sports fan apparel customized to various team stars),16 many automotive manufacturers (using the same “platform” for several models), Bic (retail packaging postponement),17 paint manufacturers (cans of paint customized to different colors), and many others.
Although postponement is primarily a way to handle demand-side risks, it also can help during a supply disruption. Having inventory of semifinished products means that the finished product can be allocated to the most important customers in case of a shortage.
Range forecasting is also a way for managing volatility in the supply chain, especially on the demand side. “Everyone is worried about supply risk, but what is worse is demand risk,” said Charlotte Diener, senior vice president for global supply chain operations at On Semiconductor. On Semiconductor faced a serious problem when a customer forecasted high demand that never appeared. “We were ramping up our capabilities and our production, and our customer’s forecast dropped like a rock,” she says. “In one month, it looked like we had less than a week of product supply, and at the end, we had more than 40 weeks of supply.” Now, the company uses range forecasting to manage the risks.18
Rather than estimate a single number for the expected demand or the number of items to be manufactured, a range forecast includes two or more estimates spanning the likely range of values for the demand, such as a high, medium, and low value. Understanding the range lets a company estimate how much flexibility it really needs and enables it to manage the volatility with a robust plan. More important, it “socializes” the company to expect changes and react to them—be they demand volatility or supply disruptions.
Companies can also use range forecasts to establish contingencies with suppliers. For example, HP uses range forecasts to establish a flexible portfolio of contracts with suppliers for each new product. HP will contract for parts based on the “low” estimate (which HP is sure it can sell) using guaranteed-purchase contracts to get a low price for the guaranteed volume. For the medium-level of estimated demand—which may be the expected demand—the company uses flexible terms in the supply contract. That is, HP tells the supplier that the supplier will get orders for some quantity between the low and high estimates, with no guarantee that anything will actually be purchased. Such contracts involve specifications of optional increased quantities with the associated payment terms and sometimes capacity-option purchases. For the highest estimate, HP does not contract with any supplier. Instead, if demand is very high, the company will use existing suppliers and the spot market. Even if the spot market price is quite high, the high demand for the product will ensure that the volume of sales provides sufficient margins.19
Similarly, Jabil Circuit, Inc., uses flexible supply contracts that might specify a +25 percent capacity boost with a one-week notice, and a +100 percent boost with four weeks’ notice. The most important lever for increasing supply chain flexibility is supply assurance, sought by 72 percent of global executives in a 2011 survey.20 The survey found that companies with flexible supply chains worked with key suppliers to develop a preferential delivery schedule in case of capacity constraints. They also collaborated to have processes with a maximum of both upward and downward flexibility so that suppliers become more comfortable with fluctuations, which may be due to forecast errors, demand surges, or supply disruptions.
General Dwight Eisenhower, supreme commander of Allied Forces in World War II and the 34th president of the United States, said, “In preparing for battle I have always found that plans are useless, but planning is indispensable.”21 Business continuity planning (BCP) is a process for preparing disruption responses. “The collective Intel response underscores the importance of creating and sustaining a well-prepared response and recovery capability. Speed is vital,” said Jackie Sturm, Intel’s vice president and general manager of global sourcing and procurement about Intel’s response to the 2008 earthquake in Sichuan near Intel’s Chengdu facility.
Since 2008, Cisco has created 14 supply chain incident management playbooks. The playbooks—Cisco’s term for BCP—cover relatively high-likelihood disruptive events. The company’s attitude toward the playbooks is, “if you get caught twice, then shame on you.”22 The playbooks vary across locations, depending on the types of disruptions typically experienced at those locations. For example, Texas has tornados whereas Thailand has monsoons and floods. According to research by Cranfield University’s Uta Jüttner, the top contingencies covered by BCP include loss of IT (91 percent of organizations surveyed have plans covering that), followed by fire (68%), loss of a site (62%), employee health and safety (52%), loss of suppliers (43%), terrorist damage (37%), and pressure group protest (22%).23
Cisco creates a new playbook by pulling together relevant elements of existing playbooks and using supply chain risk management (SCRM) analytics and know-how. After each incident, Cisco reviews its response and the responses of suppliers to collect the lessons and improve the playbook (or create a new one) for the future. After the Thai floods of 2011, for example, Cisco looked at how suppliers handled the floods (e.g., how they moved delicate test equipment to the second floor or built barriers around key buildings) and incorporated those tactics into its flood playbook.
Cisco’s playbooks list the types of questions vital to answer in a crisis, such as how many suppliers are in the region, what parts or products they make, how they could be affected, whether there are backups for the suppliers, and how to assess the actual impact on the ground. The playbooks also contain templates, checklists, and other materials to assist in managing and mitigating a disruption.
Similarly, Medtronic uses action-oriented BCP based on checklists. Each checklist item includes a task, its status, the people responsible, the timing of the task, and optional supporting documents.24 Medtronic’s planning process stresses the information, people, and actions required for recovery and continuity. The information flow elements of its BCP ensure the right people learn of the event as soon as possible via a mass notification system and that people working on continuity have the information they need to do their jobs.25
The philosophy behind Medtronic’s business continuity plans is to enable the company to operate at a predetermined, minimum capability/service level and meet demand during a disaster.26 Because Medtronic’s medical device products are crucial to the health of patients, continuity of supply is essential. Each of Medtronic’s plans addresses a worst-case scenario because a large disaster requires as much planning as possible in order to accelerate the response. Lesser disasters can always use a subset of the plan. As imperfect as plans might be, planning helps companies think through and predetermine how they might react to disruption, who should be involved in the response effort, and what assets should be prepared ahead of time.
Medtronic also expects suppliers to create and maintain Medtronic-specific BCPs and to be able to show their BCPs to Medtronic on request. The company expects each supplier’s approach to business continuity to include a plan of action, a checklist of activities, communication plans, escalation procedures, and the organization of teams, roles, and responsibilities. Medtronic expects suppliers’ plans to address the recovery time needed for a variety of business interruptions, contact information for key locations, and a supply chain assessment of risks to equipment, material, supplied components, and labor.27 Similarly, the 2013 Business Continuity Insights Survey by consulting firm PwC found that 64 percent of respondents are involving critical suppliers in their organizations’ business continuity management programs.28
In the same way, Cisco expects suppliers to use BCP and asks them about specific continuity assets, such as backup generators (and fuel), fire protection and sprinkler systems, IT recovery strategies, and overall site recovery plans. Cisco’s goal with these supplier requirements is to help a supplier site better understand how to recover the site quickly. Cisco also asks for the supplier’s expected time-to-recovery. If Cisco finds gaps in a supplier’s BCP, it works with the supplier via Cisco’s supplier commodity managers.
The primary value of a BCP is realized only when people get to exercise it. Like Cisco, Juniper Networks created a series of business continuity plans to cover disruptions linked to facilities, locations, suppliers, and geopolitical events. As Steve Darendinger, vice president of worldwide procurement at Juniper Networks described it, each BCP is encoded in a PowerPoint presentation that can be pushed to employees’ computers wherever they are. The plans also fit on a USB thumb drive for physical distribution if computer networks are not available. Juniper created online video training courses for its BCPs and by 2012 had trained about 75 percent of its worldwide operations staff in the use of the BCPs. Other companies, such as Medtronic, use third-party incident management software to distribute BCPs to stakeholders when they need them.
At Medtronic and other companies, business continuity plans are part of a larger effort of business continuity management (BCM). BCM is the overarching process of planning, disseminating, executing, and refining BCPs. In turn, BCM is a subset of enterprise risk management (ERM). BCM tends to focus only on operational risks such as disrupted supply, production, distribution, and service. ERM considers operational risks plus many other risks such as financial risks, regulatory risks, competition, customer disruptions, talent risks, product quality, intellectual property risks, compliance risks, corporate social responsibility risks, and others.29
Whereas BCM arises out of an organization’s internal motivation to maintain continuity, ERM has some external drivers for its adoption and structure. The US Sarbanes–Oxley Act of 2002 requires companies listed on US stock exchanges to manage risks,30 especially those risks associated with financial reporting and compliance. In particular, section 404 of the law requires a top-down risk assessment of financial reporting controls, risks of material misstatements, entity-level controls, and transaction controls.31 Although the law is focused on financial and accounting risks, Sarbanes–Oxley encouraged the adoption of ERM standards such as COSO32 and ISO 31000,33 which were then applied to other enterprise risks.
In his book Only the Paranoid Survive, Intel cofounder Andrew Grove wrote, “You need to plan the way a fire department plans. It cannot anticipate where the next fire will be, so it has to shape an energetic and efficient team that is capable of responding to the unanticipated.”34 Walmart, GM, Intel, and Cisco all stress the importance of preparing who should be on the team that handles a disruption. That means preparing contact lists, predefining likely roles, and training and drilling so that people can quickly step into their emergency roles. “We focus on the people we need for disaster recovery and this will be based on their experience,” said Frank Schaapveld, senior director supply chain Europe, Middle East, and Asia for Medtronic.35
Part of business continuity planning involves preparing people before trouble hits. As John O’Connor, Cisco’s senior director of supply chain transformation, said about risks that have unknown durations and impacts, “You need to work with response playbooks that identify critical infrastructures and stakeholders.”36 For example, within two days of the Japan earthquake in 2011, Cisco’s team members used their playbooks to structure and staff their “war room.” The playbooks define the key contacts related to various types of incidents and the structure of each function.
Because large-scale disruptions are rare, most disruption team members have other day jobs until something happens. Even in the largest companies, only a few staffers work full-time on disruption-related activities such as monitoring, risk analysis, planning, and the preparation and execution of drills and training exercises. For example, Cisco’s incident management team comprises nine full-time dedicated personnel out of a staff of almost 70,000.
If an event reaches what Cisco calls the “activate” threshold, the company pulls people from among a pre-identified list of 100 to 150 “volunteer firefighters” who are on call but have other jobs. The company builds a response team and orchestrates cross-functional mitigation of the events.37 This pre-identified list is similar to GM’s list of people tapped for the 2011 Japan crisis and later ones—the company learns from each disruption who has the breadth of skills and mental toughness for crisis response.
Similarly, Walmart has a small permanent staff plus pre-identified members of a set of emergency support functions that include emergency management, transportation and logistics, merchandising and replenishment, facilities, power, security, corporate affairs, and corporate giving. At Medtronic, the key people in the extended business continuity team align with three types of activities that tend to occur in a crisis: first response, salvage and recovery, and resumption of operations.38 The primary response personnel include facility-level emergency responders and corporate-level HR and communication people. The salvage and recovery team includes disaster-specific personnel in IT recovery, materials salvage, freight, distribution, and customer service. Medtronic’s resumption personnel are mainly affiliated with manufacturing operations and distribution center operations.
Following the 2008 Sichuan earthquake, Cisco realized it needed to include key external players like contract manufacturers and logistics partners in its incident management teams.39 Similarly, Mark Cooper, Walmart’s senior director of global emergency management, mentioned that the company’s emergency operations center has seats for key outsiders, such as the Red Cross and Salvation Army, who help Walmart understand the community’s needs.
Cisco and other companies also compile suppliers’ contact information, ensuring they know whom to call if something goes wrong. Yet these contact lists have one minor and one major problem. The minor problem is that turnover and promotions of workers at suppliers mean that names of contact lists go stale over time and a company must periodically refresh the list (typically, once every six months). The larger problem is that these contact lists are shallow—they include mostly Tier 1 suppliers but rarely go deeper. Yet a substantial portion of supply chain disruptions originate deeper than Tier 1.40 For example, data from the nonprofit supply chain auditing group, Sedex, shows that noncompliance incidents per audit are 18 percent higher at Tier 2 suppliers and 27 percent higher at Tier 3 suppliers compared to Tier 1 suppliers.41
In the heart of Walmart’s Bentonville, Arkansas, headquarters sits a mostly empty room. At first glance, it could be a break room with a dozen tables and about 50 chairs. Tabletop placards for different Walmart groups seem to hint at assigned seating for an upcoming luncheon. Yet the people and the luncheon never seem to come. Further scrutiny suggests a more utilitarian purpose, because every spot at the table has a tangle of power cords and computer connections at the ready. On the walls hang big-screen monitors and maps of Walmart’s operations. The room is Walmart’s Emergency Operations Center (EOC). In the corner, a few staffers monitor global news feeds, weather maps, earthquake reports, and the like. The EOC springs into action when a significant disruption requires a coordinated response. Thankfully quiet most of the time, the room is there when it is needed.
Near the EOC site is Walmart’s equivalent of a certified 911 emergency dispatch center. If there’s a problem in a Walmart store or facility, anytime and anywhere across the nation, store employees can call this 24 × 7 center. The center can also detect store situations through in-store sensors and remote cameras. Whether as a result of a call from store personnel or through automatic detection, the center can dispatch security, firefighters, medical help, or whatever else is needed. The call center handles 400,000 calls a year—similar to the call volume of the Los Angeles Fire Department. The vast majority of the calls are minor—a leaky roof or a false alarm on a smoke detector. Some of the calls are more serious, but localized—a fistfight in the parking lot or somebody slipped and fell. A few make bigger headlines, such as the attempted knife-point abduction of a toddler by a deranged person in June 2013.42 And some are larger-scale problems that require a more coordinated response, such as a regional power outage or a hurricane.
Whereas Walmart has one EOC for the entire company at its headquarters, Intel has an EOC at each of its large multibillion-dollar facilities located around the world. Each local Intel EOC has a satellite phone (in case all other communications links break down—see 4) and a set of key response personnel (e.g., security, HR, and environmental health and safety). EOCs in earthquake regions have out-of-building capabilities (tents, portable generators, etc.). In addition to these local physical EOCs, Intel has a Corporate EOC (CEOC) that convenes virtually because Intel’s executives are not all based in one location. The CEOC includes experts in engineering, procurement, manufacturing, logistics, and even public relations, and it plays a major role in larger disruptions when Intel must shift resources and activities between sites.
“Katrina was a wake-up call to governments [to realize] that they couldn’t handle the response themselves,” said Lynne Kidder, senior vice president for public-private partnerships at US Business Executives for National Security.43 In the aftermath of hurricane Katrina and Walmart’s efficient restocking of necessary supplies and rapid reopening of stores in Louisiana, US emergency management policy makers realized that the private-sector could—and should—play a key role in humanitarian disaster response.44 Given the private sector’s dominant role in supplying food and other necessities during normal conditions, leveraging private sector supply chains during a catastrophe made sense. The use of private sector inventories and assets by the government, however, required some form of public-private partnership for joint coordination of disaster response efforts. “Every disaster is different and a dynamic situation. We need that ongoing dialogue,” said Tina Curry, assistant secretary of California’s emergency management agency’s Planning, Protection and Preparedness Division.45 Thus, the BEOC (Business EOC) was born.
A BEOC is the physical or virtual place associated with a local, regional, or federal government EOC that coordinates incident response with companies operating in the communities in the EOC’s jurisdiction (such as utilities, retailers, banks, and major employers). For example, the Louisiana BEOC seats 44 people, 40 of whom represent the business community, including trade associations, chambers of commerce, economic development councils, and critical infrastructure operators. “The idea is that every business in Louisiana is represented by someone in that room,” said Joseph Booth, executive director of the Stephenson Disaster Management Institute at Louisiana State University.46 Many states and larger metropolitan areas have BEOCs or a BEOC-like element in their disaster planning. The United States even has a National BEOC at FEMA’s (the Federal Emergency Management Agency) headquarters in Washington, DC.47
“We began to see that if the businesses don’t survive, the community wouldn’t bounce back,” said David Miller, administrator of the Iowa Homeland Security and Emergency Management Division. “And the businesses saw that if the community suffered, employees couldn’t come back to work. So it was in the interest of both to get the whole community back up and running.”48
The main role of any BEOC is to coordinate and manage a two-way flow of information. The BEOC improves the situational awareness for businesses in terms of the status of infrastructure and community needs. “We want to prevent the government from deploying food and water to areas where there’s already service in place,” said Andres Calderon, associate director of the Stephenson Disaster Management Institute. The BEOC also improves situational awareness for government in terms of private-sector resources such as which stores are open or closed so government aid can focus its efforts on under-served areas. The combination also improves the effectiveness of the response. For example, during Hurricane Gustav in 2008, public-private coordination helped prioritize the reopening of roads and restoration of power in Louisiana so that local restaurants could feed people in shelters. The restaurant-provided meals made good use of the local food supply and distribution resources, avoiding the need for special resources.49
Many companies have structured processes for responding to disasters, often called incident management or crisis management. Cisco, for example, uses a six-step incident management lifecycle: monitor, assess, activate, manage, resolve, and recover. The first key step in incident management is the decision whether the event is: irrelevant to the company; a minor, local event best handled by site personnel; or a major event requiring a larger regional or corporate-wide response. Incident management encompasses the decision to activate the EOC, call up the incident management team, and mobilize other corporate resources.
Cisco, Walmart, and Intel all use a small number of predefined levels of alerts of increasing escalation. Cisco has four levels of alert status designated L0 through L3.50 The lowest level of alert, L0, is an “FYI, we’re tracking this” message. L1 is for incidents in which an impact is expected but might be minor. L2 is for large incidents that have an estimated impact on the order of $100 million. L3 is for extreme incidents with an estimated impact on the order of $1 billion and greater.51
For the most part, these alerts are internal, although the company does communicate with key customers if the customer is concerned about a developing situation (as it did in anticipation of Hurricane Ike in 2008).52 During the assessment phase, the company will contact suppliers and logistics operators in the affected region to assess the status of inbound, conversion, outbound, and shipping operations. As the alert levels escalate, companies activate their crisis teams, use their EOCs, and engage senior management in the response.
For publicly held US companies, large-magnitude events may also trigger regulatory reporting of “material events” using the Securities and Exchange Commission’s (SEC) Form 8-K.53 For example, in 2011, Seagate Technologies filed an 8-K report on the impacts of the Thailand floods.54 Although the SEC lists many specific events (e.g., changes in upper management) that require filing a 8-K report, the guidance for reporting events like supply chain disruptions is vague and contradictory. The SEC leaves assessment of materiality in the hands of the company55 but discourages “over disclosure” caused by a company’s fear of shareholders’ suits56 because investors might get lost in the disclosures.57 Some companies, such as Medtronic, have defined a series of materiality criteria with thresholds for key impact metrics such as lost revenue, percentage of customers affected, duration of new product launch delays, and extent of production outages as part of the risk-management efforts.58
Cisco allocates two members of its supply chain risk management (SCRM) team to monitor and assess the potential impacts and risks arising from global events. Cisco pulls in the supplementary staff during the “activate” phase, when the company builds a response team and orchestrates cross-functional mitigation of the events. For example, the trigger for activating the company’s SCIM (supply chain incident management) team is when an event might affect shipments and revenues in the next 24 hours. The team then carries out recovery activities to return supply chain operations to normal. Incident-related resource levels drop as the team manages the incident and resolves any disruptions.59
At Walmart, Alert level 1 is a base level of activity—the steady drumbeat of minor problems that occur in any large company with thousands of facilities and hundreds of millions of customer visits each month. Alert level 2 is for somewhat larger problems, such as the threat of a tropical storm or minor hurricane. For level 2 events, Walmart puts the emergency support function team members on call and uses virtual meetings to coordinate action. Only if the incident reaches alert level 3 do people come to their preassigned stations at the EOC, staffing it 12 hours per day and holding multi-participant teleconferences at key times. Level 4 is reserved for the worst disasters, like Katrina—those disruptions that damage large areas, close many stores, and require more intensive recovery and humanitarian efforts. These disruptions involve not only a fully staffed EOC 24/7 but also key senior executives.
Like Walmart, Intel also defines multiple alert levels, with level 1 being a minor local blip; level 2 being a modest interruption (e.g., a minor fire at a facility) that’s handled by the local facility EOC; and level 3 being a major event requiring CEOC attention. EOCs are a key part of a broader system, said Intel’s Jim Holko, program manager, corporate emergency management, “because we have the CEOC structure and because we have the emergency notification system and everybody is trained to do this, we typically get back on our feet really fast.”60
“Information and visibility are the backbones of incident response, and these tools have to be in place prior to the crisis,” said John O’Connor, Cisco’s senior director of supply chain transformation.61 Information and communications technologies allow for quick coordination of activities. For example, when the ash cloud from the Eyjafjallajökull volcano forced the closure of large swaths of European airspace, EU transportation ministers used video teleconferencing to discuss the issue.62 The rise of mobile broadband over cellular or Wi-Fi networks changes both the level of transparency and the degree of coordination possible during a disruption. Risk managers can now use mobile apps to plan for and mitigate disruptions.63 And BCP can be deployed to mobile devices using apps such as PLANet and Quantivate.64 Organizations can even run a virtual EOC via tools such as WebEOC.65
A related technology is cloud computing, in which third-party IT systems host key data or services on off-site computer networks that are independent of the company’s potentially disrupted facilities. Cloud-based shared document environments such as Dropbox, Google Docs, and TeamViewer enable connections to remote facilities, to suppliers, and to employees working from home during a disruption.66 Anyone can access these cloud resources from anywhere in the world. For example, Medtronic uses externally hosted business continuity and incident management systems.67 “If the location is impacted, then we have a plan which allows people to work from home. We have already changed our policy so that everyone has laptops instead of desktops and we have a system whereby people know they should stay at home, log in and wait for instructions,” said Medtronic’s Schaapveld.68
Although these technological options benefit from the natural flexibility and robustness of Internet-based networks, such systems still depend on functioning telecommunication links and electrical power. Whether the event was the Japanese earthquake in 2011 or Hurricane Sandy in 2012, many companies cited problems with communications resulting from power blackouts, damaged lines, or overloaded circuits. Walmart has predetermined fallback plans for replenishing stores that cannot communicate their needs to headquarters. Nonetheless, continuous innovations and high-volume manufacturing are bringing down the costs of information and communications technologies, and creating new resilient solutions—for example, a solar-powered Wi-Fi repeater costs only a couple hundred dollars.69
On Thursday, October 15, 2009, officials in China noticed an unusual increase in acute respiratory infections in Guangdong Province, Shanghai, and Beijing. By Wednesday, October 21, the Chinese health minister had declared an influenza epidemic in China. Two days later, when the health minister temporarily closed schools and universities in Guangdong Province, Cisco’s SCRM team assembled an SCIM (supply chain incident management) team and put it in standby mode, monitoring the situation and assessing risk to Cisco.70
As the flu worsened, the SCIM escalated from “standby” to “activated” on Sunday October 25, and the SCRM team conducted a detailed risk assessment of the potential impact of the epidemic on Cisco and prepared mitigation options for discussion on Monday. That Monday, Cisco asked materials managers at select facilities to analyze whether their parts inventory levels were high enough to build up extra finished goods inventory. Cisco then asked certain facilities to ramp up production to maximum overtime levels to create a buffer in case of a plant closure. SCIM members also began contacting suppliers to determine the status of their operations and whether they had been affected by the swine flu outbreak.71
On Tuesday, October 27, the World Health Organization (WHO) raised alert levels to phase 4 on its six-phase pandemic alert system. The next day, the WHO upped the alert level to phase 5 as the flu spread. On Thursday, October 29, China’s president recommended that all businesses close for five days and told the public to stay home, in an attempt to stem the spread of the virus. Cisco decided to close its Guangzhou site until November 2; fortunately its risk assessment showed that it had enough inventories at that point in time to suspend production for that period without any revenue impact.
The SCRM team’s further investigations revealed that although there were serious flu outbreaks in Guangzhou, other cities had no outbreaks. As a result, the SCRM team shifted to monitoring the city-by-city situation rather than relying on the blanket WHO alert levels. Eventually, the authorities realized that this flu was not as deadly as they feared, and businesses, universities, and schools began to reopen. Cisco discontinued its inventory build-ahead efforts, and the SCIM team alert level dropped down to “monitor-only” activities. Although the flu continued to spread around the world, the sense of crisis abated, and Cisco did a postmortem of the event.
In reality, no flu outbreak occurred. The epidemic was all part of Cisco’s 2009 annual BCP drill.72 Cisco holds annual BCP drills to ensure that its plans are actionable and incorporate the lessons from past disruptions.73 For example, although this drill simulated a pandemic in China, the SCRM team realized that it needed to examine its business in Mexico more closely, given Mexico’s rising importance to the company. Cisco works with several dozen supplier and partner sites in Mexico and some of them are in regions prone to hurricanes. The drill prompted SCRM members to discuss proactive steps they could take to ensure continuity of supply if a hurricane hit that region.
Drills such as the ones Cisco undertakes also serve to train people in crisis management: who does what, what to expect, and how to respond. Drills test the efficacy and completeness of a BCP or a playbook as well as the readiness of disaster resources. Companies sometimes uncover significant gaps in their preparedness. When one company simulated an earthquake at headquarters, most of the event went as planned except for one crucial detail. Participants discovered that a key computer server that was essential to 100 percent of the sales of the company had no failsafe outside of the vulnerable headquarters building. Thus, the drill enabled the company to find a gap in its risk mitigation efforts and correct it before a real emergency occurred.
After hurricane Katrina, Walmart’s CEO H. Lee Scott Jr. spoke about the value of all these preparations and planning—the development of options for managing disruptions. “We have an infrastructure that allows us to react,” he said.74 As with the example of P&G’s Folgers coffee plant (see chapter 4), Walmart began preparations many days before Katrina’s landfall. The company loaded 45 trucks full of critical supplies at its distribution center in Brookhaven, Mississippi, and waited for the winds to subside.75 It managed to reopen 66 percent of its stores in the affected area within 48 hours. Within one week, 93 percent of stores were reopened.76 Walmart earned high praise for its timely and effective response. “If the American government would have responded like Walmart has responded, we wouldn’t be in this crisis,” said Aaron F. Broussard, president of Jefferson Parish in the New Orleans suburbs, during a tearful “Meet the Press” interview.77