Ah, the good old days. Planning for disaster recovery, if it occurred at all, was one of the easier things an IT manager had to do.
You’d back up your mainframe to tape every night or over the weekend. If you were really conscientious, you’d send the tapes off-site and arrange for contingency processing at some other data center. Testing your recovery plan? You’d retrieve the tapes and see if you could read them.
Of course, things have gotten steadily more complicated over the years, with distributed and networked computers, n-tier computing, heterogeneous hardware and operating systems, virtualization, automated data feeds from external parties and more.
Adding to the confusion has been a steady change in the meaning of “disaster.” Ten years ago, a four-hour outage might not have even been noticed by users or customers; today, it could cost you your job.
As a result, it has become vastly more difficult to prepare and test disaster recovery plans, and increasingly unlikely that you will go to bed at night feeling 100% sure that all your IT assets are protected.
Companies are dealing with these challenges in various ways. Some are reaching out to external parties for help with disaster recovery planning and hot sites, to which computer processing can be moved quickly in an emergency. Others have pulled back from these arrangements, saying they can better handle the complexity of disaster recovery in-house. Still others are essentially redefining disaster recovery by substituting notions of “disaster avoidance.”
Jerry Grochow, CIO at MIT, illustrates the problem this way: “I once counted a dozen different boxes that had to be up for [an application] to work from end to end, and that’s not unusual. So you ask your SAP application programmer, ‘What’s necessary to recover your system?’ and you don’t necessarily get the full picture, because the programmer doesn’t realize that the authentication server needs to be running so someone can even log on, and it’s running in a different data center.”
Not only are an organization’s IT assets no longer all located in a cozy glass room with a raised floor, they may not even be under the control of the IT department. Grochow recalls an earlier job at a brokerage firm that got automated data feeds from 40 external suppliers, noting that some financial institutions have 100 such connections. “How to recover a major data processing application when you have that many feeds is extremely complicated,” he says.
Schneider National Inc. in Green Bay, Wis., at one time contracted with a service provider for a disaster recovery hot site but recently decided to set up its own second data center to serve as a recovery facility. “Ours is a very complex and highly integrated technology environment,” says Paul Mueller, vice president of technology services at the trucking company, which has 36 locations in North America. “As complexity has increased, so has the difficulty associated with hot-site recovery.”
It proved difficult to accurately replicate Schneider’s operating environment at the external facility, Mueller says, and so his semiannual disaster recovery tests were never completely satisfactory. “Invariably, we encountered issues when we executed those tests, such as tapes not being correct,” he says. “Our ability to restore was problematic based on the hardware configurations, operating system configurations and so on.”
Mueller says he is much more comfortable with his new arrangement, but it came at a stiff price. Schneider’s two data centers are connected with redundant fiber-optic cables, redundant telephone systems and dual mainframes backing each other up. “We have invested heavily based on the risk to the enterprise and to the supply chains that we help our customers manage,” he says. “But we felt this investment was absolutely the right way to go.”
And the investment was not just in facilities. With the help of a consultant, Mueller’s staff interviewed 70 business managers and a few key customers. The interviews gleaned estimates of the losses that would result from various types and durations of outages, as well as managers’ recovery-time goals.
“When you have that information consolidated into an assessment document and you get to see the aggregate impact to the business of losing your data center, it becomes a very compelling story,” Mueller says.
Bob Dowd, CIO at Sonora Quest Laboratories LLC in Tempe, Ariz., says his company can’t afford a fully redundant hot site for disaster recovery, but he has taken other steps aimed at avoiding a disaster. Sonora Quest runs medical tests for 20,000 patients every night and gets the results to doctors by early the next morning, so it’s not hard to imagine the effect that a prolonged outage in its highly automated processes would have on the business. “We have hardened the computer room and built in all kinds of redundancy, so if one node fails, we have immediate fail-over to another node,” Dowd says. The Tempe data center has redundant disks, two network cores and no single points of failure. Plus, it does two backups a day, one to a server and another to tapes that are taken off-site.
Still, Dowd worries about the data center, which sits near the end of a runway at the Phoenix airport. He’d like the safety of a remote backup facility, and he has an idea for getting one on the cheap.
Part of the Tempe data center is devoted to serving as a test environment for the labs’ systems – effectively a scaled-down duplicate of the production environment. If that were moved to Sonora Quest’s lab in Tucson, Ariz., it could be used as a backup for Tempe, Dowd reasons. “We’d be using it to save the business, not necessarily doing upgrades,” he explains.
Rod Flory, CIO at Lennox International Inc. in Richardson, Texas, says the heating and cooling system company has been rolling out server virtualization software to increase the efficiency and flexibility of its servers. But that has complicated disaster recovery planning, he says.
“With VMware, we are changing our server platforms more frequently – not adding servers, but changing memory, the number of CPUs in them and so on,” Flory says. “So quarter to quarter, our environment looks different, and keeping up with that on the hot site is a challenge.”
Flory says he tests his disaster recovery plan “religiously” once a year, and it’s not a trivial effort. “It’s a project,” he says. “I take five people and set them aside for a few weeks.”
The tests run smoothly enough, Flory says, but he’s considering involving a disaster recovery firm in a future test. “You look at situations like the bird flu. You are counting on five or six people who know how to execute the plan, but what if they are not available?” he says. “Can your plan be scripted well enough that you could hire a consulting group, give them the book and say, ‘Here, execute the plan’?”
And there’s another improvement Flory wants to make. Traditionally, Lennox’s systems have been centralized at company headquarters, but more recently, functions such as e-mail and computer-aided design have been pushed out to servers at manufacturing sites where there are no disaster recovery capabilities.
But including those remote sites in the centralized plan is not simple because they don’t have standard systems at the sites. “We are dealing with a legacy of autonomous decision-making,” Flory says. “We may have Dell servers at one facility and IBM at the next. So you look at 15 to 20 major facilities, and you realize you don’t have a common architecture.”
He says Lennox will try to move the remote sites to a more common architecture – so the central data center can serve as a hot site for them – but that could take years.
Meanwhile, MIT is supplementing its two on-campus data centers with two additional leased facilities – one a few miles away and the other “many, many” miles away, says Grochow. But these will not be traditional disaster- recovery sites. All four will be in use all the time, with each critical application running at at least two of them. The four centers in total will not have a great deal of excess capacity or redundant equipment, so they will not be prohibitively expensive, Grochow says.
With this setup, the difficulty of testing a disaster recovery plan almost disappears. Because every site is running all the time, and because each critical application is running in more than one place, the plan is essentially tested every day, Grochow says. “The idea is to always be in a ‘fail-soft’ mode. If you have an architecture that allows certain things to be down, you are never completely out of business,” he explains. “But if your architecture has lots of single points of failure, you have to have a very detailed recovery plan.
“The concept of disaster recovery as we knew it is changing,” Grochow says. “I think we have gotten past the point where you can rely on a third party to provide hot-site recovery, because it has gotten too complicated.”