Backups - A True Story
So, this past Friday, one of our engineers wanted to load some new code onto our in house server. This server handles our email system, customer & tech support systems, source control system, and backups.
It is a high powered Dell server with a Raid array.
Raid is a cool feature that will, in our setup, write the same data to two hard drives simultaneously. This means that if one drive fails, the other will have the up to date information on it and the computer will continue to work as though nothing happened. It'll send our network guy an email saying one of the drives failed so he can replace it.
Raid is great for ensuring the integrity of your most important data.
Obviously the data we have on this server is important, which is why we chose to have a Raid array in it.
So, Friday, three unrelated things conspired to give our network administrator a long, sleepless weekend.
All of a sudden, when the engineer was about to install his "stuff", the server crashed. It wouldn't startup. (Wasn't his fault by the way)
Fotunately we have backup systems in place. All of our incoming email is automatically routed to another server in another location (another city actually), so that didn't stop. But our local services, techsupport email system and source control did get interrupted.
So, the network guy starts looking into the problem. He finds that one of the drives has failed. No problem, just put a new one in, start the computer, and your back up and running. But why didn't he get an email?
Hmmmm... He then found that Windows was corrupted; the machine should have booted off the good drives but didn't. It seems that one of those dreadded Automatic Windows Updates was in progress when the crash occured and it corrupted Windows. He tried a Windows resorte, but of course, Windows couldn't be restored. It apparently was at a spot in the Automatic Upgrade where it couldn't recover from an error and left our machine damages and helpless, poor thing.
So, now we have a bad drive in the Raid Array, a corrupted Windows on the other drive that won't boot.
What about the safey with the Raid Array? Why can't we just put in the new drive and off we go? Why can't we do a system restore on Windows? Why can't we, why can't we, why can't we?
The whole point of Raid is that one drive can fail and the other(s) take over.
We'll seems that we had a drive fail first, then Windows started installing whatever security updates it needed to pretend it was a secure operating system, it crashed, wiped out half of Windows, and coudn't find its way home again. One bad drive, one corrupted drive, one bad day.
So, these things got together to ensure backups, redundancy, and money, won't buy you a good nights sleep.
The probability of all this happening is quite small, but it did happen.
The network guy got it all backup and running after buying two new drives, backing up about 100gigs of data (remember this is the backup server that crashed :), reinstalling Windows, synced the new drive with the old, restored a bunch of files. Fired it up. He then got a big hug and kiss, which was not welcomed of course.
But from a business point of view, how safe are we now? We still have the same "stuff". A powerfull Dell server and a "foolproof", "fault proof" Raid array. Industry norms say this is a secure, fault proof system. Say, this couldn't happen. But, it did...unreated errors combined to defeat the un-defeatable.
What's next? How do we ensure this doesn't happen again. First thing that comes to mind is another computer to backup this one. Kind of a Raid Array for computers. And what will conspire to defeat that un-defeatable bunch of hardware?
This type of system is called Fail-Over. When one computer fails, the other takes over. They both have the same data on their drives, and the same hardware. Sound familiar?
Boy once you have a Raid failure, you start to not trust anything.
Seems the only thing that can "fix" this is to have backups on another machine and/or on different media of all of your important files.
In this case, we did. Ironically, the backups that were on the Raid array, that were possibly lost, were all sitting on the computers that the backups came from . Sort of a reverse-backup. Since File "A" was backedup from Computer "B" onto the Dell. A copy was on both machines. If one machine went down, the other still had it. Hmmm, un-expected redundancy of backups.
Most of us think the desktop or laptop is going to fail, so we buy the backup server. In our case, the backup server failed, and the desktops and laptops came to its rescue. Hmm, life is feeling good again.
A backup machine and the source machine each, in essence, backing up each other. Cool, like that idea. But there must be another risk in there somewhere...
There is. The data needs to also be off site somewhere so if both machines go down, like in a fire, you still have access from another location.
We have that :) Our Dell would backup all of the office computers and development computers to its, now, semi-trustable Raid array of drives. It would then FTP the files to a backup server that is in the same location as our webserver--in some nuke proof concrete bunker that requires retina scans and finger prints to get into. The webserver also backs up its files, databases, website files, to the online backup server. It then FTP's that to the Dell.
Where's the next risk? If both our office and the bunker that houses our webservers go up in flames...we'll, that's when we quit.
Moral of the story? No matter what you do, you are at risk for hard drive crashing. Spend whatever money is necessary to get you to the level of risk you are comfortable with. Learn from our bad practices and from our good practices. ( I think ours are mostly good ones). Then believe that there are only two types of people in the world (those of you who have heard this can skip to the end), those who have had a hard disk crash and those that will.
Backup your files when you save them. Put them on two different machines. Write them to DVD Ram, have a backup machine, have an offsite storage location for DVD's. Have an offsite backup system in a bunker somewhere. Look into web based backup systems. And, give a Raid Array a try, we're going to stick with it and hope Windows doesnt' get in the way again.
Right!
