If your backup systems fail silently, would you notice?
Me: <while working on something else with a client> “So – how have your backups been doing recently?”
Client: “Umm…good…I think…?” insert <shy look> and <crickets chirping – loudly>
I cut my teeth in I.T. (as many of us did) on Backup Exec and tape drives. Ugh – right? Back in the day, it was common for clients to need to swap tapes each day. Back in those days, I would often show clients how to check the status of the previous nights backup jobs when they swapped the tapes. That process gave folks a physical cue to check the backup status and make sure that the tape they were taking offsite actually had good verified data on it.
Fast forward to now. Many modern backup systems are highly automated. Many of them require no physical attention at all because they write data to disk based storage that then gets automatically replicated offsite. Many of them leverage email based reporting. So, the old physical cue to check the backup status when you change the tape is simply gone. Don’t get me wrong. Newer backups that leverage virtualization with technology like VMware CBT (Changed Block Tracking) and deduplication are dramatically better than the old tape based solutions. However, they introduce a new challenge. They are so good and so highly automated that they are easy for a busy I.T. pro to forget about.
What happens by default in many cases is that the person(s) who monitors the backup systems gets emails from the backup systems letting them know how the backup jobs are working. Many organizations have multiple backup jobs configured that run at different times of the day. These email based logs can get rather noisy and potentially confusing. So, what do some folks do? Well, they create an email rule to move all those noisy backup alerts into a folder…which they check…sometimes. Hear the crickets again?
If the backup system fails, often the emails simply stop. The system fails – silently. In the worst case scenario, someone finds out that their backup solution failed silently and they don’t have recent backups just when they need to do a restore. Yikes. Now we are in potentially RGE (Resume Generating Event) territory for that person, and real potential trouble for the business they work for.
Before I go further, let me say that a lot of people do a great job monitoring their backup systems. However, we are all human and we all make mistakes and occasionally miss things. The goal of this post is not to give you hard working I.T. pros a hard time, but to help better protect you and the business you work for from the very real and very painful damage a data loss event can cause.
With that in mind, here are several potential solutions to this problem.
Low tech: Make a checklist, or do something that forces you to manually check the status of each of your backups each day. Yep – that’s right. The I.T. guy just suggested a check list. I’m even good with you filling it out on paper. It is low tech – but it works. Do something that changes your behavior and forces you to check your backups – all of your backups – each day. That way when you go to do a restore, you can be confident your backup jobs have been running as planned.
A paper based check list does not seem like an awesome solution. So, here is a potentially better way. I’ve implemented this recently, and so far I am very happy with it.
High tech: Let your monitoring system keep track of your backup logs! Recently, I found a great component (sensor really) in my favorite monitoring software package. That monitoring software package is PRTG from Paessler. I use this tool to monitor my own infrastructure, as well as infrastructure in several client environments. PRTG is absolutely fantastic. I can’t say enough good things about it. If you are not using it, I’d strongly suggest you check it out. They have a free trial that you can get from here.
PRTG has an IMAP sensor, that you can configure to connect to an email account (over IMAP) so that it can essentially read your backup systems email reports for you, and alarm actively when something is not working correctly. The PRTG folks have a great write-up on the entire config process here: http://www.paessler.com/manuals/prtg/monitoring_backups.
So, if you implement PRTG and properly configure this monitoring, PRTG will actively alert you if a backup system fails silently. This alert will hopefully cause you to investigate and resolve the problem quickly.
Obviously, it is critical to properly configure this monitoring. You need to be crystal clear on what you are monitoring for. If you configure PRTG incorrectly, it could give you a false sense of security and make you think things are working well when in fact they are not. So, if you decide to implement this solution I would suggest that you configure it first, then test it thoroughly to make sure that it properly alarms when a backup fails. You can simulate this in a variety of ways.
If you need help implementing this, give me a call. I do consulting for a living, and I’d be happy to help you implement this solution. If you are an existing client and your environment is too small to justify a dedicated PRTG install, let me know. If we discuss it and it is appropriate, I’ll work with you to potentially use my PRTG implementation to monitor your backups for you.