Latest Publications

Advice for backing up your Macs

My wife asked me today if I could give a colleague some advice for how to backup a bunch of Macs. I’ll share my advice for you here. Over the past two decades I’ve used so many different backup systems and software and hardware combinations, I can’t even count them all. So this begs the question, what do I do at home?

I use the TimeMachine software built into Leopard (and newer) OSX. I use a locally connected USB2. A Firewire drive would also be good. Here is a drive that I like because it has lots of capacity, reasonably affordable, compact, and runs quietly.

Fantom DriveFantom GreenDrive Pro 2TB eSATA and USB 2.0 7200RPM 32MB External Hard Drive

Another that’s half the capacity, but cheaper:

Fantom GreenDrive 1TB USB 2.0 and eSATA External Hard Drive

Now, for home use a 2TB drive is probably enough for all your computers. At first I networked them all together to use just one drive on one of my computers shared to all the others so that all the backups were on the one big drive. I later decided that every computer should have it’s own drive for backups. Why? A few reasons:

  1. To conserve electricity. When you are using the computer is when the backup snapshots should be taken and archived. When the computer is asleep, may not respond over the network depending on how it’s set up, meaning you need to keep that host machine powered up all the time wasting electricity.
  2. Each computer does its backups when they get used, and in the idle time before they fall asleep again. It works much better for me this way.
  3. Immediate restores. Having a local drive on each computer makes restoration super fast. It’s not like a network or tape backup where you need to wait for your data to transfer back on to your hard drive to begin using it.

It’s easy to set up Time Machine. Connect the drive, open “Time Machine Preferences” and select the drive.

I re-initialized mine using the disk utility first so that it had a journaled MacOS filesystem on it instead of the default FAT partitioning that comes from the factory.

One really nice thing about Time Machine is that you can easily revert to a prior point in time in the event you accidentally mess something up, get a virus, or whatever. It’s about the easiest tool I’ve ever used. it automatically rotates backups hourly, daily, weekly, etc and deletes old backups automatically to make room for new ones. It’s totally automatic whereas with other tools you need to set that all up yourself.

This sort of local backup does not help if your house or office gets burglarized or burns down because you lose both the primary and backup copy of the data.

Jungle DiskAnother option is to use JungleDisk to back your data up to the cloud. That has the advantage of only paying for the storage you actually use, the backups are off site, so if you have theft or fire, you can still restore, potentially somewhere else. A disadvantage is that it requires adequate internet connectivity. Your upload speed needs to be fast enough to accommodate all of the data you produce within each backup interval. If your network is already constrained on available bandwidth, running backups over it could potentially aggravate matters. In short, if you have a big fat internet connection, then use JungleDisk.

Scale -> Complexity -> Reliability -> Support

Linux magazine released an article today by Joe Brockmeier titled Rethinking Gmail: Reliability Matters. The article makes some good points, and makes an obvious statement that to some, email is a mission critical application. I don’t dispute the points. I’d like to discuss why these systems fail to begin with, and how as an end user you can have realistic expectations for web scale systems.

First of all, running a “web scale” application means you have millions of end users. Running a system at that scale commands a certain level of complexity. A “cloud computing” system used to address “web scale” requirements drives complexity. The more complex a system is, the higher the risk that it will fail as a result of its own complexity. Therefore, web scale systems are more difficult to provide on a reliable basis than more simple systems.

The simple truth of the matter is that all systems fail at one time or another. No matter how well designed it is, and how well you test it, eventually something will happen that you were not prepared for, and an outage will occur. System designers must be disciplined to plan for potential problems so they can be predicted and mitigated before they occur in production. However, it’s only a matter of time until an outage does occur. Anyone who tells you that you can have a perfect reliability record forever is a blathering idiot. Don’t be tempted to align your expectations based on what idiots say.

Can you design a system to be highly reliable? Of course. Can a complex system exhibit a reliability record that’s higher than a simple one? If course. However, if the system is driven by software, and that software is complex, then it will contain human errors in a ratio proportional to its complexity. Simply put, the more code there is, the more chance it will contain bugs, or design defects. Yes, these can be mitigated, but I maintain that this problem can not be solved 100%, and that unsolved defects eventually lead to service outages.

Not convinced? In 1986 the Space Shuttle Challenger exploded. Why? Because the decision making procedures were flawed. Human error ultimately resulted in the death of seven astronauts. Blame the problem on a mechanical failure of an o-ring? No. Flawed o-ring design and a bad decision making process lead to death. The same thing happens in computer networks. Even when the software or configurations are not flawed, human error can still lead to system outages. It happens all the time.

Ever heard of a service provider offering a 100% uptime guarantee? You think that means they are going to be up 100% of the time. No, it does not. It means that you will get a discount on your next bill if the system is not up 100% of the time. In severe cases it may give you the option to terminate your service contract. That’s it, plain and simple. If you look long and hard at these guarantees, you will see that the penalties never compensate you for the actual damage of the service being unavailable. It’s a marketing tactic.

As an end user of web scale systems, set some realistic expectations for yourself. The system will break sometimes. I’m sure that your service providers will do everything they reasonable can to avoid outages. In his article, Brockmeier makes a good point that for free services there’s no simple way to extend you a discount. That does not mean that they care any less about uptime. They care. The bottom line is that ALL large scale systems have an imperfect reliability record. Compare Gmail’s reliability record with your own internal corporate email systems. Your reliability is higher? You lie! Measure it, and be honest.

So now that we are being honest, and expect that sometimes systems will fail, I’d like to make my main point. When systems do fail, keeping customers satisfied is about how you respond to the problem, and how you commit to fixing it so that it won’t keep happening. To do this well, here are some guidelines:

1) No Excuses. Customers don’t want to hear about how this problem is not your fault, or how you never expected this. Simply accept responsibility. Be sincere and humble, and commit to taking care of the problem.

2) Communicate. Focusing all your energy on the solution and ignoring the suffering subscriber base during an outage is a mistake. Take enough time to get your facts together, verify them, and use them to keep your subscribers well informed during an outage. If you notice a significant outage before your customers do, find a way to tell them before they notice. They will appreciate your proactive notification.

3) Analyze and Correct. Once service is restored, scrutinize the problem’s root cause, and find a way to prevent a recurrence of the problem.

I could keep listing more and more things here, but these three are the most important to remember.

In conclusion, I agree 100% with Brockmeier’s article, but there is more to the story. Reliability does matter. But in addition, realistic expectations matter just as much.

Coding in the Cloud

I have been writing a 10-part series on the Rackspace Cloud Blog. I’ll be keeping a running list of the posts here as they are published.

Rule 1 – Cache is Your Friend

Rule 2 – Don’t write to the database in real time

Rule 3 – Use a “Stateless” design whenever possible

Rule 4 – Avoid Unnecessary External Dependencies

Rule 5 – CMS Plugins

Rule 6 – HTTP Includes

Rule 7 – Coming Soon

Rule 8 – Coming Later

Rule 9 – Coming Later

Rule 10 – Coming Later

Yep, if you follow all 10 of the rules, you’ll probably have a really good cloud based web app.