Americas

  • United States

Asia

What you should do before your cloud goes down

analysis
Dec 14, 20215 mins
Cloud Computing

When Amazon Web Services recently had a case of network hiccups, we all found out how miserable it is to have all our IT eggs in one cloud basket. We can do better. Here’s how.

Boy, was that fun or what? When Amazon Web Services’ (AWS’) US-EAST-1 Region started having fits with its application programming interfaces (APIs) on Dec. 7, we found out just how much we all depend on AWS. (Even people who’d never heard of it knew something was wrong when their Disney+ and Netflix shows didn’t appear on TV, the new Roomba robot vacuum cleaner stopped cleaning, and those “smart” lights didn’t work—because they all rely on AWS.)

While that was annoying, it was much worse for the many companies who depend on AWS for their IT operations. Or for those who discovered that while they’d never given AWS one thin dime, many of the services they did pay for—Asana, Smartsheet, Trello, and Slack, to name a few—were built atop AWS.

Ouch.

On Friday, Amazon said in a blog post on it site that “unexpected behavior” triggered the hours-long outage.

“An automated activity to scale capacity of one of the AWS services hosted in the main AWS network triggered an unexpected behavior from a large number of clients inside the internal network,” the company wrote in the post. As a result, devices connected to AWS’ network became overloaded.

So, what can you do? Well, for one thing, stop relying on so many Internet of Things (IoT) devices. Your dishwasher, holiday lights, refrigerator, and toothbrush really don’t need to depend on the cloud. More seriously, though, you can drop any thought of having your IT department go back to running all your own servers. Look at where your business was back when that made sense and where it is now.

During last week’s outage, a sysadmin friend of mine had to deal with a company CEO who was having hysterics. The CEO wanted to get back all the company’s data—several hundred terabytes worth—and get some application running again right away.

But no matter what the boss wants, you can’t always magically make things work­—especially when they’re outside of your control. Nor, if you sit down with your CFO and go over the numbers, is it likely you can shift all that data and all those applications back to your company. There’s a reason you moved to the cloud—usually because it costs less to run things there than locally.

Maybe you really could do it cheaper with your own server room. If so, good for you! But before you pull the trigger, look at what your downtime was like before the cloud. I’d bet you’ll find you were actually down and out more often when you ran things yourself.

So, should you “solve” your cloud problem by moving to a multi-cloud setup? That might work, but to do this properly will require at least two public cloud providers and possibly your own data center. That gets really, really expensive.

And if what you really want is a safety net for failures like the AWS one, sorry—multi-clouds simply won’t work. As Lydia Leong, Gartner Distinguished VP Analyst, put it: “Multi-cloud failover requires that you maintain full portability between two providers, which is a massive burden on your application developers. The basic compute runtime (whether VMs or containers) is not the problem, so OpenShift, Anthos, or other ‘I can move my containers’ solutions won’t really help you. The problem is all the differentiators—the different network architectures and features, the different storage capabilities, the proprietary PaaS capabilities, the wildly different security capabilities, etc.”

Enough of the bad news. Here’s what Leong and I think can work for keeping your business up and running even when your primary cloud is down and out.

  1. Run your active applications across at least two, and preferably three, Availability Zones (AZ) within each region that you use. Yes, three is much harder to do than two, but it’s still a heck of a lot easier than trying to build a multi-cloud failover solution.
  2. Run your active applications across at least two, and preferably three, regions. Again, two is much easier than three, but if your mission-critical application is truly mission-critical, it may be worth the trouble. Can’t do that? Then see if you can at least afford fast and fully automated regional failover.

Let’s face it. The cloud is here to stay. Since that’s the case, and clouds will continue to go down, it only makes sense to use the best tools they give us to protect us from their inevitable failures.

We’ll still have days when everything goes to hell in a handbasket, but at least there will be fewer of them.

Next Read This: