Anyone calling an IT support line is likely to be asked a simple question very early in the conversation,
“Have you restarted your system?”
This question has evolved into one of the top catchphrases of IT support, right up there with “Is it plugged in?” and “Do you see any lights blinking?” From the support side of the call, you can feel the user rolling their eyes in disgust when you ask for the restart. We agree… performing a regular reboot of a computer should be unnecessary. You are absolutely right to think that a well-designed computer should be able to run indefinitely. There’s just one small problem.
Most computers just don’t work that way.
Oh, I can go into the technical details. We can talk about how some applications and operating systems don’t properly release memory, or that software defects may randomly cause applications to intermittently fail. Also, and I’m not trying to be mean here, but users have a habit of installing a lot of unnecessary stuff. Certain operating systems (say, Linux) possess a reputation of being less susceptible to unplanned downtime than other operating systems (for instance, Microsoft). But at the end of the day, when faced with a system exhibiting anomalous performance, users and administrators alike must face a harsh reality in this world of always-up cloud services.
Without a reboot, systems fail.
An article that truly captures the essence of this fact caught my eye this week. The FAA released an airworthiness directive for the Boeing 787 Dreamliner. This directive resulted from the determination that generator control units operating continuously for 248 days will cause all AC power on the aircraft to fail. Since the Boeing 787 is a fly-by-wire aircraft, the loss of AC power could cause the complete loss of control of the aircraft. To avoid this condition, the FAA directive specifies a maintenance procedure. This procedure requires an electrical power deactivation of the aircraft at intervals not to exceed 120 days.
Yes, you read that correctly. The Boeing 787 Dreamliner requires a reboot every four months.
Now, this isn’t to say that all computers must restart every few weeks. This past week, we held a decommissioning ceremony for a device that didn’t have a second of downtime over the last seven years. This little HP 4/8 SAN switch was one of the first components of our CMA Cloud infrastructure. Operating a device continuously over this amount of time is quite a feat. Anything that can go wrong typically will. We were no exception. Just off the top of my head,
· Ice storms
· Power loss for no good reason (this happens more often than you think)
· Water leaks due to rain
· Water drips due to condensation
· Air conditioning failure
· UPS failure
· Generator failure
And the list goes on. All in all, this little workhorse preserved through nine disk enclosures, two backup generators, 17 battery backups, 30 servers and 13 different engineers. When we bought it, I remember being worried that it wasn’t going to be worth the investment. I can now see it was worth every penny.