Benefits and Hindrances of Regular Server Reboots

2019-04-07 07:03发布

In the ears of working in multiple teams, I've met multiple infrastructure managers that instituted a policy of weekly server reboots. As a developer, I was always against the policy - it seems that this is a hack to work around software bugs and hardware instabilities, instead of correcting them.

What are the people's opinions, positive and negative points regarding the policy?

8条回答
Rolldiameter
2楼-- · 2019-04-07 07:24

Answering my own question: One of the benefits that I see from the policy is when it is applied to a server cluster, and the processes are failed over from one node to another. That way all nodes are constantly tested for the correct software install.

查看更多
劫难
3楼-- · 2019-04-07 07:26

Obviously if the source of a problem cannot be fixed in a timely fashion, it has to be worked around. Scheduling a reboot to fix it is an easy way out to save the business if that works.

Sure, it mentally hurts and shouldn't be needed and it would be best to work against such a solution, especially if one's in control of the problematic software or in a position to bitch-slap the producers for a fix or simply replace it. But if not..?

I remember doing it for the servers in a Citrix farm, in the end they were rebooted every night with a half-complicated script waiting for users to log off, locking logins to specific servers and then rebooting the free ones. The reason was an old 16bit 4GL client application that we simply couldn't get rid of which tended to sever overall user responsiveness after a few days of uptime.

I agree though that mostly it seems to be based on not being smart enough to figure out the cause and fixing it - not everyone is as well-versed in maintenance or motivated as we'd like.

查看更多
The star\"
4楼-- · 2019-04-07 07:36

If you reboot your servers occasionally, you can be sure they will come back up. Though weekly sounds like a serious overkill, I have seen this problem on Linux machines with long uptimes.

Someone didn't bother to set up a critical service to start automatically on boot. Or the order of services coming up is wrong. Or someone upgraded libraries, added/removed software, etc. and the executable no longer works (it was started up with the old libraries, and continued using them; now it gets a dynamic linker error). Or it turns out service A depends on service B and service B depends on service A (oops).

At some point, when you least want to, you will take a reboot. The colo will drop the power on you; the server's power supplies will fail; someone will pull the cord/hit the reset button on the wrong server; etc. Now, when you can least afford downtime, your bloody server won't come back up.

Just like software, system configurations need testing. How often you need to do this testing depends on how your boxes are administered.

查看更多
来,给爷笑一个
5楼-- · 2019-04-07 07:37

Our servers are all Linux servers at work, and we don't ever reboot and haven't had any problems. I agree that it's a hack at best, and I also think it probably has something to do with the first response people used to always give when supporting Windows issues: "Have you rebooted your computer?"

Now as to why it might be beneficial, you may have applications that get into a weird state or that have memory leaks that a restart would resolve.

A big negative to me is that you've got to schedule weekly downtime for the servers. For some that's not an issue, and for others that's a huge issue.

查看更多
forever°为你锁心
6楼-- · 2019-04-07 07:42

Apologies for dusting off an old thread.

I think everyone's missing the point, especially the die-hard 'reboot? I'd rather sell my commodore!' Nix admins.

The point is that a weekly window should be SCHEDULED. Doesn't mean it has to be used, in fact the preference is that it isn't used as it's inevitably at some forsaken hour of the morning.

But if it's there, you can use it.

Personally, I think a quarterly reboot is a very good idea - it can give you a heads up on problems (hardware and software), and as the most forward thinking other poster pointed out, makes you aware of changes that prevent smooth startup that only become apparent after a reboot. Rather than having the situation arise after a 4hr power cut when taking another 2hrs to bring your box up becomes really quite embarrassing....

There are other upsides..

  • It gets the management used to reboots, and you have their confidence when you actually do need a reboot (e.g. physically moving it). If you never reboot a box, your manager's gonna be pretty darned nervous when you say it needs rebooting after 4yrs and no downtime.

  • You yourself get used to reboots, and know what can\does go wrong when it's offline.

  • You KNOW how long reboots take, so when it's coming back up and takes 10mins longer than usual, you're straight into the logs.

  • If you get knocked down by a bus tomorrow, there's CURRENT (not 4yr old) documentation on what happens when a reboot occurs (assuming you're a nice admin and write things down)

  • A 30minute reboot per quarter fits well within 99.9% uptime SLA's.

  • Finally it clears out the proverbial cobwebs.

To answer some points AGAINST regular rebooting..

  • The one about covering up a bad driver\memory leak etc is hilarious. How do you know it's a memory leak\bad driver unless you reboot the server? Not only that but what if you don't manage to fix it in your planned downtime? If you have a weekly scheduled window it's no problem! You just try again next week....

  • Notification system - if you have a planned window, you can set a planned exception. If your software\script doesn't do this, then I suggest modern software\better script writing.

  • As for the planned exception window hiding problems that 'happen to occur during the planned exception window' that's just laughable. Your other server stats will show this issue up very quickly if you review them at all.

Of course a blanket policy is not recommended, and you should have criteria for exceptions (e.g. disk space over a certain size etc)

Having said that, the bottom line is just because your server shouldn't need to be rebooted, it's incredibly naive to think that you shouldn't reboot it....

Edit:

I'm not sure I made this clear enough, but rebooting should NOT be used for plastering over a problem. The window should be weekly so that you have repeated attempts at RESOLVING the issue, not 'living with it'.

Rebooting as a method of dealing with a problem on a server is poor sysadmin. Nothing is learnt and it wastes people's valuable time and (rightly) lowers the management's opinion of you.

My point is

  • It is difficult to ensure you resolve a problem without an accepted, scheduled, weekly maintenance window in place.
  • With a weekly window you have an ongoing opportunity to sort things out properly, and avoid the situation where you have half-a-dozen jerry-rigged workarounds on as many different servers.
查看更多
孤傲高冷的网名
7楼-- · 2019-04-07 07:42

It is a hack really but it might be the most efficient hack. It is an 80:20 type problem where you can solve 80% of the problem with 20% of the effort. If you can survive the downtime or the downtime costs you less than actually fixing the root cause then this is a good solution. I personally don't like it but that is only because it isn't a clean solution.

查看更多
登录 后发表回答