"Hello, IT, did you try turning it off and on again?"
I see this happening over and over again. A network service goes down for some reason. It is critical, it has to run. The person on the rota restarts it. What's wrong with that picture?
Unlike desktop support, IT operations can't afford to just "try turning it off and on again". It's not the way proper maintenance should be done. Be it a junior sysadmin or a regular developer, some people seem to think that high availability is about bringing the service up as fast as it's possible when it fails. Well, not really.
First of all you don't have a HA setup if you have a downtime when one component is faulty. It shouldn't matter to you that something went down, since even before you notice it, another standby element will take over all the functions of the failed one. Automatically. Without any human interaction. It seems obvious, I know, but still there are those who seem to think otherwise. We have plenty of tools allowing a proper HA setup to be done, they existed even before the "DevOps" hype started, but are more and more in spotlight now, when automation is in such demand. Ironically, the old-school sysadmins seem to know better how to use them than the new generation of multi-purpose developers, self-proclaimed "DevOps engineers". These are the guys that I see most often use the "restart as fast as possible" approach.
A daemon is a piece of software that is suppose to run in the background and do it's job continuously; once started, daemon should stay alive till it's stopped manually. If it stops for any other reason, a sysadmin's first and foremost duty is to check, why it went down. Only then, when the cause for failure is know, can the sysadmin bring the daemon back up. I often hear, that "it didn't happen before", "it's the first time", "let's hope it won't happen again" and other types of excuses.
Well, it's a bit naive way of thinking and really, an experience sysadmin shouldn't be telling things like that.
Nothing's flawless. There's no such thing as perfect software, so you should always be prepared for the case that the daemon will fail. It shouldn't, but you have to have a plan for when it does. Restarting it and hoping for the best is not really a plan. It happend once, so it'll for sure happen again, but what if there won't be anyone to bring it up? Of course, you can use helpers like daemontool or perpd, but that's the same kind of an approach.
Let me give you an example. Recently I've been working on a new centralized logging system design utilizing rsyslogd. I tried to keep the configuration as simple as possible, but still, it was quite complex. It was easy to make a mistake and I did. More than once. On one of the servers rsyslog was running under the control of perp. If you don't know it, perp is basically
while :; do program; done loop; it'll restart your service when it goes down for whatever reason. It can be usefull, I'll give an example later, but in this particular case it was doing more harm than good. So we have a monitoring system that checks, it the daemons, rsyslog amongst them, are alive. And perp restarts rsyslog wheneven it dies. And I made a mistake in the config file. And it was causing rsyslog to not initialize properly and abort during startup. You get my point? Monitoring didn't catch that since the service was running - for a second or so at a time, but continuously restarted. You might say that the check algorithm was wrong and it should utilize perpls. Yes, you'd be right. But it wasn't doing that, the person setting perpd up didn't thought about that. But ultimately the error was in thinking that it should work like this in the first place.
The point is that when daemons fail, they fail for a reason. Restarting them will only make them fail again. "Doing the same thing over and over again expecting different results is madness" - I believe this one is by Albert Einstein.
Not so long ago I was a witness to a chat between a dev and an op about restarting docker containers. There was docker running on a production host with some important app contenerized, but it failed. Apparently docker daemon didn't went down and the container had
--restart=always policy enabled, but with no effect. So the dev asked the op to run the container under perp. So essentialy what he wanted was to put a
while :; do program; done loop inside another
while :; do program; done loop.
The proper way of handeling cases like this, especially when docker is involved, would be to launch two containers and set up, let's say, keepalived between them. Then, if one will fail the other will take over and the sysadmin on call can
docker inspect the failed one or do any other analysis to understand, why it went down in the first place. That being said, this particular situation only strenghten my belief that docker is still a bit too young a technology to be put into production.
There is a place for automatic restarts. The reason why perpd was introduced was to replace cron running php scripts every minute - the process of loading all the includes into php memory was quite heavy and it didn't make any sense to do that just to check that the previous invocation of the script is still running and exit for that reason. So perpd makes it a bit less heavy on the server by restarting the process as soon as it exits. Of course, a different approach could be taken and the script could be written as a proper daemon waiting for new events to show up (in this particular case inotify or icron could be used to notify the daemon on new files waiting for processing), but we have what we have and perpd seems to be a quite reasonable piece of software to do what it does in this case.
When it comes to critical network services, we want them to be reliable. For them to be like that we need to understand, what makes them fail in order to be able to fix the root causes of such a problems. Without that we might as well end with on-call rotas and handle every incident with nagios' event handlers or software like perp.
If you like this article please consider sharing it with your friends.