You know ELK, right? Elastisearch + Logstash + Kibana stack is now probably a standard tool among system administrators and developers. But if you install ELK and think you're done with your centralized logging system - you're wrong.
Recently I picked up a new idiomatic phrase - "different strokes for different folks". As I don't necessarily agree with how it was used then, I think I can use it now. I'm a great fan of simplicity and I love the unix philosophy of one tool performing one job well. Of course, I also do love awk and perl scripting, but I use standard system utilities like grep, sed, tr, wc and so on very often to perform simple, one-off tasks. When I need to check something in the logs, I'm more likely to log onto the server and just grep them. In contrast, some of my colleagues prefer to open Kibana in their browsers and some like Graylog2 more (given choice between the two I would also go with the latter). And that's really it - "different strokes for different folks".
Systems like ELK stack are really useful mostly for visualising the data hidded in your log files and correlating them. You can do a lot of fancy stuff, but you also have to be willing to pay the price.
Recently I've been working on a new centralized logging system for a bunch of servers generating pretty big amount of logs - most of them coming from web servers and php applications. The need arose to upgrade it since it was getting slower and slower and sometimes it was even causing applications to process data at highly unacceptable rate. I've identified few bottlenecks, but the biggest one was with our central logging server. When I started to work on this I was a fairly new addition to the team and didn't really know, how exactly were the logs handled. What was really weird to me was that none of the logs were actually stored on the servers generating them, but only forwarded to the central server. And forwarded in an unreliable fashion. And on the central server some of them were discarded based on the facility and priority. Enough said - it wasn't a perfect setup.
What is really important for me, especially being the command-line loving type that I am, is having access to raw log entries. But it's not only my preference that is important here. When using software like ELK stack you usually want to have flexibility and speed. It's quite common to index many fields to speed up the queries. And to split the messages into many fields in the first place. This requires a lot of processing power, uses a lot of I/O and take up much more storage than raw log files. Storage is cheap, I know. But still, it's cheaper to have less than more. And I/O isn't that cheap.
My point is that as "different folks" prefer "different strokes" the same goes for different tasks. There is the kind that benefits from visualisation and the ability to correlate information coming from various different sources and there is the kind that benefits from using command-line tools and simple scripts run against text files. Often our fancy charts, graphs and maps are used to display metrics as they come in, they operate more often on a live stream of logs than on archived ones. But sometimes you also have to go back in time to quite old logs to debug a problem or just track something down. It's really convenient to have few days worth of your logs in ELK but it's also important to have access to older ones and you probably don't want your elasticsearch indices bloated.
When designing a centralized logging system you have to take into account few factors:
you may need to debug an issue quickly and logging to a remote server is always delayed
your central logging server may become a SPoF, crash, have its storage corrupted etc
your clients may start generating more messages than the server can handle
you may want to use tools that operate on raw text files
you may want to or be forced to keep more logs than your ELK (or whatever) can store in an indexed form
you probably don't want to end up locked in one particular aggregation and correlation solution
These are the lessons that I learned and that's why while building the new system I decided to:
store the logs on local server, for a short while, but still keep them on a local storage
containerize ELK stack in order to be able to deploy a new node quickly if the need for clustering should arise
use rsyslog's RELP1 as a fast and reliable protocol
apart from sending logs to logstash store them also in raw format on the central server's storage in a neat directory/filename structure
So the main message I wanted to convey in this article is that you really should think about what you expect from your central logging system. But also, you should really take into account, that in many cases raw log file are irreplaceable. You should keep them even if you don't see the need to - it's cheap and may come in handy when things will go wrong with more complex solutions.
Reliable Event Logging Protocol ↩
If you like this article please consider sharing it with your friends.