Elephant-watching

2016-01-10T20:38:55 GMT │   Comment

Proper monitoring is the key to keeping your system in good health and it's especially important in the database field. No matter if you have a busy systems processing a lot of transactions each second or you just want to keep tabs on your data warehouse handeling only few heavy queries - you want to graph most of the metrics in order to be able to see, how they're trending. There are many solutions for doing that, but let me tell you a secret - in this endeavour Grafana will be your best friend.

This article was published by compose.io as a part of Write Stuff programme. Unlike other articles on my blog, this one is licensed under the terms of CC BY-NC-SA 4.0

The goal

Usually when talking about monitoring we really mean monitoring and alerting - those two are a bit different, but complementary and there's very little point in doing one but not the other. We can monitor events and metric. When an event occurs we may decide to alert on it. But usually it's better to act beforehand.

Probably the best way to keep track of your metrics' trends is to visualise them with graphs. And that's where Grafana comes in. Unlike one of my favourite systems - Munin - Grafana allows the creation of dynamic graphs and let's you choose a time period in a very flexible fashion.

For busy OLTP systems it's important to have a live overview of what's happening, and the biggest pain with Munin was the need to wait (by default 5 minutes) for the graphs to be updated. Thanks to Grafana (and few other parts of the pipeline discussed below) this is no longer the issue.

But it you want to view your metrics over quite a long period of time, you may experience a slowdown since most of the work with generating graphs is done in your browser. But have no fear, making Grafana generate static PNG file is just a few clicks away!

One thing to remember though - your dashboard won't be complete if you'll include only database-related graphs in it. To have a proper monitoring setup for you precious RDBMS, you should also have a clear view of system-level metrics like CPU usage (including I/O waits), system load, memory usage and of course disk-related statistics like I/O activity (both IOPS and latency), swap activity and usage, and disk space utilisation. You shouldn't also forget about network activity, sometimes it's also usefull to see hardware-level metrics like interrupts graphed over a period of time, though it may be less important if you're running your database in a virtualised environment, which happens suprisingly more often than most DBAs would like it.

Another question worth answering before you start to work on your monitoring dash is what is most important to you in your setup or, to express the same idea in a bit different way, what breaks most often for you ;-) When you look at your dashboard, you want to know at first glance when something is wrong. That's why you need to think about the placement of your graphs so that the most important once will be clearly visible.

The pipeline

Grafana's just a part of modern graphing pipeline - it is responsible for visualising metrics, putting the graphs together into dashboards and it also provides a nice WebUI simplifying those tasks.

But to be able to create graphs first you'll need data points. Usually a full stack graphing solution utilising Grafana consists also of Graphite for transporting and storing metrics and CollectD for gathering them.

While Graphite is quite a complex piece of software using few autonomous daemons, CollectD is very simple and fast. This allows having almost continuous data point collection with very small gaps of say 5 seconds between them. That's a huge difference to default Munin configuration mentioned above.

A scalable setup of those components might look like that:

                                             <==>  collectd
                       <==>  carbon-relay 1  <==>  collectd
whisper  <==>  carbon                     
                       <==>  carbon-relay n  <==>  collectd
                                             <==>  collectd 

Each monitored hosts runs collectd that connects to one of carbon-relays (a part of Graphite suite); each relay relays received data to a central server where carbon (another part of Graphite suite) uses whisper (yet another one) to write collected metrics to files in an organised structure. It's also possible to have multiple instances of carbon and whisper running to provide even better scalability, HA and backup.

I won't go into details about setting up the full pipeline since it's outside the scope of this article. Suffice to say that's a job for sysadmins and DBAs shouldn't really be concerned with anything else than configuring CollectD and creating graphs and dashboards. The rest of this article will cover just that.

The practice

Some time ago I learned about a new "sport" - birdwatching. Well, I'd like to coin the notion of elephantwatching now ;-) Actually most of the kudos for this should go to Sebastian "tokkee" Harl since it was he who used the phrase "Watch your elephants" in his presentation given at PgCong.EU at 2012 in Prague.

So the goal is to have a comprehensive database statistics dashboard presenting the most important metrics in a clear and visible manner that can communicate in obvious way any problems that may happen. To achieve this goal we have to combine few of CollectD's plugins. First, we're gonna need to configure the basic ones like "cpu", "memory", "disk" or "df". This is really simple, in some cases - like "cpu" for example - all that you need to do is to LoadPlugin. And again, this can and should be handeled by your friendy sysadmin colleagues.

But there are also some plugins that we can bend to fit our specific, database-oriented needs. Sebastian in his presentation gives a great example with "processes" plugin - we can use it to group processes and graph things like memory used by PostgreSQL's autovacuum workers, I/O utilised by background writers, CPU time used by statistics collectors and so on. This can complement system-level metrics, like general CPU or I/O activity in a very interesting way.

And then there's postgresql CollectD plugin with some nice set of default options including TPS, queries or I/O metrics available per cluster as well as per table.

Using the predefined metrics can be fine, but with more active databases we usually want to collect also some custom ones from the totally awesome PostgreSQL statistics collector. This is of course possible with this plugins - thanks to the Query block you can define a custom Statement, fetch it's results and store it as metric. One important thing to remember though - use the proper type, eg. counter or gauge; some of the metrics, like for example number of scans or tuples fetched, are always increasing while others, like number of connections, are the actual values. You may want to take a look at types.db and grep it by pg_.

When you're done, your dashboard might look somewhat like this:

pg dash

As you can see, Grafana gives a lot of freedom with visualising your data; you can choose between different graph types (line, area, stacked etc), you can present a single statistic as a number and you can also show a graph in it's background.

Presenting single stats can be very useful in making your graph obvious when it comes to spotting problems. In my example you can see a single stat showing average cache hit ratio at the top of the image; it's green, but it'll turn red should the ratio drop below a defined threshold. Same can be done with other metrics like CPU usage, I/O wait, swap activity and so on.

One another Grafana's great advantage over Munin is it's powerful query and transformation language. It's very easy to combine metrics together, correlate them, graph an average, sum etc.

Let's take a look at how I created a CPU usage graph:

cpu graph

As mentioned above, CollectD sends metrics to Graphite, Whisper then stores them in a hierachical structure. To make use of a metric in Grafana, you first have to find it in this structure. But it wouldn't be reasonable to create a new dashboard for every monitored server, right? That's why Grafana let's you use variables.

If you take a look at the screenshot, you'll notice I defined 5 variables - each in a simple manner, a as part of CollectD "path", so system_role for example is defined as collectd.* and environment as collectd.${system_role}.*. Selectors for each variable are available at the top of your dashboard and this way your dash simply becomes reusable.

In this particular graph there are two data series defined: A - named (using alias() function) user+system and B - wait. As you can see, I'm using summarize() function - this way the graph can be more smooth, that it would be when presenting all the data points. Actually, I should have used avg in summarize() instead of sum, I figured it out after capturing the screenshot. You can also see, that for series A I use sumSeriesWithWildcard() - this allows me to combine CPU time spent in userspace and in kernel - the function summarizes 8th node (0-based), so {user,system} in this case. Apart from that I'm doing average over all the system CPUs (nodes 6 and 7) and in the end - average over all the servers selected (you can select more than one).

This is quite a simple use case, but there are much more functions to choose from.

The conclusion

I'd like to end this article with a note, that you should always remember, that graphing and alerting are two elements of monitoring that should go together.

I hope this little tutorial will help you setting up your Grafana dash for PostgreSQL, but please do remember, that a good alertring solution should also be in place.

Also, there are many other interesting projects out there to help with PostgreSQL monitoring, to name only few:


If you like this article please consider sharing it with your friends.