Personal Workflow Blog

To content | To menu | To search

Sunday, 24 February 2013

When having good relationships with package maintainers can also be a curse

I advise every user to only use the packaged version of munin. Here's a short article to explain the background of my reluctance to ask for users to directly use the official tarball.

I have become upstream of munin a while ago now. As such, I'm in contact with package maintainers. They take the official releases and cram it into their own distribution of choice[1].

I have to admit that the various epic war stories read throughout the web about upstream vs packagers are very far from the truth here. They are a charm to work with. Often challenging and demanding, but always because there's a real need. And that's quite a good thing, as I'm still a rookie in term of open source software management. Therefore I'm quite grateful when they gently pinpoint my mistakes[2].

Yet, this nice team comes with a price. Since we mostly hang out on IRC together, there is way much inter-distro communication than on other software. But I'm the sole owner of the tarball distro .

Yet, as I don't like to build everything from source, I obviously use a distro. There, since the packaging is very nicely done, I don't feel to take the hassle of using my own "tarball" to test them. I just build a package for my distro out of the release code.

That's also a curse, as I admit that I although I test the code, I only seldom test the packaging. This means that I cannot really advise someone on using the tarball, nor directly git code as even I don't do it.

But, that said, I still think I'm the luckiest upstream around. Thanks guys !

Notes

[1] Be it linux-based like Gentoo, Redhat..., BSD-based as FreeBSD, OpenBSD..., or even multi-kernel based as Debian

[2] Defaulting to CGI graphics was a move that was way too premature, end-user wise. So thanks to them, it defaults to cron again

Friday, 1 February 2013

Avoid those milli-hits in Munin

A recurring question on IRC is : why do I have 500 million hit/s in my graph ?.

Turns out that they are really seeing 500m hit/s, and that lower-case m means milli, and not Mega as specified in the Metric system. This is automatically done by RRD.

To avoid this you should just specify graph_scale no as specified.

Thursday, 12 July 2012

Waiting for Munin 2.0 - Break the 5 minutes barrier !

Every monitoring software has a polling rate. It is usually 5 min, because it's the sweet spot that enables frequent updates yet still having a low overhead.

Munin is not different in that respect : it's data fetching routines have to be launched every 5 min, otherwise you'll face data loss. And this 5 min period is deeply grained in the code. So changing it is possible, but very tedious and error prone.

But sometimes we need a very fine sampling rate. Every 10 seconds enables us to track fast changing metrics that would be averaged out otherwise. Changing the whole polling process to cope with a 10s period is very hard on hardware, since now every update has to finish in these 10 seconds.

This triggered an extension in the plugin protocol, commonly known as supersampling.

Supersampling

Overview

The basic idea is that fine precision should only be for selected plugins only. It also cannot be triggered from the master, since the overhead would be way too big.

So, we just let the plugin sample itself the values at a rate it feels adequate. Then each polling round, the master fetches all the samples since last poll.

This enables various constructions, mostly around streaming plugins to achieve highly detailed sampling with a very small overhead.

Notes

This protocol is currently completely transparent to munin-node, and therefore it means that it can be used even on older (1.x) nodes. Only a 2.0 master is required.

Protocol details

The protocol itself is derived from the spoolfetch extension.

Config

A new directive is used, update_rate. It enables the master to create the rrd with an adequate step.

Omitting it would lead to rrd averaging the supersampled values onto the default 5 min rate. This means data loss.

Notes

The heartbeat has always a 2 step size, so failure to send all the samples will result with unknown values, as expected.

The RRD file size is always the same in the default config, as all the RRA are configured proportionally to the update_rate. This means that, since you'll keep as much data as with the default, you keep it for a shorter time.

Fetch

When spoolfetching, the epoch is also sent in front of the value. Supersampling is then just a matter of sending multiple epoch/value lines, with monotonically increasing epoch. Note that since the epoch is an integer value for rrdtool, the smallest granularity is 1 second. For the time being, the protocol itself does also mandates integers. We can easily imagine that with another database as backend, an extension could be hacked together.

Compatibility with 1.4

On older 1.4 masters, only the last sampled value gets into the rrd.

Sample implementation

The canonical sample implementation is multicpu1sec, a contrib plugin on github. It is also a so-called streaming plugin.

Streaming plugins

These plugins fork a background process when called that streams a system tool into a spool file. In multipcu1sec, it is the mpstat tool with a period of 1 second.

Undersampling

Some plugins are on the opposite side of the spectrum, as they only need a lower precision.

It makes sense when :

  • data should be kept for a very long time
  • data is very expensive to generate and it doesn't vary fast.

Monday, 20 June 2011

Enhance RRD I/O performance in Munin 1.4 and Scale

As with most of the RRD-based monitoring software (Cacti, Ganglia, ...), it is quite difficult to scale.

The bad part is that updating lots of small RRD files seems like pure random I/O to the OS as stated in there documentation.

The good part is that we are not alone, and therefore the RRD developers did tackle the issue with rrdcached. It spools the updates, and flushs them to disk in a batched manner, or when needed by a rrd read command such as graphing. That's why it is scales well when using CGI graphing. Otherwise, munin-graph will read every rrd, and therefore force a flush on all the cache.

And the icing on the cake is that, although it is only fully integrated to munin 2.0, you can use it right away in the 1.4.x series.

You only need to define the environment variable RRDCACHED_ADDRESS while running the scripts accessing the RRDs.

Then, you have to remove the munin-graph part of the munin-cron and run it on its own line. Usually only every hour or so, to be able to accumulate data in rrdcached before flushing it all to disk when graphing.

Updating to 2.0 is also an option to have a real CGI support. (CGI on 1.4 is existing but has nowhere decent performance).

Thursday, 16 June 2011

Autovivification in Perl : Great Idea but also Huge Trap - Another Leaking Abstraction...

Autovivification is one of Perl's really great design success.

It all comes to you don't need to worry about existence before dereferencing something.

That means, for setting a nested hash, you only need to write :

$h->{foo}{bar} = "value";

And that will work out of the box. Perl will happily create all the data-structure for you.

So, now a little coding test, what does the following code output ?

my $a;

if ($a->{foo}{bar}) {
   print "Found foo/bar\n";
}

if ($a->{foo}) {
   print "Found foo\n";
}

Naively, it shouldn’t output anything, right ?

Not so fast. Upon a careful read of Perl will happily create all the data-structure for you, we can put some emphasis on one word : Perl will happily create all the data-structure for you.

That might be just perfect, except that Perl creates it whenever it needs it, even if it is only for reading.

And now you understand the catch : a read operation can result in a write one.

As Uncle Ben (from SpiderMan) said[1] : With Great Power Comes Great Responsibility.

Dagfinn Ilmari Mannsåker showed me a nice autovivification module on CPAN that fixes this behavior, and enables a fine tuning of this process.

I really think the fact that creation also happen when querying the value is a real bug in Perl itself, or at least a bug in the design of the feature.

Notes

[1] Voltaire, Franklin D. Roosevelt and other said something very similar, but they are not as geeky.

- page 2 of 12 -