Personal Workflow Blog

To content | To menu | To search

Tag - performance

Entries feed

Monday, 2 December 2013

Experimenting with a C munin node

Core plugins are designed for simplicity...

As I wrote about it earlier, Helmut rewrote some core plugins in C. It was maintly done with efficiency in mind.

As those plugins are only parsing one /proc file, there seemed no need to endure the many forks inherent with even trivial shell programming. It also acknowledges the fact that the measuring system shall be as light as possible

Munin plugin are highly driven towards simplicity. Therefore having shell plugins is quite logical. It conveys the educational sample purpose for users to write their own, while being quite easy to code/debug for the developpers. Since their impact on current systems is very small, there are not much incentive to change.

... but efficiency is coming !

Nonetheless, now monitored systems are becoming quite small.

It is mostly thanks to embedded systems like the RaspberryPi. This means that processing power available is much lower than on normal nodes[1].

Now the embedded C approach for plugins has a new rationale.

Notes

[1] Usually datacenter nodes are more in the high end of the spectrum than the low-end.

Monday, 20 June 2011

Enhance RRD I/O performance in Munin 1.4 and Scale

As with most of the RRD-based monitoring software (Cacti, Ganglia, ...), it is quite difficult to scale.

The bad part is that updating lots of small RRD files seems like pure random I/O to the OS as stated in there documentation.

The good part is that we are not alone, and therefore the RRD developers did tackle the issue with rrdcached. It spools the updates, and flushs them to disk in a batched manner, or when needed by a rrd read command such as graphing. That's why it is scales well when using CGI graphing. Otherwise, munin-graph will read every rrd, and therefore force a flush on all the cache.

And the icing on the cake is that, although it is only fully integrated to munin 2.0, you can use it right away in the 1.4.x series.

You only need to define the environment variable RRDCACHED_ADDRESS while running the scripts accessing the RRDs.

Then, you have to remove the munin-graph part of the munin-cron and run it on its own line. Usually only every hour or so, to be able to accumulate data in rrdcached before flushing it all to disk when graphing.

Updating to 2.0 is also an option to have a real CGI support. (CGI on 1.4 is existing but has nowhere decent performance).

Saturday, 26 June 2010

Waiting for Munin 2.0 - Performance - Asynchronous updates

munin-update is the fragile link in the munin architecture. A missed execution means that some data is lost.

The problem : updates are synchronous

In Munin 1.x, updates are synchronous : the value of each service[1] is the one that munin-update retrieves each scheduled run.

The issue is that munin-update has to ask every service on every node for their values. Since the values are only computed when asked, munin-update has to wait quite some time for every value.

This very simple design enables munin to have the simplest plugins : they are completely stateless. While being one great strength of munin, it puts a severe blow on scalability : more plugins/node means obviously a slower retrieval.

Evolving Solutions

1.4 : Parallel Fetching

1.4 addresses some of these scalability issues by implementing parallel fetching. It takes into account that the most of the execution time of munin-update is spent waiting for replies. In 1.4 munin-update can ask max_processes nodes in parallel.

Now, the I/O part is becoming the next limiting factor, since updating many RRDs in parallel is the same as random I/O access for the underlying munin-master OS. Serializing & grouping the updates will be possible with the new RRDp interface from rrdtool version 1.4 and on-demand graphing. Tomas Zvala even offered a patch for 1.4 RRDp on the ML. It is very promising, but doesn't address the root defect in this design : a hard dependence of regular munin-update runs.

2.0 : Stateful plugins

2.0 provides a way for plugins to be stateful. They might schedule their polling themselves, and then when munin-update runs, only emit collect already computed values. This way, a missed run isn't as dramatic as it is in the 1.x series, since data isn't lost. The data collection is also much faster because the real computing is done ahead of time.

2.0 : Asynchronous proxy node

But changing plugins to be stateful and self-polled is difficult and tedious. It even works against of one of the real strength of munin : having simple & stateless plugins.

To address this concern, an experimental proxy node is created. For 2.0 it takes the form of a couple of processes : munin-async-server and munin-sync-client.

The proxy node in detail (munin-async)

Overview

These 2 processes form an asynchronous proxy between munin-update and munin-node. This avoids the need to change the plugins or upgrade munin-node on all nodes.

munin-async-server should be installed on the same host than the proxied munin-node in order to avoid any network issue. It is the process that will poll regularly munin-node. The I/O issue of munin-update is here non-existent, since munin-async stores all the values by simply appending them in a text file without any further processing. This file is later read by the client's munin-update, and it will be processed there.

Specific update rates

Having one proxy per node enables a polling of all the services there with a specific update rate.

To achieve this, munin-async-server forks into multiple processes, one for each proxied service. This way each service is completely isolated from the other, and therefore is able to have its own update rate, is safe from other plugins slowdowns, and it does even completely parallelize the information gathering.

SSH transport

munin-async-client uses the new SSH native transport of 2.0. It permits a very simple install of the async proxy.

Notes

[1] in 1.2 it's the same as plugin, but since 1.4 and the introduction of multigraph, one plugin can provide multiple services.

Monday, 21 June 2010

CGI on steroids with FastCGI, but on a CGI-only server - The FastCGI wrapper

FastCGI is really CGI on steroids

FastCGI is very common way to increase performance of a CGI installation. It is based on the fact that usually the startup of CGI scripts is slow, whereas the response is quite fast.

So if you have a persistent process, you only have to take care of the startup once, and you then experience a real speedup.

FastCGI vs mod_perl (or mod_python, ...)

Once a big fan of mod_perl, I'm converted to FastCGI since. mod_perl was for a long time the answer for speeding up Perl CGI scripts. It has a very good track record of stability and has real hooks deep in the Apache processing requests.

FastCGI focuses on a different feature set that is more actual than mod_perl[1] :

  • It is much simpler to install and configure, especially when having multiple applications.
  • Able to connect to a distant server (running as a different UID, chrooted or even on a remote host)
  • Able to mix scripting languages without any need to compile some other apache modules.
  • Able to be used with several webservers, even closed-source ones : FastCGI is a protocol, not an API.

But steroids do have some side effects

CGI issues

One downside is that your CGI script should be adapted to FastCGI and the fact that the script doesn't end with the end of the request.

In the real world that's quite easy. Every language that is commonly used for CGI offers CGI-wrapper libraries that works in a FastCGI context as well as a plain CGI one.

Webserver issues

Another issue can also come from the webserver. Since CGI is dead simple to implement even the micro-webserver thttpd implements it.

FastCGI on the other hand is a little more difficult to implement, since the webserver needs to create a container that monitors and calls the FastCGI-enabled script.

A standalone FastCGI container

Fortunately, the FastCGI team provided us with a ready-to-use container and a very simple client that acts a plain CGI script, but proxies it to a full-blown container.

Since the plain CGI part is a very small native executable its overhead is negligible compared to the reply time, even without comparison with the startup time of the whole script.

Its installation is also quite straightforward. I just installed the libfcgi package on Debian : it provides /usr/bin/cgi-fcgi.

I created a simple CGI wrapper for my previous munin benchmarking needs :

#! /bin/sh

exec /usr/bin/cgi-fcgi -connect /tmp/munin-cgi.sock \
     /usr/lib/cgi-bin/munin-cgi-graph

Notes

[1] who really need deep apache hooks ?

Wednesday, 31 March 2010

API Design: Avoid hidden costs of simple features

Programmers are usually like water : they always use the path of least resistance.

Let's see how to use this fact to predict the usage of an API when you design it.

Initial API

Consider the very simple DB API that consumes a connected ResultSet and presents a disconnected version of it.

class DisconnectedResultSet{
        public DisconnectedResultSet (ResultSet rs);
        public boolean next();
        public Object getObject(int col_idx);
}

It's usage is quite easy :

while (drs.next()) {
        int col_idx = 1;
        drs.getObject(col_idx++); // Do something w/ 1st col
        drs.getObject(col_idx++); // Do something w/ 2st col
        //...
}

Just a little evolution...

Since the DisconnectedResultSet is disconnected, we can imagine that it should implement a rewind() method in order to use it several times without running the initial query again. We now have an updated class :

class DisconnectedResultSet{
        public DisconnectedResultSet (ResultSet rs);
        public boolean next();
        public Object getObject(int col_idx);   
        public void rewind(); // Be able to rewind it
}

And its classical usage :

while (drs.next()) {
        // do stuff...
}
// ...
drs.rewind();
while (drs.next()) {
        // do something else with the same data...
}
// ...
drs.rewind();
while (drs.next()) {
        // do something else with the same data...
}
// ...

A new need comes

A new need comes : see if the DisconnectedResultSet is empty or not in order to avoid sending header.

The usual way is to send them once when iterating like :

boolean is_headers_sent = false;
while (drs.next()) {
        if (! is_headers_sent) { 
                send_headers(); 
                is_headers_sent = true;
        }
        // do something else with the same data...
}

But since there is a nice rewind()method, just waiting to be used, the code might become :

if (drs.next()) {
        send_headers(); 
}
drs.rewind();
while (drs.next()) {
        // do something else with the same data...
}

Now, this code isn't generic anymore to accommodate a connected ResultSet.

So, as John Carmack said :

The cost of adding a feature isn't just the time it takes to code it. The cost also includes the addition of an obstacle to future expansion.

That's really true when you design APIs since their purpose is to last long and to be extended.

So, think twice when you propose an extension "just in case".

The little evolution, revisited...

To solve this case, don't propose a rewind() method, but offer a duplicate() one. It offers the same functionality, just in a new object.

The usage will be almost the same as shown below, but since it feels more performance-sensitive, it won't be used as lightly : the boolean is_headers_sent pattern has now more chances to be used.

while (drs.next()) {
        // do stuff...
}
// ...
drs = drs.duplicate();
while (drs.next()) {
        // do something else with the same data...
}
// ...
drs = drs.duplicate();
while (drs.next()) {
        // do something else with the same data...
}
// ...

It's an other example that immutable objects are the way to go, but for a different reason this time.

Note: Just finished my March 2010 article, even on time... I'm still trying to keep at least a one article per month blogging rate. So far so good for 2010, still 9 months to go !