Waiting for Munin 2.0 - Performance - Asynchronous updates
munin-update
is the fragile link in the munin
architecture. A missed execution means that some data
is lost.
The problem : updates are synchronous
In Munin 1.x, updates are synchronous : the value of each
service[1] is the one that munin-update
retrieves each scheduled run.
The issue is that munin-update
has to ask every service on
every node for their values. Since the values are only computed when asked,
munin-update
has to wait quite some time for every value.
This very simple design enables munin to have the simplest plugins : they are completely stateless. While being one great strength of munin, it puts a severe blow on scalability : more plugins/node means obviously a slower retrieval.
Evolving Solutions
1.4 : Parallel Fetching
1.4 addresses some of these scalability issues by implementing parallel
fetching. It takes into account that the most of the execution
time of munin-update
is spent waiting for
replies. In 1.4 munin-update
can ask
max_processes
nodes in parallel.
Now, the I/O part is becoming the next limiting factor, since updating many
RRDs in parallel is the same as random I/O
access for the underlying munin-master OS. Serializing & grouping the
updates will be possible with the new RRDp interface from rrdtool version 1.4
and on-demand graphing. Tomas Zvala even offered a patch for 1.4 RRDp on the ML. It is very promising, but
doesn't address the root defect in this design : a hard dependence of regular
munin-update
runs.
2.0 : Stateful plugins
2.0 provides a way for plugins to be stateful. They might schedule their
polling themselves, and then when munin-update
runs, only emit
collect already computed values. This way, a missed run isn't
as dramatic as it is in the 1.x series, since data isn't lost.
The data collection is also much faster because the real
computing is done ahead of time.
2.0 : Asynchronous proxy node
But changing plugins to be stateful and self-polled is difficult and tedious. It even works against of one of the real strength of munin : having simple & stateless plugins.
To address this concern, an experimental proxy node is created. For 2.0 it
takes the form of a couple of processes : munin-async-server
and
munin-sync-client
.
The proxy node in detail (munin-async
)
Overview
These 2 processes form an asynchronous proxy between
munin-update
and munin-node
. This avoids the need to
change the plugins or upgrade munin-node
on all nodes.
munin-async-server
should be installed on the same host than
the proxied munin-node
in order to avoid any network issue. It is
the process that will poll regularly munin-node
. The I/O issue of
munin-update
is here non-existent, since munin-async
stores all the values by simply appending them in a text file without any
further processing. This file is later read by the client's
munin-update
, and it will be processed there.
Specific update rates
Having one proxy per node enables a polling of all the services there with a specific update rate.
To achieve this, munin-async-server
forks into multiple
processes, one for each proxied service. This way each service is completely
isolated from the other, and therefore is able to have its own update rate, is
safe from other plugins slowdowns, and it does even completely parallelize the
information gathering.
SSH transport
munin-async-client
uses the new SSH native transport of 2.0. It
permits a very simple install of the async proxy.
Notes
[1] in 1.2 it's the same as plugin, but since 1.4 and the introduction of multigraph, one plugin can provide multiple services.