→ http://munin-monitoring.org/wiki/Documentation
A plugin is best tested when run in the same conditions as it would be when triggered by munin-node; this can be simulated by running munin-run plugin as root. A potential second parameter given to this command (such as config) is passed to the plugin as a parameter.
When a plugin is invoked with the config parameter, it must describe itself by returning a set of fields:
$ sudo munin-run load config
graph_title Load average
graph_args --base 1000 -l 0
graph_vlabel load
graph_scale no
graph_category system
load.label load
graph_info The load average of the machine describes how many processes are in the run-queue (scheduled to run "immediately").
load.info 5 minute load average
The various available fields are described by the “configuration protocol” specification available on the Munin website.
→ http://munin-monitoring.org/wiki/protocol-config
When invoked without a parameter, the plugin simply returns the last measured values; for instance, executing sudo munin-run load could return load.value 0.12.
Finally, when a plugin is invoked with the autoconf parameter, it should return “yes” (and a 0 exit status) or “no” (with a 1 exit status) according to whether the plugin should be enabled on this host.
12.4.1.2. Configuring the Grapher
The “grapher” is simply the computer that aggregates the data and generate the corresponding graphs. The required software is in the munin package. The standard configuration runs munin-cron (once every 5 minutes), which gathers data from all the hosts listed in /etc/munin/munin.conf (only the local host is listed by default), saves the historical data in RRD files (Round Robin Database, a file format designed to store data varying in time) stored under /var/lib/munin/ and generates an HTML page with the graphs in /var/cache/munin/www/.
All monitored machines must therefore be listed in the /etc/munin/munin.conf configuration file. Each machine is listed as a full section with a name matching the machine and at least an address entry giving the corresponding IP address.
[ftp.falcot.com]
address 192.168.0.12
use_node_name yes
Sections can be more complex, and describe extra graphs that could be created by combining data coming from several machines. The samples provided in the configuration file are good starting points for customization.
The last step is to publish the generated pages; this involves configuring a web server so that the contents of /var/cache/munin/www/ are made available on a website. Access to this website will often be restricted, using either an authentication mechanism or IP-based access control. See Section 11.2, “Web Server (HTTP)” for the relevant details.
12.4.2. Setting Up Nagios
Unlike Munin, Nagios does not necessarily require installing anything on the monitored hosts; most of the time, Nagios is used to check the availability of network services. For instance, Nagios can connect to a web server and check that a given web page can be obtained within a given time.
12.4.2.1. Installing
The first step in setting up Nagios is to install the nagios3, nagios-plugins and nagios3-doc packages. Installing the packages configures the web interface and creates a first nagiosadmin user (for which it asks for a password). Adding other users is a simple matter of inserting them in the /etc/nagios3/htpasswd.users file with Apache's htpasswd command. If no Debconf question was displayed during installation, dpkg-reconfigure nagios3-cgi can be used to define the nagiosadmin password.
Pointing a browser at http://server/nagios3/ displays the web interface; in particular, note that Nagios already monitors some parameters of the machine where it runs. However, some interactive features such as adding comments to a host do not work. These features are disabled in the default configuration for Nagios, which is very restrictive for security reasons.
As documented in /usr/share/doc/nagios3/README.Debian, enabling some features involves editing /etc/nagios3/nagios.cfg and setting its check_external_commands parameter to “1”. We also need to set up write permissions for the directory used by Nagios, with commands such as the following:
# /etc/init.d/nagios3 stop
[...]
# dpkg-statoverride --update --add nagios www-data 2710 /var/lib/nagios3/rw
# dpkg-statoverride --update --add nagios nagios 751 /var/lib/nagios3
# /etc/init.d/nagios3 start
[...]
12.4.2.2. Configuring
The Nagios web interface is rather nice, but it does not allow configuration, nor can it be used to add monitored hosts and services. The whole configuration is managed via files referenced in the central configuration file, /etc/nagios3/nagios.cfg.
These files should not be dived into without some understanding of the Nagios concepts. The configuration lists objects of the following types:
a host is a machine to be monitored;
a hostgroup is a set of hosts that should be grouped together for display, or to factor some common configuration elements;
a service is a testable element related to a host or a host group. It will most often be a check for a network service, but it can also involve checking that some parameters are within an acceptable range (for instance, free disk space or processor load);
a servicegroup is a set of services that should be grouped together for display;
a contact is a person who can receive alerts;
a contactgroup is a set of such contacts;
a timeperiod is a range of time during which some services have to be checked;
a command is the command line invoked to check a given service.
According to its type, each object has a number of properties that can be customized. A full list would be too long to include, but the most important properties are the relations between the objects.
A service uses a command to check the state of a feature on a host (or a hostgroup) within a timeperiod. In case of a problem, Nagios sends an alert to all members of the contactgroup linked to the service. Each member is sent the alert according to the channel described in the matching contact object.
An inheritance system allows easy sharing of a set of properties across many objects without duplicating information. Moreover, the initial configuration includes a number of standard objects; in many cases, defining now hosts, services and contacts is a simple matter of deriving from the provided generic objects. The files in /etc/nagios3/conf.d/ are a good source of informaton on how they work.
The Falcot Corp administrators use the following configuration:
Example 12.3. /etc/nagios3/conf.d/falcot.cfg file
define contact{
name generic-contact
service_notification_period 24x7