Home Directory Addons Others check_oomkiller

Search Exchange

Search All Sites

Nagios Live Webinars

Let our experts show you how Nagios can help your organization.

Contact Us

Phone: 1-888-NAGIOS-1
Email: sales@nagios.com

Login

Remember Me

Directory Tree

check_oomkiller

Current Version
2.0
Last Release Date
2010-09-07
Compatible With
  • Nagios 3.x
Hits
99677
Files:
FileDescription
check_oomkiller.pl.txtcheck_oomkiller client plugin
check_oomkiller.c.txtcheck_oomkiller suid wrapper
Network Monitoring Software - Download Nagios XI
Log Management Software - Nagios Log Server - Download
Netflow Analysis Software - Nagios Network Analyzer - Download
LINUX ONLY - Check for OOM-Killer (out of memory killer) Activity
The LINUX OS will assassinate "big memory" processes during extreme memory shortages as a self defense. This plugin checks for such activity and returns a critical status if any has occurred since the previous check.

This plugin was written on and for RHEL4 using Nagios and NRPE and may need to be tweaked for other distro's.

IMPORTANT - This check requires read only access to the system messages file /var/log/messages which is not by default available to unprivileged accounts. For this reason it is either necessary to make /var/log/messages readable to the nagios user account (DON'T DO THIS), or it is necessary to write a small compiled wrapper program around the script and install the wrapper as a root owned SUID executable (DO THIS!)

Installation Instructions

1) Put the PERL script and C program in the nagios/libexec directory on the client system that will be checked.
2) Edit the C program if needed changing the REAL_PATH definition for your environment.
3) Compile the C program and install it as an SUID application. (chmod 4555 and chown root)
4) Use the following plugin definition on the client system in the nrpe.cfg configuration file. As with the C program edit the path if needed for your environment.

   command[check_oomkiller]=/usr/local/nagios/libexec/check_oomkiller

5) Use the following service check definition on the Nagios server to perform the check on monitored systems.

define service{
   use                   generic-service
   host_name             possible-oom-killer-victim
   service_description   OOM Killer
   check_command         check_nrpe60!check_oomkiller
   max_check_attempts    1
}

Because each instance of the OOM-Killer check resets the current status, the service check definition on the Nagios server MUST contain "max_check_attempts 1". If you don't do this you will NEVER be notified.

Also notice that I am using a custom check_command called check_nrpe60. The only difference between check_nrpe60 and the standard check_nrpe is the addition of a 60 second timeout specification (see below). This is necessary because on systems with large /var/log/messages files (or busy systems with few CPU cycles to spare) the standard NRPE check on the server can timeout before the plugin has actually completed on the client.

define command{
   command_name    check_nrpe60
   command_line    $USER1$/check_nrpe -H $HOSTADDRESS$ -t 60 -c $ARG1$
   }

The plugin returns a warning status if it can't perform its task, and a critical status if any OOM killer activity has taken place. If the status is critical, it also returns extended status information detailing the PID's and users affected.

And finally, it is worth noting that on a properly tuned system this activity will probably not occur. We discovered it "by accident" when a physical server was converted to a virtual machine and not given the same amount of memory it had previously. When we identified and applied the correct memory tuning parameter (vm.lower_zone_protection applied in /etc/syctl.conf in this case) the OOM-Killer activity ceased.
Reviews (1)
I've got the compiled C wrapper working on the command line as ./check_oomkiller, but when attempting to run with NRPE as ./check_nrpe -H 127.0.0.1 -c check_oomkiller I continue to get NRPE: Unable to read output. Using NRPE 3.0.1. I have also update my nrpe.cfg file with the matching command and restarted nrpe.