Build precise queries to find exactly what you need
Press ESC to close
Your review has been submitted and is pending approval.
LINUX ONLY - Check for OOM-Killer (out of memory killer) Activity
Current Version
2.0
Last Release Date
2010-09-07
Owner
John Chivian
Compatible With
check_oomkiller client plugin
check_oomkiller suid wrapper
The LINUX OS will assassinate "big memory" processes during extreme memory shortages as a self defense. This plugin checks for such activity and returns a critical status if any has occurred since the previous check.<br> <br> This plugin was written on and for RHEL4 using Nagios and NRPE and may need to be tweaked for other distro's.<br> <br> IMPORTANT - This check requires read only access to the system messages file <b>/var/log/messages</b> which is not by default available to unprivileged accounts. For this reason it is either necessary to make /var/log/messages readable to the nagios user account (DON'T DO THIS), or it is necessary to write a small compiled wrapper program around the script and install the wrapper as a root owned SUID executable (DO THIS!)<br> <br> Installation Instructions<br> <br> 1) Put the PERL script and C program in the nagios/libexec directory on the client system that will be checked.<br> 2) Edit the C program if needed changing the REAL_PATH definition for your environment.<br> 3) Compile the C program and install it as an SUID application. (chmod 4555 and chown root) <br> 4) Use the following plugin definition on the client system in the nrpe.cfg configuration file. As with the C program edit the path if needed for your environment.<br> <br> <pre> command[check_oomkiller]=/usr/local/nagios/libexec/check_oomkiller</pre> <br> 5) Use the following service check definition on the Nagios server to perform the check on monitored systems.<br> <br> <pre>define service{ use generic-service host_name possible-oom-killer-victim service_description OOM Killer check_command check_nrpe60!check_oomkiller max_check_attempts 1 }</pre> <br> Because each instance of the OOM-Killer check resets the current status, the service check definition on the Nagios server MUST contain "<b>max_check_attempts 1</b>". <i>If you don't do this you will NEVER be notified.</i><br> <br> Also notice that I am using a custom check_command called <b>check_nrpe60</b>. The only difference between check_nrpe60 and the standard check_nrpe is the addition of a 60 second timeout specification (see below). This is necessary because on systems with large /var/log/messages files (or busy systems with few CPU cycles to spare) the standard NRPE check on the server can timeout before the plugin has actually completed on the client.<br> <br> <pre>define command{ command_name check_nrpe60 command_line $USER1$/check_nrpe -H $HOSTADDRESS$ <b>-t 60</b> -c $ARG1$ }</pre> <br> The plugin returns a warning status if it can't perform its task, and a critical status if any OOM killer activity has taken place. If the status is critical, it also returns extended status information detailing the PID's and users affected.<br> <br> And finally, it is worth noting that on a properly tuned system this activity will probably not occur. We discovered it "by accident" when a physical server was converted to a virtual machine and not given the same amount of memory it had previously. When we identified and applied the correct memory tuning parameter (<b>vm.lower_zone_protection</b> applied in <b>/etc/syctl.conf</b> in this case) the OOM-Killer activity ceased.
I've got the compiled C wrapper working on the command line as ./check_oomkiller, but when attempting to run with NRPE as ./check_nrpe -H 127.0.0.1 -c check_oomkiller I continue to get NRPE: Unable to read output. Using NRPE 3.0.1. I have also update my nrpe.cfg file with the matching command and restarted nrpe.
You must be logged in to submit a review.
To:
From: