check_pbsnodes

Submit review Recommend Print Contact Owner

Rating

0 votes

Favoured:

Current Version

0.2

Last Release Date

2011-05-26

Compatible With

Nagios 2.x
Nagios 3.x

Owner

jonmills

License

GPL

Hits

95502

Files:

File	Description
check_pbsnodes	check_pbsnodes

Meet The New Nagios Core Services Platform

Built on over 25 years of monitoring experience, the Nagios Core Services Platform provides insightful monitoring dashboards, time-saving monitoring wizards, and unmatched ease of use. Use it for free indefinitely.

Download Now

Monitoring Made Magically Better

Nagios Core on Overdrive
Powerful Monitoring Dashboards
Time-Saving Configuration Wizards
Open Source Powered Monitoring On Steroids
And So Much More!

A nagios script for calling the 'showq' command to test for the presence of crashed nodes in a high performance computing cluster that uses Moab/Maui & Torque for job scheduling and queuing.

Example:
./check_pbsnodes -w 1 -c 2

This would warn Nagios if one node was unresponsive. If two nodes were down,
would send Nagios a critical message. In addition, the plugin reports the names of the crashed nodes, along with the job id's and users who own them.

FULL DESCRIPTION:

This plugin is for testing the presence of crashed nodes in a high performance
computing cluster. In such clusters, it is not uncommon for load to reach very
very high levels on compute nodes. Under such load, many parts of the system may
bog down and become unresponsive. For example, SSH logins may no longer work.
Polling via Gangila or Cacti may cease. And yet, this does not mean that the compute
node has crashed or isn't still doing the work assigned to it by the cluster scheduler.

Under such circumstances, the only way to know if a node is really down is if a job
goes negative. Torque has a higher nice level than the jobs it runs, so it is always
guranteed a processor time slice. If walltime is exceeded and Torque is able to get a
slice it will kill the job. If it can't, then it's because the node has crashed and
we'll see showq show negative time in the (time) REMAINING column.

Therefore, this plugin is designed to be run on the Cluster Service Node, calling the
showq command, parsing the output, and searching for values in the REMAINING column
that are negative numbers. When it finds them, it should report the problem using
correct Nagios syntax, and provide the crashed node names to the output string. It needs
to be called from a remote plugin executor such as NRPE, or MRPE if using Matthias Kettner's
Check_MK.

Reviews (0)

Be the first to review this listing!

Nagios, the Nagios logo, and Nagios graphics are the servicemarks, trademarks, or registered trademarks owned by Nagios Enterprises. All other servicemarks and trademarks are the property of their respective owner. The files and information on this site are the property of their respective owner(s). Nagios Enterprises makes no claims or warranties as to the fitness of any file or information on this website, for any purpose whatsoever. In fact, we officially disclaim all liability. We do, however, think these community contributions are pretty damn cool. Website Copyright © 2009-2025 Nagios Enterprises, LLC. All rights reserved.