Search Exchange

Search All Sites

Nagios Live Webinars

Let our experts show you how Nagios can help your organization.

Contact Us

Phone: 1-888-NAGIOS-1
Email: sales@nagios.com

Login

Remember Me

Directory Tree

check_pbsnodes

Rating
0 votes
Favoured:
0
Current Version
0.2
Last Release Date
2011-05-26
Compatible With
  • Nagios 2.x
  • Nagios 3.x
License
GPL
Hits
95141
Files:
FileDescription
check_pbsnodescheck_pbsnodes
Nagios CSP

Meet The New Nagios Core Services Platform

Built on over 25 years of monitoring experience, the Nagios Core Services Platform provides insightful monitoring dashboards, time-saving monitoring wizards, and unmatched ease of use. Use it for free indefinitely.

Monitoring Made Magically Better

  • Nagios Core on Overdrive
  • Powerful Monitoring Dashboards
  • Time-Saving Configuration Wizards
  • Open Source Powered Monitoring On Steroids
  • And So Much More!
A nagios script for calling the 'showq' command to test for the presence of crashed nodes in a high performance computing cluster that uses Moab/Maui & Torque for job scheduling and queuing.

Example:
./check_pbsnodes -w 1 -c 2

This would warn Nagios if one node was unresponsive. If two nodes were down,
would send Nagios a critical message. In addition, the plugin reports the names of the crashed nodes, along with the job id's and users who own them.
FULL DESCRIPTION:

This plugin is for testing the presence of crashed nodes in a high performance
computing cluster. In such clusters, it is not uncommon for load to reach very
very high levels on compute nodes. Under such load, many parts of the system may
bog down and become unresponsive. For example, SSH logins may no longer work.
Polling via Gangila or Cacti may cease. And yet, this does not mean that the compute
node has crashed or isn't still doing the work assigned to it by the cluster scheduler.

Under such circumstances, the only way to know if a node is really down is if a job
goes negative. Torque has a higher nice level than the jobs it runs, so it is always
guranteed a processor time slice. If walltime is exceeded and Torque is able to get a
slice it will kill the job. If it can't, then it's because the node has crashed and
we'll see showq show negative time in the (time) REMAINING column.

Therefore, this plugin is designed to be run on the Cluster Service Node, calling the
showq command, parsing the output, and searching for values in the REMAINING column
that are negative numbers. When it finds them, it should report the problem using
correct Nagios syntax, and provide the crashed node names to the output string. It needs
to be called from a remote plugin executor such as NRPE, or MRPE if using Matthias Kettner's
Check_MK.