Search Exchange
Search All Sites
Nagios Live Webinars
Let our experts show you how Nagios can help your organization.Login
Directory Tree
check_vmware_esx.pl
0.9.19
2014-08-21
- Nagios 1.x
- Nagios 2.x
- Nagios 3.x
- Nagios 4.x
- Nagios XI
GPL
71629
File | Description |
---|---|
check_vmware_esx_0.9.19.tgz | The code |
HISTORY | History |
README | Something to read |
check_vmware_esx.pl
===================
check_vmware_esx.pl - a fork of check_vmware_api.pl
General
-------
Why a fork? According to my personal unterstanding Nagios, Icingia etc. are tools for
a) Monitoring and alarming. That means checking values against thresholds (internal or handed over)
b) Collecting performance data. These data, collected with the checks, like network traffic, cpu usage or so should be
interpretable without a lot of other data.
Athough check_vmware_api.pl is a great plugin it suffers from various things.
a) It acts as a monitoring plugin for Nagios etc.
b) It acts a more comfortable commandline interfacescript.
c) It collects all a lot of historical data to have all informations in one interface.
While a) is ok b) and c) needs to be discussed. b) was necessary when you had only the Windows GUI and working on Linux
meant "No interface". this is obsolete now with the new webgui.
c) will be better used by using the webgui because historical data (in most situations) means adjusted data. Most of these
collected data is not feasible for alerting but for analysing performance gaps.
So as a conclusion collecting historic performance data collected by a monitored system should not be done using Nagios,
pnp4nagios etc.. It should be interpreted with the approriate admin tools of the relevant system. For vmware it means use
the (web)client for this and not Nagios. Same for performance counters not self explaining.
Example:
Monitoring swapped memory of a vmware guest system seems to makes sense. But on the second look it doesn't because on Nagios
you do not have the surrounding conditions in one view like
- the number of the running guest systems on the vmware server.
- the swap every guest system needs
- the total space allocated for all systems
- swap/memory usage of the hostcheck_vmware_esx.pl
- and a lot more
So monitoring memory of a host makes sense but the same for the guest via vmtools makes only a limited sense.
But this were only a few problems. Furthermore we had
- misleading descriptions
- things monitored for hosts but not for vmware servers
- a lot of absolutely unnesseary performance data (who needs a performane graph for uptime?)
- unnessessary output (CPU usage in Mhz for example)
- and a lot more.
This plugin is old and big and cluttered like the room of my little son. So it was time for some house cleaning.
I try to clean up the code of every routine, change things and will try to ease maintenance of the code.
The plugin is not really ready but working. Due to the mass
of options the help module needs work.
See history for changes
One last notice for technical issues. For better maintenance (and partly improved runtime)
I have decided to modularize the plugin. It makes patching a lot easier. Modules which must be there every time are
included with use, the others are include using require at runtime. This ensures that only
that part of code is loaded which is needed.
History:
========
- 22 Mar 2012 M.Fuerstenau
- Started with actual version 0.5.0
- Impelemented check_esx3new2.diff from Simon (simeg / simmerl)
- Reimplemented the changes of Markus Obstmayer for the actual version
- Comments within the code inform you about the changes
- It may happen that controllers has been found which are not active.
Therefor around line 2300 the following was added:
# M.Fuerstenau - additional to avoid using inactive controllers --------------
elsif (uc($dev->status) eq "3")
{
$status = 0;
}
#----------------------------------------------------------------------------
else
{
$state = 3;
}
$actual_state = Nagios::Plugin::Functions::max_state($actual_state, $status);
}
$perfdata = $perfdata . " adapters=" . $count . ";"$perf_thresholds . ";;";
# M.Fuerstenau - changed the output a little bit
$output .= $count . " of " . @{$storage->storageDeviceInfo->hostBusAdapter} . " defined/possible adapters online, ";
- 30 Mar 2012 M.Fuerstenau
- added --ignore_unknown. This maps 3 to 0. Why? You have for example several host adapters. Some are reported as
unknown by the plugin because they are not used or have not the capability to reports something senseful.
- 02 Apr 2012 M.Fuerstenau
- _info (Adapter/LUN/Path). Removed perfdata. To count such items and display as perfdata doesn't make sense
- Changed PATH to MPATH and help from "path - list logical unit paths" to "mpath - list logical unit multipath info" because
it is NOT an information about a path - it is an information about multipathing.
- 08 Jan 2013 M.Fuerstenau
- Removed installation informations for the perl SDK from VMware. This informations are part of the SDK and have nothing to do
with this plugin.
- Replaced global variables with my variables. Instead of "define every variable on the fly as needed it is a good practice
to define variables at the beginning and place a comment within it. It gives you better readability.
- 22 Jan 2013 M.Fuerstenau
- Merged with the actual version from op5. Therfore all changes done to the op5 version:
- 2012-05-28 Kostyantyn Hushchyn
Rename check_esx3 to check_vmware_api(Fixed issue #3745)
- 2012-05-28 Kostyantyn Hushchyn
Minor cosmetic changes
- 2012-05-29 Kostyantyn Hushchyn
Clear cluster failover perfdata units, as it describes count of possible failures
to tolerate, so can't be mesured in MB
- 2012-05-30 Kostyantyn Hushchyn
Minor help message changes
- 2012-05-30 Kostyantyn Hushchyn
Implemented timeshift for cluster checks, which could fix data retrievel issues. Small refactoring.
- 2012-05-31 Kostyantyn Hushchyn
Remove dependency on inteval value for cluster checks, which allows to run commands that doesn't require historical intervals
- 2012-05-31 Kostyantyn Hushchyn
Remove unnecessary/unimplemented function which caused cluster effectivecpu subcheck to fail
- 2012-06-01 Kostyantyn Hushchyn
Hide NIC status output for net check in case of empty perf data result(Fixed issue #5450)
- 2012-06-07 Kostyantyn Hushchyn
Fixed manipulation with undefined values, which caused perl interpreter warnings output
- 2012-06-08 Kostyantyn Hushchyn
Moved out global variables from perfdata functions. Added '-M' max sample number argument, which specify maximum data count to retrive.
- 2012-06-08 Kostyantyn Hushchyn
Added help text for Cluster checks.
- 2012-06-08 Kostyantyn Hushchyn
Increment version number Kostyantyn Hushchyn
- 2012-06-11 Kostyantyn Hushchyn
Reimplemented csv parser to process all values in sequence. Now all required functionality for max sample number argument are present in the plugin.
- 2012-06-13 Kostyantyn Hushchyn
Fixed cluster failover perf counter output.
- 2012-06-22 Kostyantyn Hushchyn
Added help message for literal values in interval argument.
- 2012-06-22 Kostyantyn Hushchyn
Added nicknames for intervals(-i argument), which helps to provide correct values in case you can not find them in GUI.
Supported values are: r - realtime interval, h - historical interval at position , starting from 0.
- 2012-07-02 Kostyantyn Hushchyn
Reimplemented Datastore checking in Datacenter using different approach(Might fix issue #5712)
- 2012-07-06 Kostyantyn Hushchyn
Fixed Datacenter runtime check Kostyantyn Hushchyn
- 2012-07-06 Kostyantyn Hushchyn
Fixed Datastore checking in Datacenter(Might fix issue #5712)
- 2012-07-09 Kostyantyn Hushchyn
Added help info for Host runtime 'sensor' subcheck
- 2012-07-09 Kostyantyn Hushchyn
Added Host runtime subcheck to threshold sensor data
- 2012-07-09 Kostyantyn Hushchyn
Fixed Host temperature subcheck causing perl interpreter messages output
- 2012-07-10 Kostyantyn Hushchyn
Added listall option to output all available sensors. Sensor name now trieted as regexp, so result will be outputed for the first match.
- 2012-07-26 Fixed issue which prevents plugin... v2.8.8 v2.8.8-beta1 Kostyantyn Hushchyn
Fixed issue which prevents plugin from executing under EPN(Fixed issue #5796)
- 2012-09-03 Kostyantyn Hushchyn
Implemented plugin timeout(returns 3).
- 2012-09-05 Kostyantyn Hushchyn
Added storage refresh functionality in case when it's present(Fixed issue #5787)
- 2012-09-21 Kostyantyn Hushchyn
Added check for dead pathes, which generates 2 in case when at least one is present(Fixed issue #5811)
- 2012-09-25 Kostyantyn Hushchyn
Changed comparison logic in storage path check
- 2012-09-26 Kostyantyn Hushchyn
Fixed 'Global symbol normalizedPathState requires explicit package name'
- 2012-10-02 Kostyantyn Hushchyn
Changed timeshift argument type to integer, so that non number values will be treated as invalid.
- 2012-10-02 Kostyantyn Hushchyn
Changed to a conditional datastore refresh(Reduce overhead of solution suggested in issue #5787)
- 2012-10-05 Kostyantyn Hushchyn
Updated description so now almost all options are documented, though somewhere should be documented arguments like timeshift(-T),
max samples(-M) and interval(-i) (Solve ticket #5950)
- 31 Jan 2013 M.Fuerstenau
- Replaced most die with a normal if statement and an exit.
- 1 Feb 2013 M.Fuerstenau
- Replaced unless with if. unless was only used eight times in the program. In all other statements we had an if statement
with the appropriate negotiation for the statement.
- 5 Feb 2013 M.Fuerstenau
- Replaced all add_perfdata statements with simple concatenated variable $perfdata
- 6 Feb 2013 M.Fuerstenau
- Corrected bug. Name of subroutine was sub check_percantage but thist was a typo.
- 7 Feb 2013 M.Fuerstenau
- Replaced $percc and $percw with $crit_is_percent and $warn_is_percent. This was just cosmetic for better readability.
- Removed check_percentage(). It was replaced by two one liners directly in the code. Easier to read.
- The only codeblocks using check_percentage() were the blocks checking warning and critical. But unfortunately the
plausability check was not sufficient. Now it is tested that no other values than numbers and the % sign can be
submitted. It is also checked that in case of percent the values are in a valid level between 0 and 100
- 12 Feb 2013 M.Fuerstenau
- Replaced literals like CRITICAL with numerical values. Easier to type and anyone developing plugins should be
safe with the use
- Replaced $state with $actual_state and $res with $state. More for cosmetical issues but the state is returned
to Nagios.
- check_against_threshold from Nagios::Plugin replaced with a little own subroutine check_against_threshold.
- Nagios::Plugin::Functions::max_state replaced with own routine check_state
- 14 Feb 2013 M.Fuerstenau
- Replaced hash %STATUS_TEXT from Nagios::Plugin::Functions with own hash %status2.
- 15 Feb 2013 M.Fuerstenau
- Own help (print_help()) and usage (print_usage()) function.
- Nagios::plugin kicked finally out.
- Mo more global variables.
- 25 Feb 2013 M.Fuerstenau
- $quickstats instead of $quickStats for better readability.
- 5 Mar 2013 M.Fuerstenau
- Removed return_cluster_DRS_recommendations() because for daily use this was more of an exotical feature
- Removed --quickstats for host_cpu_info and dc_cpu_info because quickstats is not a valid option here.
- 6 Mar 2013 M.Fuerstenau
- Replaced -o listitems with --listitems
- 8 Mar 2013 M.Fuerstenau
- --usedspace replaces -o used. $usedflag has been replaced by $usedflag.
- --listvms replaces -o listvm. $outputlist has been replaced by $listvms.
- --alertonly replaces -o brief. $briefflag has been replaced by $alertonly.
- --blacklistregexp replaces -o blacklistregexp. $blackregexpflag has been replaced by $blacklistregexp.
- --isregexp replaces -o regexp. $regexpflag has been replaced by $isregexp.
- 9 Mar 2013 M.Fuerstenau
- Main selection is now transfered to a subrouting main_select because after
a successfull if statement the rest can be skipped leaving the subroutine
with return
- 19 Mar 2013 M.Fuerstenau
- Reformatted and cleaned up a lot of code. Variable definitions are now at the beginning of each
subroutine instead of defining them "on the fly" as needed with "my". Especially using "my" for
definition in a loop is not goog coding style
- 21 Mar 2013 M.Fuerstenau
- --listvms removed as extra switch. Ballooning or swapping VMs will always be listed.
- Changed subselect list(vm) to listvm for better readability. listvm was accepted before (equal to list)
but not mentioned in the help. To have list or listvm for the same is a little bit exotic. Fixed this inconsistency.
- 22 Mar 2013 M.Fuerstenau
- Removed timeshift, interval and maxsamples. If needed use original program from op5.
- 25 Mar 2013 M.Fuerstenau
- Removed $defperfargs because no values will be handled over. Only performance check that needed another which
needed another sampling invel was cluster. This is now fix with 3000.
- 11 Apr 2013 M.Fuerstenau
- Rewritten and cleaned subroutine host_mem_info. Removed $value1 - $value5. Stepwise completion of $output makes
this unsophisticated construct obsolete.
- 16 Apr 2013 M.Fuerstenau
- Stripped down vm_cpu_info. Monitoring CPU usage in Mhz makes no sense under normal circumstances
Mhz is no valid unit for performance data according to the plugin developer guide. I have never found
a reason to monitor wait time or ready time in a normal alerting evironment. This data has some interest
for performance analysis. But this can be done better with the vmware tools.
- Rewritten and cleaned subroutine vm_mem_info. Removed $value1 - $value5. Stepwise completion of $output makes
this unsophisticated construct obsolete.
- 24 Apr 2013 M.Fuerstenau
- Because there is a lot of different performance counters for memory in vmware we ave changed something to be
more specific.
- Enhenced explanations in help.
- Changed swap to swapUSED in host_mem_info().
- Changed usageMB to CONSUMED in host_mem_info(). Same for variables.
- Removed overall in host_mem_info(). After reading the documentation carefully the addition of consumed.average + overhead.average
seems a little bit senseless because consumed.average includes overhead.average.
- Changed usageMB to CONSUMED in vm_mem_info(). Same for variables.
- Removed swapIN and swapOUT in vm_mem_info(). Not so sensefull for Nagios alerting because it is hard to find
valid thresholds
- Removed swap in vm_mem_info(). From the vmware documentation:
"Current amount of guest physical memory swapped out to the virtual machine's swap file by the VMkernel. Swapped
memory stays on disk until the virtual machine needs it. This statistic refers to VMkernel swapping and not
to guest OS swapping. swapped = swapin + swapout"
This is more an issue of performance tuning rather than alerting. It is not swapping inside the virtual machine.
it is not possible to do any alerting here because (especially with vmotion) you have no thresholds.
- Removed OVERHEAD in vm_mem_info(). From the vmware documentation:
"Amount of machine memory used by the VMkernel to run the virtual machine."
So using this we have a useless information about a virtual machine because we have no valid context and we
have no valid thresholds. More important is overhead for the host system. And if we are running in problems here
we have to look which machine must be moved to another host.
- As a result of this overall in vm_mem_info() makes no sense.
- 25 Apr 2013 M.Fuerstenau
- Removed swap in vm_mem_info(). From vmware documentation:
"Amount of guest physical memory that is currently reclaimed from the virtual machine through ballooning.
This is the amount of guest physical memory that has been allocated and pinned by the balloon driver."
So here we have again data which makes no sense used alone. You need the context for interpreting them
and there are no thresholds for alerting.
- 29 Apr 2013 M.Fuerstenau
- Renamed $esx to $esx_server. This is only for cosmetics and better reading of the code.
- Reimplmented subselect ready in vm_cpu_info and implemented it new in host_cpu_info.
From the vmware documentation:
"Percentage of time that the virtual machine was ready, but could not get scheduled
to run on the physical CPU. CPU ready time is dependent on the number of virtual
machines on the host and their CPU loads."
High or growing ready time can be a hint CPU bottlenecks (host and guest system)
- Reimplmented subselect wait in vm_cpu_info and implemented it new in host_cpu_info.
From the vmware documentation:
"CPU time spent in wait state. The wait total includes time spent the CPU Idle, CPU Swap Wait,
and CPU I/O Wait states. "
High or growing wait time can be a hint I/O bottlenecks (host and guest system)
- 30 Apr 2013 M.Fuerstenau
- Removed subroutines return_dc_performance_values, dc_cpu_info, dc_mem_info, dc_net_info and dc_disk_io_info.
Monitored entity was view type HostSystem. This means, that the CPU of the data center server is monitored.
The data center server (vcenter) is either a physical MS Windows server (which can be monitored better
directly with SNMP and/or NSClient++) or the new Linux based appliance which is a virtual machine and
can be monitored as any virtual machine. The OS (Linux) on that virtual machine can be monitored like
any standard Linux.
- 5 May 2013 M.Fuerstenau
- Revised the code of dc_list_vm_volumes_info()
- 9 May 2013 M.Fuerstenau
- Revised the code of host_net_info(). The function was devided in two parts (like others):
- subselects
- else which included all.
So most of the code existed twice. One for each subselect and nearly the same for all together.
The else block was removed and in case no subselect was defined we defined all as $subselect.
With the variable set to all we can decide wether to leave the function after a subselect section
has been processed or stay and enhance $output and $perfdata. So the code is more clear and
has nearly half the lines of code left.
- Removed KBps as unit in performance data. This unit is not specified in the plugin developer
guide. Performance data is now just a number without a unit. Adding the unit has to be done
in the graphing tool (like pnp4nagios).
- Removed the number of NICs as performance data. A little bit senseless to have those data here.
- 10 May 2013 M.Fuerstenau
- Revised the code of vm_net_info(). Same changes as for host_net_info() exept the NIC section.
This is not available for VMs.
- 14 May 2013 M.Fuerstenau
- Replaced $command and $subselect with $select and $subselect. Therfore also the options --command
--subselect changed to --select and --subselect. This has been done to become it more clear.
In fact these items where no commands (or subselects). It were selections from the amount of
performance counters available in vmware.
- 15 May 2013 M.Fuerstenau
- Kicked out all (I hope so) code for processing historic data from generic_performance_values().
generic_performance_values() is called by return_host_performance_values(), return_host_vmware_performance_values()
and return_cluster_performance_values() (return_cluster_performance_values() must be rewritten now).
The code length of generic_performance_values() was reduced to one third by doing this.
- 6 Jun 2013 M.Fuerstenau
- Substituted commandline option for select -l with -S. Therefore -S can't be used as option for the sessionfile
Only --sessionfile is accepted nor the name of the sessionfile.
- Corrected some bugs in check_against_threshold()
- Ensured that in case of thresholds critical must be greater than warning.
- 11 Jun 2013 M.Fuerstenau
- Changed select option for datastore from vmfs to volumes because we will have volumes on nfs AND vmfs.
- Changed output for datastore check to use the option --multiline. This will add a (unset -> default) for
every line of output. If set it will use HTML tag
.
The option --multiline sets a
tag instead of . This must be filtered out
before using the output in notifications like an email. sed will do the job:
sed 's/<[Bb][Rr]>/& /g' | sed 's/<[^<>]*>//g'
Example:
# 'notify-by-email' command definition
define command{
command_name notify-by-email
command_line /usr/bin/printf "%b" "Message from Nagios: Notification Type: $NOTIFICATIONTYPE$ Service: $SERVICEDESC$ Host: $HOSTNAME$ Hostalias: $HOSTALIAS$ Address: $HOSTADDRESS$ State: $SERVICESTATE$ Date/Time: $SHORTDATETIME$ Additional Info: $SERVICEOUTPUT$ $LONGSERVICEOUTPUT$" | sed 's/<[Bb][Rr]>/& /g' | sed 's/<[^<>]*>//g' | /bin/mail -s "** $NOTIFICATIONTYPE$ alert - $HOSTNAME$/$SERVICEDESC$ is $SERVICESTATE$ **" $CONTACTEMAIL$
}
- 13 Jun 2013 M.Fuerstenau
- Replaced a previous change because it was wrong done:
- --listvms replaced by subselect listvms
- 14 Jun 2013 M.Fuerstenau
- Some minor corrections like a doubled chop() datastore_volumes_info()
- Added volume type to datastore_volumes_info(). So you can see whether the volume is vmfs (local or SAN) or NFS.
- variables like $subselect or $blacklist are global there is no need to handle them over to subroutines like
($result, $output) = vm_cpu_info($vmname, local_uc($subselect)) . For $subselect we have now one uppercase
(around line 580) instead of having one with each call in the main selection.
- Later on I renamed local_uc to local_lc because I recognized that in cases the subselect is a volume name
upper cases won't work.
- replaced last -o $addopts (only for the name of a sensor) with --sensorname
- 18 Jun 2013 M.Fuerstenau
- Rewritten and cleaned subroutine host_disk_io_info(). Removed $value1 - $value7. Stepwise completion of $output makes
this unsophisticated construct obsolete.
- Removed use of performance thresholds in performance data when used disk io without subselect because threshold
can only be used for on item not for all. Therefore they weren't checked in that section. Senseless.
- Changed the output. Opposite to vm_disk_io_info() most vlues in host_disk_io_info() are not transfer rates
but latency in milliseconds. The output is now clearly understandable.
- Added subselect read. Average number of kilobytes read from the disk each second. Rate at which data is read
from each LUN on the host.read rate = # blocksRead per second x blockSize.
- Added subselect write. Average number of kilobytes written to disk each second. Rate at which data is written
to each LUN on the host.write rate = # blocksRead per second x blockSize
- Added subselect usage. Aggregated disk I/O rate. For hosts, this metric includes the rates for all virtual
machines running on the host.
- 21 Jun 2013 M.Fuerstenau
- Rewritten and cleaned subroutine vm_disk_io_info(). Removed $value1 - $valuen. Stepwise completion of $output makes
this unsophisticated construct obsolete.
- Removed use of performance thresholds in performance data when used disk io without subselect because threshold
can only be used for on item not for all. Therefore they weren't checked in that section. Senseless.
- 24 Jun 2013 M.Fuerstenau
- Changed all .= (for example $output .= $xxx.....) to = $var... (for example $output = $output . $xxx...). .= is shorter
but the longer form of notification is better readable. The probability of overlooking the dot (especially for older eyes
like mine) is smaller.
- 07 Aug 2013 M.Fuerstenau
- Changed "eval { require VMware::VIRuntime };" to "use VMware::VIRuntime;". The eval construct
made no sense. If the module isn't available the program will crash with a compile error.
- Removed own subroutine format_uptime() only used by host_uptime_info(). The complete work of this function
was done converting seconds to days, hours etc.. Instead of the we use the perl module Time::Duration.
So instead of
$output = "uptime=" . format_uptime($value);
we simply use
$output = "uptime=" . duration_exact($value);
- Removed perfdata from host_uptime_info(). Perfdata for uptime seems senseless. Same for threshold.
- Started modularization of the plugin. The reason is that it is much more easier to
patch modules than to patch a large file.
- Variables used in that functions which are defined on the top level
with "my" must now be defined with "our".
BEWARE! Using "our" with unknown modules can lead to curious results if
in this functions are variables with the same name. But in this
case it is no risk because the modules are not generic. We have only
broken the plugin in handy pieces.
- Made an seperate modules:
- help.pm -> print_help()
- process_perfdata.pm -> get_key_metrices()
-> generic_performance_values()
-> return_host_performance_values()
-> return_host_vmware_performance_values()
-> return_cluster_performance_values()
-> return_host_temporary_vc_4_1_network_performance_values()
- host_cpu_info.pm -> host_cpu_info()
- host_mem_info.pm -> host_mem_info()
- host_net_info.pm -> host_net_info()
- host_disk_io_info.pm -> host_disk_io_info()
- datastore_volumes_info.pm -> datastore_volumes_info()
- host_list_vm_volumes_info.pm -> host_list_vm_volumes_info()
- host_runtime_info.pm -> host_runtime_info()
- host_service_info.pm -> host_service_info()
- host_storage_info.pm -> host_storage_info()
- host_uptime_info.pm -> host_uptime_info()
- 13 Aug 2013 M.Fuerstenau
- Moved host_device_info to host_mounted_media_info. Opposite to it's name
and the description this function wasn't designed to list all devices
on a host. It was designed to show host cds/dvds mounted to one or more
virtual machines. This is important for monitoring because a virtual machine
with a mount cd or dvd drive can not be moved to another host.
- Made an seperate modules:
- host_mounted_media_info.pm -> host_mounted_media_info()
- 19 Aug 2013 M.Fuerstenau
- Added SOAP check from Simon Meggle, Consol. Slightly modified to fit.
- Added isblacklisted and isnotwhitelisted from Simon Meggle, Consol. . Same as above.
Following subroutines or modules are affected:
- datastore_volumes_info.pm
- host_runtime_info.pm
- Enhanced host_mounted_media_info.pm
- Added check for host floppy
- Added isblacklisted and isnotwhitelisted
- Added $multiline
- 21 Aug 2013 M.Fuerstenau
- Reformatted and cleaned up host_runtime_info().
- A lot of bugs in it.
- 17 Aug 2013 M.Fuerstenau
- Minor bug fix.
- $subselect was always converted to lower case characters.
This is correct exect $subselect contains a name (e.g. volumes). Volume names
can contain upper and lower letters. Fixed.
- datastore_volumes_info.pm had my ($datastore, $subselect) = @_; as second line
This was incorrect because "global" variables (defined as our in the main program)
are not handled over via function call. (Yes - may be handling over maybe more ok
in the sense of structured programming. But really - does handling over and giving back
a variable makes the code so much clearer? More a kind of philosophy :-)) )
- 3 Oct 2013 M.Fuerstenau
- host_runtime_info().
- Filtered out the sensor type "software components". Makes no sense for alerting.
- 01 Nov 2013 M.Fuerstenau version 0.8.10
- removed return_host_temporary_vc_4_1_network_performance_values() from
process_perfdata.pm. Not needed in ESX version 5 and above.
Affected subroutine:
host_net_info().
- Bug fixed in generic_performance_values(). Unfortunaltely I had moved
my @values = () to main function. Therefore instead of containing a
new array reference with each run the new references were added to the array
but only the first one was processed. Thanks to Timo Weber for discovering this bug.
- 20 Nov 2013 M.Fuerstenau version 0.8.11
- check_state(). Bugfix. Logical error. Complete rewrite.
- host_net_info()
- Minor bugfix. Added exit 2 in case of unallowed thresholds
- Simplified the output. Instead of doing it for every subselect (or not
if subselect=all for selecting all info) we have a new helper variable
called $true_sub_sel.
0 means not a true subselect.
1 means a true subselect (including all)
- host_runtime_info().
- Filtered out the sensor type "software components". Makes no sense for alerting.
- The complete else tree (no subselect) was scrap due to the fact that the return code
was always ok. There would never be any alarm. Kicked out.
- The else tree was replaced by a $subselect=all as in host_net_info().
- Same for output.
- subselect listvms.
- The (OK) in the output was a hardcoded. Replaced by the deliverd value (UP). (in my oipinion senseless)
- Removed perfdata. The number of virtual machines as perfdata doesn't
make so much sense.
- Rearranged the order of subselects. Must be the same as the statements
in the kicked out else sequence to get the same order in output
- connection state.
From the docs:
connected Connected to the server. For ESX Server, this is always
the setting.
disconnected The user has explicitly taken the host down. VirtualCenter
does not expect to receive heartbeats from the host. The
next time a heartbeat is received, the host is moved to
the connected state again and an event is logged.
notResponding VirtualCenter is not receiving heartbeats from the server
. The state automatically changes to connected once
heartbeats are received again. This state is typically
used to trigger an alarm on the host.
In the past the presset of returncode was 2. It was set to 0 in case of a
connected. But disconnected doesn' mean a critical error. It means main-
tenance or somesting like that. Therefore we now return a 1 for warning.
notResonding will cause a 2 for critical.
- Kicked out maintenance info in runtime summary and as subselect. In the beginning of
the function is a check for maintenance. In the original program in this case
the program will be left with a die which caused a red alert in Nagios. Now an info
is displayed and a return code of 1 (warning) is deliverd because a maintenance is
regular work. Although monitoring should take notice of it. Therefore a warning.
Therefor the maintenence check in the else tree was scrap. It was never reached.
- listvms
In case of no VMs the plugin returned a critical. But this is not correct. No VMs
on a host is not an error. It is simply what it says: No VMs.
- Replaced --listitems and --listall with --listsensors because there were two options
for the same use.
- Later on I decided to kick out the complete sensorname construction. To monitor a seperate sensor
by name (a string which ist different for each sensor) seems not to make so much sense. To monitor
sensors by type gave no usefull information (exept temperature which is still monitored. All usefull
informations are in the health section. Better to implement some out here if needed.
renamed -s temperature to -s temp (because I am lazy).
- Changed output of --listsensors. | is the seperation symbol for performance data. So it should not be
used in output strings.
- Same for the error output of sensors in case of a problem.
- subselect=health. Filter out sensors which have not valid data. Often a sensor is
reckognizedby vmware but has not the ability to report something senseful. In this
case an unknown is reported and the message "Cannot report on the current health state of
the element". This it can be skipped.
- Bugfix for vm_net_info(). It worked perfectly for -H but produced bullshit with the -D option.
With -D in PerfQuerySpec->new(...) intervalId and maxSample must be set. Default was 20 for intervalId
and 1 for maxSample
- datastore_volumes_info()
- New option --gigabyte
- Fixed bug for treating name as regexp (--isregexp).
- Fixed bug in perfdata (%:MB). Is now MB or GB.
- If no single volume is selected warning/critical threshold can be given
only in percent.
- Fixed bug in regexp for blacklists and whitelists and replace --blacklistregexp and --whitelistregexp with
--isregexp. Used in
- host_mounted_media_info()
- datastore_volumes_info()
- host_runtime_info()
All subroutines revised later will include it automatically
- Fixed bug in regexp for blacklists and whitelists and replace --blacklistregexp and --whitelistregexp with
- Opposite to the op5 original blacklists and whitelists can now contain true reular expressions.
- help.pm rewritten. It was too much output. Now the user has several choices:
-h|--help= The complete help for all.
-h|--help= Help for datacenter/vcenter checks.
-h|--help= Help for vmware host checks.
-h|--help= Help for virtual machines checks.
-h|--help= Help for cluster checks.
- host_service_info() rewritten.
- No longer a list of services via subselect.
- Instead full blacklist/whitelist support
- With --isregexp blacklist/whitelist items are interpreted as regular expressions
- Changed switch of blacklist from -x to -B
- Changed switch of whitelist from -y to -W
- main_select(). Something for the scatterbrained of us. Replaced string compare (eq,ne etc.) with
a regexp pattern matching. So it is possible to type in services or Services instead of service.
- 29 Nov 2013 M.Fuerstenau version 0.8.13
- In all functions checking host for maintenance we had the following contruction:
if (uc($host_view->get_property('runtime.inMaintenanceMode')) eq "TRUE")
This is quite stupid. runtime.inMaintenanceMode is xsd:boolean wich means true or false (in
lower cas letters. So a uc() make no sense. Removed
- Bypass SSL certificate validation - thanks Robert Karliczek for the hint
- Added --sslport if any port other port than 443 is wanted. 443 is used as default.
- Rewritten authorization part. Sessionfiles are working now.
- Updated README.
- 03 Dec 2013 M.Fuerstenau version 0.8.14
- host_runtime_info()
- Fixed minor bug. In case of an unknown with sensors the unknown was always mapped to
a warning. This wasn't sensefull under all circumstances. Now it is only mapped to warning
if --ignoreunknow is not set.
- README
- Updated installation notes
- Optional pathname for sessionfile
- 10 Dec 2013 M.Fuerstenau version 0.8.15
- datastore_volumes_info()
- Changed "For all volumes" to "For all selected volumes". This should fit all. The subroutine datastore_volumes_info
is called from several subroutines and to make a difference here for a single volume or more volumes will cause
a lot of unnecessary work just for cosmetics.
- Fixed typo in error message.
- Modified error messages
- Removed plausibility check whether critical must be greater than warning. In case of freespace for example it must
be the other way round. The plausibility check was nice but too complicated for all the different conditions.
- check_against_threshold() - Cleaned up and partially rewritten.
- 12 Dec 2013 M.Fuerstenau version 0.8.16
- datastore_volumes_info()
- Some small fixes
- Changed the output for OK from "For all selected volumes" to "OK for all seleted volumes."
- Same for error plus an error counter for the alarms
- Parameter usedspace was ignored on volume checks because a local variable defined with my instead of a more global
one. Changed it from my to our fixed it. Thanks to dgoetz for reporting (and fixing) this.
- Capacity is now displayed and deliverd as perfdata
- Main selection - GetOptions. Unce upon a time I had kicked out $timeout unintended. Fixed now. Thanks to
Andreas Daubner for the hint.
- 12 Dec 2013 M.Fuerstenau version 0.8.17
- host_net_info()
- Changed output
- Added total number of NICs
- 25 Dec 2013 M.Fuerstenau version 0.9.0
- help()
- Removed -v/--verbose. The code was not debugging the plugin but not for working with it.
- host_storage_info()
- Removed optional switch for adaptermodel. Displaying the adapter model is default now.
- Removed upper case conversion for status. The status was converted (for examplele online to ONLINE) and
compared with a upper case string like "ONLINE". Sensefull like a second asshole.
- Added working blacklist/whitelist.
- Blacklist: blacklisted adapters will not be displayed.
- Whitelist: only whitelisted adapters will be displayed.
- Removed perfdata for the number of hostbusadapters. These perfdata was absolutely senseless.
- Status for hostbusadapters was not checked correctly. The check was only done for online and unknown but NOT(!!)
for offline and unbound.
- LUN states were not correct. UNKNOWN is not a valid state. Not all states different from unknown are
supposed to be critical. From the docs:
degraded One or more paths to the LUN are down, but I/O is still possible. Further
path failures may result in lost connectivity.
error The LUN is dead and/or not reachable.
lostCommunication No more paths are available to the LUN.
off The LUN is off.
ok The LUN is on and available.
quiesced The LUN is inactive.
timeout All Paths have been down for the timeout condition determined by a
user-configurable host advanced option.
unknownState The LUN state is unknown.
- Removed number of LUNs as perfdata. Senseless (again).
- In the original selection for the displayed LUN the displayName was used first, then the deviceName and
the last one was the canonical name. Unfortunately in the GUI SCSI ID, canonical name an runtime name is
displayed. So using the freely configurable DisplayName is senseless. The device name is formed from the
path (/vmfs/devices/disks/) followed by the canonical name. So it is either senseless.
- The real numeric LUN number wasn'nt display. Fixed. Output is now LUN, canonical name, everything from
the display name not equal canonical name and status.
- Complete rewrite of the paths part. Only the state of the multipath was checked but not the state of the
paths. So a multipath can be "Active" which is ok but the second line is dead. So if the active path becomes
dead the failover won't work.There must be an alarm for a standby path too. It is now grouped in the output.
- Multiline support for this.
- 03 Jan 20143 M.Fuerstenau version 0.9.1
- check_vmware_esx.pl
Added new flag --ignore_warning. This will map a warning to ok.
- host_runtime_info() - some minor changes
- Changed state to powerstate. In the original version the power state was mapped:
poweredOn => UP
poweredOff => DOWN
suspended => SUSPENDED
This suggested a machine state but it is only a powerstate. All other than UP caused a critical. But this is
not true. A power off machine can also be ok. But to be sure that it is noticed we will have a warning for
powerd off and suspended
- Perfdata changed.
- vm_up -> vm_powerdon
- New: vm_poweroff
- New: vm_suspended
- vm_runtime_info() -> vm_runtime_info.pm
- Removed a lot of unnecessary variables and hashes. Rewritten a lot.
- Connection state. Only "connected" was checked. All other caused a critical error without a usefull
message.
This wa a little bit incomplete. Corrected. States delivered from VMware are connected, disconnected,
inaccessible, invalid and orphaned.
- Removed cpu. VirtualMachineRuntimeInfo maxCpuUsage (in Mhz) doesn't make so much sense for monitoring/alerting.
See VMware docs for further information for this performance counter.
- Removed mem. VirtualMachineRuntimeInfo maxMemoryUsage doesn't make so much sense for monitoring/alerting.
See VMware docs for further information for this performance counter.
- Changed state to powerstate. In the original version the power state was mapped:
poweredOn => UP
poweredOff => DOWN
suspended => SUSPENDED
This suggested a machine state but it is only a powerstate. All other than UP caused a critical. But this is
not true. A power off machine can also be ok. But to be sure that it is noticed we will have a warning for
powerd off and suspended
- Changed guest to gueststate. This is more descriptive.
Removed mapping in guest state. Mapping was "running" => "Running", "notrunning" => "Not running",
"shuttingdown" => "Shutting down", "resetting" => "Resetting", "standby" => "Standby", "unknown" => "Unknown".
This was not necessary from a technical point of view. The original messages are clearly understandable.
- The guest states were not interpreted correctly. In check_vmware_api.pl all states different from running
caused a "Critical" error. But this is nonsense. A planned shutted down machine is not an error. It's daily
business. But the operator should probably have a notice of that. So it causing a "Warning".
The states are (from the docs):
running -> Guest is running normally. (returns 0)
shuttingdown -> Guest has a pending shutdown command. (returns 1)
resetting -> Guest has a pending reset command. (returns 1)
standby -> Guest has a pending standby command. (returns 1)
notrunning -> Guest is not running. (returns 1)
unknown -> Guest information is not available. (returns 3)
- Rewritten subselect tools. VirtualMachineToolsStatus was deprecated. As of vSphere API 4.0
VirtualMachineToolsVersionStatus and VirtualMachineToolsRunningStatus
has to be used. So a great part of this subselect was not working.
- vm_disk_io_info()
- Minor bug in output and perfdata corrected. I/O is not in MB but in MB/s. Some
of the counters were in MB.
- Corrected help. The original on was nonsense.
- Changed all values to KB/s because so it is equal to host disk I/O and so it
it is deleverd from the API.
- help()
- Some corrections.
- host_disk_io_info()
- added total_latency.
- 08 Jan 20143 M.Fuerstenau version 0.9.2
- help()
- Some small bug fixes.
- vm_disk_io_info()
- Removed duplicated code. (if subselect ..... else ....)
The code was 90% identical.
- host_disk_io_info()
- Removed duplicated code. (if subselect ..... else ....)
The code was 90% identical.
- Bug fix. Usage was given without subselect but missing as subselect. Not
detected earlier due to the duplicate code.
- host_cpu_info()
- Removed duplicated code. (if subselect ..... else ....)
The code was 90% identical.
- Added usage as subselect.
- vm_cpu_info()
- Removed duplicated code. (if subselect ..... else ....)
The code was 90% identical.
- Added usage as subselect.
- host_mem_info()
- Removed duplicated code. (if subselect ..... else ....)
The code was 90% identical.
- swapused
- I swapused is a subselect there should be enhanced information about
the virtual machines and should be available. If this won't work
nothing will happen. In the past this caused a critical error which
is nonsense here.
- memctl
- Same as swapused
- vm_mem_info()
- Removed duplicated code. (if subselect ..... else ....)
The code was 90% identical.
- Added vmmemctl.average (memctl) to monitor balloning.
- 16 Jan 2014 M.Fuerstenau version 0.9.3
- All modules
- Corrected typo at the end. common instead of commen
- host_storage_info()
- Removed ignored counter for whitelisted items. A typical copy and paste
b...shit.
- vm_runtime_info()
- issues
- Some bugs with the output. Corrected.
- tools
- Minor bug fixed. Previously used variable was not removed.
- host_runtime_info()
- issues.
- Some bugs with the output. Corrected.
- listvms
- output now sorted by powerstate (suspended, poweredoff, powerdon)
- Corrected some minor bugs
- dc_list_vm_volumes_info()
- Removed handing over of unnecessary parameters
- dc_runtime_info() -> dc_runtime_info.pm
- Code cleaned up and reformated
- listvms
- output now sorted by powerstate (suspended, poweredoff, powerdon)
- Added working blacklist/whitelist with the ability to use regular
expressions
- Added --alertonly here
- Added --multiline here
- listhosts
- %host_state_strings was mostly nonsense. The mapped poser states from
for virtual machines were used. Hash removed. Using now the orginal
power states from the system (from the docs):
- poweredOff -> The host was specifically powered off by the user
through VirtualCenter. This state is not a certain
state, because after VirtualCenter issues the command
to power off the host, the host might crash, or kill
all the processes but fail to power off.
- poweredOn -> The host is powered on
- standBy -> The host was specifically put in standby mode, either
explicitly by the user, or automatically by DPM. This
state is not a cetain state, because after VirtualCenter
issues the command to put the host in stand-by state,
the host might crash, or kill all the processes but fail
to power off.
- unknown -> If the host is disconnected, or notResponding, we can
not possibly have knowledge of its power state. Hence,
the host is marked as unknown.
- Added working blacklist/whitelist with the ability to use regular
expressions
- Added --alertonly here
- Added --multiline here
- listcluster
- Removed senseless perf data
- More detailed check than before
- Added working blacklist/whitelist with the ability to use regular
expressions
- Added --alertonly here
- Added --multiline here
- status
- Rewritten and reformatted
- tools
- Rewritten and reformatted
- Improved more detailed output.
expressions
- Added --alertonly here
- Added --multiline here
- 24 Jan 2014 M.Fuerstenau version 0.9.4
- Merged pull request from Sven Nierlein
- Modified hel to work with Thruk
- Added Makefile. This is optional. Calling it generates a single file
from all the modules. Maybe it is a little bit slower than the modules.
The readon for modules was speed and better maintenance.
- host_runtime_info()
- Added quotes in perfdata for temp.
- Enhanced README. Explained the differences un host_storage_info() between
the original one and this one
- host_net_info()
- Minor bugfix in output. Corrected typo.
- 29 Jan 2014 M.Fuerstenau version 0.9.5
- host_runtime_info()
- Minor bug. Corrected quotes in perfdata for temp.
- vm_net_info()
- Quotes in perfdata
- Removed VM name from output
- vm_mem_info()
- Quotes in perfdata
- vm_disk_io_info()
- Quotes in perfdata
- vm_cpu_info()
- Quotes in perfdata
- host_net_info()
- Quotes in perfdata
- host_mem_info()
- Quotes in perfdata
- host_disk_io_info()
- Quotes in perfdata
- host_cpu_info()
- Quotes in perfdata
- dc_runtime_info()
- Quotes in perfdata
- datastore_volumes_info()
- Quotes in perfdata
- 04 Feb 2014 M.Fuerstenau version 0.9.6
- host_storage_info()
- New switch --standbyok for storage systems where a standby multipath is ok
and not a warning.
- 06 Feb 2014 M.Fuerstenau version 0.9.7
- Bugfixes/Enhancements
- In some cases it might happen that no performance counters are delivered
by VMware. Especially if the version is old (4.x, 3.x). Under these
circumstances an undef was returned by the routines from process_perfdata.pm
and not handled correctly in the calling subroutines. Fixed.
Affected subroutines:
- host_cpu_info()
- vm_net_info()
- vm_mem_info()
- vm_net_info()
- vm_disk_io_info()
- vm_cpu_info()
- host_net_info()
- host_mem_info()
- host_disk_io_info()
- vm_net_info()
- Rewritten to the same structure as similar modules
- host_net_info()
- Rewritten to the same structure as similar modules
- 24 Feb 2014 M.Fuerstenau version 0.9.8
- Corrected a type in the help()
- Moved the block for constructing the full path of the sessionfile downward to the authentication
stuff to have all in one place.
- Authentication:
- To reduce amounts of login/logout events in the vShpere logfiles or a lot of open sessions using
sessionfiles the login part has been rewritten. Using session files is now the default. Only one
session file per host or vCenter is used as default
The sessionfile name is automatically set to the vSphere host or the vCenter (IP or name - whatever
is used in the check).
Multiple sessions are possible using different session file names. To form different session file
names the default name is enhenced by the value you set with --sessionfile.
NOTICE! All checks using the same session are serialized. So a lot of checks using only one session
can cause timeouts. In this case you should enhence the number of sessions by using --sessionfile
in the command definition and define the value in the service definition command as an extra argument
so it can be used in the command definition as $ARGn$.
- --sessionfile is now optional and only used to enhance the sessionfile name to have multiple sessions.
- If a session logs in it sets a lock file (sessionfilename_locked).
- The lock file is been set when the session starts and removed at the end of the plugin run.
- A newly started check looks for the lock file and waits until it is no longer there. So here we
have a serialization now. It will not hang forever due to the alarm routine.
- Fixed bug "Can't call method "unset_logout_on_disconnect"". I mixed object orientated code and classical
code. (Thanks copy & paste for this bug)
- $timeout set to 40 seconds instead of 30 to have a little longer waiting before automatic cancelling
the check to prevent unwanted cancelling due to longer waiting caused by serialization.
- 25 Feb 2014 M.Fuerstenau version 0.9.9
- Bugfix and improvement for "lost" lock files. In case of a Nagios reload (or kill -HUP) Nagios is restarted with the
same PID as before. Unfortunately Nagios sends a SIGINT or SIGTERM to the plugins. This causes the plugin
to terminate without removing the lockfile.
- So we have to catch several signals now
- SIGINT and SIGTERM. One of this will be send from Nagios
- SIGALRM. Caused by alarm(). Now with output usable in Nagios.
- Instead of generating an empty file as lock file we write the process identifier of the running plugin
process into the lock file. If a session crashes for some reason an a lock file is left we are in a
situation where signal processing won't help. But here the next run of the plugin reads the PID and checks
for the process. If there is no process anymore it will remove the lock file and create a new one.
Thanks to Simon Meggle, Consol, for the idea.
- Removed "die" for opening the authfile or the session lock file with an unless construct. The plugin will
report an "understandable" message to the monitor instead of causing an internal error code.
- vm_cpu_info() and host_cpu_info()
- Removed threshold for ready and wait. Therefore thresholds are no possible
without subselect.
- 26 Feb 2014 M.Fuerstenau version 0.9.10
- Bugfixes.
- Corrected typo in dc_runtime_info() line 660.
- Corrected typo in help().
- Corrected bug in datastore_volumes_info(). Giving absolute thresholds
for a single volume was not possible. Fixed.
- Removed print_usage(). Due to mass of parameters it is not possible to display
a short usage message. Instead of that the output of the help is included
in the package as a file.
- Updated default timeout to 90 secs. to avoid timeouts.
- Before accessing the session file (and lock file) we have a random sleep up
7 secs.. This is to avoid a concurrent access in case of a monitor restart
or a "Schedule a check of all services on this host"
- In case of a locked session file the wait loop is not fix to 1 sec any more.
Instead of this it uses a random period up to 5 sec.. So we minimize the risc
of concurrent access.
- 7 Mar 2014 M.Fuerstenau version 0.9.11
- Updated README
- Section for removing HTML tags was reworked
- Added blacklist to host_net_info() so that interfaces with -S net can
be blacklisted.
- 11 Mar 2014 M.Fuerstenau version 0.9.12
- Changed sleep() to usleep() and using now microseconds instead of seconds
- So before accessing the session file (and lock file) we now have a random sleep
up to 1500 milliseconds (default - see $ms_ts). This is to avoid a concurrent access
in case of a monitor restart or a "Schedule a check of all services on this host"
but takes much less time while having much more alternatives.
- In case of a locked session file the wait loop is not fix to a random period up to
5 sec. any more. Instead of this it uses also $ms_ts which means a max of 1.5 secs.
instead of 5.
- 3 Apr 2014 M.Fuerstenau version 0.9.13
- --trace= was not working. Fixed. Small typo.
- Removed comment sign in front of unlink around. It was there due to some
some tests and I had forgotten to remove it.
- datastore_volumes_info(). Some bugs corrected.
- Wrong percent calculation
- Wrong processing of thresholds for usedspace
- Wrong processing for thresholds which are not percent
- If threshold is in percent it is calculated in MB/GB for perfdata
because mixing percent and MB/GB doesn't make sense.
- 5 Apr 2014 M.Fuerstenau version 0.9.14
- host_runtime_info()
- Fixed some bugs with issues ignored and whitelist. Some counters were calculated
wrong
- New plausibility check to ensured that blacklist and whitelist can not be used together.
- 29 Apr 2014 M.Fuerstenau version 0.9.15
- host_mem_info(), vm_mem_info(), host_cpu_info() and vm_cpu_info().
- Sometimes it may happen on Vmware 5.5 (not seen when testing with Update 1) that getting
the perfdata for cpu and/or memory will result in an empty construct because one
or more values are not delivered. In this case we have a fallback and and every value
- dc_runtime_info()
- New option --poweredonly to list only machines which are powered on
- 29 Apr 2014 M.Fuerstenau version 0.9.15
- host_mem_info(), vm_mem_info(), host_cpu_info() and vm_cpu_info().
- Sometimes it may happen on Vmware 5.5 (not seen when testing with Update 1) that getting
the perfdata for cpu and/or memory will result in an empty construct because one
or more values are not delivered. In this case we have a fallback and and every value
- dc_runtime_info()
- New option --poweredonly to list only machines which are powered on
- 20 May 2014 M.Fuerstenau version 0.9.16
- New option --nosession.
- This was implemented for 2 reasons.
- First when testing from the commandline using this switch to avoid
waiting and timeouts while the monitor system is checking the the same host.
This is the important reason.
- Second is that some people don't like sessionfiles and prefer full logs as
it was in the past. Good ol' times.
- host_runtime_info()
- added --nostoragestatus to -S runtime -s health to avoid a double alarm
when also doing a check with -S runtime -s storagehealth for the same
host.
- dc_runtime_info()
- changed
if (($subselect eq "listcluster") || ($subselect eq "all"))
to
if (($subselect =~ m/listcluster.*$/) || ($subselect eq "all"))
This is to avoid unnecessary typos because it covers listcluster and listclusters ;-)
- datastore_volumes_info()
- Heavily reworked lot of the logical structure. There were too much changes
changes after changes which lead to bugs. Now it is cleaned up.
- cluster_list_vm_volumes_info()
- Seperate module now
- cluster_cpu_info()
- Seperate module now but still not working.
- 1 Jul 2014 M.Fuerstenau version 0.9.16a
- Unfortunately published some modules containing debugging outpu. Fixed.
- host_disk_io_info.pm
- process_perfdata.pm
- vm_disk_io_info.pm
- 20 Jul 2014 M.Fuerstenau version 0.9.17
- Removing the last multiline character (n or
) was moved
from several subroutines to the main exit in check_vmware_esx.pl.
This was based this was implemented based on a proposal of Dietmar Eberth
Affected subroutines:
- vm_runtime_info()
- host_storage_info()
- host_runtime_info()
- dc_runtime_info()
-datastore_volumes_info()
- Fixed a bug on line 139 and 172. Thanks for fixing it to Dietmar Eberth.
- Instead of
if ( $state >= 0 )
{
$alertcnt++;
}
it must be:
if ( $alertcnt > 0 )
{
$alertcnt++;
}
- Fixed a bug on line 139 and 172. Thanks for fixing it to Dietmar Eberth.
- If only one volume is selected we have a better output now. Also thanks
to Dietmar Eberth.
- 21 Jul 2014 M.Fuerstenau version 0.9.17a
- Bugfix line 139 and 172 (now 140 and 173. It must be
if ( $actual_state > 0 )
instead of
if ( $alertcnt > 0 )
- 21 Jul 2014 M.Fuerstenau version 0.9.17a
- Bugfix line 139 and 172 (now 140 and 173). It must be
if ( $actual_state > 0 )
instead of
if ( $alertcnt > 0 )
- 25 Jul 2014 M.Fuerstenau version 0.9.18
- New option --perf_free_space for checking volumes. It must be used
with --usedspace. In versions prior to 0.9.18 perfdata was always
deliverd as free space even if --usedspace was selected. From 0.9.18
on when using --usedspace perfdata is recorded as used space. To prevent
old perfdata use this option.
- Cluster - removed checks for CPU and MEM
- Both checks were senseless for alarming because there are no thresholds.
A cluster or resource group is a group of Vmware hosts. Not a logical
construct taking parts of the hosts in a resource group in a manner
that several clusters are using the same hosts. So the amount of CPU
and memory of all hosts is the CPU and memory of the cluster. Monitoring
this makes no sense because there are no thresholds for alerting. For
example 50% CPU usage of a cluster can be one host with 90%, and two
with 30% each. With an average of 50% everything seems to be ok but one
machine has definetely a problem. Same for memory.
- 21 Aug 2014 M.Fuerstenau version 0.9.19
- host_runtime_info()
- Some minor corrections in output.
- host_storage_info()
- Some corrections in output for LUNs. Using code in output was a
really stupid idea because the code (like ok,error-lostCommunication or
whatever is valid there) was interpreted as non existing HTML code.
- Some corrections in output for multipath/paths.
- Bugfix. Due to a wrong placed curly bracked the output was doubled. Fixed.
- host_runtime_info()
- Small bugfix. It may happen within the heath check that some values are not set
by VMware/hardware. In this case we have an
"[Use of uninitialized value in concatenation (.) or string ..."
To avoid this we check the values of the hash with each loop an in case a value
is not set we replace it whit the string "Unknown".
===================
check_vmware_esx.pl - a fork of check_vmware_api.pl
General
-------
Why a fork? According to my personal unterstanding Nagios, Icingia etc. are tools for
a) Monitoring and alarming. That means checking values against thresholds (internal or handed over)
b) Collecting performance data. These data, collected with the checks, like network traffic, cpu usage or so should be
interpretable without a lot of other data.
Athough check_vmware_api.pl is a great plugin it suffers from various things.
a) It acts as a monitoring plugin for Nagios etc.
b) It acts a more comfortable commandline interfacescript.
c) It collects all a lot of historical data to have all informations in one interface.
While a) is ok b) and c) needs to be discussed. b) was necessary when you had only the Windows GUI and working on Linux
meant "No interface". this is obsolete now with the new webgui.
c) will be better used by using the webgui because historical data (in most situations) means adjusted data. Most of these
collected data is not feasible for alerting but for analysing performance gaps.
So as a conclusion collecting historic performance data collected by a monitored system should not be done using Nagios,
pnp4nagios etc.. It should be interpreted with the approriate admin tools of the relevant system. For vmware it means use
the (web)client for this and not Nagios. Same for performance counters not self explaining.
Example:
Monitoring swapped memory of a vmware guest system seems to makes sense. But on the second look it doesn't because on Nagios
you do not have the surrounding conditions in one view like
- the number of the running guest systems on the vmware server.
- the swap every guest system needs
- the total space allocated for all systems
- swap/memory usage of the hostcheck_vmware_esx.pl
- and a lot more
So monitoring memory of a host makes sense but the same for the guest via vmtools makes only a limited sense.
But this were only a few problems. Furthermore we had
- misleading descriptions
- things monitored for hosts but not for vmware servers
- a lot of absolutely unnesseary performance data (who needs a performane graph for uptime?)
- unnessessary output (CPU usage in Mhz for example)
- and a lot more.
This plugin is old and big and cluttered like the room of my little son. So it was time for some house cleaning.
I try to clean up the code of every routine, change things and will try to ease maintenance of the code.
The plugin is not really ready but working. Due to the mass
of options the help module needs work.
See history for changes
One last notice for technical issues. For better maintenance (and partly improved runtime)
I have decided to modularize the plugin. It makes patching a lot easier. Modules which must be there every time are
included with use, the others are include using require at runtime. This ensures that only
that part of code is loaded which is needed.
History:
========
- 22 Mar 2012 M.Fuerstenau
- Started with actual version 0.5.0
- Impelemented check_esx3new2.diff from Simon (simeg / simmerl)
- Reimplemented the changes of Markus Obstmayer for the actual version
- Comments within the code inform you about the changes
- It may happen that controllers has been found which are not active.
Therefor around line 2300 the following was added:
# M.Fuerstenau - additional to avoid using inactive controllers --------------
elsif (uc($dev->status) eq "3")
{
$status = 0;
}
#----------------------------------------------------------------------------
else
{
$state = 3;
}
$actual_state = Nagios::Plugin::Functions::max_state($actual_state, $status);
}
$perfdata = $perfdata . " adapters=" . $count . ";"$perf_thresholds . ";;";
# M.Fuerstenau - changed the output a little bit
$output .= $count . " of " . @{$storage->storageDeviceInfo->hostBusAdapter} . " defined/possible adapters online, ";
- 30 Mar 2012 M.Fuerstenau
- added --ignore_unknown. This maps 3 to 0. Why? You have for example several host adapters. Some are reported as
unknown by the plugin because they are not used or have not the capability to reports something senseful.
- 02 Apr 2012 M.Fuerstenau
- _info (Adapter/LUN/Path). Removed perfdata. To count such items and display as perfdata doesn't make sense
- Changed PATH to MPATH and help from "path - list logical unit paths" to "mpath - list logical unit multipath info" because
it is NOT an information about a path - it is an information about multipathing.
- 08 Jan 2013 M.Fuerstenau
- Removed installation informations for the perl SDK from VMware. This informations are part of the SDK and have nothing to do
with this plugin.
- Replaced global variables with my variables. Instead of "define every variable on the fly as needed it is a good practice
to define variables at the beginning and place a comment within it. It gives you better readability.
- 22 Jan 2013 M.Fuerstenau
- Merged with the actual version from op5. Therfore all changes done to the op5 version:
- 2012-05-28 Kostyantyn Hushchyn
Rename check_esx3 to check_vmware_api(Fixed issue #3745)
- 2012-05-28 Kostyantyn Hushchyn
Minor cosmetic changes
- 2012-05-29 Kostyantyn Hushchyn
Clear cluster failover perfdata units, as it describes count of possible failures
to tolerate, so can't be mesured in MB
- 2012-05-30 Kostyantyn Hushchyn
Minor help message changes
- 2012-05-30 Kostyantyn Hushchyn
Implemented timeshift for cluster checks, which could fix data retrievel issues. Small refactoring.
- 2012-05-31 Kostyantyn Hushchyn
Remove dependency on inteval value for cluster checks, which allows to run commands that doesn't require historical intervals
- 2012-05-31 Kostyantyn Hushchyn
Remove unnecessary/unimplemented function which caused cluster effectivecpu subcheck to fail
- 2012-06-01 Kostyantyn Hushchyn
Hide NIC status output for net check in case of empty perf data result(Fixed issue #5450)
- 2012-06-07 Kostyantyn Hushchyn
Fixed manipulation with undefined values, which caused perl interpreter warnings output
- 2012-06-08 Kostyantyn Hushchyn
Moved out global variables from perfdata functions. Added '-M' max sample number argument, which specify maximum data count to retrive.
- 2012-06-08 Kostyantyn Hushchyn
Added help text for Cluster checks.
- 2012-06-08 Kostyantyn Hushchyn
Increment version number Kostyantyn Hushchyn
- 2012-06-11 Kostyantyn Hushchyn
Reimplemented csv parser to process all values in sequence. Now all required functionality for max sample number argument are present in the plugin.
- 2012-06-13 Kostyantyn Hushchyn
Fixed cluster failover perf counter output.
- 2012-06-22 Kostyantyn Hushchyn
Added help message for literal values in interval argument.
- 2012-06-22 Kostyantyn Hushchyn
Added nicknames for intervals(-i argument), which helps to provide correct values in case you can not find them in GUI.
Supported values are: r - realtime interval, h
- 2012-07-02 Kostyantyn Hushchyn
Reimplemented Datastore checking in Datacenter using different approach(Might fix issue #5712)
- 2012-07-06 Kostyantyn Hushchyn
Fixed Datacenter runtime check Kostyantyn Hushchyn
- 2012-07-06 Kostyantyn Hushchyn
Fixed Datastore checking in Datacenter(Might fix issue #5712)
- 2012-07-09 Kostyantyn Hushchyn
Added help info for Host runtime 'sensor' subcheck
- 2012-07-09 Kostyantyn Hushchyn
Added Host runtime subcheck to threshold sensor data
- 2012-07-09 Kostyantyn Hushchyn
Fixed Host temperature subcheck causing perl interpreter messages output
- 2012-07-10 Kostyantyn Hushchyn
Added listall option to output all available sensors. Sensor name now trieted as regexp, so result will be outputed for the first match.
- 2012-07-26 Fixed issue which prevents plugin... v2.8.8 v2.8.8-beta1 Kostyantyn Hushchyn
Fixed issue which prevents plugin from executing under EPN(Fixed issue #5796)
- 2012-09-03 Kostyantyn Hushchyn
Implemented plugin timeout(returns 3).
- 2012-09-05 Kostyantyn Hushchyn
Added storage refresh functionality in case when it's present(Fixed issue #5787)
- 2012-09-21 Kostyantyn Hushchyn
Added check for dead pathes, which generates 2 in case when at least one is present(Fixed issue #5811)
- 2012-09-25 Kostyantyn Hushchyn
Changed comparison logic in storage path check
- 2012-09-26 Kostyantyn Hushchyn
Fixed 'Global symbol normalizedPathState requires explicit package name'
- 2012-10-02 Kostyantyn Hushchyn
Changed timeshift argument type to integer, so that non number values will be treated as invalid.
- 2012-10-02 Kostyantyn Hushchyn
Changed to a conditional datastore refresh(Reduce overhead of solution suggested in issue #5787)
- 2012-10-05 Kostyantyn Hushchyn
Updated description so now almost all options are documented, though somewhere should be documented arguments like timeshift(-T),
max samples(-M) and interval(-i) (Solve ticket #5950)
- 31 Jan 2013 M.Fuerstenau
- Replaced most die with a normal if statement and an exit.
- 1 Feb 2013 M.Fuerstenau
- Replaced unless with if. unless was only used eight times in the program. In all other statements we had an if statement
with the appropriate negotiation for the statement.
- 5 Feb 2013 M.Fuerstenau
- Replaced all add_perfdata statements with simple concatenated variable $perfdata
- 6 Feb 2013 M.Fuerstenau
- Corrected bug. Name of subroutine was sub check_percantage but thist was a typo.
- 7 Feb 2013 M.Fuerstenau
- Replaced $percc and $percw with $crit_is_percent and $warn_is_percent. This was just cosmetic for better readability.
- Removed check_percentage(). It was replaced by two one liners directly in the code. Easier to read.
- The only codeblocks using check_percentage() were the blocks checking warning and critical. But unfortunately the
plausability check was not sufficient. Now it is tested that no other values than numbers and the % sign can be
submitted. It is also checked that in case of percent the values are in a valid level between 0 and 100
- 12 Feb 2013 M.Fuerstenau
- Replaced literals like CRITICAL with numerical values. Easier to type and anyone developing plugins should be
safe with the use
- Replaced $state with $actual_state and $res with $state. More for cosmetical issues but the state is returned
to Nagios.
- check_against_threshold from Nagios::Plugin replaced with a little own subroutine check_against_threshold.
- Nagios::Plugin::Functions::max_state replaced with own routine check_state
- 14 Feb 2013 M.Fuerstenau
- Replaced hash %STATUS_TEXT from Nagios::Plugin::Functions with own hash %status2.
- 15 Feb 2013 M.Fuerstenau
- Own help (print_help()) and usage (print_usage()) function.
- Nagios::plugin kicked finally out.
- Mo more global variables.
- 25 Feb 2013 M.Fuerstenau
- $quickstats instead of $quickStats for better readability.
- 5 Mar 2013 M.Fuerstenau
- Removed return_cluster_DRS_recommendations() because for daily use this was more of an exotical feature
- Removed --quickstats for host_cpu_info and dc_cpu_info because quickstats is not a valid option here.
- 6 Mar 2013 M.Fuerstenau
- Replaced -o listitems with --listitems
- 8 Mar 2013 M.Fuerstenau
- --usedspace replaces -o used. $usedflag has been replaced by $usedflag.
- --listvms replaces -o listvm. $outputlist has been replaced by $listvms.
- --alertonly replaces -o brief. $briefflag has been replaced by $alertonly.
- --blacklistregexp replaces -o blacklistregexp. $blackregexpflag has been replaced by $blacklistregexp.
- --isregexp replaces -o regexp. $regexpflag has been replaced by $isregexp.
- 9 Mar 2013 M.Fuerstenau
- Main selection is now transfered to a subrouting main_select because after
a successfull if statement the rest can be skipped leaving the subroutine
with return
- 19 Mar 2013 M.Fuerstenau
- Reformatted and cleaned up a lot of code. Variable definitions are now at the beginning of each
subroutine instead of defining them "on the fly" as needed with "my". Especially using "my" for
definition in a loop is not goog coding style
- 21 Mar 2013 M.Fuerstenau
- --listvms removed as extra switch. Ballooning or swapping VMs will always be listed.
- Changed subselect list(vm) to listvm for better readability. listvm was accepted before (equal to list)
but not mentioned in the help. To have list or listvm for the same is a little bit exotic. Fixed this inconsistency.
- 22 Mar 2013 M.Fuerstenau
- Removed timeshift, interval and maxsamples. If needed use original program from op5.
- 25 Mar 2013 M.Fuerstenau
- Removed $defperfargs because no values will be handled over. Only performance check that needed another which
needed another sampling invel was cluster. This is now fix with 3000.
- 11 Apr 2013 M.Fuerstenau
- Rewritten and cleaned subroutine host_mem_info. Removed $value1 - $value5. Stepwise completion of $output makes
this unsophisticated construct obsolete.
- 16 Apr 2013 M.Fuerstenau
- Stripped down vm_cpu_info. Monitoring CPU usage in Mhz makes no sense under normal circumstances
Mhz is no valid unit for performance data according to the plugin developer guide. I have never found
a reason to monitor wait time or ready time in a normal alerting evironment. This data has some interest
for performance analysis. But this can be done better with the vmware tools.
- Rewritten and cleaned subroutine vm_mem_info. Removed $value1 - $value5. Stepwise completion of $output makes
this unsophisticated construct obsolete.
- 24 Apr 2013 M.Fuerstenau
- Because there is a lot of different performance counters for memory in vmware we ave changed something to be
more specific.
- Enhenced explanations in help.
- Changed swap to swapUSED in host_mem_info().
- Changed usageMB to CONSUMED in host_mem_info(). Same for variables.
- Removed overall in host_mem_info(). After reading the documentation carefully the addition of consumed.average + overhead.average
seems a little bit senseless because consumed.average includes overhead.average.
- Changed usageMB to CONSUMED in vm_mem_info(). Same for variables.
- Removed swapIN and swapOUT in vm_mem_info(). Not so sensefull for Nagios alerting because it is hard to find
valid thresholds
- Removed swap in vm_mem_info(). From the vmware documentation:
"Current amount of guest physical memory swapped out to the virtual machine's swap file by the VMkernel. Swapped
memory stays on disk until the virtual machine needs it. This statistic refers to VMkernel swapping and not
to guest OS swapping. swapped = swapin + swapout"
This is more an issue of performance tuning rather than alerting. It is not swapping inside the virtual machine.
it is not possible to do any alerting here because (especially with vmotion) you have no thresholds.
- Removed OVERHEAD in vm_mem_info(). From the vmware documentation:
"Amount of machine memory used by the VMkernel to run the virtual machine."
So using this we have a useless information about a virtual machine because we have no valid context and we
have no valid thresholds. More important is overhead for the host system. And if we are running in problems here
we have to look which machine must be moved to another host.
- As a result of this overall in vm_mem_info() makes no sense.
- 25 Apr 2013 M.Fuerstenau
- Removed swap in vm_mem_info(). From vmware documentation:
"Amount of guest physical memory that is currently reclaimed from the virtual machine through ballooning.
This is the amount of guest physical memory that has been allocated and pinned by the balloon driver."
So here we have again data which makes no sense used alone. You need the context for interpreting them
and there are no thresholds for alerting.
- 29 Apr 2013 M.Fuerstenau
- Renamed $esx to $esx_server. This is only for cosmetics and better reading of the code.
- Reimplmented subselect ready in vm_cpu_info and implemented it new in host_cpu_info.
From the vmware documentation:
"Percentage of time that the virtual machine was ready, but could not get scheduled
to run on the physical CPU. CPU ready time is dependent on the number of virtual
machines on the host and their CPU loads."
High or growing ready time can be a hint CPU bottlenecks (host and guest system)
- Reimplmented subselect wait in vm_cpu_info and implemented it new in host_cpu_info.
From the vmware documentation:
"CPU time spent in wait state. The wait total includes time spent the CPU Idle, CPU Swap Wait,
and CPU I/O Wait states. "
High or growing wait time can be a hint I/O bottlenecks (host and guest system)
- 30 Apr 2013 M.Fuerstenau
- Removed subroutines return_dc_performance_values, dc_cpu_info, dc_mem_info, dc_net_info and dc_disk_io_info.
Monitored entity was view type HostSystem. This means, that the CPU of the data center server is monitored.
The data center server (vcenter) is either a physical MS Windows server (which can be monitored better
directly with SNMP and/or NSClient++) or the new Linux based appliance which is a virtual machine and
can be monitored as any virtual machine. The OS (Linux) on that virtual machine can be monitored like
any standard Linux.
- 5 May 2013 M.Fuerstenau
- Revised the code of dc_list_vm_volumes_info()
- 9 May 2013 M.Fuerstenau
- Revised the code of host_net_info(). The function was devided in two parts (like others):
- subselects
- else which included all.
So most of the code existed twice. One for each subselect and nearly the same for all together.
The else block was removed and in case no subselect was defined we defined all as $subselect.
With the variable set to all we can decide wether to leave the function after a subselect section
has been processed or stay and enhance $output and $perfdata. So the code is more clear and
has nearly half the lines of code left.
- Removed KBps as unit in performance data. This unit is not specified in the plugin developer
guide. Performance data is now just a number without a unit. Adding the unit has to be done
in the graphing tool (like pnp4nagios).
- Removed the number of NICs as performance data. A little bit senseless to have those data here.
- 10 May 2013 M.Fuerstenau
- Revised the code of vm_net_info(). Same changes as for host_net_info() exept the NIC section.
This is not available for VMs.
- 14 May 2013 M.Fuerstenau
- Replaced $command and $subselect with $select and $subselect. Therfore also the options --command
--subselect changed to --select and --subselect. This has been done to become it more clear.
In fact these items where no commands (or subselects). It were selections from the amount of
performance counters available in vmware.
- 15 May 2013 M.Fuerstenau
- Kicked out all (I hope so) code for processing historic data from generic_performance_values().
generic_performance_values() is called by return_host_performance_values(), return_host_vmware_performance_values()
and return_cluster_performance_values() (return_cluster_performance_values() must be rewritten now).
The code length of generic_performance_values() was reduced to one third by doing this.
- 6 Jun 2013 M.Fuerstenau
- Substituted commandline option for select -l with -S. Therefore -S can't be used as option for the sessionfile
Only --sessionfile is accepted nor the name of the sessionfile.
- Corrected some bugs in check_against_threshold()
- Ensured that in case of thresholds critical must be greater than warning.
- 11 Jun 2013 M.Fuerstenau
- Changed select option for datastore from vmfs to volumes because we will have volumes on nfs AND vmfs.
- Changed output for datastore check to use the option --multiline. This will add a (unset -> default) for
every line of output. If set it will use HTML tag
.
The option --multiline sets a
tag instead of . This must be filtered out
before using the output in notifications like an email. sed will do the job:
sed 's/<[Bb][Rr]>/& /g' | sed 's/<[^<>]*>//g'
Example:
# 'notify-by-email' command definition
define command{
command_name notify-by-email
command_line /usr/bin/printf "%b" "Message from Nagios: Notification Type: $NOTIFICATIONTYPE$ Service: $SERVICEDESC$ Host: $HOSTNAME$ Hostalias: $HOSTALIAS$ Address: $HOSTADDRESS$ State: $SERVICESTATE$ Date/Time: $SHORTDATETIME$ Additional Info: $SERVICEOUTPUT$ $LONGSERVICEOUTPUT$" | sed 's/<[Bb][Rr]>/& /g' | sed 's/<[^<>]*>//g' | /bin/mail -s "** $NOTIFICATIONTYPE$ alert - $HOSTNAME$/$SERVICEDESC$ is $SERVICESTATE$ **" $CONTACTEMAIL$
}
- 13 Jun 2013 M.Fuerstenau
- Replaced a previous change because it was wrong done:
- --listvms replaced by subselect listvms
- 14 Jun 2013 M.Fuerstenau
- Some minor corrections like a doubled chop() datastore_volumes_info()
- Added volume type to datastore_volumes_info(). So you can see whether the volume is vmfs (local or SAN) or NFS.
- variables like $subselect or $blacklist are global there is no need to handle them over to subroutines like
($result, $output) = vm_cpu_info($vmname, local_uc($subselect)) . For $subselect we have now one uppercase
(around line 580) instead of having one with each call in the main selection.
- Later on I renamed local_uc to local_lc because I recognized that in cases the subselect is a volume name
upper cases won't work.
- replaced last -o $addopts (only for the name of a sensor) with --sensorname
- 18 Jun 2013 M.Fuerstenau
- Rewritten and cleaned subroutine host_disk_io_info(). Removed $value1 - $value7. Stepwise completion of $output makes
this unsophisticated construct obsolete.
- Removed use of performance thresholds in performance data when used disk io without subselect because threshold
can only be used for on item not for all. Therefore they weren't checked in that section. Senseless.
- Changed the output. Opposite to vm_disk_io_info() most vlues in host_disk_io_info() are not transfer rates
but latency in milliseconds. The output is now clearly understandable.
- Added subselect read. Average number of kilobytes read from the disk each second. Rate at which data is read
from each LUN on the host.read rate = # blocksRead per second x blockSize.
- Added subselect write. Average number of kilobytes written to disk each second. Rate at which data is written
to each LUN on the host.write rate = # blocksRead per second x blockSize
- Added subselect usage. Aggregated disk I/O rate. For hosts, this metric includes the rates for all virtual
machines running on the host.
- 21 Jun 2013 M.Fuerstenau
- Rewritten and cleaned subroutine vm_disk_io_info(). Removed $value1 - $valuen. Stepwise completion of $output makes
this unsophisticated construct obsolete.
- Removed use of performance thresholds in performance data when used disk io without subselect because threshold
can only be used for on item not for all. Therefore they weren't checked in that section. Senseless.
- 24 Jun 2013 M.Fuerstenau
- Changed all .= (for example $output .= $xxx.....) to = $var... (for example $output = $output . $xxx...). .= is shorter
but the longer form of notification is better readable. The probability of overlooking the dot (especially for older eyes
like mine) is smaller.
- 07 Aug 2013 M.Fuerstenau
- Changed "eval { require VMware::VIRuntime };" to "use VMware::VIRuntime;". The eval construct
made no sense. If the module isn't available the program will crash with a compile error.
- Removed own subroutine format_uptime() only used by host_uptime_info(). The complete work of this function
was done converting seconds to days, hours etc.. Instead of the we use the perl module Time::Duration.
So instead of
$output = "uptime=" . format_uptime($value);
we simply use
$output = "uptime=" . duration_exact($value);
- Removed perfdata from host_uptime_info(). Perfdata for uptime seems senseless. Same for threshold.
- Started modularization of the plugin. The reason is that it is much more easier to
patch modules than to patch a large file.
- Variables used in that functions which are defined on the top level
with "my" must now be defined with "our".
BEWARE! Using "our" with unknown modules can lead to curious results if
in this functions are variables with the same name. But in this
case it is no risk because the modules are not generic. We have only
broken the plugin in handy pieces.
- Made an seperate modules:
- help.pm -> print_help()
- process_perfdata.pm -> get_key_metrices()
-> generic_performance_values()
-> return_host_performance_values()
-> return_host_vmware_performance_values()
-> return_cluster_performance_values()
-> return_host_temporary_vc_4_1_network_performance_values()
- host_cpu_info.pm -> host_cpu_info()
- host_mem_info.pm -> host_mem_info()
- host_net_info.pm -> host_net_info()
- host_disk_io_info.pm -> host_disk_io_info()
- datastore_volumes_info.pm -> datastore_volumes_info()
- host_list_vm_volumes_info.pm -> host_list_vm_volumes_info()
- host_runtime_info.pm -> host_runtime_info()
- host_service_info.pm -> host_service_info()
- host_storage_info.pm -> host_storage_info()
- host_uptime_info.pm -> host_uptime_info()
- 13 Aug 2013 M.Fuerstenau
- Moved host_device_info to host_mounted_media_info. Opposite to it's name
and the description this function wasn't designed to list all devices
on a host. It was designed to show host cds/dvds mounted to one or more
virtual machines. This is important for monitoring because a virtual machine
with a mount cd or dvd drive can not be moved to another host.
- Made an seperate modules:
- host_mounted_media_info.pm -> host_mounted_media_info()
- 19 Aug 2013 M.Fuerstenau
- Added SOAP check from Simon Meggle, Consol. Slightly modified to fit.
- Added isblacklisted and isnotwhitelisted from Simon Meggle, Consol. . Same as above.
Following subroutines or modules are affected:
- datastore_volumes_info.pm
- host_runtime_info.pm
- Enhanced host_mounted_media_info.pm
- Added check for host floppy
- Added isblacklisted and isnotwhitelisted
- Added $multiline
- 21 Aug 2013 M.Fuerstenau
- Reformatted and cleaned up host_runtime_info().
- A lot of bugs in it.
- 17 Aug 2013 M.Fuerstenau
- Minor bug fix.
- $subselect was always converted to lower case characters.
This is correct exect $subselect contains a name (e.g. volumes). Volume names
can contain upper and lower letters. Fixed.
- datastore_volumes_info.pm had my ($datastore, $subselect) = @_; as second line
This was incorrect because "global" variables (defined as our in the main program)
are not handled over via function call. (Yes - may be handling over maybe more ok
in the sense of structured programming. But really - does handling over and giving back
a variable makes the code so much clearer? More a kind of philosophy :-)) )
- 3 Oct 2013 M.Fuerstenau
- host_runtime_info().
- Filtered out the sensor type "software components". Makes no sense for alerting.
- 01 Nov 2013 M.Fuerstenau version 0.8.10
- removed return_host_temporary_vc_4_1_network_performance_values() from
process_perfdata.pm. Not needed in ESX version 5 and above.
Affected subroutine:
host_net_info().
- Bug fixed in generic_performance_values(). Unfortunaltely I had moved
my @values = () to main function. Therefore instead of containing a
new array reference with each run the new references were added to the array
but only the first one was processed. Thanks to Timo Weber for discovering this bug.
- 20 Nov 2013 M.Fuerstenau version 0.8.11
- check_state(). Bugfix. Logical error. Complete rewrite.
- host_net_info()
- Minor bugfix. Added exit 2 in case of unallowed thresholds
- Simplified the output. Instead of doing it for every subselect (or not
if subselect=all for selecting all info) we have a new helper variable
called $true_sub_sel.
0 means not a true subselect.
1 means a true subselect (including all)
- host_runtime_info().
- Filtered out the sensor type "software components". Makes no sense for alerting.
- The complete else tree (no subselect) was scrap due to the fact that the return code
was always ok. There would never be any alarm. Kicked out.
- The else tree was replaced by a $subselect=all as in host_net_info().
- Same for output.
- subselect listvms.
- The (OK) in the output was a hardcoded. Replaced by the deliverd value (UP). (in my oipinion senseless)
- Removed perfdata. The number of virtual machines as perfdata doesn't
make so much sense.
- Rearranged the order of subselects. Must be the same as the statements
in the kicked out else sequence to get the same order in output
- connection state.
From the docs:
connected Connected to the server. For ESX Server, this is always
the setting.
disconnected The user has explicitly taken the host down. VirtualCenter
does not expect to receive heartbeats from the host. The
next time a heartbeat is received, the host is moved to
the connected state again and an event is logged.
notResponding VirtualCenter is not receiving heartbeats from the server
. The state automatically changes to connected once
heartbeats are received again. This state is typically
used to trigger an alarm on the host.
In the past the presset of returncode was 2. It was set to 0 in case of a
connected. But disconnected doesn' mean a critical error. It means main-
tenance or somesting like that. Therefore we now return a 1 for warning.
notResonding will cause a 2 for critical.
- Kicked out maintenance info in runtime summary and as subselect. In the beginning of
the function is a check for maintenance. In the original program in this case
the program will be left with a die which caused a red alert in Nagios. Now an info
is displayed and a return code of 1 (warning) is deliverd because a maintenance is
regular work. Although monitoring should take notice of it. Therefore a warning.
Therefor the maintenence check in the else tree was scrap. It was never reached.
- listvms
In case of no VMs the plugin returned a critical. But this is not correct. No VMs
on a host is not an error. It is simply what it says: No VMs.
- Replaced --listitems and --listall with --listsensors because there were two options
for the same use.
- Later on I decided to kick out the complete sensorname construction. To monitor a seperate sensor
by name (a string which ist different for each sensor) seems not to make so much sense. To monitor
sensors by type gave no usefull information (exept temperature which is still monitored. All usefull
informations are in the health section. Better to implement some out here if needed.
renamed -s temperature to -s temp (because I am lazy).
- Changed output of --listsensors. | is the seperation symbol for performance data. So it should not be
used in output strings.
- Same for the error output of sensors in case of a problem.
- subselect=health. Filter out sensors which have not valid data. Often a sensor is
reckognizedby vmware but has not the ability to report something senseful. In this
case an unknown is reported and the message "Cannot report on the current health state of
the element". This it can be skipped.
- Bugfix for vm_net_info(). It worked perfectly for -H but produced bullshit with the -D option.
With -D in PerfQuerySpec->new(...) intervalId and maxSample must be set. Default was 20 for intervalId
and 1 for maxSample
- datastore_volumes_info()
- New option --gigabyte
- Fixed bug for treating name as regexp (--isregexp).
- Fixed bug in perfdata (%:MB). Is now MB or GB.
- If no single volume is selected warning/critical threshold can be given
only in percent.
- Fixed bug in regexp for blacklists and whitelists and replace --blacklistregexp and --whitelistregexp with
--isregexp. Used in
- host_mounted_media_info()
- datastore_volumes_info()
- host_runtime_info()
All subroutines revised later will include it automatically
- Fixed bug in regexp for blacklists and whitelists and replace --blacklistregexp and --whitelistregexp with
- Opposite to the op5 original blacklists and whitelists can now contain true reular expressions.
- help.pm rewritten. It was too much output. Now the user has several choices:
-h|--help=
-h|--help=
-h|--help=
-h|--help=
-h|--help=
- host_service_info() rewritten.
- No longer a list of services via subselect.
- Instead full blacklist/whitelist support
- With --isregexp blacklist/whitelist items are interpreted as regular expressions
- Changed switch of blacklist from -x to -B
- Changed switch of whitelist from -y to -W
- main_select(). Something for the scatterbrained of us. Replaced string compare (eq,ne etc.) with
a regexp pattern matching. So it is possible to type in services or Services instead of service.
- 29 Nov 2013 M.Fuerstenau version 0.8.13
- In all functions checking host for maintenance we had the following contruction:
if (uc($host_view->get_property('runtime.inMaintenanceMode')) eq "TRUE")
This is quite stupid. runtime.inMaintenanceMode is xsd:boolean wich means true or false (in
lower cas letters. So a uc() make no sense. Removed
- Bypass SSL certificate validation - thanks Robert Karliczek for the hint
- Added --sslport if any port other port than 443 is wanted. 443 is used as default.
- Rewritten authorization part. Sessionfiles are working now.
- Updated README.
- 03 Dec 2013 M.Fuerstenau version 0.8.14
- host_runtime_info()
- Fixed minor bug. In case of an unknown with sensors the unknown was always mapped to
a warning. This wasn't sensefull under all circumstances. Now it is only mapped to warning
if --ignoreunknow is not set.
- README
- Updated installation notes
- Optional pathname for sessionfile
- 10 Dec 2013 M.Fuerstenau version 0.8.15
- datastore_volumes_info()
- Changed "For all volumes" to "For all selected volumes". This should fit all. The subroutine datastore_volumes_info
is called from several subroutines and to make a difference here for a single volume or more volumes will cause
a lot of unnecessary work just for cosmetics.
- Fixed typo in error message.
- Modified error messages
- Removed plausibility check whether critical must be greater than warning. In case of freespace for example it must
be the other way round. The plausibility check was nice but too complicated for all the different conditions.
- check_against_threshold() - Cleaned up and partially rewritten.
- 12 Dec 2013 M.Fuerstenau version 0.8.16
- datastore_volumes_info()
- Some small fixes
- Changed the output for OK from "For all selected volumes" to "OK for all seleted volumes."
- Same for error plus an error counter for the alarms
- Parameter usedspace was ignored on volume checks because a local variable defined with my instead of a more global
one. Changed it from my to our fixed it. Thanks to dgoetz for reporting (and fixing) this.
- Capacity is now displayed and deliverd as perfdata
- Main selection - GetOptions. Unce upon a time I had kicked out $timeout unintended. Fixed now. Thanks to
Andreas Daubner for the hint.
- 12 Dec 2013 M.Fuerstenau version 0.8.17
- host_net_info()
- Changed output
- Added total number of NICs
- 25 Dec 2013 M.Fuerstenau version 0.9.0
- help()
- Removed -v/--verbose. The code was not debugging the plugin but not for working with it.
- host_storage_info()
- Removed optional switch for adaptermodel. Displaying the adapter model is default now.
- Removed upper case conversion for status. The status was converted (for examplele online to ONLINE) and
compared with a upper case string like "ONLINE". Sensefull like a second asshole.
- Added working blacklist/whitelist.
- Blacklist: blacklisted adapters will not be displayed.
- Whitelist: only whitelisted adapters will be displayed.
- Removed perfdata for the number of hostbusadapters. These perfdata was absolutely senseless.
- Status for hostbusadapters was not checked correctly. The check was only done for online and unknown but NOT(!!)
for offline and unbound.
- LUN states were not correct. UNKNOWN is not a valid state. Not all states different from unknown are
supposed to be critical. From the docs:
degraded One or more paths to the LUN are down, but I/O is still possible. Further
path failures may result in lost connectivity.
error The LUN is dead and/or not reachable.
lostCommunication No more paths are available to the LUN.
off The LUN is off.
ok The LUN is on and available.
quiesced The LUN is inactive.
timeout All Paths have been down for the timeout condition determined by a
user-configurable host advanced option.
unknownState The LUN state is unknown.
- Removed number of LUNs as perfdata. Senseless (again).
- In the original selection for the displayed LUN the displayName was used first, then the deviceName and
the last one was the canonical name. Unfortunately in the GUI SCSI ID, canonical name an runtime name is
displayed. So using the freely configurable DisplayName is senseless. The device name is formed from the
path (/vmfs/devices/disks/) followed by the canonical name. So it is either senseless.
- The real numeric LUN number wasn'nt display. Fixed. Output is now LUN, canonical name, everything from
the display name not equal canonical name and status.
- Complete rewrite of the paths part. Only the state of the multipath was checked but not the state of the
paths. So a multipath can be "Active" which is ok but the second line is dead. So if the active path becomes
dead the failover won't work.There must be an alarm for a standby path too. It is now grouped in the output.
- Multiline support for this.
- 03 Jan 20143 M.Fuerstenau version 0.9.1
- check_vmware_esx.pl
Added new flag --ignore_warning. This will map a warning to ok.
- host_runtime_info() - some minor changes
- Changed state to powerstate. In the original version the power state was mapped:
poweredOn => UP
poweredOff => DOWN
suspended => SUSPENDED
This suggested a machine state but it is only a powerstate. All other than UP caused a critical. But this is
not true. A power off machine can also be ok. But to be sure that it is noticed we will have a warning for
powerd off and suspended
- Perfdata changed.
- vm_up -> vm_powerdon
- New: vm_poweroff
- New: vm_suspended
- vm_runtime_info() -> vm_runtime_info.pm
- Removed a lot of unnecessary variables and hashes. Rewritten a lot.
- Connection state. Only "connected" was checked. All other caused a critical error without a usefull
message.
This wa a little bit incomplete. Corrected. States delivered from VMware are connected, disconnected,
inaccessible, invalid and orphaned.
- Removed cpu. VirtualMachineRuntimeInfo maxCpuUsage (in Mhz) doesn't make so much sense for monitoring/alerting.
See VMware docs for further information for this performance counter.
- Removed mem. VirtualMachineRuntimeInfo maxMemoryUsage doesn't make so much sense for monitoring/alerting.
See VMware docs for further information for this performance counter.
- Changed state to powerstate. In the original version the power state was mapped:
poweredOn => UP
poweredOff => DOWN
suspended => SUSPENDED
This suggested a machine state but it is only a powerstate. All other than UP caused a critical. But this is
not true. A power off machine can also be ok. But to be sure that it is noticed we will have a warning for
powerd off and suspended
- Changed guest to gueststate. This is more descriptive.
Removed mapping in guest state. Mapping was "running" => "Running", "notrunning" => "Not running",
"shuttingdown" => "Shutting down", "resetting" => "Resetting", "standby" => "Standby", "unknown" => "Unknown".
This was not necessary from a technical point of view. The original messages are clearly understandable.
- The guest states were not interpreted correctly. In check_vmware_api.pl all states different from running
caused a "Critical" error. But this is nonsense. A planned shutted down machine is not an error. It's daily
business. But the operator should probably have a notice of that. So it causing a "Warning".
The states are (from the docs):
running -> Guest is running normally. (returns 0)
shuttingdown -> Guest has a pending shutdown command. (returns 1)
resetting -> Guest has a pending reset command. (returns 1)
standby -> Guest has a pending standby command. (returns 1)
notrunning -> Guest is not running. (returns 1)
unknown -> Guest information is not available. (returns 3)
- Rewritten subselect tools. VirtualMachineToolsStatus was deprecated. As of vSphere API 4.0
VirtualMachineToolsVersionStatus and VirtualMachineToolsRunningStatus
has to be used. So a great part of this subselect was not working.
- vm_disk_io_info()
- Minor bug in output and perfdata corrected. I/O is not in MB but in MB/s. Some
of the counters were in MB.
- Corrected help. The original on was nonsense.
- Changed all values to KB/s because so it is equal to host disk I/O and so it
it is deleverd from the API.
- help()
- Some corrections.
- host_disk_io_info()
- added total_latency.
- 08 Jan 20143 M.Fuerstenau version 0.9.2
- help()
- Some small bug fixes.
- vm_disk_io_info()
- Removed duplicated code. (if subselect ..... else ....)
The code was 90% identical.
- host_disk_io_info()
- Removed duplicated code. (if subselect ..... else ....)
The code was 90% identical.
- Bug fix. Usage was given without subselect but missing as subselect. Not
detected earlier due to the duplicate code.
- host_cpu_info()
- Removed duplicated code. (if subselect ..... else ....)
The code was 90% identical.
- Added usage as subselect.
- vm_cpu_info()
- Removed duplicated code. (if subselect ..... else ....)
The code was 90% identical.
- Added usage as subselect.
- host_mem_info()
- Removed duplicated code. (if subselect ..... else ....)
The code was 90% identical.
- swapused
- I swapused is a subselect there should be enhanced information about
the virtual machines and should be available. If this won't work
nothing will happen. In the past this caused a critical error which
is nonsense here.
- memctl
- Same as swapused
- vm_mem_info()
- Removed duplicated code. (if subselect ..... else ....)
The code was 90% identical.
- Added vmmemctl.average (memctl) to monitor balloning.
- 16 Jan 2014 M.Fuerstenau version 0.9.3
- All modules
- Corrected typo at the end. common instead of commen
- host_storage_info()
- Removed ignored counter for whitelisted items. A typical copy and paste
b...shit.
- vm_runtime_info()
- issues
- Some bugs with the output. Corrected.
- tools
- Minor bug fixed. Previously used variable was not removed.
- host_runtime_info()
- issues.
- Some bugs with the output. Corrected.
- listvms
- output now sorted by powerstate (suspended, poweredoff, powerdon)
- Corrected some minor bugs
- dc_list_vm_volumes_info()
- Removed handing over of unnecessary parameters
- dc_runtime_info() -> dc_runtime_info.pm
- Code cleaned up and reformated
- listvms
- output now sorted by powerstate (suspended, poweredoff, powerdon)
- Added working blacklist/whitelist with the ability to use regular
expressions
- Added --alertonly here
- Added --multiline here
- listhosts
- %host_state_strings was mostly nonsense. The mapped poser states from
for virtual machines were used. Hash removed. Using now the orginal
power states from the system (from the docs):
- poweredOff -> The host was specifically powered off by the user
through VirtualCenter. This state is not a certain
state, because after VirtualCenter issues the command
to power off the host, the host might crash, or kill
all the processes but fail to power off.
- poweredOn -> The host is powered on
- standBy -> The host was specifically put in standby mode, either
explicitly by the user, or automatically by DPM. This
state is not a cetain state, because after VirtualCenter
issues the command to put the host in stand-by state,
the host might crash, or kill all the processes but fail
to power off.
- unknown -> If the host is disconnected, or notResponding, we can
not possibly have knowledge of its power state. Hence,
the host is marked as unknown.
- Added working blacklist/whitelist with the ability to use regular
expressions
- Added --alertonly here
- Added --multiline here
- listcluster
- Removed senseless perf data
- More detailed check than before
- Added working blacklist/whitelist with the ability to use regular
expressions
- Added --alertonly here
- Added --multiline here
- status
- Rewritten and reformatted
- tools
- Rewritten and reformatted
- Improved more detailed output.
expressions
- Added --alertonly here
- Added --multiline here
- 24 Jan 2014 M.Fuerstenau version 0.9.4
- Merged pull request from Sven Nierlein
- Modified hel to work with Thruk
- Added Makefile. This is optional. Calling it generates a single file
from all the modules. Maybe it is a little bit slower than the modules.
The readon for modules was speed and better maintenance.
- host_runtime_info()
- Added quotes in perfdata for temp.
- Enhanced README. Explained the differences un host_storage_info() between
the original one and this one
- host_net_info()
- Minor bugfix in output. Corrected typo.
- 29 Jan 2014 M.Fuerstenau version 0.9.5
- host_runtime_info()
- Minor bug. Corrected quotes in perfdata for temp.
- vm_net_info()
- Quotes in perfdata
- Removed VM name from output
- vm_mem_info()
- Quotes in perfdata
- vm_disk_io_info()
- Quotes in perfdata
- vm_cpu_info()
- Quotes in perfdata
- host_net_info()
- Quotes in perfdata
- host_mem_info()
- Quotes in perfdata
- host_disk_io_info()
- Quotes in perfdata
- host_cpu_info()
- Quotes in perfdata
- dc_runtime_info()
- Quotes in perfdata
- datastore_volumes_info()
- Quotes in perfdata
- 04 Feb 2014 M.Fuerstenau version 0.9.6
- host_storage_info()
- New switch --standbyok for storage systems where a standby multipath is ok
and not a warning.
- 06 Feb 2014 M.Fuerstenau version 0.9.7
- Bugfixes/Enhancements
- In some cases it might happen that no performance counters are delivered
by VMware. Especially if the version is old (4.x, 3.x). Under these
circumstances an undef was returned by the routines from process_perfdata.pm
and not handled correctly in the calling subroutines. Fixed.
Affected subroutines:
- host_cpu_info()
- vm_net_info()
- vm_mem_info()
- vm_net_info()
- vm_disk_io_info()
- vm_cpu_info()
- host_net_info()
- host_mem_info()
- host_disk_io_info()
- vm_net_info()
- Rewritten to the same structure as similar modules
- host_net_info()
- Rewritten to the same structure as similar modules
- 24 Feb 2014 M.Fuerstenau version 0.9.8
- Corrected a type in the help()
- Moved the block for constructing the full path of the sessionfile downward to the authentication
stuff to have all in one place.
- Authentication:
- To reduce amounts of login/logout events in the vShpere logfiles or a lot of open sessions using
sessionfiles the login part has been rewritten. Using session files is now the default. Only one
session file per host or vCenter is used as default
The sessionfile name is automatically set to the vSphere host or the vCenter (IP or name - whatever
is used in the check).
Multiple sessions are possible using different session file names. To form different session file
names the default name is enhenced by the value you set with --sessionfile.
NOTICE! All checks using the same session are serialized. So a lot of checks using only one session
can cause timeouts. In this case you should enhence the number of sessions by using --sessionfile
in the command definition and define the value in the service definition command as an extra argument
so it can be used in the command definition as $ARGn$.
- --sessionfile is now optional and only used to enhance the sessionfile name to have multiple sessions.
- If a session logs in it sets a lock file (sessionfilename_locked).
- The lock file is been set when the session starts and removed at the end of the plugin run.
- A newly started check looks for the lock file and waits until it is no longer there. So here we
have a serialization now. It will not hang forever due to the alarm routine.
- Fixed bug "Can't call method "unset_logout_on_disconnect"". I mixed object orientated code and classical
code. (Thanks copy & paste for this bug)
- $timeout set to 40 seconds instead of 30 to have a little longer waiting before automatic cancelling
the check to prevent unwanted cancelling due to longer waiting caused by serialization.
- 25 Feb 2014 M.Fuerstenau version 0.9.9
- Bugfix and improvement for "lost" lock files. In case of a Nagios reload (or kill -HUP) Nagios is restarted with the
same PID as before. Unfortunately Nagios sends a SIGINT or SIGTERM to the plugins. This causes the plugin
to terminate without removing the lockfile.
- So we have to catch several signals now
- SIGINT and SIGTERM. One of this will be send from Nagios
- SIGALRM. Caused by alarm(). Now with output usable in Nagios.
- Instead of generating an empty file as lock file we write the process identifier of the running plugin
process into the lock file. If a session crashes for some reason an a lock file is left we are in a
situation where signal processing won't help. But here the next run of the plugin reads the PID and checks
for the process. If there is no process anymore it will remove the lock file and create a new one.
Thanks to Simon Meggle, Consol, for the idea.
- Removed "die" for opening the authfile or the session lock file with an unless construct. The plugin will
report an "understandable" message to the monitor instead of causing an internal error code.
- vm_cpu_info() and host_cpu_info()
- Removed threshold for ready and wait. Therefore thresholds are no possible
without subselect.
- 26 Feb 2014 M.Fuerstenau version 0.9.10
- Bugfixes.
- Corrected typo in dc_runtime_info() line 660.
- Corrected typo in help().
- Corrected bug in datastore_volumes_info(). Giving absolute thresholds
for a single volume was not possible. Fixed.
- Removed print_usage(). Due to mass of parameters it is not possible to display
a short usage message. Instead of that the output of the help is included
in the package as a file.
- Updated default timeout to 90 secs. to avoid timeouts.
- Before accessing the session file (and lock file) we have a random sleep up
7 secs.. This is to avoid a concurrent access in case of a monitor restart
or a "Schedule a check of all services on this host"
- In case of a locked session file the wait loop is not fix to 1 sec any more.
Instead of this it uses a random period up to 5 sec.. So we minimize the risc
of concurrent access.
- 7 Mar 2014 M.Fuerstenau version 0.9.11
- Updated README
- Section for removing HTML tags was reworked
- Added blacklist to host_net_info() so that interfaces with -S net can
be blacklisted.
- 11 Mar 2014 M.Fuerstenau version 0.9.12
- Changed sleep() to usleep() and using now microseconds instead of seconds
- So before accessing the session file (and lock file) we now have a random sleep
up to 1500 milliseconds (default - see $ms_ts). This is to avoid a concurrent access
in case of a monitor restart or a "Schedule a check of all services on this host"
but takes much less time while having much more alternatives.
- In case of a locked session file the wait loop is not fix to a random period up to
5 sec. any more. Instead of this it uses also $ms_ts which means a max of 1.5 secs.
instead of 5.
- 3 Apr 2014 M.Fuerstenau version 0.9.13
- --trace=
- Removed comment sign in front of unlink around. It was there due to some
some tests and I had forgotten to remove it.
- datastore_volumes_info(). Some bugs corrected.
- Wrong percent calculation
- Wrong processing of thresholds for usedspace
- Wrong processing for thresholds which are not percent
- If threshold is in percent it is calculated in MB/GB for perfdata
because mixing percent and MB/GB doesn't make sense.
- 5 Apr 2014 M.Fuerstenau version 0.9.14
- host_runtime_info()
- Fixed some bugs with issues ignored and whitelist. Some counters were calculated
wrong
- New plausibility check to ensured that blacklist and whitelist can not be used together.
- 29 Apr 2014 M.Fuerstenau version 0.9.15
- host_mem_info(), vm_mem_info(), host_cpu_info() and vm_cpu_info().
- Sometimes it may happen on Vmware 5.5 (not seen when testing with Update 1) that getting
the perfdata for cpu and/or memory will result in an empty construct because one
or more values are not delivered. In this case we have a fallback and and every value
- dc_runtime_info()
- New option --poweredonly to list only machines which are powered on
- 29 Apr 2014 M.Fuerstenau version 0.9.15
- host_mem_info(), vm_mem_info(), host_cpu_info() and vm_cpu_info().
- Sometimes it may happen on Vmware 5.5 (not seen when testing with Update 1) that getting
the perfdata for cpu and/or memory will result in an empty construct because one
or more values are not delivered. In this case we have a fallback and and every value
- dc_runtime_info()
- New option --poweredonly to list only machines which are powered on
- 20 May 2014 M.Fuerstenau version 0.9.16
- New option --nosession.
- This was implemented for 2 reasons.
- First when testing from the commandline using this switch to avoid
waiting and timeouts while the monitor system is checking the the same host.
This is the important reason.
- Second is that some people don't like sessionfiles and prefer full logs as
it was in the past. Good ol' times.
- host_runtime_info()
- added --nostoragestatus to -S runtime -s health to avoid a double alarm
when also doing a check with -S runtime -s storagehealth for the same
host.
- dc_runtime_info()
- changed
if (($subselect eq "listcluster") || ($subselect eq "all"))
to
if (($subselect =~ m/listcluster.*$/) || ($subselect eq "all"))
This is to avoid unnecessary typos because it covers listcluster and listclusters ;-)
- datastore_volumes_info()
- Heavily reworked lot of the logical structure. There were too much changes
changes after changes which lead to bugs. Now it is cleaned up.
- cluster_list_vm_volumes_info()
- Seperate module now
- cluster_cpu_info()
- Seperate module now but still not working.
- 1 Jul 2014 M.Fuerstenau version 0.9.16a
- Unfortunately published some modules containing debugging outpu. Fixed.
- host_disk_io_info.pm
- process_perfdata.pm
- vm_disk_io_info.pm
- 20 Jul 2014 M.Fuerstenau version 0.9.17
- Removing the last multiline character (n or
) was moved
from several subroutines to the main exit in check_vmware_esx.pl.
This was based this was implemented based on a proposal of Dietmar Eberth
Affected subroutines:
- vm_runtime_info()
- host_storage_info()
- host_runtime_info()
- dc_runtime_info()
-datastore_volumes_info()
- Fixed a bug on line 139 and 172. Thanks for fixing it to Dietmar Eberth.
- Instead of
if ( $state >= 0 )
{
$alertcnt++;
}
it must be:
if ( $alertcnt > 0 )
{
$alertcnt++;
}
- Fixed a bug on line 139 and 172. Thanks for fixing it to Dietmar Eberth.
- If only one volume is selected we have a better output now. Also thanks
to Dietmar Eberth.
- 21 Jul 2014 M.Fuerstenau version 0.9.17a
- Bugfix line 139 and 172 (now 140 and 173. It must be
if ( $actual_state > 0 )
instead of
if ( $alertcnt > 0 )
- 21 Jul 2014 M.Fuerstenau version 0.9.17a
- Bugfix line 139 and 172 (now 140 and 173). It must be
if ( $actual_state > 0 )
instead of
if ( $alertcnt > 0 )
- 25 Jul 2014 M.Fuerstenau version 0.9.18
- New option --perf_free_space for checking volumes. It must be used
with --usedspace. In versions prior to 0.9.18 perfdata was always
deliverd as free space even if --usedspace was selected. From 0.9.18
on when using --usedspace perfdata is recorded as used space. To prevent
old perfdata use this option.
- Cluster - removed checks for CPU and MEM
- Both checks were senseless for alarming because there are no thresholds.
A cluster or resource group is a group of Vmware hosts. Not a logical
construct taking parts of the hosts in a resource group in a manner
that several clusters are using the same hosts. So the amount of CPU
and memory of all hosts is the CPU and memory of the cluster. Monitoring
this makes no sense because there are no thresholds for alerting. For
example 50% CPU usage of a cluster can be one host with 90%, and two
with 30% each. With an average of 50% everything seems to be ok but one
machine has definetely a problem. Same for memory.
- 21 Aug 2014 M.Fuerstenau version 0.9.19
- host_runtime_info()
- Some minor corrections in output.
- host_storage_info()
- Some corrections in output for LUNs. Using code in output was a
really stupid idea because the code (like ok,error-lostCommunication or
whatever is valid there) was interpreted as non existing HTML code.
- Some corrections in output for multipath/paths.
- Bugfix. Due to a wrong placed curly bracked the output was doubled. Fixed.
- host_runtime_info()
- Small bugfix. It may happen within the heath check that some values are not set
by VMware/hardware. In this case we have an
"[Use of uninitialized value in concatenation (.) or string ..."
To avoid this we check the values of the hash with each loop an in case a value
is not set we replace it whit the string "Unknown".
Reviews (7)
byalex3105, November 1, 2018
Good morning,
I have Centos 7 and Nagios Core 4.4.1, I am presented with this error, you can guide me to what can be done
[root@localhost check_vmware_esx_0.9.19]# ./check_vmware_esx.pl
Can't locate Time/Duration.pm in @INC (@INC contains: /root/perl5/lib/perl5/5.16.3/x86_64-linux-thread-multi /root/perl5/lib/perl5/5.16.3 /root/perl5/lib/perl5/x86_64-linux-thread-multi /root/perl5/lib/perl5 /usr/local/lib64/perl5 /usr/local/share/perl5 /usr/lib64/perl5/vendor_perl /usr/share/perl5/vendor_perl /usr/lib64/perl5 /usr/share/perl5 .) at ./check_vmware_esx.pl line 1154.
BEGIN failed--compilation aborted at ./check_vmware_esx.pl line 1154.
[root@localhost check_vmware_esx_0.9.19]#
I have Centos 7 and Nagios Core 4.4.1, I am presented with this error, you can guide me to what can be done
[root@localhost check_vmware_esx_0.9.19]# ./check_vmware_esx.pl
Can't locate Time/Duration.pm in @INC (@INC contains: /root/perl5/lib/perl5/5.16.3/x86_64-linux-thread-multi /root/perl5/lib/perl5/5.16.3 /root/perl5/lib/perl5/x86_64-linux-thread-multi /root/perl5/lib/perl5 /usr/local/lib64/perl5 /usr/local/share/perl5 /usr/lib64/perl5/vendor_perl /usr/share/perl5/vendor_perl /usr/lib64/perl5 /usr/share/perl5 .) at ./check_vmware_esx.pl line 1154.
BEGIN failed--compilation aborted at ./check_vmware_esx.pl line 1154.
[root@localhost check_vmware_esx_0.9.19]#
If you are running ESX 6.5, please change line #7790:
-$url2connect = "https://" . $url2connect . "/sdk/webService";
+$url2connect = "https://" . $url2connect . "/sdk/vim.wsdl";
-$url2connect = "https://" . $url2connect . "/sdk/webService";
+$url2connect = "https://" . $url2connect . "/sdk/vim.wsdl";
[root@CCNAGIOS libexec]# ./check_vmware_esx.pl -H XXX.XXX.XX.XX -u nagios -u XXXXX -S mem -s usage
Provide either Password/Username or Auth file or Session file
Provide either Password/Username or Auth file or Session file
bynetexpertise, October 17, 2016
The script works nicely but I get a lot of timeouts on a stable LAN connection (service reports as down 2/3 times a day, after 3 failures in a row).
The supervised server runs on ESXi 5.1 and I use the session file
The supervised server runs on ESXi 5.1 and I use the session file
bymridion, July 3, 2014
If you run with sessions to lessen the login/out messages in ESX and run in console the file will be owned by root and they nagios check cannot run...
Errors with "Return code of 13 is out of bounds"
- may have run as root and nagios user cant write to session file in /yourdirectory/var/nagios_plugin_cache/
chown nagios:nagios /yourdirectory/var/nagios_plugin_cache/*
To ensure it won't happen again you could also set up default filesystem acls for that directory
Errors with "Return code of 13 is out of bounds"
- may have run as root and nagios user cant write to session file in /yourdirectory/var/nagios_plugin_cache/
chown nagios:nagios /yourdirectory/var/nagios_plugin_cache/*
To ensure it won't happen again you could also set up default filesystem acls for that directory
byR.Wolff, May 6, 2014
Hello,
I want to test the script, because the old one was very great but Need a lot of cpu power.
But now I get the following message:
"...Undefined subroutine &Vim::unset_logout_on_disconnect called at ./check_vmware_esx.pl line 1596.
End Disconnect
..."
Could you help me?
Regards,
Rolf
I want to test the script, because the old one was very great but Need a lot of cpu power.
But now I get the following message:
"...Undefined subroutine &Vim::unset_logout_on_disconnect called at ./check_vmware_esx.pl line 1596.
End Disconnect
..."
Could you help me?
Regards,
Rolf
Owner's reply
Please open an issue in GitHub for this.
Martin
byPradyumna, January 29, 2014
Working fine but vCenter and Esxi logs are growing
every check create a Login event and a logout event in the event table of the vcenter/Esxi.
We get a lot of check (2 check per VM every 5min). That is generating millions of lines in the vcenter/Esxi database which became very slow. We already reduce the event&task conservation time but not enough.
Does anyone know a way to limite the number of login to vcenter done by the check ?
Regards,
Pradyumna
every check create a Login event and a logout event in the event table of the vcenter/Esxi.
We get a lot of check (2 check per VM every 5min). That is generating millions of lines in the vcenter/Esxi database which became very slow. We already reduce the event&task conservation time but not enough.
Does anyone know a way to limite the number of login to vcenter done by the check ?
Regards,
Pradyumna
Owner's reply
For issues like this better open an issue on GitHub. Well - you are right. Lot of logins. This is caused by the way the VMware Perl SDK is used. If a check starts you have to connect to VMware. Thats a login. Every time. So a lot of checks means a lot of logins. If you use a session file (which must be unique) you hav a persistent session per check. So less login but a lot of open sessions.
But this problem is imported from the original one )check_vmware_api.pl/check_esx3.pl). I am working on a solution but first I have to clean up the rest of the code.
Regards
Nartin