Skip to content

The Standard

Health Check

There is a wide variety of monitoring tools and they provide a large amount of information. The question is why do we need all the measured values in addition to a health check? And what is a health check?

The answer: there are, at times, quick and general overviews of specific IT-elements or bundles of IT-elements that are necessary. In such a case the administrator does not need a Batch of measured values he requires something like a red and green light for predefined areas.

Here you can see a short list of health checks:

  • Run the health check to provide an overview of the IT-environment in case of an incident where the root cause is not clear
  • Run the health check to provide an overview of the IT-environment to collect effects from an incident on other IT-elements
  • Run the health check before a change for preparing a snap shot of the status
  • Run the health check after the change to obtain a quick overview of the modifications

In brief, a health check may help the administrator to generate a quick overview which serves as the basis for a positive service status or for further investigations.

The following chapters will show the diverse health check application areas.

It is of paramount importance to monitor the IT-environment in a thorough and complete way, measuring and tracking the status of its equipment, ist connections and the processes that run on top of them. Appropriate software tools should be made available to provide regular information on the status of the IT-infrastructure; and alerts should be raised every time an anomaly, a degradation of a service or an outage or failure of any IT-infrastructure component is detected.

The main target of such a monitoring system is as follows:

  • Provision of all required information and logs in the event of complex incidents that will facilitate, the determining of the root-cause of the incident after further processing
  • In more sophisticated versions, the monitoring software may be able to self-diagnose the root-cause of the incident, and in some cases to even self-correct it
  • Assurance of a check to see that the IT-environment is fully functional before implementation and high-risk changes of the infrastructure, as well as to verify the full infrastructure functionality after the change has been completed.
  • Provision of a regular and automated control of the infrastructure health status reporting long-term variations or sudden changes in the criticalyl monitored elements, raising proactive alerts that will facilitate intervention on the infrastructure before stated variations result in a service outage

The health of an IT-infrastructure depends on a multiplicity of variables due to the different hardware platforms and software components which constitute it. As such, it is important to define a complete view of the infrastructure, representing it in abstract terms, divided in different layers and in different domains.

This view will enable the establishment of a hierarchy and a structure in the information provided by the health check software, with correlations and dependencies between components. This is very important for being able to properly analyze and comprehend logs, identify root causes and implement corrective actions.

Morever, it will be possible to provide simple indicators (for instance, a semaphore-like color coded metering system) of the risk status of the IT-infrastructure, in its entirety as well as in its constituting domains, by properly weighing the status of the different components, which, in a tree-like hierarchical fashion, constitute the relevant measurement points of that given infrastructure domain.

In addition, each entity under monitoring requires that some thresholds associated to it, be provide typically indicating different degrees in the health of the entity. It is customary to provide at least two thresholds for each limit, one indicating a degrade level, above which the entity should start to be considered at risk, although without services still being impacted, and a fail level, above which services will be impacted.

Evoluted health-check software should be able to enable users to modify the list of relevant entities to monitor, reducing the scope to what deemed necessary, as well as the ability to vary the thresholds related to each monitored entity. In evoluted Software, it may be possible to define a set of actions in response to each alarm trigger, with the target of self correcting the infrastructure from the detected problem.

A final crucial point, different software interfaces may be required depending on the “North-Bound” software systems collecting the information from each domain (for instance, in networking industries some customers are used to employing SNMP to monitor their networking products; others utilize Netconf, and the like).

Health checks are primarity focused on technologies and deals with the operational status of hardware and software components of the IT-infrastructure. To be completely effective they should be tied into other business processes (for instance, service ticket creation). The focus of this section of the document is on the technical part of the health checks.

There are some cornerstones for developing and implementing an effective health check library. Some of these cornerstones are outlined in the following list:

  • General guidelines for, e.g. usage of variable names, output format, versioning , … (needs to be standardized)
  • How to interact with the interfaces to, e.g. monitoring tools, configuration databases, … (needs to be standardized)
  • Responsibilities for releases and life-cycle management of the health checks and the usage (needs to be defined)
  • Automation and usage of results from specific monitoring tools is key – the least possible deployment is crucial here.

The basis for health checks of any kind is a strong collaboration with configuration management and the interfaces to the related databases. Just as in the automation area, there are mutual dependencies on both sides, which we need to keep an eye on.

The easiest but always most effective way for applying of health checks involves the usage of simple scripts. Most vendors provide specific tools for health checks to show detailed information on the status of each of their components. It is rather simple to develop further specific scripts to bundle or connect vendor-specific ones.

There are two major benefits for simplifying health check scripts:

1) Flexibility:

  • a. scripts can be developed and assembled very easily
  • b. running the scripts and reading the results need no specific skills
  • c. no specific tools are necessary

An example of a list of scripts (applicable to a file server environment) which can be chosen as needed (depending on the encountered issue).

2) Homogeneity:

  • a. The scripts can be executed by all operators in the same way
  • b. Output for all health checks can look similar
  • c. Scripts can be merged in a flexible way and adapted to the needs of each specific infrastructure implementation

More evoluted software versions include Graphical-User Interfaces (GU) and visual diagrams which can greatly help in the human-readability of any alarm or event report.

As stated, most software and hardware vendors provide specific tools for status checks of their products and services with a large amount of detailed and helpful information. Depending on the complexity of the activity that requires being monitored, different operational procedures should be defined. It is in fact, very different from regularly monitoring the state of an IT infrastructure, in respect to the monitoring called for in the presenting of changes; and this in turn is dependent on the scope of the change itself. We can identify three different levels of complexity and the related risks:

  1. Health checks for regular maintenance tasks, such as code upgrades
  2. Detailed performance analyses required in the event of major changes, for instance, hardware refreshing, changes in the components of data center fabrics, and more
  3. Major transitions such as data-center transitons, which normally require architectural and engineering considerations

Regular code upgrades, for instance, storage devices or network device,s are among the most executed maintenance activities in IT-infrastructures. As code upgrades are necessary for all components of IT-infrastructures, in large-scale IT-environments such procedures run monthly, sometime weekly.

This leads to a high potential of human error or to implementation issues which can cause an incident in the worst case. Such issues would be considered rare, but with the great number of code upgrades occurring in the infrastructur and the increasing focus on high-availability systems, this simple operation also demands great attention as well as special quality assurance procedures. A high level of standardization and automatization is crucial for avoiding issues, due to the fact that such relatively simple activities frequently occur and have, as such, a high of cumulative risk potential.

All code upgrades should start with a standardized process for analyzing the current situation. A good example of a complete workflow is as follows:

  1. Vendors run health check on their fabrics
  2. With the output of the health check, the new features needed for each technology are identified and the correct target software for the upgrade is selected
  3. Execution plans are established
  4. Known dependencies and the prerequisites of connected components are identified and checked
  5. Verification, back-out procedures and communication plans are developed

One important element for success is that each vendor provides the known component dependencies of other vendors. While it often occurs that each vendor has the control of all the variables within its technological domain, it is the impact on other technologies which is most difficult to define upfront, making it impossible to properly adapt test procedures for the implementation and the reduction of the risk of unexpected side effects.

The question is: what makes this procedure a Zero Outage procedure?

In the event of high inter-domain dependencies, it is important to develop a standard and comprehensive runbook procedure, which will be checked by both parties. The cooperation between domains in establishing the runbook for the change will eventually, and in time, lead to standardized runbooks, which will serve as the basis for the automation of all change procedures.

Once a common change procedure is available, its verification and its associated health checks will also be delivered jointly.

In cases of highly critical and complex, further-advanced cooperation models such as premium support agreements (sometimes with on-site support teams) should be planned. Such models should include both support for maintenance activities as well as training schemes for the operational team.

For complex changes, it is recommended that a “soaking-time” period in the procedure be include, in which the change is monitored with special care for some time after the implementation. In this phase, enhanced monitoring (for instance, having log files checked by the vendor´s engineering team) should take place, in order to be able to react at short notice or in some cases even in advance in the event of side effects occurring some time after the implementation of the change.

With increased regularity, major projects such as data center moves and data centers are needed, and these complex projects require a highly detailed and complex architectural design. Owing to the complexity and diversity of IT infrastructures, most vendors deliver standardized services, which in these cases, will include engineering professional service support. These services provide considerably more information than is usually deemed to be a health check.

Experience tells us that such a service is also helpful for checking the infrastructure periodically, e.g., every 2-3 years,in order to detect the optimization and the quality improvement potential.

There are several very helpful and useful procedures and tools for all types of health checks which are already available. However, most of these health checks are developed for specific IT components or products of each specific vendor company.

What we want to achieve with this initiative is a best-practice implementation in which the specific methods of the different vendors and the different domains can be connected to generate a common E2E view, with a standardized and comparable output of the health checks, to provide a vendor-independent approach suitable for different IT infrastructures.

Revision History
Revision 1.5 - Released2016-07-12Peter Mustermann
Review and update to latest script version.
Revision 1.4 - Released2015-12-23Peter Mustermann
Review and update to latest script version.
Revision 1.3 - Released2015-06-03Peter Mustermann
Review and update to latest script version.
Revision 1.2 - Released2014-02-18Peter Mustermann
Review and update to latest script version.
Revision 1.1 - Released2012-12-21Peter Mustermann
Update to latest script version.
Revision 1.02012-12-20Peter Mustermann
Spin-off from TCC documentation.

Abstract

This book describes the test done by the Health Check Script and the different options of the script.

Intended Readership

Everybody who is responsible for Solaris systems or a Solaris administrator.

Assumed Knowledge

Fundamental Solaris knowledge and experience is required. Also knowledge about the standard processes at T-Systems make sense.

Preparations

You need root access on the systems you want to run this script.

Further Information

See also the documentation of server health check from, e.g., ../Solaris_Server_Health_Check.html and the documentation of the TCC from e.g., ../Solaris_technical_clearance_check.html.

Typographical Conventions

References and short machine inputs and outputs are presented according to the following table:

MeaningRepresentation (example)
Reference to another place in this document, e.g., Typographical Conventions
Reference to another techlib document, e.g.,techlib home page
Internet reference, e.g.,https://www.zero-outage.com
User inputpwd
Prompt and user inputprompt> pwd
Single machine output/home/user

Large-scale machine outputs (screens):

drwxr-xr-x 3 user users 4096 Jun 18 12:00 Desktop
drwx------ 2 user users 4096 Jun 18 12:00 Mail
drwxr-xr-x 2 user users 4096 Jun 18 12:00 bin
Program listings:

#!/bin/sh
ls -l

Admonitions:

Tip
A tip.
Note
A remark.
Important
An important notice.
Caution
An important notice, requiring special attention.
Warning
A warning must be respected under all circumstances.

 

The Solaris Health Check Script was developed to get a quick overview if some critical checkpoints have a healthy state. In default mode it performs the test introduced in the document, e.g., …/Solaris/Solaris_Server_Health_Check.html in the heal check section. Many additional tests are included into the script, selectable with options described in this document. Currently there are 49 tests implemented.

  • health_check_solaris.sh
    Actual script version 5.2
    Perform essential checks on a system (health check)

See following chapter for in detail information about this script.

Overview

Called without parameters this script performs a health check of essential settings an a system. If the script finds differences between system and expected settings these will be logged to syslog. Enhanced modes for deeper investigation are also available.

Description

The script is devided in five parts. The base part are the essential tests. The minimum tests includes the essential and add some additional tests. This is at last extended by the recommended settings that include all available tests. The fourth mode is a collection of tests available similar for all operating systems within T-Systems. This part consists of the contents of some lists for the different areas the common tests are grouped into. Finally there is also a list with tests intended to run as part of a daily health check. The lists are definied at beginning of the script:

# lists for the areas
checklist_bootenv="BOOT RL MIRR"
checklist_system="PKG ZFS ZPOOLx SVM METADB VXVM DUMP PW GRP NTP NTPx SID"
checklist_cpumem="SWAP LOAD PAGE PROC SCAT IOWAIT"
checklist_network="GW GWx DNS LDAP LDAPSEARCH IF IF6 NET NET6 MTU"
checklist_storage="NFS FC MPXIO VXDMP ISCSI"
checklist_virtualization="VSW"
checklist_general="FMA CPU"
checklist_platform="SVC"

# lists for the five parts
checklist_essential="FMA CPU IF IF6 ZFS ZPOOLx SVM METADB VXVM MIRR DUMP BOOT RL GW SVC MTU \
SID SWAP LOAD PROC"
checklist_minimum="${checklist_essential} NTP NET NET6 LDAP DNS FC ISCSI MPXIO VXDMP GWx \
NFS ZFSx SVMx METADBx VXVMx VSW PAGE IOWAIT"
checklist_recommended="${checklist_minimum} NTPx NETx NET6x LDAPSEARCH LDAPx FCx SOCOWOsan \
SOCOWOnfs SOCOWOnfsx SOCOWOiscsi SCAT PKG PW GRP VSCSI ALIGN"
checklist_common="${checklist_bootenv} ${checklist_system} ${checklist_cpumem} \
${checklist_network} ${checklist_storage} ${checklist_virtualization} ${checklist_general} \
${checklist_platform}"
checklist_daily="BOOT RL SID SWAP LOAD PAGE PROC IOWAIT GW NFS FMA CPU SVC"


Note
In local zones some of these tests are not possible, the script will skip these tests (in dependency of options by removing from the list or with dummy output of the appropiate test).

All checklists can be reduced by an exclude list (see below in this chapter). In the following subsections you find a short description of the tests in the three main parts. The “common” list is nearly the “recommended” list, excluding most of the extended (with a “x” suffix) and the SoCoWo tests.

Even in quite mode the script will print a mangement summary to screen (one line per area, see above). Also the number of tests skipped (only if the exclude list is populated), executed and tests with findings (if there are any) is printed.

This script is supported and tested on the following environments:

  • Solaris 10 and up (older versions might work partly)
  • VxVM & VxFS 5.0 to 6.0 (older and newer versions work most likely)
  • SPARC and i386 are supported
Essential Tests

The essential tests include all test that must be meet all the time. If some tests find an error this means an critical issue to the system. This part consists of these tests:

  • FMA = the command “/usr/sbin/fmadm” shows no errors, so no known error in system
  • CPU = the command “/usr/sbin/psrinfo” shows every cpu in online state, so no broken cpus
  • IF = the command “/usr/sbin/ifconfig -a inet” show ervery ipv4 interfaces in running state, so not failed network interface
  • IF6 = same like IF but for ipv6
  • ZFS = the command “/usr/sbin/zpool list -H -o health” on the root pool shows online state, means the state of the root pool is good
  • ZPOOLx = the command “/usr/sbin/zpool list -H -o cap” is executed for every active zfs pool, an usage below 80% will be reported (potential performance gap)
  • SVM = the command “/usr/sbin/metastat -q | /usr/bin/grep 'State:'” only bring lines that contain a positive state, means all meta devices are in good condition
  • METADB = the command “/usr/sbin/metadb | /usr/bin/grep dsk | /usr/bin/grep -v a” ends in an empty output, which means all devices in meta database are active
  • VXVM = the command “/usr/sbin/vxprint -g bootdg -vp -F %admin_state” shows everything in active state, means all devices in boot diskgroup are okay
  • MIRR = this test checks if the root filesystem is mirrored, so it’s based on at least two disks (skipped if the system is a guest logical domain with unmirrored virtual disk)
  • DUMP = check for enabled savecore with command “/usr/sbin/dumpadm” and if the dump directory has at least twice the size of kernel of free diskspace
  • BOOT = on SPARC architecture check if all disks of the root filesystem have a valid boot block installed; on X86 architecture test if all disks of the root filesystem have GRUB installed
  • RL = the default runlevel is 3 in /etc/inittab or the default milestone is set to “all” (or empty that defaults to “all”)
  • GW = try to ping the default gateway, if this is unsuccessfull it’s checked of the default gateway is in arp cache
  • SVC = all services shown by /usr/bin/svcs except some special services and legacy start scripts must be “online”
  • MTU = the MTU on all interfaces must be 1500 or 9000: /usr/sbin/ifconfig -a | /usr/bin/grep index | /usr/bin/grep -v ":.:" | /usr/bin/grep -v lo0 | /usr/bin/grep -vi "mtu 1500" | /usr/bin/grep -vi "mtu 9000
  • SID = checks if the sid in /etc/epmf/tsi_system_info.cfg exists, starts with “S” or “SGER” and has 8 digits
  • SWAP = check if we have at least 5% free swap (with commands /usr/sbin/swap and /usr/sbin/prtconf and than calculate the value)
  • LOAD = the average load in 15 minutes (taken from /usr/bin/uptime) must be below the number of hardware threads (taken from /usr/sbin/psrinfo)
  • PROC = check if a “df”, “ls”, “cd” or defunct process running for more than 3 hours

It’s possible to exclude essential test in essential mode. See details in Exception section later in this chapter.


Note
Because the Essential test in default mode log to system log an incident will be opened for every discrepance.

Caution
Every negative finding of the Essential test should be investigated and fixed as soon as possible because they influence the availability of the system.

Tip
In detail information about the essential tests are in document .../Solaris/Solaris_Server_Health_Check.html in the health check section.

Minimum Tests

The minimum tests add some tests that should succeed on every system, but don’t have the same criticality as the essential tests. These tests are added in this part of the script:

  • NTP = the output of the command “/usr/sbin/ntpq -p” shows one active entry (marked with a star)
  • NET = the command “/usr/bin/netstat -in -f inet" has zero values in column 6 and 8, means no errors on all ipv4 devices
  • NET6 = same like NET but for ipv6
  • LDAP = all LDAP servers shown by command “/usr/lib/ldap/ldap_cachemgr -g” are up, so no delay for LDAP connections
  • DNS = the command “/usr/sbin/nslookup <nameserver> <nameserver>” for all entries in /etc/resolv.conf results in a positive answer, means all DNS server are reachable and functional
  • FC = the output of the command “/usr/sbin/cfgadm -a” shows no state “failing” or “unuseable” in column 5, so every active fibre channel connection has a clean state
  • ISCSI = the output of the command “/usr/sbin/iscsiadm list target | grep 'Connections:‘” shows a value greater than zero for all targets, means the targets are connect and useable
  • MPXIO = all disks connected via MPxIO have at least two paths (every disk in a fabric must have two online paths in output of “/usr/sbin/luxadm display <disk>“)
  • VXDMP = all disks connected via VxDMP have at least two paths (the output of “/usr/sbin/vxdmpadm getdmpnode enclosure=<enclosure> redundancy=2” must be empty for all fabric enclusures)
  • GWx = try to ping or find in arp cache all target servers for static routes (based on files “/etc/default/route” and “/etc/inet/static_routes“)
  • NFS = checks if all nfs server used by nfs mounts on the system response by /usr/bin/rpcinfo
  • ZFSx = same like test ZFS from essential test, but for all active zfs pools
  • SVMx = same like test SVM, but for all active meta sets
  • METADBx = checks if we have at least three active replicas on at least two different disks
  • VXVMx = same like test VXVM, but for all active disk groups
  • VSW = if there are logical domains active, this test checks if all active interfaces have a virtual switch, and if all logical domains have a virtual network interface connected to each of these virtual switches
  • PAGE = kernel value “freemem” must be higher than kernel value “lotsfree”, if not system is paging and that might be bad for performance
  • IOWAIT = check for 5 seconds the blocked i/o threads, we shouldn’t see 5 or more over this intervall or this may indicate an i/o bottleneck

In normal circumstances all test should be positive, so no discrepancy found. Some tests, for example NTP, are classified as a requisite, so if the service is not active this is a fault.


Note
The ISCSI test will not work completely on ancient Solaris 10 releases because it's designed for the current implementation.

Important
The NFS test may take some time due to timeouts if there are unresponseable nfs servers.

Recommended Settings

The recommended settings include all tests. If all tests find no issue it means the system meet the settings recommended for a standard Solaris instance. These tests are only included in this part of the script:

  • NTPx = the active server from the output of the command “/usr/sbin/ntpq -p” has a stratum below three, means good quality of time signal
  • NETx = the command “/usr/bin/netstat -in -f inet” has zero values in column 9, means no collisions on all ipv4 devices
  • NETx6 = same like NETx but for ipv6
  • LDAPSEARCH = execute the command “/usr/bin/ldapsearch -b '' -s base -h <server> -p 636 -Z -P /var/ldap/ '(objectclass=top)' vendorVersion” to all online LDAP servers (without SSL for Solaris 9 or older)
  • LDAPx = the command “/usr/lib/ldap/ldap_cachemgr -g” shows at least three LDAP servers connected over at least two different networks, required for best redundancy
  • FCx = the output of the command “/usr/sbin/cfgadm -al -o show_SCSI_LUN” shows no state “failing” or “unuseable” in column 5, so every connected fibre channel disk has a clean state
  • SOCOWOsan = the recommended settings for san devices are set to /etc/system (“sd_max_throttle” matches 8 and “sd_io_time” matches 120)
  • SOCOWOnfs = the recommended settings for nfs mounts are followed for every active nfs mount (no version 2 or below mounts and some other recommended settings take from Storage Connectivity Guide)
  • SOCOWOnfsx = the recommended settings for nfs server are set to /etc/default/nfs or the service nfs/server (no version 2 or below shares, Solaris 10 should have NFSv4 disabled and the delegation feature must be disabled always)
  • SOCOWOiscsi = the recommended settings for iSCSI devices are followed for every iSCSI device (“Time To Retain” and “Time To Wait” should match 180 seconds)
  • SCAT = the command “/opt/SUNWscat/bin/scat --sanity_checks” is executed and checked for warnings (see, e.g., …/Solaris_Server_Health_Check.html for details about Solaris Crash Analysis Tool)
  • PKG = performs a test of package installation accuracy with “/usr/sbin/pkgchk -a” and report missing files
  • PW = executes /usr/sbin/pwck and reports issues found
  • GRP = executes /usr/sbin/grpck and reports issues found
  • VSCSI = checks if virtual disks for logical domains are based on a local mirrored filesystem or volume (nfs filesystems are skipped in this test, in case of disk devices it’s checked if the corresponding guest logical domain has at least two disks)
  • ALIGN = for all disks with sector size of 512 bytes if will be checked for all slices if start sector is divisible by 16 without carryover

Some of the test may report errors that need to be interpreted. For example the NETx test may report some collisions on an interface, but this interface is tuned to to full duplex within boot process, so it’s okay here. The tests for SoCoWo compliancy don’t support defining custom settings in difference to the system_storage_info_sol.sh scipt. This is desired to see the exceptions for that script and may adjust them.


Note
The SoCoWo iSCSI test will not work completely on ancient Solaris 10 releases because it's designed for the current implementation.

Common Tests

This executes a list of tests that are required to fulfill the requirements given by the categories in the management summary.

Daily Tests

This executes a list of tests that are intended to be checked daily by an automation tool.

Options

If the script is started without any option the essential tests in quite mode with reporting to syslog will be executed. This is similar to call the script with option “-d” that is equal to “-e -q -s -y”. The following additional options are implemented:

  • -f = full test mode, execute all tests also in non-global zones
  • -h = help
  • -l = shows the list of checks for selected mode
  • -n = normal output, shows only the found problems (DEFAULT)
  • -p = skip header and footer, prints only test results
  • -q = quiet mode, no output except summary
  • -s = report to syslog
  • -v = verbose output, shows all output, not only the found problems
  • -w = without color (default is to determine if possible)
  • -x = no output, reports summary as special CSV, selects daily tests
  • -y = assume yes for all questions
  • -z = no output, reports summary as CSV
  • -V = print version number
  • -a = select an area of tests
  • -c = common tests in verbose mode
  • -e = essential tests (DEFAULT)
  • -i = daily tests
  • -m = minimum requirements
  • -r = recommended settings
  • -t = select a special test

With options “-a” and “-t” the script requires additional arguments. The areas are the categories from the management summary (bootenv, system, cpumem, network, storage, virtualization, general and platform), the tests are visible in the test lists above or use “-r -l” to see all test.

The normal output will only show the tests that have findings, the verbose output will also show the tests with no findings, and the quite output only prints a summary.

Syslog Reporting

Even if you select syslog reporting not all findings result in a syslog event. Only the clearly warnings will be logged to syslog and therefore to system monitoring. Currently this will happen for the results of the following tests:

  • BOOT
  • RL
  • MIRR
  • ZFS
  • SVM
  • METADB
  • VXVM
  • DUMP
  • NTP
  • SWAP
  • LOAD
  • DNS
  • LDAP
  • IF
  • NET
  • NFS
  • FC
  • ISCSI
  • FMA
  • CPU

The priority is “audit.crit” and the tag is “SOLMIN”, so the incident in ServiceCenter will have priority 3.

Output

If executed in common mode with syslog reporting you see an output similar to this one (common mode defaults to verbose output):

root@xxxxxx> ./health_check_solaris.sh -c -s
>>> Health Check Script Solaris Version hc-5.2 <<<

You have to read the output and take actions to the statements that are not ok.
Type "Yes" if you want to continue: Yes

Number of tests to be executed in mode (common) ---------------------------------- [ 40 ]
BOOT: bootblock on xxxxxxxx xxxxxxxx okay ........................................ [ OK ]
RL: Milestone is (all) ........................................................... [ OK ]
MIRR: root filesystem in (rpool) mirrored with ZFS ............................... [ OK ]
PKG: there are some missing files ................................................ [ Warning ]
/opt/SUNWscat/bin/sendusage
ZFS: root pool (rpool) okay ...................................................... [ OK ]
ZPOOLx: usage of all ZFS pools okay .............................................. [ OK ]
SVM: not used for root filesystem ................................................ [ OK ]
METADB: meta database not in use ................................................. [ OK ]
VXVM: not used for root filesystem ............................................... [ OK ]
DUMP: Savecore configuration not okay ............................................ [ Warning ]
Savecore is disabled
PW: all entries match specification .............................................. [ OK ]
GRP: all entries match specification ............................................. [ OK ]
NTP: time synchronized ........................................................... [ OK ]
NTPx: time server quality is good ................................................ [ OK ]
SID: okay (SGERxxxxxxxx) ......................................................... [ OK ]
SWAP: okay, 17% = 2949 MB free swap .............................................. [ OK ]
LOAD: okay (1, max is 8) ......................................................... [ OK ]
PAGE: no paging, free memory is 482 MB ........................................... [ OK ]
PROC: no long running process .................................................... [ OK ]
SCAT: sanity check okay .......................................................... [ OK ]
IOWAIT: okay, 0 blocked i/o threads in 5 seconds ................................. [ OK ]
GW: default gateway xxx.xxx.xxx.xxx is reachable ................................. [ OK ]
GWx: all additional gateways are reachable ....................................... [ OK ]
DNS: all servers okay ............................................................ [ OK ]
LDAP: all servers okay ........................................................... [ OK ]
LDAPSEARCH: all servers okay ..................................................... [ OK ]
IF: all ipv4 interfaces okay ..................................................... [ OK ]
IF6: no active ipv6 interfaces ................................................... [ OK ]
NET: some ipv4 interfaces with reasonable error rate ............................. [ OK ]
NET6: no active ipv6 interfaces .................................................. [ OK ]
MTU: on all interfaces okay ...................................................... [ OK ]
NFS: all nfs servers reachable ................................................... [ OK ]
FC: all targets are active ....................................................... [ OK ]
MPXIO: all fc disks have at least two paths ...................................... [ OK ]
VXDMP: not active ................................................................ [ OK ]
ISCSI: not in use ................................................................ [ OK ]
VSW: not installed ............................................................... [ OK ]
FMA: no errors seen .............................................................. [ OK ]
CPU: all cpu okay ................................................................ [ OK ]
SVC: no errors seen .............................................................. [ OK ]
Number of tests with findings ---------------------------------------------------- [ 2 ]
Management Summary for amm1s010 on 2016_07_12:
==============================================================================================
Boot parameters .................................................................. [ OK ]
System Config .................................................................... [ Warning ]
CPU/Memory ....................................................................... [ OK ]
Network .......................................................................... [ OK ]
Storage .......................................................................... [ OK ]
Virtualization ................................................................... [ OK ]
General .......................................................................... [ OK ]
Platform Specific ................................................................ [ OK ]
==============================================================================================

In this case we have one missing file and the save of crash dumps is not active. Because the test “PKG” is not in the list of test to report to syslog we only get one message to system log that will be fetched by system monitoring:

Jul 12 16:15:14 xxxxxxx SOLMIN: [ID 702911 audit.crit] /etc/epmf/healthcheck.info => DUMP:
Savecore configuration not okay


Tip
The missing file seen in the PKG test is not critical, it's a known bug after updating SUNWscat and should be repaired with lower priority.

The errors are also logged to the file /etc/epmf/healthcheck.info:

Logfile of warnings from hc-5.2 for amm1s010 on 2016_07_12:

PKG: there are some missing files [ Warning ]
/opt/SUNWscat/bin/sendusage
DUMP: Savecore configuration not okay [ Warning ]
Savecore is disabled


Note
If no error was found the file only contains the header.

The excit code of the script is the number of issues found, so a zero exit code means no anomaly found.

CSV Output

If the option “-z” for CSV output is selected only the status part of the managemend summary together with the host name is printed as a list. So it’s possible to run the script over a range of servers and put the CSV output in a spreadsheet for analysis. The output for the machine above in CSV mode (recommended test list here because common test list defaults to verbose output):

root@xxxxxxxx> ./health_check_solaris.sh -r -s -z
xxxxxxxx,OK,Warning,OK,OK,OK,OK,OK,OK
Syslog messages and exit code are same like in the normal output mode.

There is also another CSV output implemented by this script, activatable with the “-x” option. This will also select the “daily” list of tests. The output format is slightly different from the previous one:

root@xxxxxxxx > ./health_check_solaris.sh -x
SGER xxxxxxxx;amm1s010;OK;OK;OK;OK;OK;NA;OK;OK


Note
Healthy state for some areas differ because different selection of tests is active here.

Syslog messages (if selected) are also same like in the normal output mode. Regardless of the number of findings the exit code is always zero in this mode.

Exceptions

There is an optional exceptions file “/etc/epmf/health_check_exceptions” to exclude some tests. If you want to exclude a test please enter one test per line to this file. You get the list by executing this script with option “-l”.


Important
If you exclude all test for a category of the management summary you get a status "N/A" respectively "NA" for "not available". The same happen if you select only a range of tests or areas that don't fulfill all categories.

Availability

The installation of the health check script is included in every system profile. Therefore a recent version of the script is available on every Solaris system installed with T-Systems default deployment. The location of the script differs between Solaris 10 and 11:

  • Solaris 10: file /opt/TSYStools/bin/health_check included in package TSYStools
  • Solaris 11: file /usr/local/sbin/healthcheck included in package tsys/tools/healthcheck

The Solaris 11 package will be updated with every general update. The update of the Solaris 10 package needs manual intervention.

Appendix 1. Document Control

Section “Metadata of Document” is required for each document. All other sections are required for “controlled documents” only.

Metadata of Document

Title of DocumentSolaris Health Check Script
ScopeCompany Global IT Operations
Kind of Document
PrivacyInternal
Publishing OrganizationUNIX operational org
Owner of Document
SiberSafe Role
PublisherMax Mustermann
AuthorApproverMax Mustermann
CoordinatorMax Mustermann
WriterMax Mustermann
Process
LanguageEN
Review Interval1 year
Preserve Period1 year

Joint Underwriting (optional)

Agreement by signing.

ScopeNameDataSignum








Proof of Releases

By signing, this document is released as a mandatory instruction for employees. Preceding versions are invalid from its release date on.

Preceding versions are invalid from its release date on.

 Coordinated (Author)Formal checked (Editor)Released (Owner)
Name
Date

Valid Documents

File NameRemarks




Distribution

File NamePosition / Company




Scroll To Top