Thema: The Role and How To's of Core File Generation

The Role and How To's of Core File Generation

The scope of this document is to provide the Notes Administrator and Notes Support Analysts with information on the Domino/Notes (Notes) UNIX fault architecture, as it relates to deriving the best information possible from this type of event. This document should help to avoid pitfalls in configuration that may not produce the desired result when attempting to resolve a fatal software problem, by covering the rules of core file generation.

What is a Crash?

Events that qualify as a server crash are events that generate a Notes Panic or Fatal Error. To determine whether a server event qualifies as a crash, you should run nsd diagnostics (with no arguments) immediately after a server becomes non-responsive. You should search the output file for the words Fatal, Panic or Segmentation.

If the output does not contain one of these words, it should not be considered a crash (for purposes of this discussion). If it does, you should ask the question, did it drop core, or can I make it drop core?

Note: Only nsd versions 2.0 and above can evaluate a core file.

Summary of Terms

Fatal (as in Fatal Error) - a crash which occurs when the Notes server violates an operating system (OS) construct because of OS or Notes-related coding errors. Most common Fatal Errors are Type a and b, which are hexadecimal for UNIX signals 10 (bus error) and 11 (segmentation violation).

Panic - a crash in which Notes violates its own internal checks, as a result of OS or Notes-related coding issues. A common violation area is a handle that is invalid, null or out of range.

Segmentation - This actually means that a core file was evaluated and the Notes server most likely dropped core. To determine why this might not be true, refer tot he Other Factors section below.

To Core or Not to Core

First, you must established what a crash is and how to determine (from nsd) the result of the crash. The following questions should be considered by those who want to understand the behavior of a Notes server:

1. When Notes crashes, does it drop a core?
2. When Notes crashes, does the thread in violation exit?
caveat: If a core is created, the entire process exits. Therefore, the second question is only valid if a core is not generated.

The answer to the first question is somewhat complex. The answer is both yes and no (that it will drop core based on environment variables and other factors). This is what should happen as a result of environment variables (with different releases of the Notes Server code).

Customers who are instructed to make the Notes server drop core are doing so to get the most information possible on the fault of the Notes server.

Environment Variables

(behavior of Notes releases prior to release 4.13_b.8 and all 4.5x releases prior to Domino release 4.51):

If the environment variable "STX9" is set, Notes drops the core and, consequently, exits. If the variable is not set, Notes spins until a debugger is attached or the thread is killed.

NOTE: With release 4.13_b.8, release 4.14 beta and release 4.5, the behavior is exactly opposite. That is, the default is to drop the core. In order to spin, the "STX9" variable must be set.

All releases after 4.14 beta and after 4.51, the default is to spin until attached to or killed. If the environment variable Debug_Enable_Core is set, the core file is generated. This means in release 4.14 and above, and in release 4.1x and release 4.52 and above, the environment variable Debug_Enable_Core's presence or lack thereof, determines whether the server will drop core or not; inherently making the debugging process more straightforward.

NOTE: The variable STX9 is no longer used after releases 4.13 and 4.51.

The summary table below shows this best historically.

Summary Table
Version of NotesDefault BehaviorEnvironment VariableResult
4.5 ,4.13 and belowSpinSTX9=1Core
4.13_b.8, 4.5a, 4.51CoreSTX9=1Spin
4.14, 4.52 and aboveSpinDEBUG_ENABLE_CORE=1Core

Other Factors

These are various other factors that can either help or hurt the Notes server from dumping core in a customer engagement.

Core File Size

If the OS is not set up for a significantly large enough core size, the core generated by the Notes server may be truncated. In terms of usefulness, a truncated core file generally adds no value to customer problem-solving. In the system information section of a complete nsd dump, the following information will be shown. Ensure that both the Soft and Hard Limits for core files are set to unlimited.

For AIX, this should be done in all new customer engagements, as the default core size is set to 2048 blocks on the user level, regardless of the hard limit setting. This means a 1 MB core file is generated when a core is dumped. A Notes server core would be truncated if the core size setting was not changed. Depending on the shell, limit (csh) or ulimit (sh, ksh) commands can be used to increase the core dump size. This needs to be incorporated into the shell startup file (.profile, .cshrc or .kshrc) for the account running the Notes server.

In the sample case below, the soft limit for core is too small and a truncated core will be generated. The nsd script version 2.38 and above includes checks that tell the nsd dump analyzers that the core is truncated, and to increase the core size.

Sample Resource Limits from an nsd dump.

Resource Limits:

Soft/Current Limits:
================
time(seconds) unlimited
file(blocks) 2048
data(kbytes) 2097148
stack(kbytes) 8192
coredump(blocks) unlimited
nofiles(descriptors) 64
memory(kbytes) unlimited

Hard Limits:
==========
time(seconds) unlimited
file(blocks) unlimited
data(kbytes) 2097148
stack(kbytes) 2097148
coredump(blocks) unlimited
nofiles(descriptors) 1024
memory(kbytes) unlimited

Sample setting of core dump size to unlimited:

csh limit coredumpsize ulimited
ksh, sh ulimit -c ulimited

CleanupScriptPath

The CleanupScriptPath notes.ini parameter should not affect the generation of a core file perse. Generally it will be used in tandem with Debug_Enable_Core=1. However, since CleanupScriptPath can do anything the shell script instructs, including the killing of the processes of the Notes server, it may prevent the Notes server from dropping a core file. Core files are generated after CleanupScriptPath returns.

FaultRecovery

FaultRecovery is an environment variable. It is used to restart the Notes server, in case a Notes Panic or Fatal Error occurs. If it is set, the Notes server restarts before reaching the point in the code where a core file is generated. As a result, environments using this variable are choosing not to drop core.

KillProcess=1

The KillProcess notes.ini parameter is for partitioned servers. It is enabled by default when a partition server is set up. It is not related to core dropping, but both things happen when there is crash. When there is a crash and KillProcess is set, all Notes processes and ipcs are killed. The normal mechanism then kicks in to determine if a core is to be dropped. If a problem exists when KillProcess is set, the process that crashes (in some cases) is killed before it has the opportunity to generate the core. Release 4.52 should resolve this anomaly.

Kill -11

This is a way to force a core to be generated, should the software not be configured to drop core. You should check the console log for the Panic or Fatal Error message to determine the process id of the crash. This will allow you to kill this individual process.

Segmentation

Just because segmentation appears in the nsd output does not mean necessarily the correct crash is being reported. The nsd diagnostic has many checks; one is time detection.

For example, if a core is a day old, it reports this in its error log. However, if you are using a CleanupScriptPath notes.ini parameter to run nsd on a fault, you may catch with it the panic or fatal as it occurs. The core file is then generated after CleanupScriptPath is executed, so there is opportunity to be off by one, should a Notes server fault more than once a day.

NOTE: Core files are best evaluated by nsd on the machine where they were created. The nsd diagnostic builds the output of calls in the core from both the version of Notes code and OS libraries installed. This speeds the evaluation of the core file from a stack level.

Since there is a risk that nsd will hang on a process, as a result of debugger anomalies, it may be wise to create a user account to run nsd against any core files created. All that needs to be present is a notes.ini file. This way, nsd can be rerun against a core file if the proper information was not captured when the server crashed, without attaching to any live server processes. You can also run nsd -noinfo to reduce the output to the crashing stack.

Summary

As you can see, there are many options in terms of what happens when a Notes server crashes. A crash is defined as an event when running, where nsd reports a Panic, Fatal_Error or Segmentation Fault. A core file is not automatically generated, and is based on the user's environment. There are several environmental toggles, starting with release 4.14 and release 4.52 and above, that can either skew or affect the desired server behavior on a fault.