System does not clear RAID controller cache during shutdown

System does not clear RAID controller cache during shutdown


Details
On systems with ESX 4.0 or ESXi 4.0 installed, if the RAID controller battery backup unit is completely discharged after a shutdown, or if a locally attached disk is removed and not returned to the system, data corruption might occur because the RAID controller cache is not cleared while shutting down the server.

Note: The typical battery discharge period is approximately 48 hours.

You may experience these symptoms, depending on what is not flushed from the RAID cache:

* Failure to boot
* Loss of customized configuration
* Data loss
* A message indicating that the cache is cleared (or being flushed) during the power-on self-test (POST) after rebooting ESX 4.0 or ESXi 4.0 systems. This is an indication that the write-back cache is not flushed to disk during the previous system shutdown.

Solution
This issue may occur when all of these conditions are true:

* RAID controllers use these drivers (the attached script helps detect the affected drivers):
o megaraid2
o megaraid_sas
o aacraid

* The cache policy in the controllers is set to write-back cache

Note: To check if the cache policy is set to write-back cache, boot the host machine and check the controller BIOS. For more information, refer the documentation for the specific RAID controller.

* The RAID controller battery backup unit is completely discharged
* SAS or SCSI storage is directly attached to the server


Check if this article applies to your system

To check if the RAID controller is using any of the affected drivers with local storage attached, use the attached script. chkdrvr.pl is a perl script for use in:

* vSphere Management Assistant (vMA) 4.0 virtual appliance that is configured to manage ESX 4.0 and/or ESXi 4.0 hosts
* ESX 4.0 Service Console
* vSphere Command-Line Interface (vCLI) 4.0 (on Windows and Linux)

Note: The script does not work in ESXi 4.0 TechSupportMode shell.
Using the script with the ESX 4.0 Service Console
To use the script with the ESX 4.0 Service Console:

1. Download the attached script file.
2. Transfer the file to a VMFS volume accessible by the ESX 4.0 hosts to be tested. If there is no shared storage available with VMFS volume on it, transfer the file to the ESX 4.0 hosts' local disk. For example, the /root or /tmp directory.
3. On the ESX Service Console, expand the file with the command:

# cd /vmfs/volumes/ (or to the location where you stored the file)
# tar zxvf chkdrvr.tgz

4. Run the script:

# ./chkdrvr.pl

The output indicates if this article is applicable to your host. If it is applicable, proceed to the "Action to take" section. If it is not applicable, do not proceed further.

Using the script with vMA 4.0
Notes:

* For details about installing the vMA 4.0 Virtual Appliance, see the vMA 4.0 Release Notes.
* For details about configuring and using the vMA 4.0 Virtual Appliance, see the vMA 4.0 Guide.

To use the script with vMA 4.0:

1. Download the attached script file.
2. Transfer it to the vMA 4.0 Appliance.
3. Expand the file with the command:

# tar zxvf chkdrvr.tgz

To use the script with vCenter Server as the managed target in vMA 4.0 Virtual Appliance:

1. Logon to vMA 4.0 Virtual Appliance as vi-admin.
2. Register the vCenter server as the managed target with the command:

# sudo vifp addserver

3. Enter your vCenter's user name and password.

If the authentication is successful, your prompt shows the current context is now the vCenter host name or IP address. For example:

[vi-admin@vma4-mk ~][vcenter04.acme.com]$

If you receive and error and your prompt does not look like the example, do not proceed with until you verify that you are using the correct credentials for the vCenter you attempted to register.

4. Run the script:

# ./chkdrvr.pl --vihost

where vihost is the ESX/ESXi host name as it appears in vCenter inventory

The output indicates if this article is applicable to your host. If it is applicable, proceed to the "Action to take" section. If it is not applicable, do not proceed further.

5.
Repeat for each ESX/ESXi host managed via vCenter 4.0 to which you are currently connected.

To use the script with ESX/ESXi 4.0 hosts as the managed targets in vMA 4.0 Virtual Appliance:

1. Logon to vMA 4.0 Virtual Appliance as vi-admin.
2. Register the ESX/ESXi 4.0 hosts as the managed targets so that you can use the FastPass facility provided by vMA 4.0. Run the command:

# sudo vifp addserver

Enter your ESX/ESXi 4.0 host's user name (with root privilege) and password.

3. Repeat the previous step for each ESX/ESXi 4.0 host you manage from this vMA 4.0 Virtual Appliance.
4. To verify the list of registered hosts, run the command:

# vifp listservers

5. Change the context to the first server that you want to check for this issue. Run the command:

# vifpinit

The prompt shows the current context is now the ESX/ESXi host name. For example:

[vi-admin@vma4-mk ~][esxi02.acme.com]$

6. Run the script without any arguments:

# ./chkdrvr.pl

Note: If you do not use the FastPass facility, use the following syntax for running the script on vMA 4.0:

# ./chkdrvr.pl --server --username --password

7. Repeat the previous 2 steps for each ESX/ESXi host to be checked.
8. The output indicates if this article is applicable to your host. If it is applicable, proceed to the "Action to take" section. If it is not applicable, do not proceed further.

Using the script on vCLI 4.0

Note: On Linux vCLI, using passwords with special characters requires escape characters. For example, type Pa\\\$\\\$w0rd instead of Pa$$w0rd (3 backslashes before each special character). This is not required when using vCLI on Windows.

To use the script on vCLI 4.0:

1. Download the attached script file to the system where vCLI 4.0 is installed.
2. Expand the file.
* On Linux. use the command:

# tar zxvf chkdrvr.tgz

* On Windows, use a tool like WinZip or WinRar then move the expanded file to:

%ProgramFiles%\VMware\VMware vSphere CLI\bin

Notes:
o Change to the above directory before proceeding.
o When you extract the file, you may need to rename it to chkdrvr.pl.

3. Run the script.
* On Linux, run the command:

./chkdrvr.pl --server --username --password

* On Windows, run the command:

chkdrvr.pl --server --username --password
4. The output indicates if this article is applicable to your host. If it is applicable, proceed to the "Action to take" section. If it is not applicable, do not proceed further.

Action to take
No immediate action is required. However, if you plan on shutting down the system for an extended period of time, follow this procedure to prevent this issue from affecting it.

1. Reboot the ESX/ESXi.
2. During Power-On-Self-Test (POST) press the hot key for Boot Device Order or equivalent. This allows the RAID controller's BIOS to load, which flushes the cache if needed, then it pauses and displays the list of boot devices.
3. Power off the system using the power switch.