Updated 9/16/2013 to incorporate excellent suggestions of commenter Greg Smith.
The most common fix is typically to increase the default timeout settings in the guest VM, and sometimes also in the host machine as well, as the root cause is usually that the SAN took longer than the default timeout to respond. This is usually because the SAN was involved in a failover, which can take > 60, or even > 120 seconds in some cases. I generally recommend setting it to at least 180 seconds, though 300 seconds or more I'm also perfectly happy with, personally. I only really have an issue with under 120 seconds.
This is in keeping with industry standards, I might add - VMware sets to 180, NetApp has long requested it be 190, and so on. I don't actually like how the timeouts and such are handled, and I especially do not like that in many scenarios the timeout is a global value applying to both the SAN-provided storage and that local spinning disk (which never needs or wants a timeout value this long), but them's the breaks I'm afraid.
Of course, the usual follow-up question from anyone told this is, "Ok, so where do I do that?" and then you're off to Google, and it can be annoying. Enough so that I decided to compile them all in one place, and add some scripts and such to simplify it (and be included in automated deployment tools, for instance). So, here you are.
Windows 2000, 2003, 2008, Vista, & Windows 7Open the registry editor (regedit) and navigate to:
HKEY_LOCAL_MACHINE / System / CurrentControlSet / Services / Disk
Once there, look for 'TimeOutValue'. If it exists, edit it, and if it does not exist, right-click and choose 'Edit/Add Value' and create it. The type is REG_DWORD, and the value should be set in decimal to the timeout in seconds that you desire (so, I suggest, 180).
This change is permanent as far as I know, as well as global, so that's all you've got to do. I am unaware if you need to reboot for it to take affect, probably should to be sure.
Linux (2.6+ non-udev)So the 'easy' but far from elegant solution is to go in and force the timeout to be higher on every block device you need to do so on. This is done on both 2.4 and 2.6 kernels by echo'ing the time in seconds you want at /sys/block/<device>/device/timeout, substituting the device name for <device>. So, for example, if the main disk (sda) was being offered up from the VM host and originated on a SAN and you wanted to make it timeout after 180 seconds, you'd do:
echo 180 > /sys/block/sda/device/timeout
The problem with this is that this isn't permanent, and will only survive until the system is rebooted. The quick and dirty answer to this is to add a command to do this into something like /etc/rc.local or create a full-blown init script that does it (be sure you add the command above the 'exit 0' that often ends the default rc.local file). For completeness, here's a simple script you can call from rc.local (put the contents below into a file, chmod +x it, and then call it from rc.local), that may or may not work for you out of the box (be sure to edit DISKS to be a list of the disks you care about):
# nex7.blogspot.com - VM Disk Timeouts - simple script for non-udev 2.6+ kernels
# - edit DISKS to be a list of disks you want to increase the timeout on to TIMEOUT_V
DISKS="sda sdb sdc"
for DISK in $DISKS; do
echo $TIMEOUT_V > /sys/block/$DISK/device/timeout
Or, read on for the better way to do it if you have a fairly modern and mainstream distribution.
Linux (2.6+ with udev)The slightly more complex but a bit more elegant method that I see, and that I wish the various major Linux distributions would adopt directly into their base releases, is something like what the VMware Tools does when installed on a supported Linux distribution. You can see their own explanation at this link.
The issue with this today is that not only is this only added if you install the VMware Tools, the line it adds to the udev rules only affects disks exposed using VMware. Something that will not help you if you are using Xen or KVM or VirtualBox and so on. So, something a bit more agnostic is called for. In building this little blog post, and coming upon this issue (admittedly for the umpteenth time), I decided to go ahead and finally do something about it.
My investigations so far have concluded there is no danger to 'bad' or unmatched rules in a udev rules file (at worst, you get a warning in syslog on boot from udev complaining about the lines it doesn't like, but it still parses the other rules fine). Thus, a simple single rules file put into /etc/udev/rules.d/ that contains rules for all possible OS and all possible exposed disks from a variety of virtualization hosts seems like the easiest way to go, so I give you this link. You can run the below command directly (as root) to install on most distributions (be sure /etc/udev/rules.d is where they go):
wget http://www.nex7.com/files/99-virt-scsi-udev.rules; mv 99-virt-disk-timeouts.rules /etc/udev/rules.d/;chmod 644 99-virt-disk-timeouts.rules
After putting it in /etc/udev/rules.d, just reboot. You can verify it is working with this one-liner (you're looking for results that have at least some entries that say '180', if you don't, it either isn't working or you don't have any disks the rules match against):
for file in `find /sys/devices -iname timeout`; do (echo $file && cat $file); done
And that's it. I've tested the file on CentOS 6.3 on top of KVM, Ubuntu 12.04 on top of KVM, and the VMware ones on a variety of OS's and versions. As far as I know, the list of presently supported virtualization platforms and guest OS's of this file are:
VMware 5+ (disks offered up via scsi)
KVM 1.0+ (disks offered up via ide or scsi - virtio doesn't expose timeout at guest level)
XenServer 5+ (disks offered up via scsi)
RHEL 5+ / CentOS 5+
If you run into any problems with this file, please let me know.
FreeBSD 9There are two variables that appear to be of note - and common wisdom seems to jump between which one to tweak. I'll err on the side of timeout over retry here, but that may not be the best option in all situations. To modify it, and it is a global variable as far as I can tell, you need to modify 'kern.cam.da.default_timeout' and change it from its default of 60 to 180. To modify it permanently, edit your /etc/sysctl.conf and add a line like this:
kern.cam.da.default_timeout = 180
If you're curious, the other variable mentioned online is 'kern.cam.da.retry_count', but I am less sure if the advice about it is fair or true.
NexentaStor (and other OpenSolaris-based derivatives)So the easy way is to modify the sd timeout value. Unfortunately in OpenSolaris today, this value can only be set in /etc/system for all drives, with no config file method of setting it on a per-disk basis that I am aware of. To modify it globally, add this line to your /etc/system file and reboot:
This is dangerous if there are any disks exposed to your VM that are not coming from a SAN and such, since this is a global value (much like the Windows one). There does exist a method of modifying the live value used by the kernel on a per-disk basis using mdb, but building this into a script to run on boot and when disks change I've decided not to try to tackle at this time. If you want more info, check out Alisdair's post on the issue, found here.