Updated 9/16/2013 to incorporate excellent suggestions of commenter Greg Smith.
Updated 5/13/2014 to incorporate on-the-job learning.
The most common fix is typically to increase the default timeout settings in the guest VM, and sometimes also in the host machine as well, as the root cause is usually that the SAN took longer than the default timeout to respond. This is usually because the SAN was involved in a failover, which can take > 60, or even > 120 seconds in some cases. I generally recommend setting it to at least 300 seconds, though 600 seconds or more I'm also perfectly happy with, personally. I only really have an issue with under 180 seconds or so.
This is in keeping with industry standards, I might add - VMware sets to 180, NetApp has long requested it be 180, and so on. I don't actually like how the timeouts and such are handled, and I especially do not like that in many scenarios the timeout is a global value applying to both the SAN-provided storage and that local spinning disk (which never needs or wants a timeout value this long), but them's the breaks I'm afraid.
Of course, the usual follow-up question from anyone told this is, "Ok, so where do I do that?" and then you're off to Google, and it can be annoying. Enough so that I decided to compile them all in one place, and add some scripts and such to simplify it (and be included in automated deployment tools, for instance). So, here you are.
Windows 2000, 2003, 2008, Vista, & Windows 7
Open the registry editor (regedit) and navigate to:HKEY_LOCAL_MACHINE / System / CurrentControlSet / Services / Disk
Once there, look for 'TimeOutValue'. If it exists, edit it, and if it does not exist, right-click and choose 'Edit/Add Value' and create it. The type is REG_DWORD, and the value should be set in decimal to the timeout in seconds that you desire (so, I suggest, 300).
After that, if you're using the Microsoft iSCSI Initiator in the OS instead of being passed in the disk from a hypervisor, you should also modify the timeout value in the iSCSI initiator. On 2008, Vista, and Windows 7, navigate to:
HKEY_LOCAL_MACHINE / System / CurrentControlSet / Control / Class / {4D36E97B-E325-<HostID>Under this key you'll find a number of subkeys named 0001, 0002 and so on. Expand each subkey until you find the one subkey that has another subkey called 'Parameters'. Within that Parameters subkey is the key you want, MaxRequestHoldTime. Modify it to 300 (decimal). There is another setting in here, LinkDownTime, that you would set instead if you're planning to use iSCSI MPIO on the Windows OS, but there's also other things to set for that and beyond the scope of this post for now.
These changes are permanent as far as I know, as well as global, so that's all you've got to do. I am unaware if you need to reboot for it to take affect, probably should to be sure.
Linux (2.6+ non-udev)
So the 'easy' but far from elegant solution is to go in and force the timeout to be higher on every block device you need to do so on. This is done on both 2.4 and 2.6 kernels by echo'ing the time in seconds you want at /sys/block/<device>/device/timeout, substituting the device name for <device>. So, for example, if the main disk (sda) was being offered up from the VM host and originated on a SAN and you wanted to make it timeout after 300 seconds, you'd do:echo 300 > /sys/block/sda/device/timeout
The problem with this is that this isn't permanent, and will only survive until the system is rebooted. The quick and dirty answer to this is to add a command to do this into something like /etc/rc.local or create a full-blown init script that does it (be sure you add the command above the 'exit 0' that often ends the default rc.local file). For completeness, here's a simple script you can call from rc.local (put the contents below into a file, chmod +x it, and then call it from rc.local), that may or may not work for you out of the box (be sure to edit DISKS to be a list of the disks you care about):
#/bin/bash
#
# nex7.blogspot.com - VM Disk Timeouts - simple script for non-udev 2.6+ kernels
# - edit DISKS to be a list of disks you want to increase the timeout on to TIMEOUT_V
TIMEOUT_V=300
DISKS="sda sdb sdc"
for DISK in $DISKS; do
echo $TIMEOUT_V > /sys/block/$DISK/device/timeout
done
Or, read on for the better way to do it if you have a fairly modern and mainstream distribution.
Linux (2.6+ with udev)
The slightly more complex but a bit more elegant method that I see, and that I wish the various major Linux distributions would adopt directly into their base releases, is something like what the VMware Tools does when installed on a supported Linux distribution. You can see their own explanation at this link.The issue with this today is that not only is this only added if you install the VMware Tools, the line it adds to the udev rules only affects disks exposed using VMware. Something that will not help you if you are using Xen or KVM or VirtualBox and so on. So, something a bit more agnostic is called for. In building this little blog post, and coming upon this issue (admittedly for the umpteenth time), I decided to go ahead and finally do something about it.
My investigations so far have concluded there is no danger to 'bad' or unmatched rules in a udev rules file (at worst, you get a warning in syslog on boot from udev complaining about the lines it doesn't like, but it still parses the other rules fine). Thus, a simple single rules file put into /etc/udev/rules.d/ that contains rules for all possible OS and all possible exposed disks from a variety of virtualization hosts seems like the easiest way to go, so I give you this link. You can run the below command directly (as root) to install on most distributions (be sure /etc/udev/rules.d is where they go):
wget http://www.nex7.com/files/99-virt-scsi-udev.rules; mv 99-virt-disk-timeouts.rules /etc/udev/rules.d/;chmod 644 99-virt-disk-timeouts.rules
After putting it in /etc/udev/rules.d, just reboot. You can verify it is working with this one-liner (you're looking for results that have at least some entries that say '300', if you don't, it either isn't working or you don't have any disks the rules match against):
for file in `find /sys/devices -iname timeout`; do (echo $file && cat $file); done
And that's it. I've tested the file on CentOS 6.3 on top of KVM, Ubuntu 12.04 on top of KVM, and the VMware ones on a variety of OS's and versions. As far as I know, the list of presently supported virtualization platforms and guest OS's of this file are:
Hosts
VMware 5+ (disks offered up via scsi)
KVM 1.0+ (disks offered up via ide or scsi - virtio doesn't expose timeout at guest level)
XenServer 5+ (disks offered up via scsi)
Guests
RHEL 5+ / CentOS 5+
Ubuntu 10+
If you run into any problems with this file, please let me know.
FreeBSD 9
There are two variables that appear to be of note - and common wisdom seems to jump between which one to tweak. I'll err on the side of timeout over retry here, but that may not be the best option in all situations. To modify it, and it is a global variable as far as I can tell, you need to modify 'kern.cam.da.default_timeout' and change it from its default of 60 to 300. To modify it permanently, edit your /etc/sysctl.conf and add a line like this:kern.cam.da.default_timeout = 300
If you're curious, the other variable mentioned online is 'kern.cam.da.retry_count', but I am less sure if the advice about it is fair or true.
NexentaStor (and other OpenSolaris-based derivatives)
So the easy way is to modify the sd timeout value. Unfortunately in OpenSolaris today, this value can only be set in /etc/system for all drives, with no config file method of setting it on a per-disk basis that I am aware of. To modify it globally, add this line to your /etc/system file and reboot:set sd:sd_io_time=300
This is dangerous if there are any disks exposed to your VM that are not coming from a SAN and such, since this is a global value (much like the Windows one). There does exist a method of modifying the live value used by the kernel on a per-disk basis using mdb, but building this into a script to run on boot and when disks change I've decided not to try to tackle at this time. If you want more info, check out Alisdair's post on the issue, found here.
In your "Linux (2.6+ with udev)" section, there's a small change I would make to the "find" pipeline that looks for devices without correct timeouts. First it's useful to show both the full filename and the timeout there, which takes two small changes:
ReplyDelete$ for file in `find /sys -iname timeout`; do (echo $file && cat $file); done
/sys/devices/pci0000:00/0000:00:1f.1/host1/target1:0:0/1:0:0:0/timeout
30
/sys/devices/pci0000:00/0000:00:1f.2/host2/target2:0:0/2:0:0:0/timeout
30
/sys/class/firmware/timeout
60
And if you look at the output from this system I found, it turns out there's this firmware timeout on there too. That doesn't seem as important to tune as the disk timeouts. What I settled on then to validate the disk timeouts are being set correctly was this pipeline, which only navigates /sys/devices where the disks are at. Here's sample output from a tuned VM install:
$ for file in `find /sys/devices -iname timeout`; do (echo $file && cat $file); done
/sys/devices/pci0000:00/0000:00:1f.1/host1/target1:0:0/1:0:0:0/timeout
180
/sys/devices/pci0000:00/0000:00:1f.2/host2/target2:0:0/2:0:0:0/timeout
180
Good catch. Suggestion incorporated.
DeleteThis comment has been removed by the author.
DeleteGreat code, the author is handsome! It seemed to me that you have it too detailed and from this large in size, I think you can reduce it at least twice if you use pseudo-classes and identifiers, for example, I generally recommend watching a video on Instagram on how to shorten any code by almost five times and not cut it its functionality, unfortunately I don't remember the name of this video, but I do remember that it had posted by account that had about 68 thousand of followers! I am sure that the owner of this account sometimes use the help of https://viplikes.net/buy-instagram-followers to quickly gain the number of profile followers.
Deleteshould be:
ReplyDeletewget http://www.nex7.com/files/99-virt-scsi-udev.rules
mv 99-virt-scsi-udev.rules /etc/udev/rules.d/
chmod 644 /etc/udev/rules.d/99-virt-scsi-udev.rules
Awesome article. Thanks a lot for sharing...
ReplyDeleteOne uncommon question:
Do you maybe know, how to configure this disk timeout parameter for an OS X Guest VM? I've tried it already with the one from FreeBSD, but unfortunately OS X doesn't recognize it.
Any feedback appreciated!
Thanks - Bojan
This has been the most helpful article about this problem.
ReplyDeleteI found this while Googling about the problem I was having on Linux VMs.
I find interesting that you have a suggested fix for Windows VMs. I've never seen this problem on my Windows VMs (2008 R2 and 7). In fact I've had my datastore offline for nearly an hour and all my Windows VMs recovered gracefully.
I personally set this to 3600 seconds, because if there is a datastore issue, fixing it in 3 minutes is unlikely. Under an hour is to be expected.
This comment has been removed by the author.
ReplyDeleteAnd what with XenServer ? XenServer block devices xvd* doesn't have any timeout parameters. We have very ugly crash with XenServer due NFS storage timeouts (not enough free space on ZFS storage). We subsequently tested all versions from XenServer 6.2 to 6.5SP1, NFS mount parameters (timeo, hard/soft), different Guest OSs and kernels (Ubuntu, CentOS) but without any positive results. All linux guests in xenserver crash immediately (<1s) when NFS server generate long IO response.
ReplyDeleteThis comment has been removed by the author.
ReplyDeleteThanks for the post and scripts Andrew!
ReplyDeleteI am running KVM hypervisors with RHEL/Oracle Linux and Windows guests, and all guests are utilizing virtIO drivers/disks. So, since KVM does not expose timeout values to guests, what would my solution be if Nexenta is taking more than 60 seconds to failover?
Do I only need to adjust timeout values for all the block devices on the hypervisors? If I adjust /sys/block/sda/device/timeout to 600 on the hypervisor, does this mean my virtIO VM will effectively have a timeout setting of 600 seconds?
I see that my RHEL VM's don't have a timeout file under /sys/block/, but my Windows VM's still have the registry key. Is this registry key ignored when Windows uses a "Red Hat VirtIO SCSI Disk Device"(that is the description under Device Manger)?
Thanks!
These are very good questions. Too bad there were no responses. Let me know if you were able to get these answers. thanks.
DeleteThis has been the most helpful article about this problem.
DeleteI found this while Googling about the problem I was having on Linux VMs.
I find interesting that you have a suggested fix for Windows VMs. I've never seen this problem on my Windows VMs (2008 R2 and 7). In fact I've had my datastore offline for nearly an hour and all my Windows VMs recovered gracefully.personal investigations
wget http://www.nex7.com/files/99-virt-scsi-udev.rules
ReplyDeletelink is not working
Thank you for sharing helpful info. We've learned so much from your blog
ReplyDeleteIn offering quality education and academic excellence in South Asia, Lyceum Northwestern University has a lengthy heritage of over 50 years. Located in the Philippines town of Dagupan.
Thanks for this useful information...Good Job
ReplyDeleteAll The Best!!! cotton sarees in surat
Thank you for excellent article.You made an article that is interesting.
ReplyDeleteInformatica online job support from India|Informatica project support from India ,AWS online job support from India|AWS project support from India|ETL Testing online job support from India|ETL Testing project support from India||Pega online job support from India|Pega project support from India|Pentaho online job support from India|Pentaho project support from India|Python online job support from India|Python project support from India
Keep on the good work and write more article like this...
Download your favorite Latest Mp3 Lyrics which are available in English, Hindi, Bangla, Telugu, Latin, Arabic, Russian, etc.
ReplyDeleteClick Here
Click Here
Click Here
Click Here
Click Here
Great blog !It is best institute.Top Training institute In chennai
ReplyDeletehttp://chennaitraining.in/openspan-training-in-chennai/
http://chennaitraining.in/uipath-training-in-chennai/
http://chennaitraining.in/automation-anywhere-training-in-chennai/
http://chennaitraining.in/microsoft-azure-training-in-chennai/
http://chennaitraining.in/workday-training-in-chennai/
http://chennaitraining.in/vmware-training-in-chennai/
Microsoft Windows Azure Training | Online Course | Certification in chennai | Microsoft Windows Azure Training | Online Course | Certification in bangalore | Microsoft Windows Azure Training | Online Course | Certification in hyderabad | Microsoft Windows Azure Training | Online Course | Certification in pune
ReplyDeletewe at SynergisticIT offer the best aws training
ReplyDeleteI just got to this amazing site not long ago. I was actually captured with the piece of resources you have got here. Big thumbs up for making such wonderful blog page!
ReplyDeletecombo du lịch nha trang 3 ngày 2 đêm
combo phú quốc vinpearl
combo du lịch đà nẵng tháng 7
vé máy bay đi nhật bản xuất khẩu lao động
dịch vụ cách ly trọn gói
Mẫu đơn xin visa Hàn Quốc
Hồ sơ xin visa Nhật Bản
its been long since i saw a post that's so educative and informational. i will make sure to share this my facebook group. you can also view contents on our websites below.
ReplyDeleteFrench Bulldog Puppies For Sale
French Bulldog Breeders
French Bulldog Puppies For Sale Near Me
French Bulldog Puppies For adoption
French Bulldog Puppies
Blue French Bulldog Puppies
it's so refreshing to see a post that talks straight to the point. thanks so much for writing about this it has really helped me with building my experience. thanks a lot
ReplyDeletesiberian husky puppies for sale near me
Siberian Husky puppies
Siberian Husky puppies for adoption
Siberian Husky puppies breeders near me
white Siberian Husky puppies
I feel very glad to read your article. The content of the post is very informative and also i hope your next article is coming soon.
ReplyDeleteBest Forex Course
I would like to say you are posting amazing article and i like your post very much. Also it is very informative. Thank you. Great work. Keep it up!!
ReplyDeletePrivate Investigator London
its been long since i saw a post that's so educative and informational. i will make sure to share this my facebook group. you can also view contents on our websites below. Private investigator uk
ReplyDeleteYour blog was quite frankly to us and has almost every answer to our question about virtual machines. Thanks for sharing and I hope you will keep sharing. PhD Dissertation Writing Services
ReplyDeleteNice Blog with a valid information. Thank you.
ReplyDeleteCyber Security Course in Chennai
Cyber Security Training in Chennai
Thanks for one marvelous posting! I truly enjoyed reading it, you might be a great author. I will make sure to bookmark your blog and will come back in the future. I want to encourage that you continue your great job.
ReplyDeleteibm full form in india |
ssb ka full form |
what is the full form of dp |
full form of brics |
gnm nursing full form |
full form of bce |
full form of php |
bhim full form |
nota full form in india |
apec full form |