Nex7's Blog

Need a job?

2014-02-28T11:59:00.001-08:00

So we're seriously in need of some good people (well, we're always interested in good people). If you or someone you know is a Linux or FreeBSD or Solaris or illumos person and would be interested in a support role, shoot me a resume at andrew.galloway@nexenta.com.

Prior ZFS and/or illumos/Solaris experience is a plus, but it is not required. Some of our best people came over with a purely Linux background, others had Solaris experience but had never touched ZFS, etc. What is necessary is:

an excellent work ethic (the job is likely telecommute, depending on your location)
prior experience and no problem crunching cases in a support queue & handling customer-facing phone calls/GTM's
grace under pressure
experience dealing directly with customers, including during stressful periods for them
a genuine desire to help people with their issues
excellent communication skills & ability to work well with a global team
quick learner & self learner (training and skill transfer from teammates is of course available/provided, but sometimes you need to find an answer now and you need to be comfortable doing that)
ability to deal with an occasionally flexible schedule
ok with very occasional travel requirements (generally to Nexenta offices/events)
console (bash) skills (comfortable with shells and common GNU tools [grep/awk/sed/etc]) needed - sysadmin and/or devops experience desired
skill at one or more common scripting languages (Python, Perl, Ruby, bash, etc) strongly desired
ability to at least partially read & understand C code desirable

Looking for L2 and L3 (we have plenty of open positions).

Lots of L2ARC not a good idea, for now

2013-10-09T20:14:00.000-07:00

Before I get into it, let me say that the majority of the information below and analysis of it comes from one of my partners in crime (coworker at Nexenta), Kirill (@kdavyd). All credit for figuring out what was causing the long export times, and reproducing the issue internally (it was initially observed at a customer) after we cobbled together enough equipment, and analysis to date, goes to him. I'm just the messenger, here.

Here's the short version, in the form of the output from a test machine (emphasis is mine):

root@gold:/volumes# time zpool export l2test

real 38m57.023s
user 0m0.003s
sys 15m16.519s

That was a test pool with a couple of TB of L2ARC, filled up, with an average block size on pool of 4K (a worst case scenario) -- and then simply exported, as shown. If you are presently using a pool with a significant amount of L2ARC (100's of GB's or TB's), (update:) and you have multiple (3+) cache vdevs, and have any desire to ever export your pool that has a time limit attached to it before you need to be importing it again to maintain service availability, you should read on. Otherwise this doesn't really affect you. In either case, it should be fixable, long-term, so probably not a panic-worthy piece of information for most people. :)

If you want the workaround, by the way, it is to shoot the box in the head -- panic it, power it off hard, etc, to export the pool. This is obviously very dangerous if you do not have a proper setup with an actually utilized ZIL (preferably through nice log devices) and all your important data arriving synchronously, or if you have other pools on the system you don't want to lose access to as well. Unfortunately if either of those is the case, the only other workaround would be to remove the cache vdevs from the pool, wait out the eviction stuff, and only then export the pool. Better than just exporting it and waiting, since you'd be online while you removed the cache vdevs, but fairly time consuming still.

Here's your culprit, where we sit for potentially minutes, or even hours if you had a sufficiently 'bad' environment for this issue:

PC: _resume_from_idle+0xf1 CMD: zpool export l2test
stack pointer for thread ffffff4b94b44180: ffffff022cdd9820
[ ffffff022cdd9820 _resume_from_idle+0xf1() ]
swtch+0x145()
turnstile_block+0x760()
mutex_vector_enter+0x261()
_l2arc_evict+0x8d()
l2arc_evict+0xb0()
l2arc_remove_vdev+0x9b()
spa_l2cache_drop+0x65()
spa_unload+0xa6()
spa_export_common+0x1d9()
spa_export+0x2f()
zfs_ioc_pool_export+0x41()
zfsdev_ioctl+0x15e()
cdev_ioctl+0x45()
spec_ioctl+0x5a()
fop_ioctl+0x7b()
ioctl+0x18e()
dtrace_systrace_syscall32+0x11a()
_sys_sysenter_post_swapgs+0x149()

Analysis is ongoing, but it appears to be made exponentially worse the more L2ARC you have, the bigger the L2ARC devices are individually, the clock speed of your CPU (as this is a single-CPU-bound task), and your average block size (as that affects how many l2hdr entries you have).

This should be relatively easy to fix, in theory -- ARC evictions were already made asynchronous, but it looks like this wasn't done for L2ARC as well. If it is, this should be less of an issue.

Update 10/10/13:
Well, it appears there is code for an async l2arc eviction, and it's in play:

root@gold:/volumes# echo zfs_l2arc_async_evict::print | mdb -k
0x1 (B_TRUE)

So the investigation continues as to why this is happening anyway.

Update 10/10/13:
Kirill continues to investigate, and believes he's pinned the problem down to the async l2arc eviction not being always actually being asynchronous. In his testing, much of it is actually done asynchronously, but the minute there's more than X number going, the next one gets done synchronously. This is tied to the number of cache vdevs you have. And you cannot export until all the data they referenced is cleared, so the export won't finish until the only remaining tasks are in process (and are asynchronous, otherwise it'll have to wait them out). This explains why the export did complete before this was all actually through -- ongoing tasks were asynchronous and there were no more outstanding. Seemingly there is a 'maxalloc' being set on taskq_create by this line:

arc_flush_taskq = taskq_create("arc_flush_tq",
max_ncpus, minclsyspri, 1, 4, TASKQ_DYNAMIC);

The '4' is the culprit here. In Kirill's words: "Probably at the beginning of export, we are flushing ARC as well, so 2 threads are already busy, leaving us with only 2 for L2ARC. It is evicting asynchronously, just not all drives, because taskq isn’t big enough, since someone probably didn’t consider that there may be more than 2 L2ARC’s on a system :)", then adds, "Also explains why I couldn’t reproduce it with a single 1TB L2ARC - looks like you need at least three."

So the major takeaway here is that the original correlation between size of L2ARC and the size of the individual L2ARC vdevs is actually only of concern if you hit this 'bug' in the first place. To hit it, the size of your L2ARC and L2ARC vdevs isn't important, it is basically the number of them - seemingly in most situations you'll need at least 3 cache vdevs in the pool to ever run into this.

Update 10/14/13:
So apparently the arc_flush_taskq and the asynchronous ARC & L2ARC eviction code is all only in NexentaStor at the moment, having not (yet? I cannot publicly state our intent with this code, I have truly not been informed internally of what it is) been pushed back to illumos. Fortunately this means that on Nexenta 3.x machines, you'll only hit this long export time problem if you have more than 2 L2ARC devices in a pool you're trying to export, in general, and also the amount of ARC related to the pool shouldn't effect export times, either. Long-term we'll of course move to fix this issue, likely by increasing the limit above 4, but only after more thorough analysis internally.

Unfortunately it seems to mean that if you're not running Nexenta 3.x, and are on an older version or using an alternative illumos-derivative, you may very well be susceptible to a long export time even with just a lot of small block data in ARC, as well as any number of L2ARC devices including just 1. I do not have a box of sufficient size with a non-Nexenta illumos distro on it at the moment to do any testing, I'm afraid.

Cables Matter

2013-04-04T22:33:00.002-07:00

No, I don't mean brand, or color, or even connector-type zealotry.. but when sizing solutions and working with customers and partners on new builds, I often find that either no thought goes into the SAS cabling, or only a little bit does. I find this distressing. Let me tell you why.

See, I often see either no thought put into it (a common mistake) -- or, the only thought I see put into it is to look at cable redundancy, sometimes paired with thoughts towards JBOD and/or HBA redundancy (an even more common, if more forgivable mistake). All of these are, of course, important (how important depends on the use case). I'm debating a blog post on sizing of solutions, and on redundancy, so I'll save talk of that for later. This is just a short (by my standards) post to explain something I often see completely overlooked. Throughput - and just how much of it you actually have compared to what you think you have.

See, anyone sizing a solution involving JBOD's often does give some thought to throughput. Most people understand that each 'mini-SAS' (SFF 8086/7/8 connector style cables) carry 4 separate SAS paths, and most understand that if your entire solution is SAS, you'll get 3 Gbit/s out of each path, and if it is SAS-2, you'll get 6 Gbit/s out of each path. My first word of advice is to treat this much like many network administrators treat network connections - pretend you only get 80%. For ease of remembrance, I just generally pretend that at best, a mini-SAS cable can do 2 GByte/s, and I'm rarely disappointed.

The thing I often see people forget, however, is that what's coming into your SAN/NAS is not what's going down to the drives. Let's take the easiest use-case to understand (and the generally worst one to deal with) - mirrors (RAID1/10). If 200 MB/s of data is coming into the SAN, all of it unique, how much is going to the drives if they're in a 2-disk mirror vdev pool? Answer: more than 400 MB/s (why more than double? Easy. ZFS maintains metadata about each block, and that also has to go down on the disks). Suddenly that 2 GB/s SAS cable is only actually capable of sending less than 1 GB/s of unique data downstream.

Ironically while ZFS far (far, far, far) prefers mirror pools for IOPS-heavy use cases, it has a significant downstream impact on throughput potential, especially if your build isn't taking this doubling into account in the design. Conversely, raidz1|2|3 vdevs lose much less - the only additional data that has to go down is the parity, which even in a raidz3 vdev is still less ballooning of data going down than mirrors, by quite a bit. So for raw throughput where the SAS cabling could become the bottleneck, raidz is a clear winner in terms of efficiency.

It isn't all bad news, though - once you understand this bottleneck, you'll appreciate ZFS' built-in compression even more than you probably already did, because that compression happens before the data goes down to the disks, potentially having quite an impact on how much usable data can get down the paths per second. And while I almost always steer people away from it, if your use-case does benefit strongly from deduplication, that also takes effect beforehand, massively reducing writes to disk if dedupe ratio is high.

So in the end, my advice when building solutions utilizing any sort of SAS expanding is to bear in mind not just how much performance you want to get out of the pool (a number you often know), but how much that actually means in terms of data going to drives, and rather your cabling can even carry it all. I am seeing more and more boxes with multiple 10 Gbit NIC's go out where there's single SAS cable bottlenecks that will very likely end up making it impossible to fully utilize the incoming network bandwidth in a throughput situation, because even if the back-end disks could support it, the SAS cabling in between simply can't. That is OK if you're hoping for most that network bandwidth to be ARC-served reads -- but if you're expecting it to come to or from disks, remember this advice.

ZFS Intent Log

2013-04-02T23:34:00.002-07:00

[edited 11/22/2013 to modify formula]

The ZFS Intent Log gets a lot of attention, and unfortunately often the information being posted on various forums and blogs and so on is misinformed or makes assumptions about the knowledge level of the reader that if incorrect can lead to danger. Since my ZIL page on the old site is gone now, let me try to reconstruct the knowledge a bit in this post. I'm hesitant to post this - I've written the below and.. it is long. I tend to get a bit wordy, but it is also a subject with a lot of information to consider. Grab a drink and take your time here, and since this is on Blogger now, comments are open so you can ask questions.

If you don't want to read through this entire post, and you are worried about losing in-flight data due to things like a power loss event on the ZFS box, follow these rules:

Get a dedicated log device - it should be a very low-latency device, such as a STEC ZeusRAM or an SLC SSD, but even a high quality MLC SSD is better than leaving log traffic on the data vdevs (which is where they'll go without log devices in the pool). It should be at least a little larger than this formula, if you want to prevent any possible chance of overrunning the size of your slog: (maximum possible incoming write traffic in GB * seconds between transaction group commits * 3). Make it much larger if its an SSD, and much much larger if its an MLC SSD - the size will help with longevity. Oh, and seconds between transaction group commits is the ZFS tunable zfs_txg_timeout. Default in older distributions is 30 seconds, newer is 5, with even newer probably going to 10. It is worth noting that if your rarely if ever have heavy write workloads, you may not have to size it as large -- it is very preferably from a performance perspective that you not be regularly filling the slog, but if you do it rarely it's no big deal. So if your average writes in [txg_timeout * 3] is only 1 GB, then you probably only need 1 GB of log space, and just understand when you rarely overfill it there will be a performance impact for a short period of time while the heavy write load continues. [edited 11/22/2013 - also, as a note, this logic only applies on ZFS versions that still use the older write code -- newer versions will have the new write mechanics and I will update this again with info on that when I have it]
(optional but strongly preferred) Get a second dedicated log device (of the exact same type as the first), and when creating the log vdev, specify it as a mirror of the two. This will protect you from nasty edge cases.
Disable 'writeback cache' on every LU you create from a zvol, that has data you don't want to lose in-flight transactions for.
Set sync=always on the pool itself, and do not override the setting on any dataset you care about data integrity on (but feel free TO override the setting to sync=disabled on datasets where you know loss of in-transit data will be unimportant, easily recoverable, and/or not worth the cost associated with making it safe; thus freeing up I/O on your log devices to handle actually important incoming data).

Alright, on with the words.

It is important to, first and foremost, clear up a common misconception I see about the ZIL. It is not a write cache. There is no caching of any kind going on in the ZIL. The ZIL's purpose is not to provide you with a write cache. The ZIL's purpose is to protect you from data loss. It is necessary because the actual ZFS write cache, which is not the ZIL, is handled by system RAM, and RAM is volatile.

ZFS absolutely caches writes (usually) - incoming writes are held in RAM and, with a few notable exceptions, only written to disk during transaction group commits, which happen every N seconds. However, that isn't the ZIL. The ZIL is invoked when the incoming write meets certain requirements (most notably, something has tagged it as being a synchronous request), and overrides the 'put in RAM and respond to client that data is written' normal flow of asynchronous data in ZFS to be 'put in RAM, then put on stable media, and only once it is on stable media respond to client that data is written'.

One of the most common performance problems people run into with ZFS is not understanding ZIL mechanics. This comes about because, on every distribution I'm aware of, the default ZFS setup is that the ZIL is enabled -- and if there are no dedicated log devices configured on a pool, the ZIL will use a small portion of the data drives themselves to handle the log traffic. This workload is terrible on spinning media - it is single queue depth random sync write with cache flush - something spinning disks are terrible at. This leads to not only a noticeable performance problem for clients on writes, it has a very disruptive effect on the spinning media's ability to handle normal read requests and normal transaction group commits.

It is just all around a less than stellar situation to be in, and one that any ZFS appliance doing any significant traffic load is going to end up getting bit by (home users often do not - I run a number of boxes at home off a ZFS device with no dedicated log, and it is fine - I simply do not usually do enough I/O for it to be an issue).

So, enter the 'log' vdev type in ZFS. You can specify multiple 'log' virtual devices on a ZFS pool, containing one or more physical devices, just like a data vdev - you can even mirror them (and that's often a good idea). When ZFS sees that an incoming write to a pool is going to a pool with a log device, and that the rules surrounding usage of the ZIL are triggered and the write needs to go into the ZIL, ZFS will use these log virtual devices in a round-robin fashion to handle that write, as opposed to the normal data vdevs.

This has a double win for performance. First, you've just offloaded the ZIL traffic from the data vdevs. Second, your write response times (write latency) to clients will drop considerably not only because you're no longer using media that is being contended by multiple workflows, but because any sane person uses an SSD or non-volatile RAM-based device for log vdevs.

As a minor third benefit, by the way, you might see an additional overall improvement because the lower latency allows for more incoming writes, which has itself two potential performance improvements: one, it means that if the data being written happens to be such that it is writing to the same block multiple times within a single transaction group commit (txg), the txg only has to write the final state of the block to spinning media instead of all the intermediary updates; and second, the increased ability to send sync writes at the pool may mean better utilization of existing pool assets than was possible before (the pool might have been starved for writes, even though the client could send more, as the client was waiting on response before sending more and the pool was slow to send that response because the latency for writes was too high). However, these two benefits are very reliant on the specific environment and workload.

So, to summarize so far, you've got a ZFS Intent Log that is very similar to the log a SQL database uses, write once and forget (unless something bad happens), and you've got an in-RAM write cache and transaction group commits that handle actually writing to the data vdevs (and by the by, the txg commits are sequential, so all your random write traffic that came in between commits is sequential when it hits disk). The write cache is volatile as its in RAM, so the ZIL is in place to store the synchronous writes on stable media to restore from if things go south.

If you've gotten this far, you may have noticed I've kept hedging and saying 'synchronous' and such. This is important. ZFS is making some assumptions about data retention based on how the incoming writes look that many typical users just don't realize it is doing, and are often bitten quite hard because of it. I have seen thousands of ZFS boxes that are in danger of data loss.

The reason is that they are unaware that their clients are not sending data in a manner that triggers the ZIL, and as such, the incoming writes are only going into RAM, where they sit until the next txg commit - some day, when the box inevitably has an issue resulting in power loss, they're going to lose data. The severity of this data loss is directly tied to the workload they're putting on the server. It is extremely common for those environments I see where they're in danger to be utilizing things like iSCSI to provide virtual hard disks to VM's, and this is one of the worst environments to lose a couple seconds of write data in, as that write data is potentially critically important metadata for a filesystem sitting on top of a zvol, that when lost, corrupts the whole thing.

So first, let's talk about what gets you into the ZIL, today. This is pretty complicated, because there's essentially a number of ways that ZFS can handle an incoming write. Note first of all that as far as I'm aware, all incoming writes will be stored in RAM while the transaction group is open or committing to disk (I haven't been able to fully verify this yet), even when they're instantly put on final data vdev (thus, a read on this data should come from RAM). Aside from that, however, any of the following could happen:

Write the data immediately to the log (ZIL) and respond to client OK. Data will be written from RAM to disk during next txg commit, normally. Data in log is only for restoration if power is lost.
Write the data immediately to the data vdevs and store a pointer to the new block in the log (ZIL) then respond to client OK. Pointer to data block in log is used only on recovery if power is lost. On txg commit, just update metadata to point at the already-written new block (the data block itself won't be rewritten on txg commit, merely actually made part of the pool; prior to that, it's not actually referenced by the ZFS pool aside from the pointer in the ZIL).
Write the data immediately to the data vdevs - nothing is written to the log device as this is a full write complete with metadata update, etc - then respond to client OK.

What can lead to these 3 types of workflow is a combination of a number of variables and the characteristics of both the incoming write and the total open transaction group. Sufficed to say, these variables are important:

logbias setting on the dataset being written to
zfs_immediate_write_sz
zil_slog_limit
Existence of a log device (method ZFS uses to handle writes will take into account rather the ZIL is on a log or not - it has major effects on the choice of mode used to deal with incoming data)
The incoming data has been, in one way or another, tagged as synchronous.

That last bold bullet point is key. None of all of the above stuff matters, and the incoming data will be stored solely in RAM, no ZIL mechanics in play, and written only to disk as part of the upcoming transaction group commit, if the incoming data is considered asynchronous. So. What can make you synchronous? Any of the following.

The dataset being written into is sync=always. The incoming block could even be specifically called ASYNC in some way, and it won't matter, ZFS is going to treat it as a synchronous write.
The dataset being written into is sync=standard and the block was written with a form of SYNC flag set on it.

The sync=standard setting is the default, and important data should be sent SYNC, right? So, surely all your important data is already being set with one of the above, right? Wrong. Different protocols specify sync or honor (or don't) client sync requests in different ways. Different layers in the stack between the client and the ZFS pool may alter a request to be sync or to disregard a sync request. And of course, ZFS itself may choose to interpret the incoming write as sync or async disregarding client semantics.

NFS - out of the box, most NFS traffic should be properly set as sync; specifying 'sync' (or the OS equivalent on your platform) on the mount command will guarantee this, specifying 'async' will likely ruin this and lead to most or all of the traffic from that mount not utilize the ZIL
CIFS/SMB - somewhat dependent on client - check with it to see what its defaults are
iSCSI - default is async, and very dangerously, some intermediary layers commonly found in an iSCSI setup will disregard sync requests from clients - notably, some hypervisor intermediary layers, where the hypervisor is responsible for iSCSI and the VM only sees the disk as presented by the hypervisor may be requesting O_SYNC inside the VM, but the hypervisor is disregarding that based on settings, and the request is sent to ZFS without sync set
Local box - this is to say, you're doing tasks directly on the box running the zpool - usually this is going to be asynchronous unless the application has intentionally requested sync writes (some things will, depending on settings, like *SQL databases for example). Generally speaking, however, it will be asynchronous from a client perspective.

If you've got data you want to be sure is being set properly to sync, how you guarantee this is a factor of rather you care about granularity. If you want every last bit of data being written to be sync (as you very often do when you have a dedicated log device, and even more so when the clients are, say, virtual machines using the storage as their primary disks), make sure all your datasets have sync unset (eg: being inherited from parent) and set sync=always on the pool itself. This is a quick and easy way that should guarantee data integrity.

It may seem counter-intuitive, but sometimes, data integrity is trumped by the cost of delivering it. Labs, non-production use-cases, and so on are obvious, but even other times, it is perhaps not important enough to warrant the ongoing performance cost, not to mention the up-front cost of hardware to support it.

Take, for instance, the aforementioned virtual machine host use-case. The VM's in question may be important, but a good backup system may be in place, the services they offer unlikely to be severely impacted by the loss of a few minutes of data, or even services that essentially do not change, meaning a restore of a backup from the prior day would work just as well.

If the restoration process only takes an hour, and the time the VM can be offline before it is important is longer, and the VM, once restored, would be of sufficient service (even having lost some amount of recent data), then the costs involved in delivering fully ZIL-backed-up ZFS storage underneath the VM may be higher than they are worth.

The only time having a ZIL matters is if the ZFS server itself loses power, and once it has been restored, the only data lost would be in-transit data (so, at best, 1 or 2 transaction group commits worth of seconds of data). In most file server situations, you won't have any issue other than recently updated or in-the-process-of-being-updated files would be affected at all. In situations where the storage is hosting things like virtual hard disks for VM's, the filesystems on top of those virtual disks (be they zvols or files within an NFS share) may experience some level of loss.

Depending on the filesystem sitting on top of those zvols or vhd files, and what was in transit at the time, the damage may be negligible. I've seen VM's come back up without a single warning, and when they do, the very common scenario is that the filesystem merely complains and needs to be fsck'd or chkdsk's, and the data lost is zero or not noticeably important (last few seconds of a log file, for instance).

I'm not suggesting that data integrity is unimportant - but it is worth looking at the overall environment before deciding that the storage in question truly requires ZIL mechanics to keep from losing a few seconds of data. In many environments, it doesn't. Also remember that in such environments, you don't have to go all or none - if you set sync=always on the datasets that matter, and intentionally set sync=disabled on datasets where it does not, a single pool can fulfill both sorts of situations. ZFS itself should (barring serious hardware problems) never have a problem; rather the data in the dataset was ZIL-backed or not, ZFS itself is, due to its atomic nature, always fine after the power is restored - it cannot by its design require a 'fsck'.

In closing, I'd also like to make another point - if you use a log device, and properly configure the pool or your clients to send all important data in such a manner that it makes use of the ZIL, and ZFS' own built-in integrity saves you from almost any disaster.. why would you need backups? Answer: because you need backups! Pools CAN be corrupted beyond reasonable recovery (there are a few very gifted ZFS experts out there willing to help you, but their time is precious, and your data may not be worth enough to afford their rates), and perhaps more importantly, the data on the pool can be destroyed in oh so many ways, some of which are flat out unrecoverable.

Accidental rm -rf or intentionally even? Hacker? Exploit? Client goes nuts and spews bad data and you didn't notice and didn't have a snapshot pre-crazy (or, even if you did, no easy way to recover from it due to environment)? SAN itself explodes? Is melted? Is shot by Gatling gun? Controller goes nuts and spews bad data at disks for hours while you're on vacation?

It is a simple fact of IT sanity that a comprehensive backup strategy (one that handles not only backing up the data, but making it quick and easy to restore as well) is a necessity for any production infrastructure you put up. Since this is a fact, and you are going to do it or rue the day you chose otherwise, you should probably remember that because you have it, you might not actually need a log device nor even ZIL mechanics, at least on some of your datasets (and every dataset you set sync=disabled on is another dataset with a bit more ZIL IOPS available for it to use instead). Carefully weigh risk and potential damage caused by loss of in-flight data as well as time to restore and how critical the service is before determining if ZIL mechanics are necessary.

(IPMP vs LACP) vs MPIO

2013-03-30T11:01:00.002-07:00

If you're running an illumos or Solaris-based distribution for your ZFS needs, especially in a production environment, you may find yourself wanting to aggregate multiple network interfaces either for performance, redundancy, or both. With Solaris, your choices are not limited to standard LACP.

So first, in case you're not aware, LACP is a link aggregation technology well supported by most operating systems and switches. It is sometimes called bonding, NIC teaming, and so on. You can get a pretty thorough write-up on it from Wikipedia.

IPMP is a Solaris technology that is similar to LACP, but superior in a number of ways, but most non-Solaris admins are generally unaware of its existence. Due to the rise of ZFS even within otherwise Linux-only environments, I often see administrators setting up and running with LACP when IPMP would have been a better fit, but they were simply unaware they had an option. I'm not going to wax on about IPMP or its virtues - a quick Google search will find you plenty of information. In a nutshell, it is different from LACP (running at the IP layer instead of the MAC layer), can actually be run in conjunction with LACP, and has some benefits and some drawbacks compared to an LACP aggregate.

No, what I want to take a moment to do is add this blog to the long list of sources that will explain something - LACP and IPMP and similar technologies increase the number of lanes on your highway - but each individual client generally only still has the 1 car, and the speed limit remains the same. Using LACP to aggregate 2, 4, or more NIC's together will improve your aggregate throughput speeds, but will not increase the speed of any individual stream of data past that of a single NIC within the aggregate.

Neither technology should be looked at as a means by which to improve the speed one client to hit one server with - for instance, if your client has a 10Gbit NIC and your server has 4x1Gbit NIC's, bonding all four 1 Gbit NIC's together with either technology will not then allow the client to send it data at 4 Gbit/s - it will still only go at 1 Gbit/s. Often even if you enable bonding on both client and server, the single-transfer throughput will remain one NIC's worth (but multiple transfers may, depending on settings, be capable of going down other links in the aggregate, thus allowing multiple link-speed transfers at once).

With that out of the way, what I often run into as well is client sites where they've set up ZFS appliances for use mostly or even entirely with iSCSI clients, and then used LACP or IPMP (or both). This is a mistake. The default iSCSI initiators for both Linux and Windows clients support iSCSI MPIO, a technology that will provide you with most the benefits of LACP or IPMP (namely, failover, and aggregation of multiple interfaces), and add to that the actual ability to increase the speed of single transfers beyond that of a single interface.

iSCSI MPIO does require support on the server side as well, and often a specific setup to allow it. If you are using NexentaStor or a similar OS, rather than rewrite things I've already written, I'll merely link you to a Solutions guide I already wrote (if you're not running Nexenta, both the client and server side advice translate, so long as it is COMSTAR you're using on the server side). If you're running Linux, I don't currently have an answer for you (I've avoided the ZFS on Linux project to date, as I'm busy and am waiting for it to reach a 1.0 state, as I tend to distrust anything that the maintainer of feels isn't ready to carry a 1.0 moniker), but I suspect Google can assist you, it is Linux after all. If you're running FreeBSD, I believe istgt supports MPIO, and merely requires you set up the Portal Group with more than one interface to allow it. That's second-hand information, I'm afraid, as my own home setup has a mere single port on it; if/when I can acquire hardware to change that, I'll do a post with exact configuration and testing results.

I highly recommend investigating MPIO for iSCSI in lieu of even turning on LACP or IPMP, if your setup is 100% iSCSI. If you've also got NFS/CIFS/etc in there, at the moment most of the file level protocols don't support any form of MPIO, so network link aggregation is still a requirement, and in such an event I'd only caution that when configuring them, try to configure things in such a way that MPIO can still work, if you've got some % of iSCSI in there.

ZVOL Used Space

2013-03-26T06:21:00.003-07:00

One of the more common complaints I hear about with ZFS is when clients are using zvol's to offer up iSCSI or FC block targets to clients, formatting filesystems on them, and then using them. Well, I don't get complaints about that, so much.. what I get are complaints about how, over time, the 'used' space as visible in ZFS is completely out of whack with the used space as visible in the filesystem on the client.

The reason for this is pretty simple - ZFS has absolutely no idea about your filesystem, and most filesystems delete files by updating a File Allocation Table or some such construct, not by going out and wiping out the blocks that had made up those files. Without some sort of semantics to understand what has happened, ZFS can't know that some of the blocks in your zvol are no longer being referenced by any file on the filesystem on top.

Enter SCSI Unmap. Support for this has been in COMSTAR for more than a year now, and the whole idea is that if your OS supports it, it can send down a SCSI UNMAP command that we can use to realize the blocks can be freed in the underlying disk. Except there's a problem. Very few operating systems/filesystems currently support this fully. Even when they do, they seem to only do so for normal operations, and often don't send down anything for metadata updates and such, so over time, you still get to a point where your filesystem on your client reads 39% full, and your zvol reads 100% full.

So, what's an admin to do? Well, that depends on the operating system you're dealing with. Also, all of the below absolutely rely on the fact that you have compress=on set on the zvol. Also, all of these are manual methods of clearing up the discrepancy between the zvol and the filesystem. None are perfect (there will pretty much always be some level of difference between the 'used' space on the zvol and the filesystem on top of it), but all of these should get you much closer to parity. If you want them to be automated, I leave that as a task for the reader. Please note that all of these utilities tend to cause a ton of throughput traffic as they do what they're doing - and can take hours and hours to complete depending on available bandwidth. Running one every night is probably impossible.

I also hope I don't have to mention that snapshots will sort of muck up this whole idea. You'll clear up space on the current dataset, but snapshots will just keep referring to the prior blocks until they're destroyed. If you have some sort of automatic snapshot taking/destroying mechanism (like AutoSnap in NexentaStor) then the effects on the overall zpool 'used space' will not be felt until the snapshots have rolled out, replaced by new ones. If you take permanent or semi-permanent snapshots by yourself, you'll have to remember that your zpool won't regain any of the space from this activity until you've killed all the snapshots made prior to running one of the below utilities.

Windows

The free answer is 'sdelete', provided by Microsoft. Link here: http://technet.microsoft.com/en-us/sysinternals/bb897443.aspx

Basically snag the 'sdelete' tool, and as an Administrator, run it with "sdelete -z C:" (replacing C: with whatever disk is the one with a zvol ultimately at the back) in a Command Prompt window.

Linux

The main two I see used are "secure-delete" and "zerofree". On Ubuntu, both are available. With secure-delete, you're looking for the 'sfill' utility, and you'll want to read the man page, I believe there's a way to reduce it to basically just zero filling. However, I use 'zerofree', which is very simple to use - install it, and just run 'zerofree <filesystem>' and it'll do the rest.

Mac OSX

Use the built-in functionality. Check this link: http://support.apple.com/kb/ht3680

Nexenta - SMTP Server Settings

2013-03-25T14:34:00.005-07:00

A common problem for users of NexentaStor, especially home users and people doing a trial evaluation of the software, is that, at least as of 3.1.3.5, it requires a real SMTP server somewhere else in order to send email. I'm often told no email was set up because no SMTP server or account was available (at home, this is often just permanently true, and in the enterprise this is often true during eval phases). This is bad - NexentaStor sends all sorts of alerts via email that it does not display anywhere else. Your appliance may be warning you of something, and you'll have no idea. There is a workaround, however, if the appliance can get to the internet and you have GMail (or can be bothered to set up an account)!

Simply set the server address to smtp.gmail.com and the username to the full email address of the account on Gmail, and of course set the password. Choose SSL as the authentication method, and be sure the 'From E-Mail Address' is set to the email address of the Gmail user (you don't technically have to - but Gmail will just use the actual from anyway, not what you set here, and I worry if you set it otherwise Google might eventually claim you're trying to spam through their service). And voila, you're all set. Here's a screenshot of the settings in the initial NMV setup wizard, just for completeness.

Reservation & Ref Reservation - An Explanation (Attempt)

2013-03-23T16:48:00.001-07:00

So in this article I'm going to try to explain and answer a lot of the questions I get and misconceptions I see in terms of ZFS and space utilization of a pool. Not sure how well I'm going to do here, I've never found a holy grail way of explaining these that everyone understands - seems like one way works with one person, and a different way is necessary for the next guy, but let's give it a shot.

ZFS has two provided methods of determining used and free space. The first is at a non-granular pool level, and the second is at a dataset level. There are some very big differences in how these two viewpoints behave. Pool-wide statistics are provided by the 'zpool' command, as evidenced by this example:

root@myhost:/# zpool list
NAME SIZE ALLOC FREE CAP DEDUP HEALTH ALTROOT
example-pool 3.94G 121K 3.94G 0% 1.00x ONLINE -

Of note here are 'SIZE', 'ALLOC', and 'FREE'. You'll notice my brand new, completely empty 'example-pool' is 3.94 GB in size, and has allocated 121 KB of space (metadata and such). Now let's create a filesystem.

root@myhost:/# zfs create example-pool/noquota-noreserve-filesystem

root@myhost:/# zpool list

NAME SIZE ALLOC FREE CAP DEDUP HEALTH ALTROOT

example-pool 3.94G 160K 3.94G 0% 1.00x ONLINE -

syspool 15.9G 3.15G 12.7G 19% 1.00x ONLINE -

Now we've got one empty filesystem on our pool - and as some may expect, a few extra KB of allocated space (again, metadata and such). So let's see what the second viewpoint of looking at this has to say, and that's the 'zfs' command. Let's concentrate just on the important stuff for now, with a command like this:

root@myhost:/# zfs list -o name,used,avail,refer,usedds,usedchild

NAME USED AVAIL REFER USEDDS USEDCHILD

example-pool 160K 3.88G 32K 32K 128K

example-pool/noquota-noreserve-filesystem 31K 3.88G 31K 31K 0

I apologize for all the fields - but believe it or not, they're all informative (eventually). As you can see, in addition to seeing the filesystem I created a moment ago, we also see the pool itself. Or do we? Do not be fooled, gentle reader, for the 'example-pool' entry you see in 'zfs list' is not exactly equivalent to the 'example-pool' you'll see in 'zpool list'. Where 'zpool list' is showing you the pools and pool-level statistics, 'zfs list' is only showing you datasets and dataset statistics. That's right, your pool is also a filesystem dataset (indeed, all filesystems created under it are children of it).

This is important to note, because it means that 'example-pool' in 'zfs list' is going to be factoring in the used space of all the filesystems on the pool. You would think this would be pretty much the same as 'zpool list', then, right? And indeed, in the above example, both 'zpool list' and 'zfs list' are showing a used (alloc) space of 160 KB. Where this will bite you is simple: 'zpool list' does not take into account anything but actual, physically allocated blocks of data on the disks -- and 'zfs list' takes not only those into account, but also reservations as well. So what are reservations? Well, to answer that, let's first look at what ZFS' manpage has to say about them (and there 2, not one):

reservation=size | none

The minimum amount of space guaranteed to a dataset and its descendents. When the amount of space used is below this value, the dataset is treated as if it were taking up the amount of space specified by its reservation. Reservations are accounted for in the parent datasets' space used, and count against the parent datasets' quotas and reservations.

This property can also be referred to by its shortened column name, reserv.

refreservation=size | none

The minimum amount of space guaranteed to a dataset, not including its descendents. When the amount of space used is below this value, the dataset is treated as if it were taking up the amount of space specified by refreservation. The refreservation reservation is accounted for in the parent datasets' space used, and counts against the parent datasets' quotas and reservations.

If refreservation is set, a snapshot is only allowed if there is enough free pool space outside of this reservation to accommodate the current number of "referenced" bytes in the dataset.

This property can also be referred to by its shortened column name, refreserv.

Now, some people wax over these definitions and don't notice that 'reservation' and 'refreservation' do not actually have identical first paragraphs. The key different cam be found in the very first sentence, where 'reservation' says "and its descendents" and referservation says "not including its descendents". This is key. A hint to how important this distinction is comes in the form of the extra paragraph you find in the referservation description, but we'll get back to that in a minute. First, an initial, simple example of both. Remember our pool? Let's created another dataset, but this time, specify a reservation (as opposed to a refreservation).

root@myhost:/# zfs create -o reserv=500M example-pool/noquota-reserv-filesystem

root@myhost:/# zfs list -o name,used,avail,refer,usedds,usedchild

NAME USED AVAIL REFER USEDDS USEDCHILD

example-pool 500M 3.39G 33K 33K 500M

example-pool/noquota-noreserve-filesystem 31K 3.39G 31K 31K 0

example-pool/noquota-reserv-filesystem 31K 3.88G 31K 31K 0

Now, the keen-eyed amongst you may have noticed that while this new filesystem looks pretty much like the other one we created, suddenly the root filesystem (example-pool) is reporting USED of 500M! What? But the new filesystem isn't using 500 MB! In a panic, we check 'zpool list' to verify our assumption:

root@myhost:/# zpool list

NAME SIZE ALLOC FREE CAP DEDUP HEALTH ALTROOT

example-pool 3.94G 394K 3.94G 0% 1.00x ONLINE -

syspool 15.9G 3.15G 12.7G 19% 1.00x ONLINE -

Whew. We're right - we're not suddenly using 500 MB of disk space. So why is 'zfs list' claiming we are? Are we? The short answer is -- no, you're not. And yes, you are. See, a reservation is us telling ZFS that the filesystem have 500 MB reserved for the use of that filesystem (and its children, since its a reservation), and only for that filesystem and its children. This doesn't go write 500 MB of data to the drive.. but from the perspective of any other filesystem on that pool, it may as well have. Notice how our first fileystem, has an available space of 3.39 GB (as does 'example-pool')? Yet our new filesystem has 3.88 GB? That's 500 MB of difference (basically).

The reason for this new discrepancy is simple - by reserving 500 MB, we've told ZFS that our new filesystem may use all of the pool and has 500 MB reserved for its use, and that any other filesystem (including our first one) cannot use 500 MB of the pool. Now, this reservation is just that -- a reservation up to that amount. It isn't 'above and beyond' what we're actually using. If I put 50 MB of data into that second filesystem, 550 MB won't be unavailable to other filesystems -- it will still be only 500. Let's test that assertion:

root@myhost:/# cd /volumes/example-pool/noquota-reserv-filesystem/

root@myhost:/volumes/example-pool/noquota-reserv-filesystem# dd if=/dev/zero of=50MB-testfile bs=1M count=50

50+0 records in

50+0 records out

52428800 bytes (52 MB) copied, 0.218335 seconds, 240 MB/s

root@myhost:/volumes/example-pool/noquota-reserv-filesystem# zfs list -o name,used,avail,refer,usedds,usedchild

NAME USED AVAIL REFER USEDDS USEDCHILD

example-pool 500M 3.39G 33K 33K 500M

example-pool/noquota-noreserve-filesystem 31K 3.39G 31K 31K 0

example-pool/noquota-reserv-filesystem 50.0M 3.83G 50.0M 50.0M 0

See? Despite adding 50 MB to the new filesystem, the old filesystem (and root filesystem) still show 500 MB used and 3.39 GB available. They won't change, in fact, until we've put 501 MB or more into that second filesystem (at which point, the reservation is almost pointless, unless we delete things and go back below 500 MB used in the filesystem).

So with our current layout, what if we add in a snapshot? What would that do?

root@myhost:/volumes/example-pool/noquota-reserv-filesystem# zfs list -t all -o name,used,avail,refer,usedds,usedchild

NAME USED AVAIL REFER USEDDS USEDCHILD

example-pool 500M 3.39G 33K 33K 500M

example-pool/noquota-noreserve-filesystem 31K 3.39G 31K 31K 0

example-pool/noquota-reserv-filesystem 50.0M 3.83G 50.0M 50.0M 0

example-pool/noquota-reserv-filesystem@first-snap 0 - 50.0M - -

As expected - nothing, really. Still 500 MB used and 3.39 GB available to the pool. I'm not trying to explain how snapshots work in this post, only in how reservations can affect them - so let's move on. Time for refreservations! Let's make a new filesystem, and see what we get:

root@myhost:/# zfs list -t all -o name,used,avail,refer,usedds,usedchild

NAME USED AVAIL REFER USEDDS USEDCHILD

example-pool 2.49G 1.39G 35K 35K 2.49G

example-pool/noquota-noreserve-filesystem 31K 1.39G 31K 31K 0

example-pool/noquota-refreserv-filesystem 2G 3.39G 31K 31K 0

example-pool/noquota-reserv-filesystem 50.0M 1.83G 50.0M 50.0M 0

example-pool/noquota-reserv-filesystem@first-snap 0 - 50.0M - -

So this is interesting - when we made a filesystem with a 500 MB reservation, it ate up 500 MB from the root filesystem availability - but the 'USED' on the filesystem was nil until we added a file. Yet now we've made a new filesystem with a 2 GB refreservation, and the 'USED' on the actual filesystem is 2 GB! Once again in a panic, we look at 'zpool list':

root@myhost:/# zpool list

NAME SIZE ALLOC FREE CAP DEDUP HEALTH ALTROOT

example-pool 3.94G 50.5M 3.89G 1% 1.00x ONLINE -

syspool 15.9G 3.15G 12.7G 19% 1.00x ONLINE -

Nope -- we're still only using 50 MB, from that one file we made. Whew. Ok. So does it act different? If we make a 1.5 GB file inside there, is it going to show 3.5 GB used? Let's see:

root@myhost:/# cd /volumes/example-pool/noquota-refreserv-filesystem/

root@myhost:/volumes/example-pool/noquota-refreserv-filesystem# dd if=/dev/zero of=1.5Gtest bs=1M count=1536

1536+0 records in

1536+0 records out

1610612736 bytes (1.6 GB) copied, 16.6138 seconds, 96.9 MB/s

root@myhost:/volumes/example-pool/noquota-refreserv-filesystem# zfs list -t all -o name,used,avail,refer,usedds,usedchild

NAME USED AVAIL REFER USEDDS USEDCHILD

example-pool 2.49G 1.39G 35K 35K 2.49G

example-pool/noquota-noreserve-filesystem 31K 1.39G 31K 31K 0

example-pool/noquota-refreserv-filesystem 2G 1.89G 1.50G 1.50G 0

example-pool/noquota-reserv-filesystem 50.0M 1.83G 50.0M 50.0M 0

example-pool/noquota-reserv-filesystem@first-snap 0 - 50.0M - -

Hrm, nope. Looks like just as with a reservation, the 'refreserv's 2 GB isn't 'above and beyond' actual usage. As you can see, USEDDS (used dataset) and REFER have gone up to 1.5 GB on that new filesystem, but the USED remains at 2 GB. So unlike 'reserv', 'refreserv' is visually displaying the refreservation in USED, but in terms of functionality, it still is just a reservation, right? Well, mostly - but the reason for the aesthetic difference and the gotchya to 'reserv' vs 'refreserv' now comes into play as I take a snapshot of this new filesystem. Let's see what happens:

root@myhost:/volumes/example-pool/noquota-refreserv-filesystem# zfs snapshot example-pool/noquota-refreserv-filesystem@second-snap

cannot create snapshot 'example-pool/noquota-refreserv-filesystem@second-snap': out of space

What?! Out of space?? No I'm not! Quick, check 'zpool list'!

root@myhost:/volumes/example-pool/noquota-refreserv-filesystem# zpool list

NAME SIZE ALLOC FREE CAP DEDUP HEALTH ALTROOT

example-pool 3.94G 1.55G 2.39G 39% 1.00x ONLINE -

syspool 15.9G 3.14G 12.7G 19% 1.00x ONLINE -

See?! I have 2.39 GB free! Heck, even looking at the 'zfs list' we did a moment ago, I have 1.39 GB available in the root 'example-pool'! Why can't I take a snapshot? If this was a reservation, instead of a refreservation, you would be able to take that snapshot. However, this is a refreservation. It is a reservation on the dataset NOT including its children. It is a reservation stating that absolutely under no circumstances will there not be 2 GB of space available on 'example-pool' for this filesystem, ever, not even because of snapshots on this filesystem.

See, I have 1.5 GB of referenced data (REFER) in 'example-pool/noquota-refreserv-filesystem', and 1.39 GB of available space in 'example-pool' right now, and 1.5 is greater than 1.39 -- and the second I make a snapshot of this filesystem, all the existing referred data becomes referenced from the child, so I need 1.5 GB of free space outside of my 2 GB reservation, because I must still be able to meet my 2 GB commitment. Since I don't have it, the snapshot attempt fails.

Now, let's be clear here - when you take this snapshot, you're not adding 1.5 GB of new data to the pool. Right away, all that happened was a snapshot begins referencing the existing 1.5 GB file - there's no new space usage. But the whole point of a 'refreservation' is to protect the filesystem the refreservation is on from running out of space (up to the refreservation) for ANY REASON, so it has to assume that you might want to delete that 1.5 GB file and make a new one, all while that snapshot is still there retaining a link to all the blocks of that original 1.5 GB file.

I've seen many users hit this 'no space' message, or even worse, have just barely sufficient space in their pool to take that snapshot, and then all their other datasets quickly started running out of space, even though their pool may have had tons of space left from a physical perspective. A proper understanding of refreserv would have saved them a lot of headache.

So given how potentially dangerous 'refreserv' can be from an administrative perspective, why does ZFS have it, and not just 'reserv'? There are probably a few reasons, but let me give you one (which I believe to be the main one) - zvols.

The most common use of a zvol is as a backend 'device' then shared up over iSCSI to some client that formats it with a filesystem and treats it like any other hard drive. That client makes a few expectations about that drive -- it expects that the entirety of that drive will be available to it, and that if it thinks it is 40% full, it isn't going to suddenly get an 'out of space' error when it goes to write to it. Most clients really don't behave well when their disk suddenly starts claiming its out of space when it shouldn't. So 'refreservation' is provided by ZFS to the admin as a method of making guarantees about available space above and beyond 'reservation'.

See, with a simple 'reservation' (or with nothing at all), a zvol of 50 GB could eat up 50 GB and then, because of snapshots, eat up even more. If I was doing something crazy on the zvol that completely changed it every day, and was keeping 7 days of snapshots, I'd realistically end up with over 350 GB of used space on my pool between the base dataset and the snapshots. What if my pool only had 200 GB of total space? I'd be out of luck -- and I may very well end up sending 'out of space' errors to my client. But if that same zvol had a 'refreserv' set of 50 GB, then it is my snapshot creation attempts that would fail, and at no time would my zvol ever be in risk of running out of its 50 GB of needed space.

By default, when you create a zvol in ZFS, if you don't specify 'sparse' (with -s on the command line option), ZFS will make it a 'thick provisioned' zvol. All that actually is is ZFS making a 'refreservation' equivalent to the 'volsize' you specified for the zvol. Create one some time, and take a look for yourself. The fact is, if you're not planning on making snapshots, refreservations are a very sane way of not only guaranteeing space to your clients, but of easily keeping yourself from overprovisioning your pool. If you want to skip them, don't set a refreservation -- but be warned; if you do so, the onus is now on you as the administrator to keep a close eye on the pool utilization and take action before it gets too full.

So there you have it - 1000's of words to try to explain the difference between a 'reservation' and a 'refreservation', for your reading pleasure.

VM Disk Timeouts

2013-03-23T14:57:00.000-07:00

A pretty common issue to run into when using some SAN back-ends for virtual machines is that the VM's end up crashing, BSOD'ing, or (most commonly) remounting their "disks" read-only when there's a hiccup or failover in the storage system, often resulting in a need to reboot to restore functionality.

Updated 9/16/2013 to incorporate excellent suggestions of commenter Greg Smith.
Updated 5/13/2014 to incorporate on-the-job learning.

The most common fix is typically to increase the default timeout settings in the guest VM, and sometimes also in the host machine as well, as the root cause is usually that the SAN took longer than the default timeout to respond. This is usually because the SAN was involved in a failover, which can take > 60, or even > 120 seconds in some cases. I generally recommend setting it to at least 300 seconds, though 600 seconds or more I'm also perfectly happy with, personally. I only really have an issue with under 180 seconds or so.

This is in keeping with industry standards, I might add - VMware sets to 180, NetApp has long requested it be 180, and so on. I don't actually like how the timeouts and such are handled, and I especially do not like that in many scenarios the timeout is a global value applying to both the SAN-provided storage and that local spinning disk (which never needs or wants a timeout value this long), but them's the breaks I'm afraid.

Of course, the usual follow-up question from anyone told this is, "Ok, so where do I do that?" and then you're off to Google, and it can be annoying. Enough so that I decided to compile them all in one place, and add some scripts and such to simplify it (and be included in automated deployment tools, for instance). So, here you are.

Windows 2000, 2003, 2008, Vista, & Windows 7

Open the registry editor (regedit) and navigate to:

HKEY_LOCAL_MACHINE / System / CurrentControlSet / Services / Disk

Once there, look for 'TimeOutValue'. If it exists, edit it, and if it does not exist, right-click and choose 'Edit/Add Value' and create it. The type is REG_DWORD, and the value should be set in decimal to the timeout in seconds that you desire (so, I suggest, 300).

After that, if you're using the Microsoft iSCSI Initiator in the OS instead of being passed in the disk from a hypervisor, you should also modify the timeout value in the iSCSI initiator. On 2008, Vista, and Windows 7, navigate to:

HKEY_LOCAL_MACHINE / System / CurrentControlSet / Control / Class / {4D36E97B-E325-<HostID>

Under this key you'll find a number of subkeys named 0001, 0002 and so on. Expand each subkey until you find the one subkey that has another subkey called 'Parameters'. Within that Parameters subkey is the key you want, MaxRequestHoldTime. Modify it to 300 (decimal). There is another setting in here, LinkDownTime, that you would set instead if you're planning to use iSCSI MPIO on the Windows OS, but there's also other things to set for that and beyond the scope of this post for now.

These changes are permanent as far as I know, as well as global, so that's all you've got to do. I am unaware if you need to reboot for it to take affect, probably should to be sure.

Linux (2.6+ non-udev)

So the 'easy' but far from elegant solution is to go in and force the timeout to be higher on every block device you need to do so on. This is done on both 2.4 and 2.6 kernels by echo'ing the time in seconds you want at /sys/block/<device>/device/timeout, substituting the device name for <device>. So, for example, if the main disk (sda) was being offered up from the VM host and originated on a SAN and you wanted to make it timeout after 300 seconds, you'd do:

echo 300 > /sys/block/sda/device/timeout

The problem with this is that this isn't permanent, and will only survive until the system is rebooted. The quick and dirty answer to this is to add a command to do this into something like /etc/rc.local or create a full-blown init script that does it (be sure you add the command above the 'exit 0' that often ends the default rc.local file). For completeness, here's a simple script you can call from rc.local (put the contents below into a file, chmod +x it, and then call it from rc.local), that may or may not work for you out of the box (be sure to edit DISKS to be a list of the disks you care about):

#/bin/bash
#
# nex7.blogspot.com - VM Disk Timeouts - simple script for non-udev 2.6+ kernels
# - edit DISKS to be a list of disks you want to increase the timeout on to TIMEOUT_V

TIMEOUT_V=300
DISKS="sda sdb sdc"

for DISK in $DISKS; do
echo $TIMEOUT_V > /sys/block/$DISK/device/timeout
done

Or, read on for the better way to do it if you have a fairly modern and mainstream distribution.

Linux (2.6+ with udev)

The slightly more complex but a bit more elegant method that I see, and that I wish the various major Linux distributions would adopt directly into their base releases, is something like what the VMware Tools does when installed on a supported Linux distribution. You can see their own explanation at this link.

The issue with this today is that not only is this only added if you install the VMware Tools, the line it adds to the udev rules only affects disks exposed using VMware. Something that will not help you if you are using Xen or KVM or VirtualBox and so on. So, something a bit more agnostic is called for. In building this little blog post, and coming upon this issue (admittedly for the umpteenth time), I decided to go ahead and finally do something about it.

My investigations so far have concluded there is no danger to 'bad' or unmatched rules in a udev rules file (at worst, you get a warning in syslog on boot from udev complaining about the lines it doesn't like, but it still parses the other rules fine). Thus, a simple single rules file put into /etc/udev/rules.d/ that contains rules for all possible OS and all possible exposed disks from a variety of virtualization hosts seems like the easiest way to go, so I give you this link. You can run the below command directly (as root) to install on most distributions (be sure /etc/udev/rules.d is where they go):

wget http://www.nex7.com/files/99-virt-scsi-udev.rules; mv 99-virt-disk-timeouts.rules /etc/udev/rules.d/;chmod 644 99-virt-disk-timeouts.rules

After putting it in /etc/udev/rules.d, just reboot. You can verify it is working with this one-liner (you're looking for results that have at least some entries that say '300', if you don't, it either isn't working or you don't have any disks the rules match against):

for file in `find /sys/devices -iname timeout`; do (echo $file && cat $file); done

And that's it. I've tested the file on CentOS 6.3 on top of KVM, Ubuntu 12.04 on top of KVM, and the VMware ones on a variety of OS's and versions. As far as I know, the list of presently supported virtualization platforms and guest OS's of this file are:

Hosts

VMware 5+ (disks offered up via scsi)
KVM 1.0+ (disks offered up via ide or scsi - virtio doesn't expose timeout at guest level)
XenServer 5+ (disks offered up via scsi)

Guests

RHEL 5+ / CentOS 5+
Ubuntu 10+

If you run into any problems with this file, please let me know.

FreeBSD 9

There are two variables that appear to be of note - and common wisdom seems to jump between which one to tweak. I'll err on the side of timeout over retry here, but that may not be the best option in all situations. To modify it, and it is a global variable as far as I can tell, you need to modify 'kern.cam.da.default_timeout' and change it from its default of 60 to 300. To modify it permanently, edit your /etc/sysctl.conf and add a line like this:

kern.cam.da.default_timeout = 300

If you're curious, the other variable mentioned online is 'kern.cam.da.retry_count', but I am less sure if the advice about it is fair or true.

NexentaStor (and other OpenSolaris-based derivatives)

So the easy way is to modify the sd timeout value. Unfortunately in OpenSolaris today, this value can only be set in /etc/system for all drives, with no config file method of setting it on a per-disk basis that I am aware of. To modify it globally, add this line to your /etc/system file and reboot:

set sd:sd_io_time=300

This is dangerous if there are any disks exposed to your VM that are not coming from a SAN and such, since this is a global value (much like the Windows one). There does exist a method of modifying the live value used by the kernel on a per-disk basis using mdb, but building this into a script to run on boot and when disks change I've decided not to try to tackle at this time. If you want more info, check out Alisdair's post on the issue, found here.

ZFS: Read Me 1st

2013-03-21T02:57:00.000-07:00

Things Nobody Told You About ZFS

Yes, it's back. You may also notice it is now hosted on my Blogger page - just don't have time to deal with self-hosting at the moment, but I've made sure the old URL redirects here.

So, without further adieu..

Foreword

I will be updating this article over time, so check back now and then.

Latest update 9/12/2013 - Hot Spare, 4K Sector and ARC/L2ARC sections edited, note on ZFS Destroy section, minor edit to Compression section.

There are a couple of things about ZFS itself that are often skipped over or missed by users/administrators. Many deploy home or business production systems without even being aware of these gotchya's and architectural issues. Don't be one of those people!

I do not want you to read this and think "ugh, forget ZFS". Every other filesystem I'm aware of has many and more issues than ZFS - going another route than ZFS because of perceived or actual issues with ZFS is like jumping into the hungry shark tank with a bleeding leg wound, instead of the goldfish tank, because the goldfish tank smelled a little fishy! Not a smart move.

ZFS is one of the most powerful, flexible, and robust filesystems (and I use that word loosely, as ZFS is much more than just a filesystem, incorporating many elements of what is traditionally called a volume manager as well) available today. On top of that it's open source and free (as in beer) in some cases, so there's a lot there to love.

However, like every other man-made creation ever dreamed up, it has its own share of caveats, gotchya's, hidden "features" and so on. The sorts of things that an administrator should be aware of before they lead to a 3 AM phone call! Due to its relative newness in the world (as compared to venerable filesystems like NTFS, ext2/3/4, and so on), and its very different architecture, yet very similar nomenclature, certain things can be ignored or assumed by potential adopters of ZFS that can lead to costly issues and lots of stress later.

I make various statements in here that might be difficult to understand or that you disagree with - and often without wholly explaining why I've directed the way I have. I will endeavor to produce articles explaining them and update this blog with links to them, as time allows. In the interim, please understand that I've been on literally 1000's of large ZFS deployments in the last 2+ years, often called in when they were broken, and much of what I say is backed up by quite a bit of experience. This article is also often used, cited, reviewed, and so on by many of my fellow ZFS support personnel, so it gets around and mistakes in it get back to me eventually. I can be wrong - but especially if you're new to ZFS, you're going to be better served not assuming I am. :)

1. Virtual Devices Determine IOPS

IOPS (I/O per second) are mostly a factor of the number of virtual devices (vdevs) in a zpool. They are not a factor of the raw number of disks in the zpool. This is probably the single most important thing to realize and understand, and is commonly not.

ZFS stripes writes across vdevs (not individual disks). A vdev is typically IOPS bound to the speed of the slowest disk within it. So if you have one vdev of 100 disks, your zpool's raw IOPS potential is effectively only a single disk, not 100. There's a couple of caveats on here (such as the difference between write and read IOPS, etc), but if you just put as a rule of thumb in your head that a zpool's raw IOPS potential is equivalent to the single slowest disk in each vdev in the zpool, you won't end up surprised or disappointed.

2. Deduplication Is Not Free

Another common misunderstanding is that ZFS deduplication, since its inclusion, is a nice, free feature you can enable to hopefully gain space savings on your ZFS filesystems/zvols/zpools. Nothing could be farther from the truth. Unlike a number of other deduplication implementations, ZFS deduplication is on-the-fly as data is read and written. This creates a number of architectural challenges that the ZFS team had to conquer, and the methods by which this was achieved lead to a significant and sometimes unexpectedly high RAM requirement.

Every block of data in a dedup'ed filesystem can end up having an entry in a database known as the DDT (DeDupe Table). DDT entries need RAM. It is not uncommon for DDT's to grow to sizes larger than available RAM on zpools that aren't even that large (couple of TB's). If the hits against the DDT aren't being serviced primarily from RAM or fast SSD, performance quickly drops to abysmal levels. Because enabling/disabling deduplication within ZFS doesn't actually do anything to data already on disk, do not enable deduplication without a full understanding of its requirements and architecture first. You will be hard-pressed to get rid of it later.

3. Snapshots Are Not Backups

This is critically important to understand. ZFS has redundancy levels from mirrors and raidz. It has checksums and scrubs to help catch bit rot. It has snapshots to take lightweight point-in-time captures of data to let you roll back or grab older versions of files. It has all of these things to help protect your data. And one 'zfs destroy' by a disgruntled employee, one fire in your datacenter, one random chance of bad luck that causes a whole backplane, JBOD, or a number of disks to die at once, one faulty HBA, one hacker, one virus, etc, etc, etc -- and poof, your pool is gone. I've seen it. Lots of times. MAKE BACKUPS.

4. ZFS Destroy Can Be Painful

(9/12/2013) A few illumos-based OS are now shipping ZFS with "async destroy" feature. That has a significant mitigating impact on the below text, and ZFS destroys, while they still have to do the work, do so in the background in a less performance and stability damaging manner. However, not all shipping OS have this code in them yet (for instance, NexentaStor 3.x does not). If your ZFS has feature flag support, it might have async destroy, if it still is using the old 'zpool version' method, it probably doesn't.

Something often waxed over or not discussed about ZFS is how it presently handles destroy tasks. This is specific to the "zfs destroy" command, be it used on a zvol, filesystem, clone or snapshot. This does not apply to deleting files within a ZFS filesystem (unless that file is very large - for instance, if a single file is all that a whole filesystem contains) or on the filesystem formatted onto a zvol, etc. It also does not apply to "zpool destroy". ZFS destroy tasks are potential downtime causers, when not properly understood and treated with the respect they deserve. Many a SAN has suffered impacted performance or full service outages due to a "zfs destroy" in the middle of the day on just a couple of terabytes (no big deal, right?) of data. The truth is a "zfs destroy" is going to go touch many of the metadata blocks related to the object(s) being destroyed. Depending on the block size of the destroy target(s), the number of metadata blocks that have to be touched can quickly reach into the millions, even the hundreds of millions.

If a destroy needs to touch 100 million blocks, and the zpool's IOPS potential is 10,000, how long will that zfs destroy take? Somewhere around 2 1/2 hours! That's a good scenario - ask any long-time ZFS support person or administrator and they'll tell you horror stories about day long, even week long "zfs destroy" commands. There's eventual work that can be done to make this less painful (a major one is in the works right now) and there's a few things that can be done to mitigate it, but at the end of the day, always check the actual used disk size of something you're about to destroy and potentially hold off on that destroy if it's significant. How big is too big? That is a factor of block size, pool IOPS potential, extenuating circumstances (current I/O workload of the pool, deduplication on or off, a few other things).

5. RAID Cards vs HBA's

ZFS provides RAID, and does so with a number of improvements over most traditional hardware RAID card solutions. ZFS uses block-level logic for things like rebuilds, it has far better handling of disk loss & return due to the ability to rebuild only what was missed instead of rebuilding the entire disk, it has access to more powerful processors than the RAID card and far more RAM as well, it does checksumming and auto-correction based on it, etc. Many of these features are gone or useless if the disks provided to ZFS are, in fact, RAID LUN's from a RAID card, or even RAID0 single-disk entities offered up.

If your RAID card doesn't support a true "JBOD" (sometimes referred to as "passthrough") mode, don't use it if you can avoid it. Creating single-disk RAID0's (sometimes called "virtual drives") and then letting ZFS create a pool out of those is better than creating RAID sets on the RAID card itself and offering those to ZFS, but only about 50% better, and still 50% worse than JBOD mode or a real HBA. Use a real HBA - don't use RAID cards.

6. SATA vs SAS

This has been a long-standing argument in the ZFS world. Simple fact is, the majority of ZFS storage appliances, most of the consultants and experts you'll talk to, and the majority of enterprise installations of ZFS are using SAS disks. To be clear, "nearline" SAS (7200 RPM SAS) is fine, but what will often get you in trouble is the use of SATA (including enterprise-grade) disks behind bad interposers (which is most of them) and SAS expanders (which almost every JBOD is going to be utilizing).

Plan to purchase SAS disks if you're deploying a 'production' ZFS box. In any decent-sized deployment, they're not going to have much of a price delta over equivalent SATA disks. The only exception to this rule is home and very small business use-cases -- and for more on that, I'll try to wax on about it in a post later.

7. Compression Is Good (Even When It Isn't)

It is the very rare dataset or use-case that I run into these days where compress=on (lzjb) doesn't make sense. It is on by default on most ZFS appliances, and that is my recommendation. Turn it on, and don't worry about it. Even if you discover that your compression ratio is nearly 0% - it still isn't hurting you enough to turn it off, generally speaking. Other compression algorithms such as gzip are another matter entirely, and in almost all cases should be strongly avoided. I do see environments using gzip for datasets they truly do not care about performance on (long-term archival, etc). In my experience if that is the case, go with gzip-9, as the performance difference between gzip-1 and gzip-9 is minimal (when then compared to lzjb or off). You're going to get the pain, so you may as well go for the best compression ratio.

8. RAIDZ - Even/Odd Disk Counts

Try (and not very hard) to keep the number of data disks in a raidz vdev to an even number. This means if its raidz1, the total number of disks in the vdev would be an odd number. If it is raidz2, an even number, and if it is raidz3, an odd number again. Breaking this rule has very little repercussion, however, so you should do so if your pool layout would be nicer by doing so (like to match things up on JBOD's, etc).

9. Pool Design Rules

I've got a variety of simple rules I tell people to follow when building zpools:

Do not use raidz1 for disks 1TB or greater in size.
For raidz1, do not use less than 3 disks, nor more than 7 disks in each vdev (and again, they should be under 1 TB in size, preferably under 750 GB in size) (5 is a typical average).
For raidz2, do not use less than 6 disks, nor more than 10 disks in each vdev (8 is a typical average).
For raidz3, do not use less than 7 disks, nor more than 15 disks in each vdev (13 & 15 are typical average).
Mirrors trump raidz almost every time. Far higher IOPS potential from a mirror pool than any raidz pool, given equal number of drives. Only downside is redundancy - raidz2/3 are safer, but much slower. Only way that doesn't trade off performance for safety is 3-way mirrors, but it sacrifices a ton of space (but I have seen customers do this - if your environment demands it, the cost may be worth it).
For >= 3TB size disks, 3-way mirrors begin to become more and more compelling.
Never mix disk sizes (within a few %, of course) or speeds (RPM) within a single vdev.
Never mix disk sizes (within a few %, of course) or speeds (RPM) within a zpool, except for l2arc & zil devices.
Never mix redundancy types for data vdevs in a zpool (no raidz1 vdev and 2 raidz2 vdevs, for example)
Never mix disk counts on data vdevs within a zpool (if the first data vdev is 6 disks, all data vdevs should be 6 disks).
If you have multiple JBOD's, try to spread each vdev out so that the minimum number of disks are in each JBOD. If you do this with enough JBOD's for your chosen redundancy level, you can even end up with no SPOF (Single Point of Failure) in the form of JBOD, and if the JBOD's themselves are spread out amongst sufficient HBA's, you can even remove HBA's as a SPOF.

If you keep these in mind when building your pool, you shouldn't end up with something tragic.

10. 4KB Sector Disks

(9/12/2013) The likelihood of this being an issue for you is presently very up in the air, very dependent on OS choice at the moment. There are more 4K disks out there, including some SSD's, and still some that are lying and claiming 512. However, there is also work being done to hard-code in recognition of these disks in illumos and so on. My blog post on here talking about my home BSD-based ZFS SAN has instructions on how to manually force recognition of 4K sector disks if they're not reporting on BSD, but it is not as easy on illumos derivatives as they do not have 'geom'. All I can suggest at the moment is Googling about zfs and "ashift" and your chosen OS and OS version -- not only does that vary the answer, but I myself am not spending any real time keeping track, so all I can suggest is do your own homework right now. I also do not recommend mixing -- if your pool started off with one sector size, keep it that way if you grow it or replace any drives. Do not mix/match.

There are a number of in-the-wild devices that are 4KB sector size instead of the old 512-byte sector size. ZFS handles this just fine if it knows the disk is 4K sector size. The problem is a number of these devices are lying to the OS about their sector size, claiming it is 512-byte (in order to be compatible with ancient Operating Systems like Windows 95); this will cause significant performance issues if not dealt with at zpool creation time.

11. ZFS Has No "Restripe"

If you're familiar with traditional RAID arrays, then the term "restripe" is probably in your vocabulary. Many people in this boat are surprised to hear that ZFS has no equivalent function at all. The method by which ZFS delivers data to the pool has a long-term equivalent to this functionality, but not an up-front way nor a command that can be run to kick off such a thing.

The most obvious task where this shows up is when you add a vdev to an existing zpool. You could be forgiven to expect that the existing data in the pool would slide over and all your vdevs would end up of roughly equal used size (rebalancing is another term for this), since that's what a traditional RAID array would do. ZFS? It won't. That data balancing will only come as an indirect result of rewrites. If you only ever read from your pool, it'll never happen. Bear this in mind when designing your environment and making initial purchases. It is almost never a good idea, performance wise, to start off with a handful of disks if within a year or two you expect to grow that pool to a significant larger size, adding in small numbers of disks every X weeks/months.

12. Hot Spares

Don't use them. Pretty much ever. Warm spares make sense in some environments. Hot spares almost never make sense. Very often it makes more sense to include the disks in the pool and increase redundancy level because of it, than it does to leave them out and have a lower redundancy level.

For a bit of clarification, the main reasoning behind this has to do with the present method hot spares are handled by ZFS & Solaris FMA and so on - the whole environment involved in identifying a failed drive and choosing to replace it is far too simplistic to be useful in many situations. For instance, if you create a pool that is designed to have no SPOF in terms of JBOD's and HBA's, and even go so far as to put hot spares in each JBOD, the code presently in illumos (9/12/2013) has nothing in it to understand you did this, and it's going to be sheer chance if a disk dies and it picks the hot spare in the same JBOD to resilver to. It is more likely it just picks the first hot spare in the spares list, which is probably in a different JBOD, and now your pool has a SPOF.

Further, it isn't intelligent enough to understand things like catastrophic loss -- say you again have a pool setup where the HBA's and JBOD's are set up for no SPOF, and you lose an HBA and the JBOD connected to it - you had 40 drives in mirrors, and now you are only seeing half of each mirror -- but you also have a few hot spares in that JBOD, say 2. Now, obviously, picking 2 random mirrors and starting to resilver them from the hot spares still visible is silly - you lost a whole JBOD, all your mirrors have gone to single drive, and the only logical solution is getting the other JBOD back on (or if it somehow went nuts, a whole new JBOD full of drives and attach them to the existing mirrors). Resilvering 2 of your 20 mirror vdevs to hot spares in the still-visible JBOD is just a waste of time at best, and dangerous at worst, and it's GOING to do it.

What I tend to tell customers when the hot spare discussion comes up is actually to start with a question. The multi-part question is this: how many hours could possibly pass before your team is able to remotely login to the SAN after receiving an alert that there's been a disk loss event, and how many hours could possibly pass before your team is able to physically arrive to replace a disk after receiving an alert that there's been a disk loss event?

The idea, of course, is to determine if hot spares are seemingly required, or if warm spares would do, or if cold spares are acceptable. Here's the ruleset in my head that I use after they tell me the answers to that question (and obviously, this is just my opinion on the numbers to use):

Under 24 hours for remote access, but physical access or lack of disks could mean physical replacement takes longer

Warm spares

Under 24 hours for remote access, and physical access with replacement disks is available by that point as well

Pool is 2-way mirror or raidz1 vdevs

Warm spares

Pool is >2-way mirror or raidz2-3 vdevs

Cold spares

Over 24 hours for remote or physical access

Hot spares start to become a potential risk worth taking, but serious discussion about best practices and risks has to be had - often is it's 48-72 hours as the timeline, warm or cold spares may still make sense depending on pool layout; > 72 hours to replace is generally where hot spares become something of a requirement to cover those situations where they help, but at that point a discussion needs to be had on customer environment that there's a > 72 hour window where a replacement disk isn't available

I'd have to make one huge bullet list to try to cover every possible contingency here - each customer is unique, but this is some general guidelines. Remember, it takes a significant amount of time to resilver a disk, and so adding in X amount of additional hours is not adding a lot of risk, especially for 3-way or higher mirrors and raidz2-3 vdevs which can already handle multiple failures.

13. ZFS Is Not A Clustered Filesystem

I don't know where this got started, but at some point, something must have been said that has led some people to believe ZFS is or has clustered filesystem features. It does not. ZFS lives on a single set of disks in a single system at a time, period. Various HA technologies have been developed to "seamlessly" move the pool from one machine to another in case of hardware issues, but they move the pool - they don't offer up the storage from multiple heads at once. There is no present (9/12/2013) method of "clustered ZFS" where the same pool is offering up datasets from multiple physical machines. I'm aware of no work to change this.

14. To ZIL, Or Not To ZIL

This is a common question - do I need a ZIL (ZFS Intent Log)? So, first of all, this is the wrong question. In almost every storage system you'll ever build utilizing ZFS, you will need and will have a ZIL. The first thing to explain is that there is a difference between the ZIL and a ZIL (referred to as a log or slog) device. It is very common for people to call a log device a "ZIL" device, but this is wrong - there is a reason ZFS' own documentation always refers to the ZIL as the ZIL, and a log device as a log device. Not having a log device does not mean you do not have a ZIL!

So with that explained, the real question is, do you need to direct those writes to a separate device from the pool data disks or not? In general, you do if one or more of the intended use-cases of the storage server are very write latency sensitive, or if the total combined IOPS requirement of the clients is approaching say 30% of the raw pool IOPS potential of the zpool. In such scenarios, the addition of a log vdev can have an immediate and noticeable positive performance impact. If neither of those is true, it is likely you can just skip a log device and be perfectly happy. Most home systems, for example, have no need of a log device and won't miss not having it. Many small office environments using ZFS as a simple file store will also not require it. Larger enterprises or latency-sensitive storage will generally require fast log devices.

15. ARC and L2ARC

(9/12/2013) There are presently issues related to memory handling and the ARC that have me strongly suggesting you physically limit RAM in any ZFS-based SAN to 128 GB. Go to > 128 GB at your own peril (it might work fine for you, or might cause you some serious headaches). Once resolved, I will remove this note.

One of ZFS' strongest performance features is its intelligent caching mechanisms. The primary cache, stored in RAM, is the ARC (Adaptive Replacement Cache). The secondary cache, typically stored on fast media like SSD's, is the L2ARC (second level ARC). Basic rule of thumb in almost all scenarios is don't worry about L2ARC, and instead just put as much RAM into the system as you can, within financial realities. ZFS loves RAM, and it will use it - there is a point of diminishing returns depending on how big the total working set size really is for your dataset(s), but in almost all cases, more RAM is good. If your use-case does lend itself to a situation where RAM will be insufficient and L2ARC is going to end up being necessary, there are rules about how much addressable L2ARC one can have based on how much ARC (RAM) one has.

16. Just Because You Can, Doesn't Mean You Should

ZFS has very few limits - and what limits it has are typically measured in megazillions, and are thus unreachable with modern hardware. Does that mean you should create a single pool made up of 5,000 hard disks? In almost every scenario, the answer is no. The fact that ZFS is so flexible and has so few limits means, if anything, that proper design is more important than in legacy storage systems. It is a truism that in most environments that need lots of storage space, it is likely more efficient and architecturally sound to find a smaller-than-total break point and design systems to meet that size, then build more than one of them to meet your total space requirements. There is almost never a time when this is not true.

It is very rare for a company to need 1 PB of space in one filesystem, even if it does need 1 PB in total space. Find a logical separation and build to meet it, not go crazy and try to build a single 1 PB zpool. ZFS may let you, but various hardware constraints will inevitably doom this attempt or create an environment that works, but could have worked far better at the same or even lower cost.

Learn from Google, Facebook, Amazon, Yahoo and every other company with a huge server deployment -- they learned to scale out, with lots of smaller systems, because scaling up with giant systems not only becomes astronomically expensive, it quickly ends up being a negative ROI versus scaling out.

17. Crap In, Crap Out

ZFS is only as good as the hardware it is put on. Even ZFS can corrupt your data or lose it, if placed on inferior components. Examples of things you don't want to do if you want to keep your data intact include using non-ECC RAM, using non-enterprise disks, using SATA disks behind SAS expanders, using non-enterprise class motherboards, using a RAID card (especially one without a battery), putting the server in a poor environment for a server to be in, etc.

Sub-$1K Followup - Learning and Tuning

2012-09-22T23:57:00.001-07:00

So as of a few hours ago, I've finally migrated everything off the old home SAN and onto the new one. I'll be updating that post a bit later with the promised pictures, but I must say, I am neither a photographer nor a superior case wiring guy.

Zpool status and zfs list (minus all the VM LUN's/shares)

Performance

So just so we are all on the same page, this is a home SAN used by a handful of desktops, laptops, and servers (some of which are running a few VM's). Overall performance requirements are not onerous, and certainly nowhere near an enterprise. That said, the little guy can run.

Various iozone and bonnie benchmarks run locally lead me to the belief the 2-mirror vdev 4-disk pool 'home-0' is capable of some 500 MB/s sequential large-block throughput performance best-case, and has a random 8K write IOPS potential of around 250, and a random 8K read IOPS potential of around 500. Which is in keeping with what one may expect from a pool of this configuration. The throughput is more than sufficient to overload the single Gbit NIC port, and in fact could overload a few more were I inclined to throw in a card. I'll paste one iozone output - this was write/rewrite & read/re-read on 1 GB file x 4 threads @ 128K with O_DIRECT on (even still, reads are through ARC, clearly).

File size set to 1048576 KB
Record Size 128 KB
O_DIRECT feature enabled
Command line used: iozone -i 0 -i 1 -t 4 -s 1G -r 128k -I
Output is in Kbytes/sec
Time Resolution = 0.000001 seconds.
Processor cache size set to 1024 Kbytes.
Processor cache line size set to 32 bytes.
File stride size set to 17 * record size.
Throughput test with 4 processes
Each process writes a 1048576 Kbyte file in 128 Kbyte records

Children see throughput for 4 initial writers = 4013508.50 KB/sec
Parent sees throughput for 4 initial writers = 133536.91 KB/sec
Min throughput per process = 942170.62 KB/sec
Max throughput per process = 1084839.00 KB/sec
Avg throughput per process = 1003377.12 KB/sec
Min xfer = 910720.00 KB

Children see throughput for 4 rewriters = 4426626.19 KB/sec
Parent sees throughput for 4 rewriters = 187908.93 KB/sec
Min throughput per process = 1016124.94 KB/sec
Max throughput per process = 1168724.50 KB/sec
Avg throughput per process = 1106656.55 KB/sec
Min xfer = 913792.00 KB

Children see throughput for 4 readers = 9211610.25 KB/sec
Parent sees throughput for 4 readers = 9182744.11 KB/sec
Min throughput per process = 2298095.25 KB/sec
Max throughput per process = 2309781.00 KB/sec
Avg throughput per process = 2302902.56 KB/sec
Min xfer = 1043328.00 KB

Children see throughput for 4 re-readers = 11549440.75 KB/sec
Parent sees throughput for 4 re-readers = 11489037.68 KB/sec
Min throughput per process = 2084150.50 KB/sec
Max throughput per process = 3724946.00 KB/sec
Avg throughput per process = 2887360.19 KB/sec
Min xfer = 590848.00 KB

To achieve this, I have the following minimal tunables for ZFS in my loader.conf:

### ZFS

# still debating best settings here - very dependent on type of disk used
# in ZFS, also scary when you don't have homogeneous disks in a pool -
# fortunately I do at home. What you put here, like on Solaris,

# will have an impact on throughput vs latency - lower = better latency,
# higher = more throughput but worse latency - 4/8 seem like a decent
# middle ground for 7200 RPM devices with NCQ

vfs.zfs.vdev.min_pending=4

vfs.zfs.vdev.max_pending=8

# 5/1 is too low, and 30/5 is too high - best wisdom at the moment for a
# good default is 10/5

vfs.zfs.txg.timeout="10"

vfs.zfs.txg.synctime_ms="5000"

# it is the rare workload that benefits from this more than it hurts

vfs.zfs.prefetch_disable="1"

iSCSI

What's more important is how it performs to clients. My main box is (gasp) a Windows 7 desktop.. I ran around a bit and found 'ATTO', which I recall a customer once using to benchmark disks on Windows. I can't speak to its 'goodness', but it returned results I'd expect with Direct I/O checked on or off.

I'm only really interested in the transfer sizes I ever expect to use, so 8K to 128K, and I maxxed out the 'length' ATTO could go to, 2 GB, which I might add is still well within the range of my ARC, but Direct I/O checkbox in ATTO works quite well and seems to force real reads. As you can see in the first picture, the iSCSI disk (F:) performs quite admirably considering this is between two standard Realtek NIC's on desktop motherboards going through a sub-$200 home 1 Gbit switch, all running at standard 1500 MTU.

8 KB to 128 KB - 2 GB file - Direct I/O - 4 queue

For giggles, in the next picture you'll see the same test performed with the Direct I/O checkbox removed. Look at me go! 4.1 GB/s read performance - across a 1 Gbit link! Obviously a little broken. ATTO seemed to give up on the 128 K (I did a test for the full xfer sizes up to 8 M, and it completely crapped out on the 8M xfer size one, too, reporting nothing for read performance).

8 KB to 128 KB - 2 GB file - no Direct I/O - 4 queue

For iSCSI, I opted for the istgt daemon, not the older offering. My istgt.conf file is below in its uncommented entirety. Note that my NodeBase and TargetName are being set the way they are to mimic the settings from the old SAN, thereby removing my need to touch the iSCSI setup on my machines. I've left out most the LUN's, but left two to show how they are defined in this file:

[Global]
Comment "Global section"
NodeBase "iqn.1986-03.com.sun:02"
PidFile /var/run/istgt.pid
AuthFile /usr/local/etc/istgt/auth.conf
LogFacility "local7"
Timeout 30
NopInInterval 20
DiscoveryAuthMethod Auto
MaxSessions 32
MaxConnections 8
MaxR2T 32
MaxOutstandingR2T 16
DefaultTime2Wait 2
DefaultTime2Retain 60
FirstBurstLength 262144
MaxBurstLength 1048576
MaxRecvDataSegmentLength 262144
InitialR2T Yes
ImmediateData Yes
DataPDUInOrder Yes
DataSequenceInOrder Yes
ErrorRecoveryLevel 0
[UnitControl]
Comment "Internal Logical Unit Controller"
AuthMethod CHAP Mutual
AuthGroup AuthGroup10000
Portal UC1 127.0.0.1:3261
Netmask 127.0.0.1
[PortalGroup1]
Comment "T1 portal"
Portal DA1 192.168.2.15:3260
[InitiatorGroup1]
Comment "Initiator Group1"
InitiatorName "ALL"
Netmask 192.168.2.0/24
[LogicalUnit1]
TargetName iqn.1986-03.com.sun:02:homesan
Mapping PortalGroup1 InitiatorGroup1
AuthMethod Auto
AuthGroup AuthGroup1
UnitType Disk
LUN0 Storage /dev/zvol/home-0/ag_disk1 Auto
LUN1 Storage /dev/zvol/home-0/steam_games Auto

CIFS / Samba

What about CIFS? Yes, I make some use of it (sigh). What can I say, I'm lazy. I'm not sure how to 'benchmark' CIFS, apparently it's not something often done. I found a 1+ GB .iso file and copied it to my OS SSD and then back to a new place on a share, and captured these two shots. To the left, you have me copying the ISO from Z: (a CIFS network mapped drive) to My Documents (my OS SSD). The speed range was pretty consistent, in the 80-90 MB/s area.

From CIFS Share to OS SSD

To the right you see the same, but in reverse -- copying the file from the OS SSD to a folder on the Z: network share. Again, performance in the 80-90 MB/s. It never dipped on either transfer. I tried various other files and the speed range was a bit erratic at times (as CIFS is wont to do), with the lowest I saw being 68 MB/s and the highest being 192 MB/s (hah).
	From OS SSD to CIFS Share

Part of reaching this level of performance with CIFS to a FreeBSD Samba install was in the tuning, however. I went with samba version 3.6, and my smb.conf entries for ZFS/AIO can be found below. Please bear in mind I use CIFS at the home for unimportant data that is snapshotted regularly and I can afford to lose a few minutes of work - and am impatient and don't want to wait forever for a file copy on it, etc, and at the moment the home SAN has no SSD for slog. These settings are very likely not data safe for in-transit data if power is lost (mostly due to aio write behind = yes):

[global]

...

socket options = IPTOS_LOWDELAY TCP_NODELAY SO_RCVBUF=131072 SO_SNDBUF=131072

use sendfile = no

min receivefile size = 16384

aio read size = 16384

aio write size = 16384

aio write behind = yes

...

I also have the following related changes to my /boot/loader.conf file:

aio_load="YES"

net.inet.tcp.sendbuf_max=16777216
net.inet.tcp.recvbuf_max=16777216
net.inet.tcp.sendspace=65536
net.inet.tcp.recvspace=131072

And to my /etc/rc.conf file:

smbd_enable="YES"

nmbd_enable="YES"

FreeBSD

In all, I'm impressed on the whole, but finding a few things lacking. One example, while it is of no consequence to me, would be as others have pointed out to me that if I had been doing FC Target stuff on my Solaris SAN, I'd have been unable to try this out - FreeBSD lacking good FC Target capability like COMSTAR has. The istgt package seems to do iSCSI great, but it is limited to iSCSI.

There were also various personal learning curves. For awhile, on boot the box would be unable to initially do istgt, and it turned out that while I had zfs_load="YES" in my /boot/loader.conf file, I did not have zfs_enable="YES" in my /etc/rc.conf file. This is sneaky, because ZFS worked just fine (I could type 'zfs list' right after boot, and there were my datasets), but even after doing a zfs list, istgt wouldn't work until I created or renamed a zvol. It turned out FreeBSD wasn't creating the /dev/zvol links that istgt was looking for until I created or renamed a zvol, whereas once I added zfs_enable="YES" to /etc/rc.conf, they were instantiated at boot time just fine.

I find the ports system both flexible, powerful, and foreign. I'm probably pulling a FreeBSD admin faux pas by installing the new 'pkg' system but both before and after installing it have randomly used 'make install' in /usr/ports when necessary (like to compile Samba 3.6 with the options I wanted).

I do enjoy learning new things, though, so I'm quite pleased with the time spent and the results so far. My next steps will be:

creating a robust automatic snapshot script/daemon
seeing if I can't dust off my old Python curses skills and build a simple one screen health monitor with various SAN-specific highlights to leave running on the LCD monitor plugged into the box
going downstairs right now and finding the Crown and a Coke - pictures on other post can wait

Ciao!

Sub-$1000 8 TB Home SAN? Can I do it?

2012-09-19T21:13:00.001-07:00

An 8+ TB home SAN in under $1000? Can I do it?
(I'll update with pictures when I can - because everybody likes homebrew SAN pornography).

It has only been 5 years and 2 months since I registered this blog and promptly did nothing with it. Why not suddenly start using it, I ask myself. Good place for personal stuff, and this is about my home SAN/NAS, so why not? I'll probably cross-post this one at Nex7.com, as well as any other ZFS-related ones, but I'll also post other non-ZFS and even non-computer stuff up here (like the plans to move to Washington state and set up a self-sufficient rural living situation - no, I'm not kidding!).

Before I answer the question posed by the title, let me further explain my requirements for this. I have an existing Nexenta-running SAN with 2 TB of usable space in it and a whole 4 GB of RAM. It is an old dual-Opteron SuperMicro motherboard and just locating 2 GB of RAM to add to the 2 I got it with cost me nearly $300! -- oh, and the hardware is dying.

Still, it served well for a few years, and at this point is indispensable, as I have multiple iSCSI, CIFS, and NFS shares hanging off of it that are in live use on my other systems, including my main day to day desktop (which actually lives off that SAN; the only disk in it is an 80 GB SSD for the OS). So, being critical, and with the motherboard starting to flake out and it having components in it over 6 years old, it was time to upgrade!

The solution needed to be:

Cost-effective (sub-$1000).
Provide a minimum of 8 TB of raw disk space (required).
Separate OS from data (required).
Provide ECC RAM (required - makes things much more difficult).
Provide at least 8 GB of RAM (preference for 16 GB or more - my ARC hit rate on the old box is usually over 99%, but at peak times can drift down quite a bit, and the plan is to put the wife on it as well, and a few new boxes/VM's I have planned).
Support ZFS, either via an illumos derivative or FreeBSD (preference on supporting both).
Be expandable (preferred but optional).

Essentially, I wanted a "pseudo-enterprise-grade" (via ZFS and ECC RAM) SAN for a home user budget, with sufficient space and RAM to handle even a serious power user or two's plus some home server's day to day performance requirements.

So, was I able to do it?

Short answer: almost! I slipped by $110.86 ($50.87 for you, who learn from my mistakes).

Long answer:

The most time-consuming parts of this was the requirement of ECC RAM. This is pretty critical - running a ZFS storage server without ECC RAM just rubs me the wrong way. Sure, RAM bit flips are rare, but they do happen, and if I'm going to spend all this effort checksumming all my bytes on disk, I may as well checksum my RAM as well. :)

Tied to the necessity for ECC RAM was motherboard choice - I clearly wanted a good motherboard, but at the same time, a 'budget' build can't spend $450 on a mid-grade server motherboard! So keeping my search to desktop or low-end server motherboards led to a fairly time-consuming process of elimination trying to find one with the features, a decent collection of PCI-e slots for expansion and customization ability, sufficient RAM maximum with ECC support, etc, etc.

I quickly found ECC RAM on any Intel-based platform adds $100's of dollars more than your average home system; requiring a server-grade motherboard and a server-class Intel CPU. AMD to the rescue - certain AMD chipset-based motherboards, notably many of the ones from ASUS, support ECC RAM, and do so for a combined mobo/CPU cost literally a 1/3 or less that of a comparable Intel solution. It took more than a little digging to figure out a currently available motherboard that someone else had already purchased and stated unequivocally did support ECC RAM, but I was able to find a few (the Gigabyte equivalent to the motherboard I got from ASUS is also verified to have it, per a user on [H]ardOCP).

I also spent quite awhile agonizing over cases and such, trying to figure out one that provided me with sufficient cooling in a home environment for at least 8 disks (I knew I only needed 4 to hit my space requirements for the next few years, but I wanted at least double the slots, for future-proofing). There were a couple of good options; the one I've listed below was the winner for me, but feel free to experiment. I will say, I have no complaints about the case, other than that one of the 4-slot HDD bays came with a screw that was unable to be removed through any reasonable means, and cannot be put back in now (but since the HDD's actually are secured via screws in the bottom and the slide 'tray' locks in pretty securely without the screw, I didn't bother to RMA).

Shopping list, with links and prices on NewEgg as of 9/10/2012 when it was all ordered:

Component	Qty.	Price (sum)	Description	Newegg Link
PSU	1	$89.99	SeaSonic M12II 620 Bronze 620W ATX	Link
MBD	1	$154.99	ASUS M5A99FX PRO R2.0	Link
CPU	1	$129.99	AMD FX-6100 Zambezi 3.3 GHz Hex-Core	Link
RAM	4	$28.99 ($115.96)	Kingston 4GB 240-Pin DDR3 ECC Unbuffered	Link
OS SSD	1	$49.99	OCZ Vertex Plus R2 VTXPLR2-25SAT2-60GB	Link
Data HDD	4	$99.99 ($399.96)	Seagate Barracuda STBD2000101 2TB 7200 RPM	Link
Case	1	$109.99	Fractal Design Arc Midi Black	Link
Total		$1,050.87

Compatibility notes - short list:

OS	NET	Disk	USB 3.0	Link
FreeBSD 9.0	X*	✓	✓	Link
FreeBSD 9.1	✓	✓	✓	N/A
NexentaStor Enterprise 3.1.3	✓ (gani driver)	✓	2.0 only	Link
NexentaStor Enterprise 4.0 (not publicly avail, yet)	✓ (re driver)	✓	2.0 only	N/A
Illumian 1.0	X**	✓	2.0 only	Link

* FreeBSD 9.0 - you can snag drivers either pre-compiled or build them into the kernel after install very easily. Google for 'FreeBSD 9 Realtek 8111F' and everything you need should be on the first page.

** I wasn't able to tell - when I finally burned a copy of illumian I already had BSD installed, so I just looked at the installer, and unlike the installer for NexentaStor, I saw no instance of a gani0 or re0 in ifconfig; this could be that it doesn't have the necessary support, or it could be illumian installer doesn't turn up a network device until later in the process than I was willing to go. I'll err on the side of caution and say no. :)

Compatibility notes - long list:

These drives above are the 4K drives that report as 512. On FreeBSD, the following is required to make them work properly (basically gnop them, create pool, export pool, destroy gnop drives, import again, verify ashift is still 12):

bsdsan# ls -lha /dev/ada*
crw-r----- 1 root operator 0, 101 Sep 19 01:00 /dev/ada0 <-- SSD
crw-r----- 1 root operator 0, 108 Sep 19 01:00 /dev/ada0p1 -
crw-r----- 1 root operator 0, 110 Sep 18 20:00 /dev/ada0p2 -
crw-r----- 1 root operator 0, 112 Sep 18 20:00 /dev/ada0p3 -
crw-r----- 1 root operator 0, 114 Sep 19 01:00 /dev/ada1 <-- data HDD
crw-r----- 1 root operator 0, 116 Sep 19 01:00 /dev/ada2 <-- data HDD
crw-r----- 1 root operator 0, 122 Sep 19 01:00 /dev/ada3 <-- data HDD
crw-r----- 1 root operator 0, 124 Sep 19 01:00 /dev/ada4 <-- data HDD
bsdsan# gnop create -S 4096 /dev/ada1
bsdsan# gnop create -S 4096 /dev/ada2
bsdsan# gnop create -S 4096 /dev/ada3
bsdsan# gnop create -S 4096 /dev/ada4
bsdsan# zpool create home-0 mirror /dev/ada1.nop /dev/ada2.nop mirror /dev/ada3.nop /dev/ada4.nop
bsdsan# zpool list
NAME SIZE ALLOC FREE CAP DEDUP HEALTH ALTROOT
home-0 3.62T 452K 3.62T 0% 1.00x ONLINE -
bsdsan# zdb -C home-0 | grep ashift
ashift: 12
ashift: 12
bsdsan# zpool export home-0
bsdsan# gnop destroy /dev/ada1.nop /dev/ada2.nop /dev/ada3.nop /dev/ada4.nop
bsdsan# zpool import home-0
bsdsan# zdb -C home-0 | grep ashift
ashift: 12
ashift: 12
bsdsan# zpool status
pool: home-0
state: ONLINE
scan: none requested
config:

NAME STATE READ WRITE CKSUM
home-0 ONLINE 0 0 0
mirror-0 ONLINE 0 0 0
ada1 ONLINE 0 0 0
ada2 ONLINE 0 0 0
mirror-1 ONLINE 0 0 0
ada3 ONLINE 0 0 0
ada4 ONLINE 0 0 0

errors: No known data errors

To be honest, I'm not sure the best way to deal with the 512/4K issue on the illumos-derivatives. My usual advice in a professional capacity to people building corporate SAN's is to just steer clear of them altogether, so I've never dug too far into it - and I fear answers that involve custom binaries! As the goal of this build was not only to replace an aging home storage box, but also to familiarize myself with ZFS+FreeBSD since I get plenty of Nexenta at work, I didn't bother to research (maybe I will later), as FreeBSD made it so painless to deal with (nice).

Also, the motherboard I chose has a Realtek 8111F chipset for the NIC. You need to either compile it in after installation or snag compiled drivers for it, OR just wait for FreeBSD 9.1. FreeBSD 9 has no in-built support. It is not hard to do - Google 'FreeBSD 9 Realtek 8111F' and the first page has all you need.

Afterword:

So for those paying attention, you may note that's not quite the total I calculated. No, it isn't from tax. There were a few extra components I took the liberty of ordering that were not strictly necessary (and some went unused), like Arctic Cooling MX-4 to replace the terrible thermal pad included on the OEM heatsink (which I did use) and a few extra cables (they ended up not actually being necessary).

Those really paying attention may notice the case holds 8 drives and I only put in 4 (the OS SSD fits in a small SSD slot up in the 5.25" area, leaving you with a full 4 unused 3.5" internal HDD slots). This is because the on-board motherboard SATA controller(s) only handle a total of 7 drives. That's 1 OS and 6 data. There is a way to get to a full 8, as well as put all 8 onto a solid non-motherboard SATA controller, but it involves more cost and some extra legwork. If you're interested, look into this card: http://www.newegg.com/Product/Product.aspx?Item=N82E16816117157.

Specifically, that card and then flashing it with an LSI IT firmware, which it will take, and it will then be probably the cheapest PCI-e SATA/SAS controller you can get that's really enterprise grade and works fine in Solaris derivatives as well as FreeBSD. You'll also need 2 mini-SAS to SATA FORWARD breakout cables (don't get reverse breakout cables). That's the route I intend to take when I need to double the spindle count in this little home SAN, but for now 8 raw TB is more than sufficient for my needs, so I opted to skip it (it adds about $200 between the card and cables).

Still, add that in and that knocks it up to about $1250 for a 8 TB raw SAN that can be expanded to 16 TB raw for another $400, so $1650 for 16 raw TB (interesting that it comes out to something close to $100/TB). As long as you stay to one break-out cable per mini-SAS port (so 8 disks total), SATA is no problem on OpenSolaris derivatives (reportedly less of an issue on BSD), and if you go SAS, well, the case becomes your limiter (you could probably cram 2-3 more drives up in the 5.25" area, and possibly more if you wanted to get crazy). Obviously 3 TB disks are rapidly lowering in price, too, so soon this same build could be had at 12 TB raw on 4 disks or 24 TB raw at 8 disks -- and of course, 4 TB goes up to 16/32.

I'm still setting it up, after testing out all the OS's above and so on - now it is time to get serious on it, so we'll see if I get motivated to post up what I find out about migrating from a Solaris+ZFS solution to a BSD+ZFS solution (I can already see a few gotchya's, like differences between COMSTAR & istgt).