Thursday, March 21, 2013

ZFS: Read Me 1st

Things Nobody Told You About ZFS

Yes, it's back. You may also notice it is now hosted on my Blogger page - just don't have time to deal with self-hosting at the moment, but I've made sure the old URL redirects here.

So, without further adieu..

Foreword

Article last updated 4/17/2013 - I will be updating this article over time, so check back now and then.

There are a couple of things about ZFS itself that are often skipped over or missed by users/administrators. Many deploy home or business production systems without even being aware of these gotchya's and architectural issues. Don't be one of those people!

I do not want you to read this and think "ugh, forget ZFS". Every other filesystem I'm aware of has many and more issues than ZFS - going another route than ZFS because of perceived or actual issues with ZFS is like jumping into the hungry shark tank with a bleeding leg wound, instead of the goldfish tank, because the goldfish tank smelled a little fishy! Not a smart move.

ZFS is one of the most powerful, flexible, and robust filesystems (and I use that word loosely, as ZFS is much more than just a filesystem, incorporating many elements of what is traditionally called a volume manager as well) available today. On top of that it's open source and free (as in beer) in some cases, so there's a lot there to love.

However, like every other man-made creation ever dreamed up, it has its own share of caveats, gotchya's, hidden "features" and so on. The sorts of things that an administrator should be aware of before they lead to a 3 AM phone call! Due to its relative newness in the world (as compared to venerable filesystems like NTFS, ext2/3/4, and so on), and its very different architecture, yet very similar nomenclature, certain things can be ignored or assumed by potential adopters of ZFS that can lead to costly issues and lots of stress later.

I make various statements in here that might be difficult to understand or that you disagree with - and often without wholly explaining why I've directed the way I have. I will endeavor to produce articles explaining them and update this blog with links to them, as time allows. In the interim, please understand that I've been on literally 1000's of large ZFS deployments in the last 2+ years, often called in when they were broken, and much of what I say is backed up by quite a bit of experience. This article is also often used, cited, reviewed, and so on by many of my fellow ZFS support personnel, so it gets around and mistakes in it get back to me eventually. I can be wrong - but especially if you're new to ZFS, you're going to be better served not assuming I am. :)

1. Virtual Devices Determine IOPS

IOPS (I/O per second) are mostly a factor of the number of virtual devices (vdevs) in a zpool. They are not a factor of the raw number of disks in the zpool. This is probably the single most important thing to realize and understand, and is commonly not. 

ZFS stripes writes across vdevs (not individual disks). A vdev is typically IOPS bound to the speed of the slowest disk within it. So if you have one vdev of 100 disks, your zpool's raw IOPS potential is effectively only a single disk, not 100. There's a couple of caveats on here (such as the difference between write and read IOPS, etc), but if you just put as a rule of thumb in your head that a zpool's raw IOPS potential is equivalent to the single slowest disk in each vdev in the zpool, you won't end up surprised or disappointed.

2. Deduplication Is Not Free

Another common misunderstanding is that ZFS deduplication, since its inclusion, is a nice, free feature you can enable to hopefully gain space savings on your ZFS filesystems/zvols/zpools. Nothing could be farther from the truth. Unlike a number of other deduplication implementations, ZFS deduplication is on-the-fly as data is read and written. This creates a number of architectural challenges that the ZFS team had to conquer, and the methods by which this was achieved lead to a significant and sometimes unexpectedly high RAM requirement.

Every block of data in a dedup'ed filesystem can end up having an entry in a database known as the DDT (DeDupe Table). DDT entries need RAM. It is not uncommon for DDT's to grow to sizes larger than available RAM on zpools that aren't even that large (couple of TB's). If the hits against the DDT aren't being serviced primarily from RAM or fast SSD, performance quickly drops to abysmal levels. Because enabling/disabling deduplication within ZFS doesn't actually do anything to data already on disk, do not enable deduplication without a full understanding of its requirements and architecture first. You will be hard-pressed to get rid of it later.

3. Snapshots Are Not Backups

This is critically important to understand. ZFS has redundancy levels from mirrors and raidz. It has checksums and scrubs to help catch bit rot. It has snapshots to take lightweight point-in-time captures of data to let you roll back or grab older versions of files. It has all of these things to help protect your data. And one 'zfs destroy' by a disgruntled employee, one fire in your datacenter, one random chance of bad luck that causes a whole backplane, JBOD, or a number of disks to die at once, one faulty HBA, etc, etc, etc -- and poof, your pool is gone. I've seen it. Lots of times. MAKE BACKUPS.

4. ZFS Destroy Can Be Painful

Something often waxed over or not discussed about ZFS is how it presently handles destroy tasks. This is specific to the "zfs destroy" command, be it used on a zvol, filesystem, clone or snapshot. This does not apply to deleting files within a ZFS filesystem (unless that file is very large - for instance, if a single file is all that a whole filesystem contains) or on the filesystem formatted onto a zvol, etc. It also does not apply to "zpool destroy". ZFS destroy tasks are potential downtime causers, when not properly understood and treated with the respect they deserve. Many a SAN has suffered impacted performance or full service outages due to a "zfs destroy" in the middle of the day on just a couple of terabytes (no big deal, right?) of data. The truth is a "zfs destroy" is going to go touch many of the metadata blocks related to the object(s) being destroyed. Depending on the block size of the destroy target(s), the number of metadata blocks that have to be touched can quickly reach into the millions, even the hundreds of millions.

If a destroy needs to touch 100 million blocks, and the zpool's IOPS potential is 10,000, how long will that zfs destroy take? Somewhere around 2 1/2 hours! That's a good scenario - ask any long-time ZFS support person or administrator and they'll tell you horror stories about day long, even week long "zfs destroy" commands. There's eventual work that can be done to make this less painful (a major one is in the works right now) and there's a few things that can be done to mitigate it, but at the end of the day, always check the actual used disk size of something you're about to destroy and potentially hold off on that destroy if it's significant. How big is too big? That is a factor of block size, pool IOPS potential, extenuating circumstances (current I/O workload of the pool, deduplication on or off, a few other things).

5. RAID Cards vs HBA's

ZFS provides RAID, and does so with a number of improvements over most traditional hardware RAID card solutions. ZFS uses block-level logic for things like rebuilds, it has far better handling of disk loss & return due to the ability to rebuild only what was missed instead of rebuilding the entire disk, it has access to more powerful processors than the RAID card and far more RAM as well, it does checksumming and auto-correction based on it, etc. Many of these features are gone or useless if the disks provided to ZFS are, in fact, RAID LUN's from a RAID card, or even RAID0 single-disk entities offered up. 

If your RAID card doesn't support a true "JBOD" (sometimes referred to as "passthrough") mode, don't use it if you can avoid it. Creating single-disk RAID0's (sometimes called "virtual drives") and then letting ZFS create a pool out of those is better than creating RAID sets on the RAID card itself and offering those to ZFS, but only about 50% better, and still 50% worse than JBOD mode or a real HBA. Use a real HBA - don't use RAID cards.

6. SATA vs SAS

This has been a long-standing argument in the ZFS world. Simple fact is, the majority of ZFS storage appliances, most of the consultants and experts you'll talk to, and the majority of enterprise installations of ZFS are using SAS disks. To be clear, "nearline" SAS (7200 RPM SAS) is fine, but what will often get you in trouble is the use of SATA (including enterprise-grade) disks behind bad interposers (which is most of them) and SAS expanders (which almost every JBOD is going to be utilizing).

Plan to purchase SAS disks if you're deploying a 'production' ZFS box. In any decent-sized deployment, they're not going to have much of a price delta over equivalent SATA disks. The only exception to this rule is home and very small business use-cases -- and for more on that, I'll try to wax on about it in a post later.

7. Compression Is Good (Even When It Isn't)

It is the very rare dataset or use-case that I run into these days where compress=on (lzjb) doesn't make sense. It is on by default on most ZFS appliances, and that is my recommendation. Turn it on, and don't worry about it. Even if you discover that your compression ratio is nearly 0% - it still isn't hurting you enough to turn it off, generally speaking. Other compression algorithms such as gzip are another matter entirely, and in almost all cases should be strongly avoided.

8. RAIDZ - Even/Odd Disk Counts

Try (and not very hard) to keep the number of data disks in a raidz vdev to an even number. This means if its raidz1, the total number of disks in the vdev would be an odd number. If it is raidz2, an even number, and if it is raidz3, an odd number again. Breaking this rule has very little repercussion, however, so you should do so if your pool layout would be nicer by doing so.

9. Pool Design Rules

I've got a variety of simple rules I tell people to follow when building zpools:
  • Do not use raidz1 for disks 1TB or greater in size.
  • For raidz1, do not use less than 3 disks, nor more than 7 disks in each vdev.
  • For raidz2, do not use less than 5 disks, nor more than 10 disks in each vdev.
  • For raidz3, do not use less than 7 disks, nor more than 15 disks in each vdev.
  • Mirrors trump raidz almost every time. Far higher IOPS potential from a mirror pool than any raidz pool, given equal number of drives.
  • For 3TB+ size disks, 3-way mirrors begin to become more and more compelling.
  • Never mix disk sizes (within a few %, of course) or speeds (RPM) within a single vdev.
  • Never mix disk sizes (within a few %, of course) or speeds (RPM) within a zpool, except for l2arc & zil devices.
  • Never mix redundancy types for data vdevs in a zpool.
  • Never mix disk counts on data vdevs within a zpool (if the first data vdev is 6 disks, all data vdevs should be 6 disks).
  • If you have multiple JBOD's, try to spread each vdev out so that the minimum number of disks are in each JBOD. If you do this with enough JBOD's for your chosen redundancy level, you can even end up with no SPOF (Single Point of Failure) in the form of JBOD, and if the JBOD's themselves are spread out amongst sufficient HBA's, you can even remove HBA's as a SPOF.
If you keep these in mind when building your pool, you shouldn't end up with something tragic.

10. 4KB Sector Disks

There are a number of in-the-wild devices that are 4KB sector size instead of the old 512-byte sector size. ZFS handles this just fine if it knows the disk is 4K sector size. The problem is a number of these devices are lying to the OS about their sector size, claiming it is 512-byte (in order to be compatible with ancient Operating Systems like Windows 95); this will cause significant performance issues if not dealt with at zpool creation time.

11. ZFS Has No "Restripe"

If you're familiar with traditional RAID arrays, then the term "restripe" is probably in your vocabulary. Many people in this boat are surprised to hear that ZFS has no equivalent function at all. The method by which ZFS delivers data to the pool has a long-term equivalent to this functionality, but not an up-front way nor a command that can be run to kick off such a thing. 

The most obvious task where this shows up is when you add a vdev to an existing zpool. You could be forgiven to expect that the existing data in the pool would slide over and all your vdevs would end up of roughly equal used size, since that's what a traditional RAID array would do. ZFS? It won't. That data balancing will only come as an indirect result of rewrites. If you only ever read from your pool, it'll never happen.

12. Hot Spares

Don't use them. Pretty much ever. Warm spares make sense in some environments. Hot spares almost never make sense. Very often it makes more sense to include the disks in the pool and increase redundancy level because of it, than it does to leave them out and have a lower redundancy level.

13. ZFS Is Not A Clustered Filesystem

I don't know where this got started, but at some point, something must have been said that has led some people to believe ZFS is or has clustered filesystem features. It does not. ZFS lives on a single set of disks in a single system at a time, period. Various HA technologies have been developed to seamlessly move the pool from one machine to another in case of hardware issues, but they move the pool - they don't offer up the storage from multiple heads at once.

14. To ZIL, Or Not To ZIL

This is a common question - do I need a ZIL (ZFS Intent Log)? So, first of all, this is the wrong question. In almost every storage system you'll ever build utilizing ZFS, you will need and will have a ZIL. The first thing to explain is that there is a difference between the ZIL and a ZIL (referred to as a log or slog) device. It is very common for people to call a log device a "ZIL" device, but this is wrong - there is a reason ZFS' own documentation always refers to the ZIL as the ZIL, and a log device as a log device. Not having a log device does not mean you do not have a ZIL!

So with that explained, the real question is, do you need to direct those writes to a separate device from the pool data disks or not? In general, you do if one or more of the intended use-cases of the storage server are very write latency sensitive, or if the total combined IOPS requirement of the clients is approaching say 30% of the raw pool IOPS potential of the zpool. In such scenarios, the addition of a log vdev can have an immediate and noticeable positive performance impact. If neither of those is true, it is likely you can just skip a log device and be perfectly happy. Most home systems, for example, have no need of a log device.

15. ARC and L2ARC

One of ZFS' strongest performance features is its intelligent caching mechanisms. The primary cache, stored in RAM, is the ARC (Adaptive Replacement Cache). The secondary cache, typically stored on fast media like SSD's, is the L2ARC (second level ARC). Basic rule of thumb in almost all scenarios is don't worry about L2ARC, and instead just put as much RAM into the system as you can, within financial realities. ZFS loves RAM, and it will use it - there is a point of diminishing returns depending on how big the total working set size really is for your dataset(s), but in almost all cases, more RAM is good. If your use-case does lend itself to a situation where RAM will be insufficient and L2ARC is going to end up being necessary, there are rules about how much addressable L2ARC one can have based on how much ARC (RAM) one has.

16. Just Because You Can, Doesn't Mean You Should

ZFS has very few limits - and what limits it has are typically measured in zillions, and are thus unreachable with modern hardware. Does that mean you should create a single pool made up of 5,000 hard disks? In almost every scenario, the answer is no. The fact that ZFS is so flexible and has so few limits means, if anything, that proper design is more important than in legacy storage systems. It is a truism that in most environments that need lots of storage space, it is likely more efficient and architecturally sound to find a smaller-than-total break point and design systems to meet that size, then build more than one of them to meet your total space requirements. There is almost never a time when this is not true.

It is very rare for a company to need 1 PB of space in one filesystem, even if it does need 1 PB in total space. Find a logical separation and build to meet it, not go crazy and try to build a single 1 PB zpool. ZFS may let you, but various hardware constraints will inevitably doom this attempt or create an environment that works, but could have worked far better at the same or even lower cost.

Learn from Google, Facebook, Amazon, Yahoo and every other company with a huge server deployment -- they learned to scale out, with lots of smaller systems, because scaling up with giant systems not only becomes astronomically expensive, it quickly ends up being a negative ROI versus scaling out.

17. Crap In, Crap Out

ZFS is only as good as the hardware it is put on. Even ZFS can corrupt your data or lose it, if placed on inferior components. Examples of things you don't want to do if you want to keep your data intact include using non-ECC RAM, using non-enterprise disks, using SATA disks behind SAS expanders, using non-enterprise class motherboards, using a RAID card (especially one without a battery), putting the server in a poor environment for a server to be in, etc.

12 comments:

  1. This is great! When you have a slog, how do you decide pool spindle count to maximize the use of the slog. I have always used mirrors, but my math says to take advantage of a high performance slog, that I would want lots of spindles.

    My slogs do 900MB/sec, therefore don't I want a pool that does 900MB/sec, which is 20+ vdevs.

    ReplyDelete
  2. That answer is really pretty specific on the workload of the pool itself. Much of the time, the slog devices are there to speed up the pool by offloading the ZIL traffic - and as an added benefit, reducing write latency from a client perspective.

    I almost always am looking at slog devices from an IOPS perspective first and foremost, and a throughput potential as a distant or even non-existent second (depends on the environment). Often a pool that can do 2.4 GB/s in a large-block sequential workload can't do anywhere near that at 4K random read/write request sizes (indeed, that's some 620,000 IOPS) -- and the client is doing exactly those, so suddenly all the interest is in IOPS and little time is spent worrying about throughput.

    In a pure throughput workload, things can and should be a bit different. And in ZFS, they are. For instance, ZFS has built-in mechanics for negating normal ZIL workflow if the incoming data is a large-block streaming workload. It can opt to send the data straight to the disk, bypassing any slog device (well, bypassing the ZIL entirely, really, and thus the slog device). There could be a whole post at some point on the varying conditions and how ZFS deals with each, I think. You've got 'logbias' on datasets (good writeup here: https://blogs.oracle.com/roch/entry/synchronous_write_bias_property ). And even on latency, there's some code to deal with limits, I believe. Take a look at: http://src.illumos.org/source/xref/illumos-gate/usr/src/uts/common/fs/zfs/zil.c#893 , or the ZFS On Linux guys (dechamps, specifically) has a pretty good write-up on this at https://github.com/zfsonlinux/zfs/issues/1012 .

    ReplyDelete
  3. I liked the oracle article the best, thanks for the feedback. My scenario is different then theirs however. My specific workload is a VDI implemnation with 80/20 r/w bias. I cannot seem to get a diskpool to get to performance levels to match the hardware I think. I have 22 spindle 10K mirrored pools with a ram based slog. The slog is rated at 90K iops and 900MB/sec.

    Wouldn't zpool iostat show under ideal conditions 22 * 50MB/s = 1100 MB/sec or near there? Best I can get is 300 MB/sec. I am just trying to explain the gap. Zpool iostat shows peaks of 42K iops which is great, but never very high MB/sec. When the system is not busy, I would think that a file copy would reach the speed of the slog at least or at least double what the readings are at 300MB/sec. Nobody seems to use zpool iostat for performance data. iostat seems to be the tool of choice, but I don't have that data compiled over time like I do for zpool iostat.


    So would taking my 22 spindle 10k mirrored pool to a 44 spindle mirrored pool, which a little bigger than what oracle pushed at spec.org here: http://www.spec.org/sfs2008/results/res2012q2/sfs2008-20120402-00208.html I should see my numbers go up closer to the limits of my slog right?


    ReplyDelete
  4. So, 'rated at' and 'capable of' are always two different things. However, more importantly, 'capable of when used as a ZFS log device' is a whole new ballgame.

    Manufacturers tend to provide numbers that show them in the most favorable light -- and even third-party analysis websites focus on typical use-cases: database, file transfer, those sorts of things.

    ZIL log device traffic is something every device fears - synchronous single-thread I/O. Your device may be capable of 90,000 IOPS @ 4K block size with 8, 16, 32, or more threads.. and anywhere from 4 to 64 threads is likely what both they and third-party websites run tests at -- but what can it do at 1 thread, at the average block size of your pool datasets? Because that's what a log device will be asked to do. :)

    As for SPECfs - I tend to, well, ignore that benchmark entirely. What it is testing isn't particularly real-world applicable, especially since vendors tend to game the system. For instance, you mention 44 spindle mirrors - no, in that test, the Oracle system had *280* drives, which they split up into 4 pools, each containing 4 filesystems, which were then tested in aggregate I believe. I also believe the data amount tested was significantly less than the pool size, and various other tunings were likely done as well. This picture gives some idea as to how big that system was: http://www.spec.org/sfs2008/results/res2012q2/sfs2008-20120402-00208.7420cluster.jpg

    Even pretending you had the specific tunings, and ignoring for a moment its not particularly fair to just 'divide down' to get an idea for what a smaller system could do, doing so puts your 22 spindle 10K mirrored pool at about 14K iops, on the same benchmark.

    I generally want to see both iostat and zpool iostat; they're very different in what they're reporting, as they're reporting on different layers. Sometimes the combination of both gives hints that one or the other would not alone provide.

    I suspect with a 'VDI' implementation you're probably running 4-32K block size, and at that, I'd be happy with a peak of 42K iops out of 22 10K disks.. indeed, that's way past what you should realistically expect out of the drives, most of that 42K is coming out of ARC and an 80%+ read workload. Were I just gut feeling, I'd suspect you to get much less at times.

    This sort of performance work is time-consuming and involves a ton of variables. However, it is important to note that the log device is not some sort of write cache -- that's your RAM. The log device's job is to take the ZIL workload off the data pool. The performance benefit of that is purely in that the pool devices now have all those I/O they were spending on ZIL back. If there's any further benefit, its just that luck of the draw that the incoming writes were 'redundant' (they were writing some % of the same blocks multiple times within a txg, allowing ZFS to effectively ignore all but the last write that comes in when it lays it out on the spinning media). The pain that spinning disks feel from ZIL traffic cannot be understated. However, the streaming small-block performance of the spinning media minus the serious pain of interjecting the random read that gets past the ARC is, at the end of the day, the actual write performance the pool is capable of -- not what the log device can do at all.

    In super streaming workloads, sometimes, the log devices end up being the bottleneck. However, in almost all VM/VDI deployments I've seen, the log device is not your bottleneck - your drives are. :)

    ReplyDelete
  5. Therefore, is going from a 22 disk mirror to a 44 disk mirror bad? How may vdevs are too many vdevs? The spec test, which I get was tuned, 280 disks, 2 controllers, leaves 140 disks per controller. 4 slogs, mean 4 mirrored pools, therefore they used 35 spindles. But you say that lots of spindles is bad.

    The slog I have is a STEC ZeusRAM. I discovered them in the Nexenta setup from VMworld 2011 (I have a diagram of it also), which is what I have been trying to replicate ever since. Since I have 100 of these 10K drives and JBODS to go with them, I am trying to figure out how to get the best out of them for a VDI deployment. So far I have only tried 22 spindles and I was thinking 44 would be better.

    Lots of $$ in equipment and consultants plus gobs and gobs of wasted time still has me scratching my head.

    ReplyDelete
  6. No no, more spindles is usually better up to a point. I don't start to worry about spindle counts until it is up into the 100's. However, remember the Oracle box got the 200K+ IOPS only from 280 spindles - at 44, you're at a small fraction of that.

    Your box will perform twice as well as your 22-disk mirror system does, assuming no part in the system hits a bottleneck (which is going to happen if you've insufficient RAM, CPU, network, etc), and it is properly tuned(!). I would not expect, on a properly tuned system, in an 8-32K average block size VDI-workload, for 44 drives in mirror pool to be able to outperform a single STEC ZeusRAM (eg: I wouldn't expect it to be your bottleneck, from an IOPS perspective).

    I would expect the ZeusRAM to bottleneck you on a throughput test - or if your average blocksize is 32K or greater (getting ever more likely up to 128K). Its IOPS potential is not 90,000 at 4K, nor at 8K, 32K, or 128K (each of which is worse than the previous), because ZIL traffic is single-threaded, unlike most benchmarks you'd cite when saying how fast a device is.

    I love ZeusRAM, and I recommend them on every VM/VDI deployment I'm involved with and commend you on their use; but while they are in fact the very best device you could possibly use, it is not like they can't limit you, they are not of unlimited power. Still, again, if you're at 16K or under average block size, I'd suspect your pool (22 or 44 drives) to run out of IOPS, first. What block size are you using? What protocol (iSCSI, NFS)?

    Is this a NexentaStor-licensed system, or a home grown (and if so, what O/S & version)? That will matter in terms of where you can go for performance tuning assistance - because it needs some, unless you've already started down that path? I'm unaware of a single ZFS-capable O/S whose default tuneables for ZFS will well suit a high-IOPS VDI workload. The spinning disks are very likely underutilized.

    ReplyDelete
  7. I am running Solaris 11 because the Nexenta resellers I reached out to were too busy to get back with me I guess because they never did. So I just started buying what made sense to me. If any of you out there are reading this.. look at what you missed. Sorry I wasn't interesting enough! I have 100 SAS 10K spindles, 2 Stec's, 2 DDRDrives, 2 256GB w/10Gbe Servers. Tried a 60 disk pool but someone told me it was too big, so now I have 22. Your nugget of vdev's are for I/O was worth the price of admission. I learned this, you can never have enough RAM, ever. All in all its been such a letdown because of the $$ spent and the results achieved.

    ReplyDelete
  8. Sorry they never got back to you. Doubly so since that precludes the option of contacting Nexenta to do a performance tuning engagement. :(

    Also sorry the performance has seemed underwhelming - this is one of the current problems with ZFS go-it-on-your-own, is that there's just such a dearth of good information out there on sizing, tuning, performance gotchya's, etc - and the out of box ZFS experience at scale is quite bad. What information does exist is often in mailing lists, hidden amongst a lot of other, bad advice. I'm hoping to try to fix that as best I can with blog entries on here, but time I have to spend on this is erratic, and some of these topics are nearly impossible to address fully in a few paragraphs on a blog post, I'm afraid.

    60 disk is most assuredly not 'too big'. Average Nexenta deployment these days I'd say is probably around 96 disks per pool, or somewhere thereabouts. If you don't mind people poking around on the box via SSH (and it is in a place where that's possible), email me (nexseven@gmail.com) to work out login details, and I can try to find some off time to take a peek at it.

    ReplyDelete
  9. I dropped you a note last weekend, but maybe your on spring break like I have been. I was thinking of just adding another jbod of 24 disks to the exiting pool, creating new zfs datasets, then copy the exiting data to them to spread it around the new disks. Go from 22 to 44 spindles. The whole operation should only take a few hours. Currently when I do zpool iostat I see maybe 1-2k ops/s with a high of 4k. What I don't like is the time to clone VM's, the max MB/s I get is around 500-550 and doubling the spindles would double that .. correct?

    Also.. how many minutes/seconds should a RSF1 with 96 disks take to fail over? I am curious what I would expect.

    ReplyDelete
  10. Oops, email lost in the clutter. I've responded.

    It would very likely double your IOPS count, but not potentially double your throughput count, since there's more bottleneck concerns to consider there. I assume you're using NFS -- you might (and it IS beta, so bear that in mind) be interested in this: http://nexentastor.org/boards/13/topics/9315 - we're in beta on the NFS VAAI plugin. I say that because you mentioned tasks like VM cloning and such, and NFS VAAI support could have a serious impact on certain VM image manipulation tasks in VMware when backed by NexentaStor. Possibly worth looking at (though again -- beta, probably not good for production, yet).

    The goal of RSF-1 is to fail over in the shortest safe time possible. I've seen failovers take under 20 seconds. That said, I've also seen them take over 4 minutes (which isn't bad when you put it in context -- at my last job, my Sun 7410 took *15 minutes* to fail over). There's a number of factors involved. Number of disks is one, number of datasets (zvols & filesystems) is another. In general I recommend people expect 60-120 seconds, which is why I have the blog post up on VM Timeouts and suggest at least 180 second timeout values everywhere (personally I use higher than even that, as I see no reason to go read-only when I know the SAN will come back *some day*).

    ReplyDelete
  11. What about Zpool fragmentation? That seems to be another issue with ZFS that you don't see much discussion about. As your pools get older, they tend to get slower and slower because of the fragmentation, and in the case of a root filesystem on a zpool, that can even mean that you can't create a new swap device or dump device because there is no contiguous space left. Zpools really need a defrag utility. Today the only solution is to create a new pool and migrate all your data to it.

    A related issue is that there are no tools to even easily check the pool fragmentation. Locally, we estimate the fragmentation based on the output of "zdb -mm", but even that falls down when you have zpools that are using an "alternate root" (for example in a zone in a cluster). "zpool list" sees those pools fine, but zdb does not.

    Are you aware of any work being done on solutions to those issues?

    ReplyDelete