Thursday, April 4, 2013

Cables Matter

No, I don't mean brand, or color, or even connector-type zealotry.. but when sizing solutions and working with customers and partners on new builds, I often find that either no thought goes into the SAS cabling, or only a little bit does. I find this distressing. Let me tell you why.

See, I often see either no thought put into it (a common mistake) -- or, the only thought I see put into it is to look at cable redundancy, sometimes paired with thoughts towards JBOD and/or HBA redundancy (an even more common, if more forgivable mistake). All of these are, of course, important (how important depends on the use case). I'm debating a blog post on sizing of solutions, and on redundancy, so I'll save talk of that for later. This is just a short (by my standards) post to explain something I often see completely overlooked. Throughput - and just how much of it you actually have compared to what you think you have.

See, anyone sizing a solution involving JBOD's often does give some thought to throughput. Most people understand that each 'mini-SAS' (SFF 8086/7/8 connector style cables) carry 4 separate SAS paths, and most understand that if your entire solution is SAS, you'll get 3 Gbit/s out of each path, and if it is SAS-2, you'll get 6 Gbit/s out of each path. My first word of advice is to treat this much like many network administrators treat network connections - pretend you only get 80%. For ease of remembrance, I just generally pretend that at best, a mini-SAS cable can do 2 GByte/s, and I'm rarely disappointed.

The thing I often see people forget, however, is that what's coming into your SAN/NAS is not what's going down to the drives. Let's take the easiest use-case to understand (and the generally worst one to deal with) - mirrors (RAID1/10). If 200 MB/s of data is coming into the SAN, all of it unique, how much is going to the drives if they're in a 2-disk mirror vdev pool? Answer: more than 400 MB/s (why more than double? Easy. ZFS maintains metadata about each block, and that also has to go down on the disks). Suddenly that 2 GB/s SAS cable is only actually capable of sending less than 1 GB/s of unique data downstream.

Ironically while ZFS far (far, far, far) prefers mirror pools for IOPS-heavy use cases, it has a significant downstream impact on throughput potential, especially if your build isn't taking this doubling into account in the design. Conversely, raidz1|2|3 vdevs lose much less - the only additional data that has to go down is the parity, which even in a raidz3 vdev is still less ballooning of data going down than mirrors, by quite a bit. So for raw throughput where the SAS cabling could become the bottleneck, raidz is a clear winner in terms of efficiency.

It isn't all bad news, though - once you understand this bottleneck, you'll appreciate ZFS' built-in compression even more than you probably already did, because that compression happens before the data goes down to the disks, potentially having quite an impact on how much usable data can get down the paths per second. And while I almost always steer people away from it, if your use-case does benefit strongly from deduplication, that also takes effect beforehand, massively reducing writes to disk if dedupe ratio is high.

So in the end, my advice when building solutions utilizing any sort of SAS expanding is to bear in mind not just how much performance you want to get out of the pool (a number you often know), but how much that actually means in terms of data going to drives, and rather your cabling can even carry it all. I am seeing more and more boxes with multiple 10 Gbit NIC's go out where there's single SAS cable bottlenecks that will very likely end up making it impossible to fully utilize the incoming network bandwidth in a throughput situation, because even if the back-end disks could support it, the SAS cabling in between simply can't. That  is OK if you're hoping for most that network bandwidth to be ARC-served reads -- but if you're expecting it to come to or from disks, remember this advice.


  1. Do you mind if I ask;
    Is this a problem mainly for large JBOD systems, with multiple cascading units? I definitely see your point regarding wide mirrors/stripes vs RAIDZs

    Roughly how many 7200rpm drives can one write to over a SFF 3Gbps SAS/SATA cable?

    I have a little 15+spare OmniOS server, running 5 stripes of 3-way mirrors, and was going to move it from a mix of internal connections and a 8 disk, 2xSFF cable JBOD to a 16 disk JBOD with just the two SFF connections.

    How do you think a mixed 3/6Gbps JBOD would perform? Or would that depend on the HBA and JBOD backplanes.

    Awesome blog. Thanks


  2. Generally for larger systems, yes.

    How many drives can write over a mini-SAS cable depends on a couple of factors - throughput wise, the obvious answer is however many drives it takes to max out the 4 x 3 Gbps bandwidth of the cable (so, full out, it could be as few as just 5 or 6 disks). Realistically, though, the question is better suited as 'how many drives does it take to max out throughput at a given IOPS number'. If you're doing large-block sequential I/O, it'll be a lot less disks required to max out throughput than if it's small-block high IOPS stuff.

    So for instance, a full 16-disk JBOD plugged in with two mini-SAS cables running at 3 Gbps is theoretically capable of around 2.4 GB/s of throughput (or 4.8 if it's all 6 Gbps gear), but realistically, under reasonable load, I'd not expect 16 drives to be trying to consume or send out more than 800 MB/s or so.

    Mixing 3/6 Gbps is a no-no, almost always, as most systems just end up dropping everything to 3 Gbps if even a single device plugged into it is a 3 Gbps max device.

  3. Thank you for sharing such a nice and interesting blog with us. I have seen that all will say the same thing repeatedly. But in your blog, I had a chance to get some useful and unique information. I would like to suggest your blog in my dude circle.
    Isoft Innovations Company Address
    Isoft Innovations Facebook