Saturday, March 23, 2013

Reservation & Ref Reservation - An Explanation (Attempt)

So in this article I'm going to try to explain and answer a lot of the questions I get and misconceptions I see in terms of ZFS and space utilization of a pool. Not sure how well I'm going to do here, I've never found a holy grail way of explaining these that everyone understands - seems like one way works with one person, and a different way is necessary for the next guy, but let's give it a shot.

ZFS has two provided methods of determining used and free space. The first is at a non-granular pool level, and the second is at a dataset level. There are some very big differences in how these two viewpoints behave. Pool-wide statistics are provided by the 'zpool' command, as evidenced by this example:


root@myhost:/# zpool list
NAME           SIZE  ALLOC   FREE    CAP  DEDUP  HEALTH  ALTROOT
example-pool  3.94G   121K  3.94G     0%  1.00x  ONLINE  -

Of note here are 'SIZE', 'ALLOC', and 'FREE'. You'll notice my brand new, completely empty 'example-pool' is 3.94 GB in size, and has allocated 121 KB of space (metadata and such). Now let's create a filesystem.

root@myhost:/# zfs create example-pool/noquota-noreserve-filesystem
root@myhost:/# zpool list
NAME           SIZE  ALLOC   FREE    CAP  DEDUP  HEALTH  ALTROOT
example-pool  3.94G   160K  3.94G     0%  1.00x  ONLINE  -
syspool       15.9G  3.15G  12.7G    19%  1.00x  ONLINE  -

Now we've got one empty filesystem on our pool - and as some may expect, a few extra KB of allocated space (again, metadata and such). So let's see what the second viewpoint of looking at this has to say, and that's the 'zfs' command. Let's concentrate just on the important stuff for now, with a command like this:

root@myhost:/# zfs list -o name,used,avail,refer,usedds,usedchild
NAME                                        USED  AVAIL  REFER  USEDDS  USEDCHILD
example-pool                                160K  3.88G    32K     32K       128K
example-pool/noquota-noreserve-filesystem    31K  3.88G    31K     31K          0

I apologize for all the fields - but believe it or not, they're all informative (eventually). As you can see, in addition to seeing the filesystem I created a moment ago, we also see the pool itself. Or do we? Do not be fooled, gentle reader, for the 'example-pool' entry you see in 'zfs list' is not exactly equivalent to the 'example-pool' you'll see in 'zpool list'. Where 'zpool list' is showing you the pools and pool-level statistics, 'zfs list' is only showing you datasets and dataset statistics. That's right, your pool is also a filesystem dataset (indeed, all filesystems created under it are children of it).

This is important to note, because it means that 'example-pool' in 'zfs list' is going to be factoring in the used space of all the filesystems on the pool. You would think this would be pretty much the same as 'zpool list', then, right? And indeed, in the above example, both 'zpool list' and 'zfs list' are showing a used (alloc) space of 160 KB. Where this will bite you is simple: 'zpool list' does not take into account anything but actual, physically allocated blocks of data on the disks -- and 'zfs list' takes not only those into account, but also reservations as well. So what are reservations? Well, to answer that, let's first look at what ZFS' manpage has to say about them (and there 2, not one):

       reservation=size | none

           The  minimum  amount  of  space guaranteed to a dataset and its descendents. When the amount of space used is below this value, the dataset is treated as if it were taking up the amount of space specified by its reservation. Reservations are accounted for in the parent datasets' space used, and count against the parent datasets' quotas and reservations.

           This property can also be referred to by its shortened column name, reserv.

       refreservation=size | none

           The minimum amount of space guaranteed to a dataset, not including its descendents. When the amount of space used is below this value, the dataset is treated as if it were taking up  the  amount  of space specified by refreservation. The refreservation reservation is accounted for in the parent datasets' space used, and counts against the parent datasets' quotas and reservations.

           If refreservation is set, a snapshot is only allowed if there is enough free pool space outside of this reservation to accommodate the current number of "referenced" bytes in the dataset.

           This property can also be referred to by its shortened column name, refreserv.

Now, some people wax over these definitions and don't notice that 'reservation' and 'refreservation' do not actually have identical first paragraphs. The key different cam be found in the very first sentence, where 'reservation' says "and its descendents" and referservation says "not including its descendents". This is key. A hint to how important this distinction is comes in the form of the extra paragraph you find in the referservation description, but we'll get back to that in a minute. First, an initial, simple example of both. Remember our pool? Let's created another dataset, but this time, specify a reservation (as opposed to a refreservation).

root@myhost:/# zfs create -o reserv=500M example-pool/noquota-reserv-filesystem
root@myhost:/# zfs list -o name,used,avail,refer,usedds,usedchild
NAME                                        USED  AVAIL  REFER  USEDDS  USEDCHILD
example-pool                                500M  3.39G    33K     33K       500M
example-pool/noquota-noreserve-filesystem    31K  3.39G    31K     31K          0
example-pool/noquota-reserv-filesystem       31K  3.88G    31K     31K          0

Now, the keen-eyed amongst you may have noticed that while this new filesystem looks pretty much like the other one we created, suddenly the root filesystem (example-pool) is reporting USED of 500M! What? But the new filesystem isn't using 500 MB! In a panic, we check 'zpool list' to verify our assumption:

root@myhost:/# zpool list
NAME           SIZE  ALLOC   FREE    CAP  DEDUP  HEALTH  ALTROOT
example-pool  3.94G   394K  3.94G     0%  1.00x  ONLINE  -
syspool       15.9G  3.15G  12.7G    19%  1.00x  ONLINE  -

Whew. We're right - we're not suddenly using 500 MB of disk space. So why is 'zfs list' claiming we are? Are we? The short answer is -- no, you're not. And yes, you are. See, a reservation is us telling ZFS that the filesystem have 500 MB reserved for the use of that filesystem (and its children, since its a reservation), and only for that filesystem and its children. This doesn't go write 500 MB of data to the drive.. but from the perspective of any other filesystem on that pool, it may as well have. Notice how our first fileystem, has an available space of 3.39 GB (as does 'example-pool')? Yet our new filesystem has 3.88 GB? That's 500 MB of difference (basically). 

The reason for this new discrepancy is simple - by reserving 500 MB, we've told ZFS that our new filesystem may use all of the pool and has 500 MB reserved for its use, and that any other filesystem (including our first one) cannot use 500 MB of the pool. Now, this reservation is just that -- a reservation up to that amount. It isn't 'above and beyond' what we're actually using. If I put 50 MB of data into that second filesystem, 550 MB won't be unavailable to other filesystems -- it will still be only 500. Let's test that assertion:

root@myhost:/# cd /volumes/example-pool/noquota-reserv-filesystem/
root@myhost:/volumes/example-pool/noquota-reserv-filesystem# dd if=/dev/zero of=50MB-testfile bs=1M count=50
50+0 records in
50+0 records out
52428800 bytes (52 MB) copied, 0.218335 seconds, 240 MB/s
root@myhost:/volumes/example-pool/noquota-reserv-filesystem# zfs list -o name,used,avail,refer,usedds,usedchild
NAME                                        USED  AVAIL  REFER  USEDDS  USEDCHILD
example-pool                                500M  3.39G    33K     33K       500M
example-pool/noquota-noreserve-filesystem    31K  3.39G    31K     31K          0
example-pool/noquota-reserv-filesystem     50.0M  3.83G  50.0M   50.0M          0

See? Despite adding 50 MB to the new filesystem, the old filesystem (and root filesystem) still show 500 MB used and 3.39 GB available. They won't change, in fact, until we've put 501 MB or more into that second filesystem (at which point, the reservation is almost pointless, unless we delete things and go back below 500 MB used in the filesystem).

So with our current layout, what if we add in a snapshot? What would that do?

root@myhost:/volumes/example-pool/noquota-reserv-filesystem# zfs list -t all -o name,used,avail,refer,usedds,usedchild
NAME                                                USED  AVAIL  REFER  USEDDS  USEDCHILD
example-pool                                        500M  3.39G    33K     33K       500M
example-pool/noquota-noreserve-filesystem            31K  3.39G    31K     31K          0
example-pool/noquota-reserv-filesystem             50.0M  3.83G  50.0M   50.0M          0
example-pool/noquota-reserv-filesystem@first-snap      0      -  50.0M       -          -

As expected - nothing, really. Still 500 MB used and 3.39 GB available to the pool. I'm not trying to explain how snapshots work in this post, only in how reservations can affect them - so let's move on. Time for refreservations! Let's make a new filesystem, and see what we get:

root@myhost:/# zfs list -t all -o name,used,avail,refer,usedds,usedchild
NAME                                                USED  AVAIL  REFER  USEDDS  USEDCHILD
example-pool                                       2.49G  1.39G    35K     35K      2.49G
example-pool/noquota-noreserve-filesystem            31K  1.39G    31K     31K          0
example-pool/noquota-refreserv-filesystem             2G  3.39G    31K     31K          0
example-pool/noquota-reserv-filesystem             50.0M  1.83G  50.0M   50.0M          0
example-pool/noquota-reserv-filesystem@first-snap      0      -  50.0M       -          -

So this is interesting - when we made a filesystem with a 500 MB reservation, it ate up 500 MB from the root filesystem availability - but the 'USED' on the filesystem was nil until we added a file. Yet now we've made a new filesystem with a 2 GB refreservation, and the 'USED' on the actual filesystem is 2 GB! Once again in a panic, we look at 'zpool list':

root@myhost:/# zpool list
NAME           SIZE  ALLOC   FREE    CAP  DEDUP  HEALTH  ALTROOT
example-pool  3.94G  50.5M  3.89G     1%  1.00x  ONLINE  -
syspool       15.9G  3.15G  12.7G    19%  1.00x  ONLINE  -

Nope -- we're still only using 50 MB, from that one file we made. Whew. Ok. So does it act different? If we make a 1.5 GB file inside there, is it going to show 3.5 GB used? Let's see:

root@myhost:/# cd /volumes/example-pool/noquota-refreserv-filesystem/
root@myhost:/volumes/example-pool/noquota-refreserv-filesystem# dd if=/dev/zero of=1.5Gtest bs=1M count=1536
1536+0 records in
1536+0 records out
1610612736 bytes (1.6 GB) copied, 16.6138 seconds, 96.9 MB/s
root@myhost:/volumes/example-pool/noquota-refreserv-filesystem# zfs list -t all -o name,used,avail,refer,usedds,usedchild
NAME                                                USED  AVAIL  REFER  USEDDS  USEDCHILD
example-pool                                       2.49G  1.39G    35K     35K      2.49G
example-pool/noquota-noreserve-filesystem            31K  1.39G    31K     31K          0
example-pool/noquota-refreserv-filesystem             2G  1.89G  1.50G   1.50G          0
example-pool/noquota-reserv-filesystem             50.0M  1.83G  50.0M   50.0M          0
example-pool/noquota-reserv-filesystem@first-snap      0      -  50.0M       -          -

Hrm, nope. Looks like just as with a reservation, the 'refreserv's 2 GB isn't 'above and beyond' actual usage. As you can see, USEDDS (used dataset) and REFER have gone up to 1.5 GB on that new filesystem, but the USED remains at 2 GB. So unlike 'reserv', 'refreserv' is visually displaying the refreservation in USED, but in terms of functionality, it still is just a reservation, right? Well, mostly - but the reason for the aesthetic difference and the gotchya to 'reserv' vs 'refreserv' now comes into play as I take a snapshot of this new filesystem. Let's see what happens:

root@myhost:/volumes/example-pool/noquota-refreserv-filesystem# zfs snapshot example-pool/noquota-refreserv-filesystem@second-snap
cannot create snapshot 'example-pool/noquota-refreserv-filesystem@second-snap': out of space

What?! Out of space?? No I'm not! Quick, check 'zpool list'!

root@myhost:/volumes/example-pool/noquota-refreserv-filesystem# zpool list
NAME           SIZE  ALLOC   FREE    CAP  DEDUP  HEALTH  ALTROOT
example-pool  3.94G  1.55G  2.39G    39%  1.00x  ONLINE  -
syspool       15.9G  3.14G  12.7G    19%  1.00x  ONLINE  -

See?! I have 2.39 GB free! Heck, even looking at the 'zfs list' we did a moment ago, I have 1.39 GB available in the root 'example-pool'! Why can't I take a snapshot? If this was a reservation, instead of a refreservation, you would be able to take that snapshot. However, this is a refreservation. It is a reservation on the dataset NOT including its children. It is a reservation stating that absolutely under no circumstances will there not be 2 GB of space available on 'example-pool' for this filesystem, ever, not even because of snapshots on this filesystem.

See, I have 1.5 GB of referenced data (REFER) in 'example-pool/noquota-refreserv-filesystem', and 1.39 GB of available space in 'example-pool' right now, and 1.5 is greater than 1.39 -- and the second I make a snapshot of this filesystem, all the existing referred data becomes referenced from the child, so I need 1.5 GB of free space outside of my 2 GB reservation, because I must still be able to meet my 2 GB commitment. Since I don't have it, the snapshot attempt fails. 

Now, let's be clear here - when you take this snapshot, you're not adding 1.5 GB of new data to the pool. Right away, all that happened was a snapshot begins referencing the existing 1.5 GB file - there's no new space usage. But the whole point of a 'refreservation' is to protect the filesystem the refreservation is on from running out of space (up to the refreservation) for ANY REASON, so it has to assume that you might want to delete that 1.5 GB file and make a new one, all while that snapshot is still there retaining a link to all the blocks of that original 1.5 GB file.

I've seen many users hit this 'no space' message, or even worse, have just barely sufficient space in their pool to take that snapshot, and then all their other datasets quickly started running out of space, even though their pool may have had tons of space left from a physical perspective. A proper understanding of refreserv would have saved them a lot of headache.

So given how potentially dangerous 'refreserv' can be from an administrative perspective, why does ZFS have it, and not just 'reserv'? There are probably a few reasons, but let me give you one (which I believe to be the main one) - zvols. 

The most common use of a zvol is as a backend 'device' then shared up over iSCSI to some client that formats it with a filesystem and treats it like any other hard drive. That client makes a few expectations about that drive -- it expects that the entirety of that drive will be available to it, and that if it thinks it is 40% full, it isn't going to suddenly get an 'out of space' error when it goes to write to it. Most clients really don't behave well when their disk suddenly starts claiming its out of space when it shouldn't. So 'refreservation' is provided by ZFS to the admin as a method of making guarantees about available space above and beyond 'reservation'.

See, with a simple 'reservation' (or with nothing at all), a zvol of 50 GB could eat up 50 GB and then, because of snapshots, eat up even more. If I was doing something crazy on the zvol that completely changed it every day, and was keeping 7 days of snapshots, I'd realistically end up with over 350 GB of used space on my pool between the base dataset and the snapshots. What if my pool only had 200 GB of total space? I'd be out of luck -- and I may very well end up sending 'out of space' errors to my client. But if that same zvol had a 'refreserv' set of 50 GB, then it is my snapshot creation attempts that would fail, and at no time would my zvol ever be in risk of running out of its 50 GB of needed space.

By default, when you create a zvol in ZFS, if you don't specify 'sparse' (with -s on the command line option), ZFS will make it a 'thick provisioned' zvol. All that actually is is ZFS making a 'refreservation' equivalent to the 'volsize' you specified for the zvol. Create one some time, and take a look for yourself. The fact is, if you're not planning on making snapshots, refreservations are a very sane way of not only guaranteeing space to your clients, but of easily keeping yourself from overprovisioning your pool. If you want to skip them, don't set a refreservation -- but be warned; if you do so, the onus is now on you as the administrator to keep a close eye on the pool utilization and take action before it gets too full. 

So there you have it - 1000's of words to try to explain the difference between a 'reservation' and a 'refreservation', for your reading pleasure.

No comments:

Post a Comment