Nex7's Blog: ZFS Intent Log

Tuesday, April 2, 2013

ZFS Intent Log

[edited 11/22/2013 to modify formula]

The ZFS Intent Log gets a lot of attention, and unfortunately often the information being posted on various forums and blogs and so on is misinformed or makes assumptions about the knowledge level of the reader that if incorrect can lead to danger. Since my ZIL page on the old site is gone now, let me try to reconstruct the knowledge a bit in this post. I'm hesitant to post this - I've written the below and.. it is long. I tend to get a bit wordy, but it is also a subject with a lot of information to consider. Grab a drink and take your time here, and since this is on Blogger now, comments are open so you can ask questions.

If you don't want to read through this entire post, and you are worried about losing in-flight data due to things like a power loss event on the ZFS box, follow these rules:

Get a dedicated log device - it should be a very low-latency device, such as a STEC ZeusRAM or an SLC SSD, but even a high quality MLC SSD is better than leaving log traffic on the data vdevs (which is where they'll go without log devices in the pool). It should be at least a little larger than this formula, if you want to prevent any possible chance of overrunning the size of your slog: (maximum possible incoming write traffic in GB * seconds between transaction group commits * 3). Make it much larger if its an SSD, and much much larger if its an MLC SSD - the size will help with longevity. Oh, and seconds between transaction group commits is the ZFS tunable zfs_txg_timeout. Default in older distributions is 30 seconds, newer is 5, with even newer probably going to 10. It is worth noting that if your rarely if ever have heavy write workloads, you may not have to size it as large -- it is very preferably from a performance perspective that you not be regularly filling the slog, but if you do it rarely it's no big deal. So if your average writes in [txg_timeout * 3] is only 1 GB, then you probably only need 1 GB of log space, and just understand when you rarely overfill it there will be a performance impact for a short period of time while the heavy write load continues. [edited 11/22/2013 - also, as a note, this logic only applies on ZFS versions that still use the older write code -- newer versions will have the new write mechanics and I will update this again with info on that when I have it]
(optional but strongly preferred) Get a second dedicated log device (of the exact same type as the first), and when creating the log vdev, specify it as a mirror of the two. This will protect you from nasty edge cases.
Disable 'writeback cache' on every LU you create from a zvol, that has data you don't want to lose in-flight transactions for.
Set sync=always on the pool itself, and do not override the setting on any dataset you care about data integrity on (but feel free TO override the setting to sync=disabled on datasets where you know loss of in-transit data will be unimportant, easily recoverable, and/or not worth the cost associated with making it safe; thus freeing up I/O on your log devices to handle actually important incoming data).

Alright, on with the words.

It is important to, first and foremost, clear up a common misconception I see about the ZIL. It is not a write cache. There is no caching of any kind going on in the ZIL. The ZIL's purpose is not to provide you with a write cache. The ZIL's purpose is to protect you from data loss. It is necessary because the actual ZFS write cache, which is not the ZIL, is handled by system RAM, and RAM is volatile.

ZFS absolutely caches writes (usually) - incoming writes are held in RAM and, with a few notable exceptions, only written to disk during transaction group commits, which happen every N seconds. However, that isn't the ZIL. The ZIL is invoked when the incoming write meets certain requirements (most notably, something has tagged it as being a synchronous request), and overrides the 'put in RAM and respond to client that data is written' normal flow of asynchronous data in ZFS to be 'put in RAM, then put on stable media, and only once it is on stable media respond to client that data is written'.

One of the most common performance problems people run into with ZFS is not understanding ZIL mechanics. This comes about because, on every distribution I'm aware of, the default ZFS setup is that the ZIL is enabled -- and if there are no dedicated log devices configured on a pool, the ZIL will use a small portion of the data drives themselves to handle the log traffic. This workload is terrible on spinning media - it is single queue depth random sync write with cache flush - something spinning disks are terrible at. This leads to not only a noticeable performance problem for clients on writes, it has a very disruptive effect on the spinning media's ability to handle normal read requests and normal transaction group commits.

It is just all around a less than stellar situation to be in, and one that any ZFS appliance doing any significant traffic load is going to end up getting bit by (home users often do not - I run a number of boxes at home off a ZFS device with no dedicated log, and it is fine - I simply do not usually do enough I/O for it to be an issue).

So, enter the 'log' vdev type in ZFS. You can specify multiple 'log' virtual devices on a ZFS pool, containing one or more physical devices, just like a data vdev - you can even mirror them (and that's often a good idea). When ZFS sees that an incoming write to a pool is going to a pool with a log device, and that the rules surrounding usage of the ZIL are triggered and the write needs to go into the ZIL, ZFS will use these log virtual devices in a round-robin fashion to handle that write, as opposed to the normal data vdevs.

This has a double win for performance. First, you've just offloaded the ZIL traffic from the data vdevs. Second, your write response times (write latency) to clients will drop considerably not only because you're no longer using media that is being contended by multiple workflows, but because any sane person uses an SSD or non-volatile RAM-based device for log vdevs.

As a minor third benefit, by the way, you might see an additional overall improvement because the lower latency allows for more incoming writes, which has itself two potential performance improvements: one, it means that if the data being written happens to be such that it is writing to the same block multiple times within a single transaction group commit (txg), the txg only has to write the final state of the block to spinning media instead of all the intermediary updates; and second, the increased ability to send sync writes at the pool may mean better utilization of existing pool assets than was possible before (the pool might have been starved for writes, even though the client could send more, as the client was waiting on response before sending more and the pool was slow to send that response because the latency for writes was too high). However, these two benefits are very reliant on the specific environment and workload.

So, to summarize so far, you've got a ZFS Intent Log that is very similar to the log a SQL database uses, write once and forget (unless something bad happens), and you've got an in-RAM write cache and transaction group commits that handle actually writing to the data vdevs (and by the by, the txg commits are sequential, so all your random write traffic that came in between commits is sequential when it hits disk). The write cache is volatile as its in RAM, so the ZIL is in place to store the synchronous writes on stable media to restore from if things go south.

If you've gotten this far, you may have noticed I've kept hedging and saying 'synchronous' and such. This is important. ZFS is making some assumptions about data retention based on how the incoming writes look that many typical users just don't realize it is doing, and are often bitten quite hard because of it. I have seen thousands of ZFS boxes that are in danger of data loss.

The reason is that they are unaware that their clients are not sending data in a manner that triggers the ZIL, and as such, the incoming writes are only going into RAM, where they sit until the next txg commit - some day, when the box inevitably has an issue resulting in power loss, they're going to lose data. The severity of this data loss is directly tied to the workload they're putting on the server. It is extremely common for those environments I see where they're in danger to be utilizing things like iSCSI to provide virtual hard disks to VM's, and this is one of the worst environments to lose a couple seconds of write data in, as that write data is potentially critically important metadata for a filesystem sitting on top of a zvol, that when lost, corrupts the whole thing.

So first, let's talk about what gets you into the ZIL, today. This is pretty complicated, because there's essentially a number of ways that ZFS can handle an incoming write. Note first of all that as far as I'm aware, all incoming writes will be stored in RAM while the transaction group is open or committing to disk (I haven't been able to fully verify this yet), even when they're instantly put on final data vdev (thus, a read on this data should come from RAM). Aside from that, however, any of the following could happen:

Write the data immediately to the log (ZIL) and respond to client OK. Data will be written from RAM to disk during next txg commit, normally. Data in log is only for restoration if power is lost.
Write the data immediately to the data vdevs and store a pointer to the new block in the log (ZIL) then respond to client OK. Pointer to data block in log is used only on recovery if power is lost. On txg commit, just update metadata to point at the already-written new block (the data block itself won't be rewritten on txg commit, merely actually made part of the pool; prior to that, it's not actually referenced by the ZFS pool aside from the pointer in the ZIL).
Write the data immediately to the data vdevs - nothing is written to the log device as this is a full write complete with metadata update, etc - then respond to client OK.

What can lead to these 3 types of workflow is a combination of a number of variables and the characteristics of both the incoming write and the total open transaction group. Sufficed to say, these variables are important:

logbias setting on the dataset being written to
zfs_immediate_write_sz
zil_slog_limit
Existence of a log device (method ZFS uses to handle writes will take into account rather the ZIL is on a log or not - it has major effects on the choice of mode used to deal with incoming data)
The incoming data has been, in one way or another, tagged as synchronous.

That last bold bullet point is key. None of all of the above stuff matters, and the incoming data will be stored solely in RAM, no ZIL mechanics in play, and written only to disk as part of the upcoming transaction group commit, if the incoming data is considered asynchronous. So. What can make you synchronous? Any of the following.

The dataset being written into is sync=always. The incoming block could even be specifically called ASYNC in some way, and it won't matter, ZFS is going to treat it as a synchronous write.
The dataset being written into is sync=standard and the block was written with a form of SYNC flag set on it.

The sync=standard setting is the default, and important data should be sent SYNC, right? So, surely all your important data is already being set with one of the above, right? Wrong. Different protocols specify sync or honor (or don't) client sync requests in different ways. Different layers in the stack between the client and the ZFS pool may alter a request to be sync or to disregard a sync request. And of course, ZFS itself may choose to interpret the incoming write as sync or async disregarding client semantics.

NFS - out of the box, most NFS traffic should be properly set as sync; specifying 'sync' (or the OS equivalent on your platform) on the mount command will guarantee this, specifying 'async' will likely ruin this and lead to most or all of the traffic from that mount not utilize the ZIL
CIFS/SMB - somewhat dependent on client - check with it to see what its defaults are
iSCSI - default is async, and very dangerously, some intermediary layers commonly found in an iSCSI setup will disregard sync requests from clients - notably, some hypervisor intermediary layers, where the hypervisor is responsible for iSCSI and the VM only sees the disk as presented by the hypervisor may be requesting O_SYNC inside the VM, but the hypervisor is disregarding that based on settings, and the request is sent to ZFS without sync set
Local box - this is to say, you're doing tasks directly on the box running the zpool - usually this is going to be asynchronous unless the application has intentionally requested sync writes (some things will, depending on settings, like *SQL databases for example). Generally speaking, however, it will be asynchronous from a client perspective.

If you've got data you want to be sure is being set properly to sync, how you guarantee this is a factor of rather you care about granularity. If you want every last bit of data being written to be sync (as you very often do when you have a dedicated log device, and even more so when the clients are, say, virtual machines using the storage as their primary disks), make sure all your datasets have sync unset (eg: being inherited from parent) and set sync=always on the pool itself. This is a quick and easy way that should guarantee data integrity.

It may seem counter-intuitive, but sometimes, data integrity is trumped by the cost of delivering it. Labs, non-production use-cases, and so on are obvious, but even other times, it is perhaps not important enough to warrant the ongoing performance cost, not to mention the up-front cost of hardware to support it.

Take, for instance, the aforementioned virtual machine host use-case. The VM's in question may be important, but a good backup system may be in place, the services they offer unlikely to be severely impacted by the loss of a few minutes of data, or even services that essentially do not change, meaning a restore of a backup from the prior day would work just as well.

If the restoration process only takes an hour, and the time the VM can be offline before it is important is longer, and the VM, once restored, would be of sufficient service (even having lost some amount of recent data), then the costs involved in delivering fully ZIL-backed-up ZFS storage underneath the VM may be higher than they are worth.

The only time having a ZIL matters is if the ZFS server itself loses power, and once it has been restored, the only data lost would be in-transit data (so, at best, 1 or 2 transaction group commits worth of seconds of data). In most file server situations, you won't have any issue other than recently updated or in-the-process-of-being-updated files would be affected at all. In situations where the storage is hosting things like virtual hard disks for VM's, the filesystems on top of those virtual disks (be they zvols or files within an NFS share) may experience some level of loss.

Depending on the filesystem sitting on top of those zvols or vhd files, and what was in transit at the time, the damage may be negligible. I've seen VM's come back up without a single warning, and when they do, the very common scenario is that the filesystem merely complains and needs to be fsck'd or chkdsk's, and the data lost is zero or not noticeably important (last few seconds of a log file, for instance).

I'm not suggesting that data integrity is unimportant - but it is worth looking at the overall environment before deciding that the storage in question truly requires ZIL mechanics to keep from losing a few seconds of data. In many environments, it doesn't. Also remember that in such environments, you don't have to go all or none - if you set sync=always on the datasets that matter, and intentionally set sync=disabled on datasets where it does not, a single pool can fulfill both sorts of situations. ZFS itself should (barring serious hardware problems) never have a problem; rather the data in the dataset was ZIL-backed or not, ZFS itself is, due to its atomic nature, always fine after the power is restored - it cannot by its design require a 'fsck'.

In closing, I'd also like to make another point - if you use a log device, and properly configure the pool or your clients to send all important data in such a manner that it makes use of the ZIL, and ZFS' own built-in integrity saves you from almost any disaster.. why would you need backups? Answer: because you need backups! Pools CAN be corrupted beyond reasonable recovery (there are a few very gifted ZFS experts out there willing to help you, but their time is precious, and your data may not be worth enough to afford their rates), and perhaps more importantly, the data on the pool can be destroyed in oh so many ways, some of which are flat out unrecoverable.

Accidental rm -rf or intentionally even? Hacker? Exploit? Client goes nuts and spews bad data and you didn't notice and didn't have a snapshot pre-crazy (or, even if you did, no easy way to recover from it due to environment)? SAN itself explodes? Is melted? Is shot by Gatling gun? Controller goes nuts and spews bad data at disks for hours while you're on vacation?

It is a simple fact of IT sanity that a comprehensive backup strategy (one that handles not only backing up the data, but making it quick and easy to restore as well) is a necessity for any production infrastructure you put up. Since this is a fact, and you are going to do it or rue the day you chose otherwise, you should probably remember that because you have it, you might not actually need a log device nor even ZIL mechanics, at least on some of your datasets (and every dataset you set sync=disabled on is another dataset with a bit more ZIL IOPS available for it to use instead). Carefully weigh risk and potential damage caused by loss of in-flight data as well as time to restore and how critical the service is before determining if ZIL mechanics are necessary.