Wednesday, October 9, 2013

Lots of L2ARC not a good idea, for now

Before I get into it, let me say that the majority of the information below and analysis of it comes from one of my partners in crime (coworker at Nexenta), Kirill (@kdavyd). All credit for figuring out what was causing the long export times, and reproducing the issue internally (it was initially observed at a customer) after we cobbled together enough equipment, and analysis to date, goes to him. I'm just the messenger, here.

Here's the short version, in the form of the output from a test machine (emphasis is mine):

root@gold:/volumes# time zpool export l2test

real    38m57.023s
user    0m0.003s
sys     15m16.519s

That was a test pool with a couple of TB of L2ARC, filled up, with an average block size on pool of 4K (a worst case scenario) -- and then simply exported, as shown. If you are presently using a pool with a significant amount of L2ARC (100's of GB's or TB's), (update:) and you have multiple (3+) cache vdevs, and have any desire to ever export your pool that has a time limit attached to it before you need to be importing it again to maintain service availability, you should read on. Otherwise this doesn't really affect you. In either case, it should be fixable, long-term, so probably not a panic-worthy piece of information for most people. :)

If you want the workaround, by the way, it is to shoot the box in the head -- panic it, power it off hard, etc, to export the pool. This is obviously very dangerous if you do not have a proper setup with an actually utilized ZIL (preferably through nice log devices) and all your important data arriving synchronously, or if you have other pools on the system you don't want to lose access to as well. Unfortunately if either of those is the case, the only other workaround would be to remove the cache vdevs from the pool, wait out the eviction stuff, and only then export the pool. Better than just exporting it and waiting, since you'd be online while you removed the cache vdevs, but fairly time consuming still.

Here's your culprit, where we sit for potentially minutes, or even hours if you had a sufficiently 'bad' environment for this issue:

PC: _resume_from_idle+0xf1    CMD: zpool export l2test
  stack pointer for thread ffffff4b94b44180: ffffff022cdd9820
  [ ffffff022cdd9820 _resume_from_idle+0xf1() ]

Analysis is ongoing, but it appears to be made exponentially worse the more L2ARC you have, the bigger the L2ARC devices are individually, the clock speed of your CPU (as this is a single-CPU-bound task), and your average block size (as that affects how many l2hdr entries you have).

This should be relatively easy to fix, in theory -- ARC evictions were already made asynchronous, but it looks like this wasn't done for L2ARC as well. If it is, this should be less of an issue.

Update 10/10/13:
Well, it appears there is code for an async l2arc eviction, and it's in play:

root@gold:/volumes# echo zfs_l2arc_async_evict::print | mdb -k
0x1 (B_TRUE)

So the investigation continues as to why this is happening anyway.

Update 10/10/13:
Kirill continues to investigate, and believes he's pinned the problem down to the async l2arc eviction not being always actually being asynchronous. In his testing, much of it is actually done asynchronously, but the minute there's more than X number going, the next one gets done synchronously. This is tied to the number of cache vdevs you have. And you cannot export until all the data they referenced is cleared, so the export won't finish until the only remaining tasks are in process (and are asynchronous, otherwise it'll have to wait them out). This explains why the export did complete before this was all actually through -- ongoing tasks were asynchronous and there were no more outstanding. Seemingly there is a 'maxalloc' being set on taskq_create by this line:

 arc_flush_taskq = taskq_create("arc_flush_tq",
     max_ncpus, minclsyspri, 1, 4, TASKQ_DYNAMIC);

The '4' is the culprit here. In Kirill's words: "Probably at the beginning of export, we are flushing ARC as well, so 2 threads are already busy, leaving us with only 2 for L2ARC. It is evicting asynchronously, just not all drives, because taskq isn’t big enough, since someone probably didn’t consider that there may be more than 2 L2ARC’s on a system :)", then adds, "Also explains why I couldn’t reproduce it with a single 1TB L2ARC - looks like you need at least three."

So the major takeaway here is that the original correlation between size of L2ARC and the size of the individual L2ARC vdevs is actually only of concern if you hit this 'bug' in the first place. To hit it, the size of your L2ARC and L2ARC vdevs isn't important, it is basically the number of them - seemingly in most situations you'll need at least 3 cache vdevs in the pool to ever run into this.

Update 10/14/13:
So apparently the arc_flush_taskq and the asynchronous ARC & L2ARC eviction code is all only in NexentaStor at the moment, having not (yet? I cannot publicly state our intent with this code, I have truly not been informed internally of what it is) been pushed back to illumos. Fortunately this means that on Nexenta 3.x machines, you'll only hit this long export time problem if you have more than 2 L2ARC devices in a pool you're trying to export, in general, and also the amount of ARC related to the pool shouldn't effect export times, either. Long-term we'll of course move to fix this issue, likely by increasing the limit above 4, but only after more thorough analysis internally.

Unfortunately it seems to mean that if you're not running Nexenta 3.x, and are on an older version or using an alternative illumos-derivative, you may very well be susceptible to a long export time even with just a lot of small block data in ARC, as well as any number of L2ARC devices including just 1. I do not have a box of sufficient size with a non-Nexenta illumos distro on it at the moment to do any testing, I'm afraid.


  1. Thanks for the post. We are definitely hitting this bug with 3 SSD drives for L2ARC. Our failover time is typically about 20-25 minutes. to export the pool with a pool size of 8TB use( pool size of 30TB) and swing it over to the other head node. It looks like I'll me removing one of the L2ARC drives before we manually swing over (upgrade, etc.) for now.

  2. Scott, if you have SLOG devices in your pools and all your datasets have sync set properly (zvols too) then it is perfectly safe to panic one node via uadmin or reboot -p.

  3. I've been dealing with this problem for a couple years now. I run OmniOS so I'm not benefiting from the work Nexenta has done on this. My systems are all HA with RSF-1.

    My procedure for doing maintenance is to first set sync=always on the pool of the pool I need to export. I then panic the system and the other head takes over the pool. When I'm finished I set sync=standard back on the pool. This has worked many times for me with out issue.

    One thing I would like to point out is that removing the cache SSDs from the pool is not helpful here. As soon as you execute "zpool remove tank {cache_ssd}" you get into a blocking situation that too can last a ridiculous amount of time. On one pool I have with 8 400GB SSDs the first remove took 12 minutes, that blocked the I/O the entire time. That was 12 minutes of downtime I was not planning on dealing with.

    Because of this issue, I prefer to build my high performance systems with more RAM and less L2ARC.

    Hopefully Nexenta will release their work on this to Illumos soon.

  4. this is really too useful and have more ideas from yours. keep sharing many techniques. eagerly waiting for your new blog and useful information. keep doing more.
    Germany Education Consultants in Chennai