Wednesday, October 9, 2013

Lots of L2ARC not a good idea, for now

Before I get into it, let me say that the majority of the information below and analysis of it comes from one of my partners in crime (coworker at Nexenta), Kirill (@kdavyd). All credit for figuring out what was causing the long export times, and reproducing the issue internally (it was initially observed at a customer) after we cobbled together enough equipment, and analysis to date, goes to him. I'm just the messenger, here.

Here's the short version, in the form of the output from a test machine (emphasis is mine):

root@gold:/volumes# time zpool export l2test

real    38m57.023s
user    0m0.003s
sys     15m16.519s

That was a test pool with a couple of TB of L2ARC, filled up, with an average block size on pool of 4K (a worst case scenario) -- and then simply exported, as shown. If you are presently using a pool with a significant amount of L2ARC (100's of GB's or TB's), (update:) and you have multiple (3+) cache vdevs, and have any desire to ever export your pool that has a time limit attached to it before you need to be importing it again to maintain service availability, you should read on. Otherwise this doesn't really affect you. In either case, it should be fixable, long-term, so probably not a panic-worthy piece of information for most people. :)

If you want the workaround, by the way, it is to shoot the box in the head -- panic it, power it off hard, etc, to export the pool. This is obviously very dangerous if you do not have a proper setup with an actually utilized ZIL (preferably through nice log devices) and all your important data arriving synchronously, or if you have other pools on the system you don't want to lose access to as well. Unfortunately if either of those is the case, the only other workaround would be to remove the cache vdevs from the pool, wait out the eviction stuff, and only then export the pool. Better than just exporting it and waiting, since you'd be online while you removed the cache vdevs, but fairly time consuming still.

Here's your culprit, where we sit for potentially minutes, or even hours if you had a sufficiently 'bad' environment for this issue:

PC: _resume_from_idle+0xf1    CMD: zpool export l2test
  stack pointer for thread ffffff4b94b44180: ffffff022cdd9820
  [ ffffff022cdd9820 _resume_from_idle+0xf1() ]

Analysis is ongoing, but it appears to be made exponentially worse the more L2ARC you have, the bigger the L2ARC devices are individually, the clock speed of your CPU (as this is a single-CPU-bound task), and your average block size (as that affects how many l2hdr entries you have).

This should be relatively easy to fix, in theory -- ARC evictions were already made asynchronous, but it looks like this wasn't done for L2ARC as well. If it is, this should be less of an issue.

Update 10/10/13:
Well, it appears there is code for an async l2arc eviction, and it's in play:

root@gold:/volumes# echo zfs_l2arc_async_evict::print | mdb -k
0x1 (B_TRUE)

So the investigation continues as to why this is happening anyway.

Update 10/10/13:
Kirill continues to investigate, and believes he's pinned the problem down to the async l2arc eviction not being always actually being asynchronous. In his testing, much of it is actually done asynchronously, but the minute there's more than X number going, the next one gets done synchronously. This is tied to the number of cache vdevs you have. And you cannot export until all the data they referenced is cleared, so the export won't finish until the only remaining tasks are in process (and are asynchronous, otherwise it'll have to wait them out). This explains why the export did complete before this was all actually through -- ongoing tasks were asynchronous and there were no more outstanding. Seemingly there is a 'maxalloc' being set on taskq_create by this line:

 arc_flush_taskq = taskq_create("arc_flush_tq",
     max_ncpus, minclsyspri, 1, 4, TASKQ_DYNAMIC);

The '4' is the culprit here. In Kirill's words: "Probably at the beginning of export, we are flushing ARC as well, so 2 threads are already busy, leaving us with only 2 for L2ARC. It is evicting asynchronously, just not all drives, because taskq isn’t big enough, since someone probably didn’t consider that there may be more than 2 L2ARC’s on a system :)", then adds, "Also explains why I couldn’t reproduce it with a single 1TB L2ARC - looks like you need at least three."

So the major takeaway here is that the original correlation between size of L2ARC and the size of the individual L2ARC vdevs is actually only of concern if you hit this 'bug' in the first place. To hit it, the size of your L2ARC and L2ARC vdevs isn't important, it is basically the number of them - seemingly in most situations you'll need at least 3 cache vdevs in the pool to ever run into this.

Update 10/14/13:
So apparently the arc_flush_taskq and the asynchronous ARC & L2ARC eviction code is all only in NexentaStor at the moment, having not (yet? I cannot publicly state our intent with this code, I have truly not been informed internally of what it is) been pushed back to illumos. Fortunately this means that on Nexenta 3.x machines, you'll only hit this long export time problem if you have more than 2 L2ARC devices in a pool you're trying to export, in general, and also the amount of ARC related to the pool shouldn't effect export times, either. Long-term we'll of course move to fix this issue, likely by increasing the limit above 4, but only after more thorough analysis internally.

Unfortunately it seems to mean that if you're not running Nexenta 3.x, and are on an older version or using an alternative illumos-derivative, you may very well be susceptible to a long export time even with just a lot of small block data in ARC, as well as any number of L2ARC devices including just 1. I do not have a box of sufficient size with a non-Nexenta illumos distro on it at the moment to do any testing, I'm afraid.


  1. Thanks for the post. We are definitely hitting this bug with 3 SSD drives for L2ARC. Our failover time is typically about 20-25 minutes. to export the pool with a pool size of 8TB use( pool size of 30TB) and swing it over to the other head node. It looks like I'll me removing one of the L2ARC drives before we manually swing over (upgrade, etc.) for now.

  2. Scott, if you have SLOG devices in your pools and all your datasets have sync set properly (zvols too) then it is perfectly safe to panic one node via uadmin or reboot -p.

  3. I've been dealing with this problem for a couple years now. I run OmniOS so I'm not benefiting from the work Nexenta has done on this. My systems are all HA with RSF-1.

    My procedure for doing maintenance is to first set sync=always on the pool of the pool I need to export. I then panic the system and the other head takes over the pool. When I'm finished I set sync=standard back on the pool. This has worked many times for me with out issue.

    One thing I would like to point out is that removing the cache SSDs from the pool is not helpful here. As soon as you execute "zpool remove tank {cache_ssd}" you get into a blocking situation that too can last a ridiculous amount of time. On one pool I have with 8 400GB SSDs the first remove took 12 minutes, that blocked the I/O the entire time. That was 12 minutes of downtime I was not planning on dealing with.

    Because of this issue, I prefer to build my high performance systems with more RAM and less L2ARC.

    Hopefully Nexenta will release their work on this to Illumos soon.

  4. this is really too useful and have more ideas from yours. keep sharing many techniques. eagerly waiting for your new blog and useful information. keep doing more.
    Germany Education Consultants in Chennai

  5. I simply want to tell you that I’m all new to blogs and truly liked you’re blog site. Very likely I’m likely to bookmark your site .You surely come with remarkable articles. Cheers for sharing your website page.
    Home Interiors in Chennai

  6. is much interesting about which engaged me more.Spend a worthful time.keep updating more.
    SEO Services in India

  7. This is extremely great information for these blog. And Very good work. It is very interesting to learn from to easy considered. Thank you for giving information.

    Bigdata Training in Chennai

  8. This comment has been removed by the author.

  9. I can get very detailed information from your blog alltime so looking forward for more from your side if anybody interested refer them links to get updates relevant to interior designs and approaches from:
    Best Architects in India
    Turnkey Interior Contractors in Chennai
    Architecture Firms in Chennai
    Warehouse Architect
    Factory Architect Chennai
    Office Interiors in Chennai
    Rainwater Harvesting chennai


  10. Learn about other Apache projects that are part of the Hadoop ecosystem, including Pig, Hive, HBase, ZooKeeper, Oozie, Sqoop, Flume, among others. Big Data University provides separate courses on these other projects, but we recommend you start here.

    Bigdata training in Chennai OMR

  11. Amazing article. Your blog helped me to improve myself in many ways thanks for sharing this kind of wonderful informative blogs in live. I have bookmarked more article from this website. Such a nice blog you are providing ! Kindly Visit Us @ Best Travels in Madurai | Tours and Travels in Madurai | Madurai Travels

  12. Awesome things put in this blog, thank you so much for sharing this blog with us. Visit following for website designing and development services.
    Website Designing Company in Delhi

  13. Nice blog, Visit Mutual Fund Wala for best mutual fund investment schemes.
    Investment Advisor in Delhi

  14. This is an informative blog you share with us, thank you so much for sharing this.
    Lifestyle Magazine India

  15. A befuddling web diary I visit this blog, it's incredibly grand. Strangely, in this present blog's substance made motivation behind fact and sensible. The substance of information is instructive
    Oracle Fusion Financials Online Training
    Oracle Fusion HCM Online Training
    Oracle Fusion SCM Online Training

  16. Thank you so much for sharing such an amazing blog. Visit Kalakutir Pvt Ltd for the best Commercial Vehicle Painting & Branding, Godown Floor Marking Painting and Caution & Indication Signages services in delhi, India.
    Godown Floor Marking Painting

  17. I've been using online dating sites for some time and got a lot to write about. That's why I started my blog about dating sites. If you're intersted, read my recent victoria hearts review

  18. This should be relatively easy to fix, in theory -- ARC evictions were already made asynchronous, but it looks like this wasn't done for L2ARC as well. If it is, this should be less of an issue, please see here : Starup

  19. Thank you for taking the time to provide us with your valuable information. We strive to provide our candidates with excellent care

  20. Thanks for provide great informatic and looking beautiful blog, really nice required information & the things i never imagined and i would request, wright more blog and blog post like that for us. Thanks you once agian

    birth certificate in delhi
    name add in birth certificate
    birth certificate in gurgaon
    birth certificate correction
    birth certificate in noida
    birth certificate online
    birth certificate in ghaziabad
    birth certificate in india
    birth certificate apply online
    birth certificate in bengaluru

  21. Thanks for provide great informatic and looking beautiful blog, really nice required information & the things i never imagined and i would request, wright more blog and blog post like that for us. Thanks you once agian

    death certificate in delhi
    death certificate in bangalore
    death certificate in mumbai
    death certificate in faridabad
    death certificate in gurgaon
    death certificate in noida
    duplicate driving licence delhi
    death certificate online
    death certificate
    death certificate apply online

  22. Ngày nay, bàn học không chỉ có chức năng để học nữa mà còn như một vật trang trí trong phòng của các bé. Khác hẳn với bé gái, bàn học dành cho bé trai mang vẻ năng động và tinh nghịch. Vì vậy, hãy tham khảo những mẫu bàn học dành cho bé trai dưới đây để tìm ra mẫu phù hợp cho con mình nhé!

  23. Thanks for provide great informatic and looking beautiful blog, really nice required information & the things i never imagined and i would request, wright more blog and blog post like that for us. Thanks you once agian

    name change procedure in ghaziabad
    name change procedure delhi
    name change procedure gurgaon
    name change in faridabad
    name change in noida
    name change
    name change in india
    name change procedure in bangalore
    name change procedure in rajasthan
    name change procedure in maharashtra