Nex7's Blog

Friday, February 28, 2014

Need a job?

So we're seriously in need of some good people (well, we're always interested in good people). If you or someone you know is a Linux or FreeBSD or Solaris or illumos person and would be interested in a support role, shoot me a resume at andrew.galloway@nexenta.com.

Prior ZFS and/or illumos/Solaris experience is a plus, but it is not required. Some of our best people came over with a purely Linux background, others had Solaris experience but had never touched ZFS, etc. What is necessary is:

an excellent work ethic (the job is likely telecommute, depending on your location)
prior experience and no problem crunching cases in a support queue & handling customer-facing phone calls/GTM's
grace under pressure
experience dealing directly with customers, including during stressful periods for them
a genuine desire to help people with their issues
excellent communication skills & ability to work well with a global team
quick learner & self learner (training and skill transfer from teammates is of course available/provided, but sometimes you need to find an answer now and you need to be comfortable doing that)
ability to deal with an occasionally flexible schedule
ok with very occasional travel requirements (generally to Nexenta offices/events)
console (bash) skills (comfortable with shells and common GNU tools [grep/awk/sed/etc]) needed - sysadmin and/or devops experience desired
skill at one or more common scripting languages (Python, Perl, Ruby, bash, etc) strongly desired
ability to at least partially read & understand C code desirable

Looking for L2 and L3 (we have plenty of open positions).

Wednesday, October 9, 2013

Lots of L2ARC not a good idea, for now

Before I get into it, let me say that the majority of the information below and analysis of it comes from one of my partners in crime (coworker at Nexenta), Kirill (@kdavyd). All credit for figuring out what was causing the long export times, and reproducing the issue internally (it was initially observed at a customer) after we cobbled together enough equipment, and analysis to date, goes to him. I'm just the messenger, here.

Here's the short version, in the form of the output from a test machine (emphasis is mine):

root@gold:/volumes# time zpool export l2test

real 38m57.023s
user 0m0.003s
sys 15m16.519s

That was a test pool with a couple of TB of L2ARC, filled up, with an average block size on pool of 4K (a worst case scenario) -- and then simply exported, as shown. If you are presently using a pool with a significant amount of L2ARC (100's of GB's or TB's), (update:) and you have multiple (3+) cache vdevs, and have any desire to ever export your pool that has a time limit attached to it before you need to be importing it again to maintain service availability, you should read on. Otherwise this doesn't really affect you. In either case, it should be fixable, long-term, so probably not a panic-worthy piece of information for most people. :)

Cables Matter

No, I don't mean brand, or color, or even connector-type zealotry.. but when sizing solutions and working with customers and partners on new builds, I often find that either no thought goes into the SAS cabling, or only a little bit does. I find this distressing. Let me tell you why.

ZFS Intent Log

[edited 11/22/2013 to modify formula]

The ZFS Intent Log gets a lot of attention, and unfortunately often the information being posted on various forums and blogs and so on is misinformed or makes assumptions about the knowledge level of the reader that if incorrect can lead to danger. Since my ZIL page on the old site is gone now, let me try to reconstruct the knowledge a bit in this post. I'm hesitant to post this - I've written the below and.. it is long. I tend to get a bit wordy, but it is also a subject with a lot of information to consider. Grab a drink and take your time here, and since this is on Blogger now, comments are open so you can ask questions.

If you don't want to read through this entire post, and you are worried about losing in-flight data due to things like a power loss event on the ZFS box, follow these rules:

Get a dedicated log device - it should be a very low-latency device, such as a STEC ZeusRAM or an SLC SSD, but even a high quality MLC SSD is better than leaving log traffic on the data vdevs (which is where they'll go without log devices in the pool). It should be at least a little larger than this formula, if you want to prevent any possible chance of overrunning the size of your slog: (maximum possible incoming write traffic in GB * seconds between transaction group commits * 3). Make it much larger if its an SSD, and much much larger if its an MLC SSD - the size will help with longevity. Oh, and seconds between transaction group commits is the ZFS tunable zfs_txg_timeout. Default in older distributions is 30 seconds, newer is 5, with even newer probably going to 10. It is worth noting that if your rarely if ever have heavy write workloads, you may not have to size it as large -- it is very preferably from a performance perspective that you not be regularly filling the slog, but if you do it rarely it's no big deal. So if your average writes in [txg_timeout * 3] is only 1 GB, then you probably only need 1 GB of log space, and just understand when you rarely overfill it there will be a performance impact for a short period of time while the heavy write load continues. [edited 11/22/2013 - also, as a note, this logic only applies on ZFS versions that still use the older write code -- newer versions will have the new write mechanics and I will update this again with info on that when I have it]
(optional but strongly preferred) Get a second dedicated log device (of the exact same type as the first), and when creating the log vdev, specify it as a mirror of the two. This will protect you from nasty edge cases.
Disable 'writeback cache' on every LU you create from a zvol, that has data you don't want to lose in-flight transactions for.
Set sync=always on the pool itself, and do not override the setting on any dataset you care about data integrity on (but feel free TO override the setting to sync=disabled on datasets where you know loss of in-transit data will be unimportant, easily recoverable, and/or not worth the cost associated with making it safe; thus freeing up I/O on your log devices to handle actually important incoming data).

Alright, on with the words.

(IPMP vs LACP) vs MPIO

If you're running an illumos or Solaris-based distribution for your ZFS needs, especially in a production environment, you may find yourself wanting to aggregate multiple network interfaces either for performance, redundancy, or both. With Solaris, your choices are not limited to standard LACP.