Goodbye Areca, Hello LSI
While it’s not clear exactly who-is-causing what, what is clear is the areca driver tries to de-reference a NULL pointer, this is either because the adapter screws up, or the driver screws up somewhere. The result is a Solaris kernel fault, pointing at the arcmsr driver, and apparently an adapter lockup. It’s not 100% clear what causes this condition. It could be the driver not handling some buffer appropriately, it could be the card sending an error that the driver doesn’t handle. It’s pretty likely though the issue is completely inside of the arcmsr driver and Areca hardware. One thing we did discover that means we HAVE to replace the Areca hardware is that in JBOD mode (which is how we use it, since we’re using Solaris’ ZFS superset of RAID functionality), any disk failure seizes the whole card up until the failure clears, or maybe until some apparently long timer clears. SATA and SAS have ethernet-like link failure detection. You know within milliseconds when the cable is pulled. The Areca’s in JBOD mode seem unable to handle hot-swap of any type, or even failures of any type. When we tried to get them to address it all we received was vague “you must have a failing drive” answers, which for a RAID card is a bad answer. Even in JBOD mode the controller should signal/propogate an error. Solaris’ would handle this condition.
Then there’s the boot selection. All logical drives appear in the boot selection. Either the list fills up or the Areca’s only show drives on the first controller. That’s a problem if you want to be able to boot an alternate drive on a second controller.
So, Saturday, I get to backup the entire user data. Blow the whole damn thing away. And start over. *sigh*
I know I posted about this before, but we were avoiding the whole rebuild thing, turns out I’m going to have to do that anyway. Argh. The biggest reason is the LSI cards use an on-disk metadata format (apparently). They’re kinda quiet about it all so I’m not sure how big it is or where on the disk. I’m betting it’s the last N megs of the drive. 8, 16, 20 something. Most people won’t notice, but if you’re migrating, it becomes noticeable.