Discussion:
bioctl "intermitently" reports RAID 1 array as degraded
(too old to reply)
Theodore Wynnychenko
2018-04-14 18:03:22 UTC
Permalink
Hello

I am trying to understand what I may be missing (I have been noticing this issue
for a year or so).

I have a machine running -current that is setup with 2 SSD hard drives.

The SSD's are fdisk'ed with 1 openbsd partition:

# fdisk sd0
Disk: sd0 geometry: 19457/255/63 [312581808 Sectors]
Offset: 0 Signature: 0xAA55
Starting Ending LBA Info:
#: id C H S - C H S [ start: size ]
-------------------------------------------------------------------------------
0: 00 0 0 0 - 0 0 0 [ 0: 0 ] unused
1: 00 0 0 0 - 0 0 0 [ 0: 0 ] unused
2: 00 0 0 0 - 0 0 0 [ 0: 0 ] unused
*3: A6 0 1 2 - 19456 254 63 [ 64: 312576641 ] OpenBSD

The disklabels on each disk have an "a" 4.2BSD partition, a "b" swap partition,
and then a "m" RAID partition:

# disklabel sd0
# /dev/rsd0c:
type: SCSI
disk: SCSI disk
label: INTEL SSDSA2BW16
duid: 43d094716532e926
flags:
bytes/sector: 512
sectors/track: 63
tracks/cylinder: 255
sectors/cylinder: 16065
cylinders: 19457
total sectors: 312581808
boundstart: 64
boundend: 312576705
drivedata: 0

16 partitions:
# size offset fstype [fsize bsize cpg]
a: 2104448 64 4.2BSD 2048 16384 1 # /
b: 18860313 2104512 swap # none
c: 312581808 0 unused
m: 291611880 20964825 RAID

Most of the time, everything is fine:

# bioctl -i sd2
Volume Status Size Device
softraid0 0 Online 149305012224 sd2 RAID1
0 Online 149305012224 0:0.0 noencl <sd0m>
1 Online 149305012224 0:1.0 noencl <sd1m>


BUT, every once in a while (let's say, a couple of weeks, then a couple of
months), all of sudden the array will report as being degraded.

However, other than the notice that the array is degraded and that a mirror is
offline, I can find nothing in any log, or any changes in the dmesg to suggest
what may have happened.

I have changed the hard drive cables. I have changed out the SSD drives.

But, it still happens every so often.

When the array is degraded, I can still fdisk/disklabel the "offline" disk
without a problem. I can rebuild the degraded array with the "offline" disk (#
bioctl -R /dev/sd1m sd2), and the rebuild completes without a problem, and the
array is stable for weeks/months until, randomly, it happens again.

I am wondering if there is anything I should be looking at/for to help figure
out what the issue is?

As I said, I have already swapped out hardware (at least) once. If it is a
hardware issue, I can keep swapping out hardware, but (at this point) it seems
that the probability is really low that I would have multiple drives that have
the same intermittent problem (but, obviously, not zero).

I would appreciate any advice on how to track down what the problem may be the
next time it happens.

Thanks
Ted

Loading...