Looking for thoughts/opinions
I have a 5 disc raidz1 array. The volumes are accumulating CKSUM errors - fairly evenly distributed over the discs. I’ve been lazy and let this progress to the point where there are permanent errors in files.
# zpool status -v
pool: tank
state: ONLINE
status: One or more devices has experienced an error resulting in data
corruption. Applications may be affected.
action: Restore the file in question if possible. Otherwise restore the
entire pool from backup.
see: https://openzfs.github.io/openzfs-docs/msg/ZFS-8000-8A
scan: scrub repaired 748K in 06:17:19 with 1 errors on Sun Jul 14 06:41:22 2024
config:
NAME STATE READ WRITE CKSUM
tank ONLINE 0 0 0
raidz1-0 ONLINE 0 0 0
ata-ST8000VN004-2M2101_WSD13YBW ONLINE 0 0 6
ata-ST8000VN004-2M2101_WSD13YE4 ONLINE 0 0 7
ata-ST8000VN004-2M2101_WSD1454G ONLINE 0 0 8
ata-ST8000VN004-2M2101_WSD1454W ONLINE 0 0 6
ata-ST8000VN004-2M2101_WSD14563 ONLINE 0 0 7
errors: Permanent errors have been detected in the following files:
/you/do/not/need/this/level of detail.txt
I’ve done some research and believe (hope) that the cause of these errors is the “domestic” onboard SATA controllers I’m using and I have ordered a LSI SAS3008 9300-8i HBA as an upgrade.
I know I can fix the permanent error by deleting and restoring it and then running a scrub. But, I’m torn - should I scrub now and risk stressing it more on the crappy SATA controllers, or wait until I get the new HBA (in a few weeks - free cheap, slow, shipping)?
Hello from All.
I don’t know what any of that means, but as a scrub tech, my vote is to scrub.
Why would anyone downvote this? Don’t you have a sense of humour? I thought it was funny.
I have the same issue. For what it’s worth it’s still running just fine after 3 years apart from the occasional corrupted file after scrub, which thankfully that pool is mostly games and media I can just redownload. Error rate is always the same, and a corrupted file when the controller fucks up. Weirdly my SSD pool on the same controller seems fine, but it also completes scrub in less than an hour vs the HDD array.
You’ll be fine waiting a week if you want to be sure.
I wish there was an option to retry a few times instead of giving up, as the controller will give the correct data if tried again. It seems to happen when the controller is under heavy load for an extended period of time (ie 18h of scrubbing), it only does it close to the end.
I’ve seen some tunables to make the scrub slower, it might help reduce the strain enough to not cause the errors.
I have not been observant enough to notice that the corruption is caused by the scrubs. It makes sense - that would be the only real time my array gets any stress. That being the case - I’ll leave the scrub until after I get the HBA installed.
I’d shut it down before it corrupts even more, replace HBA when it arrives and run a scrub to see what’s the damage
I know that’s the correct response. But, it’s been running like this for many months, maybe even years - as I said in the post, I’ve been lazy. There’s nothing on it that can’t easily be restored, or replaced, and shutting it down would be a PITA.
There’s always a chance your backups might get corrupted too if you let it continue like that