We're sorry about the downtime that occurred yesterday. One of the drives in uwu.social's server failed. We were running in RAID-1, but because the software raid was misconfigured we encountered corrupted filesystems due to high I/O.
All data was recovered successfully. The RAID array has been rebuilt with a new (0 hour) drive. One of our concerns was that the remaining drive in the replica would fail during the rebuild but that didn't happen, which we're grateful for. We will be putting that hard drive out of service within the next month (which will require scheduled downtime).
It seems like the drive died while processing the huge amount of strain imposed on it by Mastodon and Minio during the old media removal process. We will move to Backblaze B2 before running out again (which will require more scheduled downtime).
Moving forwards, to avoid a situation like this again we will:
- Keep making backups, and backing up even more things than before (system SSH keys, Mastodon media, etc.)
- Set up alerts for old drives so we can replace them safely
- Replace drives on all servers if they are old
- Never use Minio again for something with as much content as Mastodon, instead using a cloud provider to mitigate drive strain
Next time we have extended unexpected maintenance again, we will endeavour to put a updates page up again.
Aurieh and I learned a lot from this event. Hopefully, it never happens again.
@dean Thank you for the maintenance page. It was refreshing to see an honest log of developments unfold, unlike many other status pages and post-mortems I've seen.
@dean Thank both you and Aurieh for your hard work, hardware reliability (esp. around the time some/most/all hardware starts failing from age) can be a real nightmare! Fortunately (from what I saw in replies / local timeline) you also have a very understanding community. Thank you both, again, for the hard work and the transparency!