uwu.social is back online after our scheduled HDD replacement. There were a few roadbumps here and there, so it took 6.5 hours instead of the 4-5 we planned for.
Hetzner put a 16k hour drive in the server instead of a 0 hour server at first, so we had to wait for them to replace it twice.
The bootloader was on the removed disk, so we weren't able to boot the server until we got KVM and figured out the cause.
Next on our agenda is moving all Mastodon media to a cloud storage provider to reduce strain on the disks in the future and to free up some space. We will probably start copying media in the background shortly. The main copy operation will not require downtime, but when we do the switch from local to B2 it will require some more planned downtime in the future.
Thank you for your understanding in regards to the downtime we've been having recently.
Moving forwards, to avoid a situation like this again we will:
- Keep making backups, and backing up even more things than before (system SSH keys, Mastodon media, etc.)
- Set up alerts for old drives so we can replace them safely
- Replace drives on all servers if they are old
- Never use Minio again for something with as much content as Mastodon, instead using a cloud provider to mitigate drive strain
Next time we have extended unexpected maintenance again, we will endeavour to put a updates page up again.
Aurieh and I learned a lot from this event. Hopefully, it never happens again.
We're sorry about the downtime that occurred yesterday. One of the drives in uwu.social's server failed. We were running in RAID-1, but because the software raid was misconfigured we encountered corrupted filesystems due to high I/O.
All data was recovered successfully. The RAID array has been rebuilt with a new (0 hour) drive. One of our concerns was that the remaining drive in the replica would fail during the rebuild but that didn't happen, which we're grateful for. We will be putting that hard drive out of service within the next month (which will require scheduled downtime).
It seems like the drive died while processing the huge amount of strain imposed on it by Mastodon and Minio during the old media removal process. We will move to Backblaze B2 before running out again (which will require more scheduled downtime).
Recently I made a shitty DDR pad to make up for the fact the local arcade sold their ITG cabinet.
The sensors are made out of tin foil lol. Eventually I'll make a improved version with real sensors (weight sensors).
Haven't made a bar for it yet, using an unused bedside table temporarily (it sucks since it's the wrong height and position)
This happened after updating to the newest version, so we were convinced it was broken code but no one else was reporting similar issues.
For debugging, we undid our patches and added tracing to sidekiq and mastodon, which lead us to believe mastodon was filling memory because of elasticsearch failures. We'll reapply the patches shortly.
We've been having a lot of troubles with mastodon recently (filling memory and swap), which we traced back to elasticsearch freaking out about there not being enough space on the storage drive.
We cleared off some old backups/unnecessary data from the disk, so everything should be fine now. Sorry about the recent downtime.
mastodon moderation tools suck. performing moderation actions takes way longer than it should (it loads for a while), and for some reason disabled accounts can still post? I disabled 2 accounts a week ago and they somehow both posted today...
and to top it all off, after I suspended both of the accounts, the posts from one were still available on local timeline until I manually deleted the post.
I switched to fastmail today to try it out, they added labels in beta which was the last thing I needed to switch. gonna stay on the trial for a month but it seems pretty good so far.
the android app felt a bit sluggish but it's whatever. I can use a different mail app if it's a problem, plus I don't use my phone for viewing my mail very often.
admin of the uwu.social
cool guy