This evening I have updated Lemmy from 0.19.8 to 0.19.9
There has been an increasing number of new users. Many of these have turned out to be dishonest trolls and subsequently banned. One of the traits of the internet I guess.
I shall be moving the pict-rs data from local filesystem to local S3 this evening.
The site shall be offline for an extended period during this migration. There is ~450G in just over 3 million files currently hosted on the pict-rs instance.
Lemmy and all its containers are on an NVMe backed zfs pool. With pict-rs fast approaching 500G I've been exploring alternatives. So I introduced an S3 service using spinning rust.
I expect the migration to take several hours so will start this just before I head off for bed between 2300-0000 GMT.
Update 2025-01-21 03:09 GMT
Downtime 3h 17m -- Migration completed. Images seem to be working, both new and old. Site is catching up on missed posts.
The new server is prepped and ready to take to the datacentre. I've created a vxlan (mikrotik eoip) to the office so have been transferring services to the new server when I had time over the last few days.
I just performed the transfer of Lemmy to the new server which was an unannounced 2h30m downtime (sorry!). It was a good backup restore test though.
To complete the project, I shall be visiting the datacentre on Thursday, 2025-01-02 to swap the old server for the new. lazysoci.al will be offline on 2025-01-02 from 0830 to 1200 GMT
I shall be updating the instance over the next couple of weeks to 0.19.8. I'll post an update some hours before the upgrade.
Over the holidays I shall be decommissioning the physical server that this Lemmy instance is hosted on. The new server is being built and will be a Dell PowerEdge R640 running all U.2 NVMe drives. Hopefully, this will make the server the fastest it can possibly be.
Our lemmy instance stopped processing new activity sometime on Monday 2024-11-18 morning.
The root cause remains unknown. Services were online. Database was responsive.
Lemmy server logs were showing the incoming ActivityPub requests, no errors, but no response was being returned to the sender. The system was restarted on 2024-11-19 and processing of requests resumed.
Luckily, the protocol allows for some caching of requests across all servers, so after 30 minutes of heavy load, our server had mostly caught up.
I was away on Monday, and I did notice the issue, but I initially thought it was a problem with my mobile app (recently moved to Boost). I normally view Lemmy sorted by "Top - Last Twelve Hours" and on Tuesday this returned zero results, which prompted a closer look.
I have added additional monitoring to the system, checking for the age of the latest post. I shall now receive an alert if a new post has not been received for 15 minutes. This may result in some false positive
lazysoci.al was offline whilst the cluster was being updated.
Outage was 38 minutes. 11:00 to 11:38 BST
This was expected to be <15 minutes. The extended outage was due to an issue bringing up a docker container which was a pre-requisite for the load balancer.
I've had to ban multiple vile accounts this morning. It seems this instance has found itself on the radar of trolls.
To that end, sign-ups now require users to fill in a questionnaire prior to joining. I always thought it was lame, and it won't really prevent a troll account from joining, but it would slow them down, and likely cause them to find another easier-to-join instance.
Upgrade
I shall be upgrading the instance to 0.19.4 0.19.5 this afternoon on Friday morning, so expect a little downtime
I'll unpin this notice once the update is complete.
Edit: Unfortunately work got in the way of me performing the update today, so postponing this to first thing tomorrow morning.
Cached images
We use the standard group of services for Lemmy, including the pict-rs image/thumbnail cache. This image cache grew to 700G recently and continues to grow as Lemmy grows. Therefore some effort has been made to keep it under control.
What was meant to be a quick blip ended up being over an hour. Migrating the reverse proxy that sits in front of the Lemmy server failed as dockerhub is having an outage.
lazysoci.al was offline for 3h 15m today following a database corruption. Server is now back online, federated data is flowing again.
Details
I moved the server to its own dedicated host this morning, for both the performance and security (dedicated vlan) impact. Should have been a simple case of moving the virtual disk with the Lemmy data to the new VM and spinning up the new docker image.
The docker logs didn't show any initial issues, however writing to the database gave errors of ERROR: relation "approvals" does not exist for every UPDATE query.
After some troubleshooting, I finally thought the database was corrupted, so I started a restore from last nights backup. This took approx. 2h 30m to restore.
Post-restore, the same issue. I then performed an update to the latest beta, and the issue is now resolved.
This has highlighted one problem. I use proxmox-backup-server and proxmox-virtual-environment. You can't easily restore a single disk from a VM into a ZFS vo