Maintenance Window - August 2021
We have now completed our lengthy maintenance to migrate the entire stack of Creators Wave to a new infrastructure.
In our usual spirit of transparency, here's what we changed, what went well and what we (painfully) learned.
Firstly, why the change?
As some of our eagle-eyed members noticed, our old network server suffered from a corrupted Virtual Machine (VM). This lead to a catastrophic failure of the site, and the complete inability to access it. This was eventually partially recovered using our backups and snapshots, however one of the agents running one our server incorrectly flagged forum.js and admin.js (both core files of Flarum) as malicious, and deleted these without warning.
We recovered these, and a separate agent then corrupted the virtual disk of the server, leading to another full outage. We felt we had no choice but to take the drastic action of pulling our entire site offline whilst we investigated the issue.
What we changed
With the failure of our network infrastructure, we took the chance to move our core networking back to [Login to see the link].
Our stack now operates over a floating IP address, and runs through [Login to see the link] - This means that we are able to easily scale the website to meet growth and demand.
Further, we have also introduced a new, separate SQL cluster to the mix. This means that our SQL database will be managed on a private network that the website has a direct tunnel to. This leads to a much better security standing.
In short, the database can only be connected to over a private network that contains the server, an SSH jump-box, and the SQL cluster.
A follow-up blog detailing this will be available in the future.
What we learned
We learned some very tough lessons throughout this ordeal, and whilst difficult to admit - I must hold my hands up and admit the fact that failures were made throughout this process.
- We failed to act quickly enough to the failing server machine, this lead to unnecessary delays and poor public image for the site.
- We failed to identify the cause of the fault quickly enough, which allowed the server agents to do too much (irreparable) damage.
- We failed to take adequate enough backups (in an offsite location!), which means that unfortunately we have lost a portion of our users' display pictures. (Data on the site was not lost in this ordeal).
What we will do differently going forwards.
We have now updated our internal policies, and we will ensure that these are adhered to going forwards. In short, we will ensure that our backups are now adequate for the job, and that software installed on the server (including any agents for monitoring, security, or otherwise) goes through rigorous testing in a staging environment beforehand.
- Any installations to the server MUST have completed a full week's staging.
- Any changes to the server MUST be done out of peak times, and the server should be snapshotted pre, and post, change for a period of 5 days, after which period the snapshot can be removed.
- Backups should be managed in multiple secure locations. This will operate over a three-stage backup policy:
- Backups stored on Digital Ocean will be automated to their backup section.
- Backups of the FTP and SQL will be automated, and dumped to an encrypted share file.
- A script will then be ran to pull these encrypted share files to an offline storage medium every week (in the event of catastrophic outage).
We once again apologise for the issues caused on our site, and assure you that a thorough health check has been made to ensure that this does not happen going forwards.