Creators Wave downtime
What happened, and what we learned.
It’s never fun when a major fault leads to an extended period of downtime, and we are truly sorry that this has happened. Unfortunately, the error that we experienced has been outside of our control, with a fault caused by our hosting provider leading to a lack in network connectivity on our server.
It took around three days to get a response from support, which is unacceptable and lead to an unnecessarily long period of downtime, which could have been avoided.
From this experience however, we have learnt a few things that will aid us in disaster recovery going forward. We have broken this down below:
What we did wrong
Whilst the issue lied with our host for the main part, there is no denying that we ourselves could have done better to minimise risk.
- We did not have a proper backup system in place, and the backups that we did have were either old, corrupted, or both.
- It was not documented anywhere how our environment was configured, so historical artefacts from previous configurations had lead to issues accessing mounts, or restoring backups.
- Our monitoring system had not alerted us quick enough to our issues, and instead we had to rely on an automated email from OVH - of which came a considerable amount of time after the fault that it had identified, despite this time being logged in the alert.
- We publicly vented frustration against our host. Which, whilst I’m sure that you can consider might have been called for, I would prefer was not made public.
What are we doing to improve?
- Firstly, we now have a box dedicated entirely to backup storage and replication. Every night, our database will automatically back itself up, and upload to this storage.
- We now have version control, and commit history through Git, which we will use as both a backup, and central repository for our work. This will also mean that any failed updates etc can be rolled back by removing commits.
- Over the coming days, we will be working to migrate our site away from the current host to a new VPS with much clearer (and better) SLAs that will help to support uptime guarantees, as well as, allowing us to maintain greater control over our host.
- Our monitoring systems will be evaluated to ensure that errors are correctly logged, alerted and escalated as required. System outages such as what we experienced should send an automated call to myself.
- Backups, replications and snapshots are to be tested in full at random stages to ensure that these are working as intended, and that they do not contain corruption.
- Environment configuration, and changes to the systems are to be logged centrally to be called on as and when required. These should be dated, and signed off by the relevant team member(s) (n.b. This is future-proof, as currently it is only myself on the team).
- Our backups will be replicated across multiple sites, under multiple providers. We will make use of our environment to spread these out in the event that data is unrecoverable from one location.
- Disaster recovery documentation is to be updated to reflect new changes, and what we have learnt from this event.
With thanks to:
I would also like to say a huge thank you to Ben Cousins, who provided invaluable Linux support in his help to recover our environment whilst we awaited a response from OVH, I’d have really struggled without his support.
Ben has been a huge input for Creators Wave both past and present, and has helped to shape our disaster recovery plans following this incident.
Once again, I would like to apologise for the inconvenience caused by our issues. Rest assured that no data has been lost, leaked, stolen or shared. We have maintained full security, and have successfully restored our live environment. The entire environment has now been backed up to a compliant storage host, who store an offline backup your data in accordance with guidelines provided by ourselves. This storage can be bought online only by a registered contact, as per the agreement we have in place.