Extended maintenance for Community Portal
Incident Report for Mautic Community
Postmortem

On Monday morning, a two hour outage was planned to transition our Decidim instance from using a private image over to using the Mautic GitHub repository and DockerHub image, with the end goal being able to have automated pipelines for deployment when a community PR is merged.

During the process, one of the encryption keys which are essential for Decidim’s functioning - which was supposed to be backed up - unfortunately was overwritten and could not be recovered.

We reached out to the hosting provider, Infomaniak, to request a rollback to backup, but it was after hours which meant we had to wait for the next morning.

The next day, we called Infomaniak again to request the rollback and they assured us it would happen in the early evening.

Upon calling to get a status update, we were informed that they did not have any backups from which to restore from. We therefore had to set about the process of removing all encrypted data and re-configuring it with new keys, which happened on Wednesday morning.

Actions and learnings

  1. In the process the current setup was removed and started again pulling from the new infrastructure, which resulted in the keys being lost. In the future we will take the longer route of making the production instance read-only and setting up a separate working area to do the update before switching to it.
  2. We were planning to set up off-site backups rather than just relying on the hosting provider’s backup which we now know are non-existent. This has now been accelerated and we’ve got backups both of the encryption keys and the entire instance in several locations for future resilience. Pipelines will be set up to automatically push backups at regular intervals.
  3. One of the people involved assumed the other had set up off-site backups already and didn’t think to question - in the future we’ll be more explicit about our backup integrity before making such major changes.
Posted Dec 06, 2023 - 15:50 UTC

Resolved
This incident has now been resolved.
Posted Dec 06, 2023 - 15:41 UTC
Monitoring
A fix has been implemented for the issues that were experienced, we will continue to monitor to ensure that all systems are working as expected.
Posted Dec 06, 2023 - 11:31 UTC
Update
We have implemented a roll-back and are in the process of reconfiguring some aspects which have not been cleanly recovered. It is anticipated this will be resolved on Wednesday morning. At present login and some pages will be unavailable until this is resolved.
Posted Dec 06, 2023 - 00:01 UTC
Update
We have identified the cause of the issues (broken links and not being able to log in) which is relating to a mismatch in encryption keys. We will be rolling back in the morning when the infrastructure providers' support is available, which will resolve the issue and allow us to get back up and running.

Apologies for the inconvenience.
Posted Dec 04, 2023 - 16:42 UTC
Update
We are continuing to work on a fix for this issue.
Posted Dec 04, 2023 - 14:48 UTC
Identified
We are working on finalising the deployment of the Community Portal which encountered some problems with broken images and links not working.
Posted Dec 04, 2023 - 14:47 UTC
This incident affected: Mautic Community Portal.