Suramya's Blog : Welcome to my crazy life…

April 14, 2022

Ensure your BCP plan accounts for the Cloud services you depend on going down

Filed under: Computer Software,My Thoughts,Tech Related — Suramya @ 1:53 AM

Long time readers of the blog and folks who know me know that I am not a huge fan of putting everything on the cloud and I have written about this in the past (“Cloud haters: You too will be assimilated” – Yeah Right…), I mean don’t get me wrong, the cloud does have it’s uses and advantages (some of them are significant) but it is not something that you want to get into without significant planning and thought about the risks. You need to ensure that the ROI for the move is more than the increased risk to your company/data.

One of the major misconceptions about the cloud is that when we put something on there we don’t need to worry about backups/uptimes etc because the service provider takes care of it. This is obviously not true. You need to ensure you have local backups and you need to ensure that your BCP (Business Continuity Plan) accounts for what you would do if the provider itself went down and the data on the cloud is not available.

You think that this is not something that could happen? The 9 day and counting outage over at Atlassian begs to differ. On Monday, April 4th, 20:12 UTC, approximately 400 Atlassian Cloud customers experienced a full outage across their Atlassian products. This is just the latest instance where a cloud provider has gone down leaving it’s users in a bit of a pickle and as per information sent to some of the clients it might take another 2 weeks to restore the services for all users.

One of our standalone apps for Jira Service Management and Jira Software, called “Insight – Asset Management,” was fully integrated into our products as native functionality. Because of this, we needed to deactivate the standalone legacy app on customer sites that had it installed. Our engineering teams planned to use an existing script to deactivate instances of this standalone application. However, two critical problems ensued:

Communication gap. First, there was a communication gap between the team that requested the deactivation and the team that ran the deactivation. Instead of providing the IDs of the intended app being marked for deactivation, the team provided the IDs of the entire cloud site where the apps were to be deactivated.
Faulty script. Second, the script we used provided both the “mark for deletion” capability used in normal day-to-day operations (where recoverability is desirable), and the “permanently delete” capability that is required to permanently remove data when required for compliance reasons. The script was executed with the wrong execution mode and the wrong list of IDs. The result was that sites for approximately 400 customers were improperly deleted.

To recover from this incident, our global engineering team has implemented a methodical process for restoring our impacted customers.

To give you an idea of how serious this outage is, I will use my personal experience with their products and how they were used in one of my previous companies. Without Jira & Crucible/Fisheye no one will be able to commit code into the repositories or do code reviews of existing commits. The users will not be able to do production / dev releases of any product. Since Confluence is down users/teams can’t access guides/instructions/SOP documents/documentation for any of their systems. Folks who use Bitbucket/sourcetree would not be able to commit code. This is the minimal impact scenario. It gets worse for organizations who use CI/CD pipelines and proper SDLC processes/lifecycles that depend on their products.

If the outage was on the on-premises servers then the teams could fail over to the backup servers and continue, but unfortunately for them the issue is on the Atlassian side and now everyone just has to wait for it to be fixed.

Code commits blocks (pre-commit/post-commit hooks etc) can be disabled but unless you have local copies of the documentation stored in Confluence you are SOL. We actually faced this issue once with our on-prem install where the instructions on how to do the failover were stored on the confluence server that had gone down. We managed to get it back up by a lot of hit & try methods but after that all teams were notified that their BCP/failover documentation needed to be kept in multiple locations including hardcopy.

If the companies using their services didn’t prepare for a scenario where Atlassian went down then there are a lot of people scrambling to keep their businesses and processes running.

To prevent issues, we should look at setting up systems that take auto-backups of the online systems and store it on a different system (can be in the cloud but use a different provider or locally). All documentation should have local copies and for really critical documents we should ensure hard copy versions are available. Similarly we need to ensure that any online repositories are backed up locally or on other providers.

This is a bad situation to be in and I sympathize with all the IT staff and teams trying to ensure that their companies business is running uninterrupted during this time. The person who ran the script on the other hand on the Atlassian server should seriously consider getting some sort of bad eye charm to protect themselves against all the curses flying their way (I am joking… mostly.)

Well this is all for now. Will write more later.

No Comments »

No comments yet.

RSS feed for comments on this post. TrackBack URL

Leave a comment

Powered by WordPress