Posts Tagged ‘continous delivery’

SaaS Deployment Fail

October 13, 2013 Leave a comment
Push Button Deploy

Remove the human factor from your deployments to avoid mistakes.

Lesson Learned: Automate your entire deployment pipeline and don’t celebrate until you are done.

Most SaaS (software as a service) companies carefully plan and orchestrate deployments as a core competency. Many veterans of these deployments have learned to be good at it through experiencing failures and learning what not to do. One of the greatest learning opportunities is when things go terribly wrong and it takes heroics to correct. This story is one of those experiences.

In 2005 I was the Director of Quality at Convoq, a video conference start-up. We build web-based and desktop video conference tools that enabled applications like to evoke ad hoc or scheduled video conferences (Convoq ASAP Pro). The application was build upon the Adobe Flash Media Server. Customer accounts, their contacts, and conference history was stored in a relational database. We used physical hardware built from ghost images to assure our servers were carefully replicated for high availability.

We carefully planned and iterated on an instance of the Deployment Document for each release. Although we continued to add automated scripting for deployments, we still had to manually start the scripts and we had to add in all the verification points along the way. We also had a final Deployment Document sign-off ceremony prior to commencing any release. We thought we were pretty good at this pattern. During deployments a member from each group and members from IS would assemble in the conference room to work through the deployment document procedure together. The procedure included putting up a maintenance page during an outage, and we worked at making this outage as short a window as possible. We planned our releases to begin at 5pm Wednesdays, since Wednesday evenings seem to be less impact to customers on the west coast, it gave us all night to complete the deployment if things went wrong, and we had the next day with a full engineering staff to address any problems that might result from the deployment the night before. On planned major release days, we brought in beer to celebrate the completion of a release.

One planned release, we began as we normally do with both email and in-product notification of a scheduled release. There was a short outage window expected with this release. It should be very routine. Here is how it went as I recall it…

  • 5 pm ET: We assembled in the conference room as planned and proceeded to execute pre-deployment steps, including putting up the maintenance page. Each step was called out verbally when started and finished, for both changes and verification. One of the changes was a series of SQL statements that would update the schema and add related tables. Order of changes and completing verification of a change must precede the next change. This included migrating some data to new tables and updating the index.
  • 6 pm ET: As we proceeded to execute through the initial changes everything was going well. This seemed routine for us and the were a bit punchy, joking about customers, and having a great time. We were more than half-way through the release.
  • 6:30 pm ET: The last part is a piece of cake and we expected to complete by 7PM ET. We were feeling pretty good about this release and ready to celebrate. So, we broke out the beer. Boy that first gulp tasted good. Things were still going fine as we get down to final testing of these changes. Then, we noticed the contacts for one customer were appearing under a different customer. Really? How can that happen?
  • 7PM ET: The tone of the room starts to change. We check other customer data and find the same thing. We start searching through changes and database queries to figure out what happened. Apparently a database change was made before a prior database change completed. We lost track of verifying start and end to each step and got out of order. I recall that a change was made to correct the problem and made it worse. Agh, the beer must have clouded our minds and we messed up. All the beer went into the trash. We should not be celebrating this early.
  • 9 pm ET: We are still trying to understand the impact of the problem. Phone calls go out to the database experts to get them looking at the situation. We even had a couple engineers come into the office. We also were trying to replicate the problem in one of our test environments. We call spouses to indicate this will be a late night or all-nighter.
  • 11 pm ET: We are clearer about the issue now and believe we know how to correct it. I make a statement that “we just have to be back up by dawn.” 7AM ET is when our published SLA (service level agreement) with indicates when we will complete any maintenance window by this time, and shortly afterwards the CEO arrives each morning.
  • 1 am ET: We made progress with database corrections. However we still have a data problem that needs correcting.
  • 3 am ET: The coffee is not working to keep me awake. I do some pushups on the floor to wake me up and resume verifying fixes.
  • 5 am ET: We believe we have it all solved, but not ready to turn on traffic yet. It’s still dark out and the birds aren’t singing yet.
  • 6:45 am ET: We enable traffic again. Everything looks good. We keep verifying live data and watching logs to see that all new activity looks right.
  • 7:20 am ET: The CEO walks in the office. I’m usually in the office by then, so he doesn’t yet see anything out of the ordinary. He walks by my cube and asks how the release is doing. I told him everything is working well. He said “great” and proceeds to walk towards his office. He stops, looks back at me, and asks if those are the same clothes I had on the day before and forgot to shave, with a smirk on his face. I told him he was right, and then he noticed more folks in the office than usual at this hour. I then had to tell him we had some problems with the deployment that required us to work through the night, but we were live before dawn and met our SLA. He laughs and then asks me to come to his office to explain.

I don’t believe I mentioned the beer in my explanation. I did mention that we would be much more careful about keeping the deploy steps in order next time and automating as much as we could. It was the human element that got in the way of a successfully executed release. Most of our customers will never know there was any problem, and our SLA was met. We came dangerously close to having a far bigger impact to the field. This experience convinced me to strive for automating the deployment pipeline end to end, including validation between steps.

%d bloggers like this: