Archive

Archive for the ‘Lessons Learned’ Category

Trust Your Instruments

February 4, 2017 Leave a comment

Lesson Learned: Calibrate, monitor, and trust your instruments to keep on track.

I love to hear and tell stories in illustrating lessons learned. They really convey the emotional impact of the lesson at the time is was learned. This is a tale of two opposing true stories related to small craft navigation that actually happened to me. They take place a few years apart but have a number if similarities that make the point of the lesson learned.

Foggy Fail

In the summer of 1996 I was invited to on an night-time striper fishing trip inside the mouth of the Merrimack River. The skipper was the bother of a colleague, and he had a few friends and relatives on the trip. This was a beautiful 36′ twin-screw motor yacht that was only a year old. It was a very calm serene evening trip destined to anchor in the area inside the Ben Butler’s Toothpick rocks opposite Joppa Flats in Newburyport, MA. We dropped anchor in a two-point moor to keep our position fixed. I don’t recall what we were using as bait that night, but I know it was not artificial lures. We fished well into the night with little success, and no keepers, but it was a fun night. Shortly after midnight the fog rolled in quickly from the sea, engulfing the boat. It was a weeknight, getting late, and a bit cooler. So, we pulled up anchor and attempted to head back up river to the slip.

marine-radar-scopeThis boat was equipped with radar and could see the outline of the mainland mass and other larger objects, but couldn’t see the sandbars or markers on the radar, and we couldn’t see the shore. The tide was low and going lower, but could see and sense the flow of the river. The skipper pulled up anchor and started heading upriver. We could now see the first marker and it looked like we were heading in the right direction. A little further and we saw some other boats anchored that we had to move around. And then the boat touched a sandbar. The skipper motored over that one and back into the channel. We thought things were going well, but the appearance of the channel didn’t match the radar. The skipper started to correct to get back on track with the radar, but had to motor over another sandbar. This didn’t make sense. We should be on track and in the main channel, but as we moved forward we keep getting off course. The skipper debated backtracking downriver to get the channel and radar back in sync before going upriver further, but decided his instinct was better than the instruments. Big mistake. As it turns out we were getting further up a tidal channel in Joppa Flats with the tide going out and ran hard aground on the sand flats. We hit hard and you just know there was damage to the underside. The skipper radioed for the tow boat to come help. It was an off-hour call so it would take time for them to get there. Before we know it the boat listed to the side and we were high and dry on the sand. The tow boat could not help except to taxi the guests back to the dock. The skipper had to wait until dawn for the fog to lift and the tide to come back in and float him of. It cost the skipper several thousands of dollars that night in repairs and the tow back to the dock.

Foggy Success

In the summer of 2002 I took my Dad out for a ride in my new 21′ cuddy cabin I/O boat to the Isles of Shoals island of Portsmouth NH. It was a calm Saturday morning early in the season when the water was still a bit cold. I set the Isle of Shoals weigh-point in my dash-mounted GPS and we headed down river. It was nice and sunny, and we were up on a nice plane at 22 knots. We left the mouth of the Merrimack and veered off towards the Isle of Shoals. About 10 minutes into the ride we noticed what looked like fog on the horizon. It was moving towards us at a faster rate than I thought and I could feel a westerly breeze picking up. As the fog approached I noticed my Dad looking around my dashboard curiously. My Dad is an old-school electrical engineer, who was a master at the slide rule, map calipers and parallels, and calibrating an oil-filled compass. So I had to ask him what he was looking for. He ask where my compass is. I pointed down to my foot locker and said it was in there. He looked up at me shocked as the fog began to engulf us and asked how the hell are we going to navigate now? I dropped off the plane to about 8 knots and continued on track.

p_6875_garmin-gps128I pointed to the GPS on the dashboard and said I’m navigating from GPS. It tells me direction, constantly correcting for the westerly breeze nudging us to port. I told him I know how fast we are going, where the nun buoy at the entrance to the harbor is that we are targeting, and how long it is going to take us as this rate. He said oh, and what if that fails? I pulled out my backup waterproof handheld GPS in my pocket with fresh batteries and said “then we use this.” He looked at me with guarded approval and said “ok, if you say so.” Clearly my Dad had not used GPS for navigation before. At this point we had about 50′ of visibility. As we approached the targeted buoy, I slowed right down to a idle and said look off the bow, we should see the buoy in a few seconds. The red nun then popped out of the fog. And my Dad said “son of a gun, it worked!” My Dad and I used very different tools to navigate in nautical waters, but we each trusted our instruments and were able to keep on track with where we are going and when we will get there, even when we can’t visually see the target along the way.

Wisdom

Both these stories are memorable. The both share the lesson that you need to have good instruments, that give you the visibility you need to keep on track, and that you trust the instruments to guide you when you can’t see the target.

The same can be applied to the SaaS industry. We have many ways to measure and calibrate our instruments for applications, their containers and infrastructure and their tests. We must choose our measurement tools wisely, and calibrate, monitor, and trust your instruments to keep on track.

sonar2

Advertisements

Quality Engineering in a Scaled Agile Organization

February 4, 2017 Leave a comment
goldpyramid

Test Pyramid – test at the earliest and lowest level feasible

Organizational structure is an important part of creating the culture and capability for Quality Engineers to be successful. In this post I share my insights in the journey of evolving the Quality Engineering role of a scaled Agile organization in a SaaS company. The Quality Engineering function is to advocate defect prevention, assure continuous delivery confidence, and advocate for customer satisfaction. Defect prevention is efficiently testing the right thing at the earliest and lowest level feasible (test pyramid). Continuous delivery confidence begins with the Story acceptance criteria, and ends with doneness criteria and automated pipeline success. Customer satisfaction is both meeting all the expectations as core value, and adding more value with delighters that differentiate you from your competition, usually reflected in the business metric NPS (Net Promoter Score).

 

Follow Conway’s Law

conways-law

model team and app/test architecture to mimic each other to reduce friction

Model your team organization and your application/test architecture in similar ways to maximize capability and productivity in each…commonly known as Conway’s Law. Shaping engineering culture and software design are two sides of the same coin. They are synergistic patterns that must flow together or friction will emerge in each. As a change agent, realizing this natural tendency is eye opening and makes driving change much more scientific, shaping change as an engineer. Whether you adopt a formal process like SAFe to scale your agile teams or simply looking for ways to continuously improve specific issues, these insights apply.

 

Organizing Teams, Architecture, and Interfaces

In a modern SaaS (Software as a Service) Scrum Agile organization with more than a couple teams, the teams and the architecture should be based on small cohesive units that operate with common interface standards, and are loosely coupled to the organization they interface with. In monolithic applications, the teams, architecture, and interfaces tend to be more tightly coupled with their consumers, and standards managed within each team. In micro-services applications, the teams, architecture, and interfaces tend to be more loosely coupled with their consumers, and centralized standards managed across teams such that the interfaces are consistent. Most organizations operate somewhere in-between these two polar extremes.

In the past I have inadvertently created friction in one of these areas while attempting to solve problems in another area. I have since realized that both have to change to achieve the aspirational results I was seeking. Making sense of this tendency makes strategic planning much clearer and achievable, and easier to communicate.

Over the last 30 years I’ve worked in many software engineering organizational shapes and sizes, from basement start-up to giants including Eastman Kodak and Agfa Typographic Systems (in their heyday). This notion applies in all cases, but the approach will differ in the size of the organization. As organizations grow, so does the approach needed to move the organization along in the direction you want to take it. In the past 7 years I’ve experienced maturity in moving from the monolith pole towards the distributed pole.

Quality Assurance to Quality Engineering

In 2010 I re-branded Quality Assurance to Quality Engineering at the advice of our VP of Engineering at the time, with the interest of moving the bar on our technical expectations up the Engineering scale significantly, towards defect prevention. This proved to be a very effective idea as we continued to drive test automation out the each of the Scrum teams and improve the technical capability within each of these teams. Looking back over the last 7 years of improvement, we went from ~800 tests per week to over 10,000 tests in about 3 to 4 hours. Through growth and attrition we transformed the make-up of the organization be mostly software engineers with a quality engineering specialty. The regression test automation push swung the pendulum significantly in the direction of automating everything we wanted to assure confidence that working software stays working. So much so, that we then had to reinforce balancing spending time manually working through exploring the customer experience against building/maintaining an adequate regression suite.

During this time we evolved our application architecture to be much more distributed (pulling business logic back to services with RESTful interfaces), expanding application stacks (Java, RoR, JS), and provide shared components. Our test approaches also continue to evolve to align with our application code designs, requiring common continuous integration test infrastructure and reporting services with RESTful interfaces, and test repositories co-existing in application code repositories. We continues to invest in test infrastructure, evolving the functional test automation team (dubbed the “Autobots”) to a “Center of Excellence” (CoE) team, and further spinning off a Performance CoE team.

As we grew in size so did we grow the Quality Engineering leadership into a matrix-ed organization. By the Summer of 2015, my Quality Engineering organization at Constant Contact grew to the size of 50 (in a 200-person engineering and operations organization). This includes 5 Quality Engineering Managers, 2 QE Architects (Functional and Performance engineering), with test automators embedded in 18 product delivery scrum teams, distributed across 5 geographical locations. During this time, we were growing the role of the Scrum Master, limiting the backlog control of Development Managers to enabling the team. There were a lot of efficiencies in Quality Engineering, including a closely nit community of quality engineers. However, there was a growing animosity amongst the Development Managers about silos between Dev and Test roles under different line managers, and for concern of losing control of the work towards self-organizing teams. We knew we needed to change again to improve our efficiencies in product delivery teams.

Continuous Delivery

The notion of DevOps was introduced to us in 2013. Spawned by the novel “The Phoenix Project” authored by Gene Kim, and the book “Continuous Delivery” authored by Jen Humble and David Farley, we charted a direction to improve our efficiencies in how we deliver changes and value to the customer. We were locked into monolithic release cycles for most applications and needed to get to independent release capability. This spawned the creation of a CD team to stitch together a continuous delivery pipeline, including both commercial and internally developed services. Using a pipeline like this requires re-plumbing our applications toward micro-services and static UI application design, changing the way we manage our backlogs and delivery commitments, and how we operate as teams. All changes should be independently testable and deliverable. The vision is to release value to the customer an Agile Story at a time, mid-Sprint with confidence.

 

pipeline

automate the product delivery pipeline with validation and test feedback loops as gates

 

As we looked to reduce to cultural leadership friction in these teams as well, we re-organized the Quality Engineering Organization again. This time re-aligning Quality Engineers embedded in the scrum product delivery teams to report to the Engineering Managers (formerly Development Managers), re-assigning QE Managers to other roles. There was also a conscious movement to reduce number of quality engineers on some teams, in favor of more test development during application development. There was an expectation for software engineers to take more ownership for testing, sharing the test burden across the team. The whole teams is responsible to quality craftsmanship and the quality engineer brings the quality consciousness, test leadership, and customer advocacy.

Preserve the Quality Engineering Community – Guild and CoE

These changes led to the destruction of the QE Community as we knew it, and the Quality Engineers remaining had a sense of abandonment by the Quality Engineering leadership. The Engineering Managers felt the rising animosity about this need for resurrecting the QE Community. To address this, I introduced formation of the QE Guild community of practice, inspired by the Guild approach pioneered by Spotify and encouraged by Janet Gregory and Lisa Crispin in their book “More Agile Testing” under Organizational Culture.  This introduces a loosely grouped community of test-minded engineers focused on career and capability development in the craft of quality engineering, led by a guild coordinator (me as Dir QE for now). We also preserved the CoE (Center of Excellence) “capability radiator” team approach as centralized teams that provide evolving capability to the product delivery teams for the craft of quality engineering.

spotifyguild

Quality Engineering Guild is an organic and wide-reaching “community of interest” group of people across the organization that want to share knowledge, tools, code, and practices –  (search Spotify Guilds for more about Guilds)

 

Continuous Improvement

Reflecting on where we are now and where we go from here, I can tie this back to the notion of Conway’s Law. Our software architecture is driven more tightly around RESTFul Standards, micro-services, single-page GUI’s (CDN edge-cached), and a managed public authentication layer.The Product Manager (owner) organization also now tightly manage an organizational ranked backlog or Epics that direct the priorities and alignment of the agile teams. Scrum teams organize around delivering features with these architectural constructs in mind. Continuous test is now part of delivering each Story and we have a zero-defect policy (issues found during the Story that gate delivery). These teams operate with a cohesive “developer”-“tester”-“product owner” partners, where the whole team owns quality. It’s not all that way yet, but we continue to strive for this culture.

Constant Contact was purchased by Endurance International Group in Feb 2016 with the same mission to improve online marketing success of small businesses. The mission is the same, but the brand organizations are very different. At the time of this writing we are in the midst of aligning the many brands, including the Constant Contact premium email marketing brand. Merging different organizations like this, although at a grander scale, require similar patterns of re-organizing we experienced within the Constant Contact brand and Conway’s Law continues to apply. The product offerings and feature teams may change, but the idea of architecture and team organization beget each other is a truth that leads to greater efficiency in Scaling Agile as the organization grows...tightly integrated teams, loosely coupled architectural layers with strict interface and performance standards, and continuous delivery pipelines with fast feedback loops.

In the end, the customer wins with a stronger company delivering better results faster.

SaaS Deployment Fail

October 13, 2013 Leave a comment
Push Button Deploy

Remove the human factor from your deployments to avoid mistakes.

Lesson Learned: Automate your entire deployment pipeline and don’t celebrate until you are done.

Most SaaS (software as a service) companies carefully plan and orchestrate deployments as a core competency. Many veterans of these deployments have learned to be good at it through experiencing failures and learning what not to do. One of the greatest learning opportunities is when things go terribly wrong and it takes heroics to correct. This story is one of those experiences.

In 2005 I was the Director of Quality at Convoq, a video conference start-up. We build web-based and desktop video conference tools that enabled applications like Salesforce.com to evoke ad hoc or scheduled video conferences (Convoq ASAP Pro). The application was build upon the Adobe Flash Media Server. Customer accounts, their contacts, and conference history was stored in a relational database. We used physical hardware built from ghost images to assure our servers were carefully replicated for high availability.

We carefully planned and iterated on an instance of the Deployment Document for each release. Although we continued to add automated scripting for deployments, we still had to manually start the scripts and we had to add in all the verification points along the way. We also had a final Deployment Document sign-off ceremony prior to commencing any release. We thought we were pretty good at this pattern. During deployments a member from each group and members from IS would assemble in the conference room to work through the deployment document procedure together. The procedure included putting up a maintenance page during an outage, and we worked at making this outage as short a window as possible. We planned our releases to begin at 5pm Wednesdays, since Wednesday evenings seem to be less impact to customers on the west coast, it gave us all night to complete the deployment if things went wrong, and we had the next day with a full engineering staff to address any problems that might result from the deployment the night before. On planned major release days, we brought in beer to celebrate the completion of a release.

One planned release, we began as we normally do with both email and in-product notification of a scheduled release. There was a short outage window expected with this release. It should be very routine. Here is how it went as I recall it…

  • 5 pm ET: We assembled in the conference room as planned and proceeded to execute pre-deployment steps, including putting up the maintenance page. Each step was called out verbally when started and finished, for both changes and verification. One of the changes was a series of SQL statements that would update the schema and add related tables. Order of changes and completing verification of a change must precede the next change. This included migrating some data to new tables and updating the index.
  • 6 pm ET: As we proceeded to execute through the initial changes everything was going well. This seemed routine for us and the were a bit punchy, joking about customers, and having a great time. We were more than half-way through the release.
  • 6:30 pm ET: The last part is a piece of cake and we expected to complete by 7PM ET. We were feeling pretty good about this release and ready to celebrate. So, we broke out the beer. Boy that first gulp tasted good. Things were still going fine as we get down to final testing of these changes. Then, we noticed the contacts for one customer were appearing under a different customer. Really? How can that happen?
  • 7PM ET: The tone of the room starts to change. We check other customer data and find the same thing. We start searching through changes and database queries to figure out what happened. Apparently a database change was made before a prior database change completed. We lost track of verifying start and end to each step and got out of order. I recall that a change was made to correct the problem and made it worse. Agh, the beer must have clouded our minds and we messed up. All the beer went into the trash. We should not be celebrating this early.
  • 9 pm ET: We are still trying to understand the impact of the problem. Phone calls go out to the database experts to get them looking at the situation. We even had a couple engineers come into the office. We also were trying to replicate the problem in one of our test environments. We call spouses to indicate this will be a late night or all-nighter.
  • 11 pm ET: We are clearer about the issue now and believe we know how to correct it. I make a statement that “we just have to be back up by dawn.” 7AM ET is when our published SLA (service level agreement) with indicates when we will complete any maintenance window by this time, and shortly afterwards the CEO arrives each morning.
  • 1 am ET: We made progress with database corrections. However we still have a data problem that needs correcting.
  • 3 am ET: The coffee is not working to keep me awake. I do some pushups on the floor to wake me up and resume verifying fixes.
  • 5 am ET: We believe we have it all solved, but not ready to turn on traffic yet. It’s still dark out and the birds aren’t singing yet.
  • 6:45 am ET: We enable traffic again. Everything looks good. We keep verifying live data and watching logs to see that all new activity looks right.
  • 7:20 am ET: The CEO walks in the office. I’m usually in the office by then, so he doesn’t yet see anything out of the ordinary. He walks by my cube and asks how the release is doing. I told him everything is working well. He said “great” and proceeds to walk towards his office. He stops, looks back at me, and asks if those are the same clothes I had on the day before and forgot to shave, with a smirk on his face. I told him he was right, and then he noticed more folks in the office than usual at this hour. I then had to tell him we had some problems with the deployment that required us to work through the night, but we were live before dawn and met our SLA. He laughs and then asks me to come to his office to explain.

I don’t believe I mentioned the beer in my explanation. I did mention that we would be much more careful about keeping the deploy steps in order next time and automating as much as we could. It was the human element that got in the way of a successfully executed release. Most of our customers will never know there was any problem, and our SLA was met. We came dangerously close to having a far bigger impact to the field. This experience convinced me to strive for automating the deployment pipeline end to end, including validation between steps.

My Biggest Test Mistake

October 13, 2013 Leave a comment

Lesson learned: A one character change can have a very significant impact on quality.

I love to hear and tell stories in illustrating lessons learned. They really convey the emotional impact of the lesson at the time it was learned. This is a story about a very costly mistake when attempting to make a small one character change in application resources that I would never want to repeat again.

In 1993, I was working at Custom Applications on the FreedomOfPress product line. This is a print driver application that provided a software PostScript RIP (raster image processor) for rendering pages to over 150 different printers.  We were releasing a new Mac version for mass distribution to the major software distributors (Ingram, MicroD, etc.). At the time, software was distributed on CD’s from a glass master and silk-screened. This included sending a CD image to the disc duplicator, who created a glass master image, and then stamped out tens of thousands of resin discs. The discs were silk-screened with images and print, and assembled into a printed box along with its printed manual and other material.

Macintosh-centris-660avA blockbuster release of the new Macintosh Centris 660 AV was was the hottest thing in desktop publishing and the CEO had it in his office. We were able to get Apple hardware ahead of distribution to develop and test with the hardware, so we only had one. We had a chance to test with it at times and didn’t have it available to us at all times. As we were preparing for the trade show that Apple was unveiling this new hardware we were received registered trademark ® certificate. Of course we had to have this symbol on our box and on the software! The marketing department was all over this.

We received a request from marketing to add one last change and test, on the day we had to send out the disc image for duplication. This was only a one character change, right? How hard could it be? Our application artwork was created on a Solaris work station and saved to the Mac as a resource. We were asked to add the ® symbol was the splash screen when you start up the application. The text was updated on the Solaris workstation and saved to the Mac (as we normally did, right?). We built and spot checked this new version with the ® symbol proudly displayed on the splash screen on machines we had in the lab. We then sent the new disc image to the duplicators by overnight courier.

The next morning the CEO comes into the office and installs the new version of FreedomOfPress onto the Centris 660AV. He launches the application and the splash screen appears first…the system restarts. What (?!) he says to himself…he thought he must have done something wrong. When the system comes back up he tries again with the same result. He calls me in and explains what he experienced and shows me what is happening. Wow! How did this happen and how did we miss this? Well that machine was nicely tucked into his office when we were testing permutation of machines and printers the day before.

The VP of Engineering was called in and we discussed what might be the issue. After all, we only made one character change from what was running properly on this machine only 24 hours earlier. This is bad…very bad. All the engineers involved in making this changed looked carefully at the code and resources and diff’ed the files. Ah, there is one difference other than the ® symbol. A control+D symbol appears at the end of the text file. This was a normal file termination character for the text file created by the editing application on the Solaris workstation. This control character means something very different on a Mac however, instructing the computer to reset. Ouch! We had to stop the disc duplication immediately and get a new image to the duplicators.

Marketing calls the disc duplicators and they find out that the duplicators had already created the glass masters and proceeded with high speed manufacturing of the disc copies. They had already created thousands of copies. Agh, they had to stop the line and these masters and copies had to be trashed. After removing this control+D character, we spent that day testing through this disc image again, including the Centris 660AV. The new image was sent out to the duplicators overnight again and they rushed through the order. We were able to get the new product boxes to the trade show just in the nick of time.

This was an expensive one character change. As it turns out, there was a corner cut when making this change and the text was not re-editing and saved on the Mac before adding the resource to the bundle. If this step was not skipped, the control-D would not have appeared in the final resource file and this problem would not have been introduced. Also, we should have included the new hardware in the list if things we had to verify.

The real lesson learned here is to pay very close attention to the proven process and not cut any corners when making even simple changes, especially when making a seemly benign change at the last moment.

‡ a.k.a. ColorAge, purchased by Splash in the late 1990’s

%d bloggers like this: