This post was originally published on CIO.com on April 30th, 2020.
As engineering leaders, we are often asked to make strategic decisions about directions in which to take the company. How do we make decisions to prioritize this initiative or that one? I agree with Jim Barksdale that it’s best to start with data.
However, I’ve observed that there is often a conflict between the need to make strategic decisions, and having the necessary data to make those decisions. Tactical decisions are often much easier, for example, have we found that we are running out of CPU? Get faster servers. But even in straightforward cases like this, there can be other considerations that lead to questions such as “are we better off spending two weeks optimizing the code?” One of the things I often advise early engineering leaders is to make it a spreadsheet problem. Or, in the case of technical decisions, make it a business problem.
Making a problem a spreadsheet problem can often be simplified to “it’s just math” or “just look at the data”. If we are looking at our CPU case,what will be the costs to move to a faster server instance? Over what time period? What are the other possible ways of solving the problem? Can we shift some of the workload somewhere else? If the workload is not constant, could we spin down resources for part of the day? If we are going to commit engineering resources to improve the code, how long would that take? How many engineers? How much do those resources cost, taking into account opportunity costs?
You can see how the problems may not remain trivial for long. But if we are consistent about choosing the time periods over which to find our Return on Investment (ROI), we can often see that the company might choose to spend $200 more a month on server infrastructure rather than $5,000 over three weeks to solve a problem ($2400/yr vs. $5,000/yr). When first looking at the problem the answer may not have appeared quite as obvious, but most organizations would gladly pay less than half as much to achieve the same result.
Douglas Hubbard popularized the idea of being able to measure intangibles in his book How to Measure Anything, which in essence promoted the wise notion that getting ANY data and making an educated guess, will always be better than wild guessing, similar to Jim Barksdale’s quote above.
I’ve definitely gone to my management before and shown a case for killing a product that was rapidly losing customers. We had to demonstrate the actual costs of infrastructure, servers, power, etc. but also the costs of having dedicated engineers on-call to handle outages, do security patching, etc. and the fact that those engineers would be spending their time on products that were declining, instead of improving opportunities by working on products in their growth phase. The time spent by those engineers had a cost beyond mere salary to include morale and missed contributions to other projects. We needed to contrast this with the revenue, major contract dates, reputational costs (for a dying product), and were able to project a date after which it made little sense to continue to operate said product.
But not all problems are quite as easy to demonstrate as numbers on a bill from a cloud provider, (albeit mixed with some educated guesses).
One of the things I like about Agile is its bias toward predictability. Yes, there are benefits to the agility of Agile itself, however let’s focus on how it contrasts with one of the biggest complaints about the Waterfall method of project delivery: it was almost impossible to predict with any certainty that any two steps would line up on a date in the future. Agile threw that away and focused on certainty for shorter time horizons so that if my teams said something could be delivered in two weeks, it was very likely that was going to be the case. The team wasn’t building estimates, built on top of estimates, built on top of estimates, about delivery. Their work systems became constrained either with a time box as in Scrum, or a Work In Progress (WIP) limit in Kanban. This is why waterfall still works in construction, the plumber is not asked to work on a pipe by 3 different teams at once.
If I’m a front line engineering manager and I approach my boss asking for more headcount, I’m almost certain to be met with a “Why do you feel you need more headcount?” response. I can make all kinds of arguments about missed deadlines, upcoming vacations in the summer, or different departments asking my team for help with their projects. Or, I can make it a spreadsheet problem.
The ability to measure work in Agile allows us to turn questions like “Why do you feel you need more headcount?” into a spreadsheet problem. In the Scrum methodology, we have the Average Team Velocity which is the average amount of work a team completes in a time period. In Kanban we have lead times and cycle times, which are measures of how long it takes for work to complete once either introduced, or begun.
When faced with the question of why we need more headcount we can go back to systems thinking. We need to start the conversation in terms of business problems. Is the company getting everything they need out of the team? Do they wish to accomplish more? Now we can start turning the conversation into a spreadsheet problem. How much is it worth to the company for the team to accomplish more? The answer could sometimes be that we should get more work out of the same team members.
If we look at the field of Joint Cognitive Systems (yes, CIOs always operate in one), David Woods & Erik Hollnagel describe 4 different responses to overload in a system:
- Shed load - do less work
- Reduce thoroughness - cut corners, decrease quality
- Shift work in time - we can’t do this with humans because of burnout, contracts, etc.
- Recruit resources - hire (or transfer in) people!
There are few leaders who would choose to shift work or reduce thoroughness. If the business wants more productivity from a team, delivering work of lower quality or delivering far less work as a result of burnout and not good strategies. If we want more productivity, doing less work is actually a strategy. In Agile we often say “You have to go slow to go fast.”, with the analogy that if a pipe leaks when running at 100%, the solution is often to run at 80%. This is why so often in Kanban, simply reducing WIP will allow for better results.
But, if it’s a spreadsheet problem, we can show that over the past 3 months, we’ve reduced WIP and each time we decreased cycle time as well, up to a point. After that, any further reductions increased our cycle time. It’s a spreadsheet problem since it can be compared numerically! We can also do things like say our average Scrum velocity is 100 and that the feedback we’ve received from the business is that people get their deliverables about 20% later than the business needs them (but always on time). If each team member contributes about 20 points to that 100, adding another team member is a spreadsheet problem!
At a certain point, adding more team members will not continue to add capacity to the system, just as further reducing WIP will also not improve our cycle time, but as we learned from the Hubbard book above, making decisions with data is always better than a wild guess.
Driving decisions from data can be seen in many different areas of the business. In DevOps we often talk about Gene Kim’s 1st Way where we are trying to optimize the overall performance of the system. Creating a local optimization will generally not be our best investment in overall throughput. A simple example would be adding a 3 inch pipe to the end of a 2 inch pipe. The bottleneck is still going to be the 2 inch pipe. But we have techniques available to us to turn that problem into a spreadsheet problem. Borrowing from Lean manufacturing, we can apply the principle of Value Stream Mapping where we look at each step in the value stream and record which components take the most amount of time. If our spreadsheet at the completion of the exercise shows that a single step takes 2.5 out of a total of 3 hours required for a value stream, we know where to focus our energy.
Finally, there was a time I had to go to an executive vice president and ask for a budget to build a continuous delivery pipeline. This would enable us to do automated testing of code being written to ensure that we had high quality in our code and in the result. One of the principles we learn from Jez Humble and Dave Farley in their book on the topic is that the amount of confidence we will have in the quality of our deliverable is related to the amount of testing that we do. If I do very little testing, I will have low confidence. If I do a large amount of testing I will have higher confidence. We worked out how much coverage we felt we could build for different dollar amounts. When asked how much money we needed for the new pipeline, we made it a spreadsheet problem, and asked how confident they wanted to feel in the product. We turned a technical problem into a business problem. After that, it was up to the business to decide how they felt about the investment.
Be Data Driven
If all we have are opinions, sometimes that’s the best we can do. If we can get any amount of data, we will almost always be able to make better decisions. If we are operating value streams where we keep track of flow metrics related to team capacity, or testing code coverage, we can often make much more educated and informed decisions about both the capacity and outcomes of the many systems in which we operate. Choose to be data driven.