urandom Mangot ideas

tech.mangot.com

Engineering the Hiring of Engineers

This post was originally published on CIO.com on January 27th, 2020.

Current State

I’ve either been a team lead, or a manager for much of my career, and every time I was back in a role that required hiring, I always secretly prayed that the industry had gotten better at hiring since the last time I’d tried. It was like Charlie Brown, Lucy, and the football. I’d see lots of positive signs like people discussing whiteboard coding, and was left flat on my back in pain after each encounter. I don’t believe this is intentional.

In case you have managed to erase the experience from your memory, it goes like this: I have a role I need to fill so I contact Human Resources and give them a list of requirements and desired skills. I then wait as the recruiters send dozens of cold InMails or emails to people who may fit the profile, in the hopes that they will get lucky and find THAT person who just happens to be looking. Or we post the job opening on a job board somewhere and get flooded with resumes of folks who aren’t even in the right field, let alone being the right person for the job.

In spite of my good faith in the intentions of the hiring process, as an engineer, it is a system that makes little sense. As an engineer, I know that the system can be better. The inputs to the system should be more consistent. The outputs from the system of people we hire should not be highly variable. We would not accept highly variable results for car manufacturing, and should not accept it for people with whom we work 8 hours a day.

The Inputs

To begin with, we need to set up our system up with good inputs. By this, we mean better than random contacts from a recruiter. There are many different possibilities to gain these inputs. All good sales people will tell you that sales is about relationships. Well, recruiting is sales. It is about the candidate selling themselves to the company, and vice versa. In that case, we should leverage relationships to get good inputs.

Referrals

Referrals are a great way to leverage relationships as they are literally the product of relationships! Your employees already know who is good in the industry, and who would want to come and work with them. They can do the job of selling the potential candidate on the company and the role much better than some random hiring manager because they know the prospect already. The biggest mistake a company can make with referrals is to try and save money on the referral bonus. When it is common practice to pay an external recruiter a $25,000 fee to email random people on the Internet if one of them gets hired, it makes no sense to pay a $2,000 referral bonus to your own employee to bring in vetted talent with whom they’ve worked before when they’re hired. The recruiter has never even written a line of code with the candidate!

Internships

Internships are another great way of finding talent. Much like referrals, internships can be beneficial to both the intern and the company. The intern gets very valuable experience working on a real engineering team and the company has an opportunity to evaluate talent in real situations (as opposed to a typical interview process) with little risk. The interns will almost always make less money than a regular employee and all internships are time boxed by school commitments and the conditions of the internship program. I’ve seen internship programs work so well that there was a seemingly endless supply of junior engineers available to join the company at the conclusion of their studies. The company also had the ability to extend offers to only the top engineers who had often been back for multiple internships as part of the program.

Bootcamps

Not all candidates will make themselves known through traditional methods for varying reasons. That doesn’t mean there are not very good candidates out there who when given an opportunity can contribute in many ways and are often supported by a network behind them. There is such a shortage of qualified engineers available that not exploring non-traditional avenues would be foolish. Setting up a relationship with a bootcamp or an organization like Code2040, Techtonica, Hack the Hood, or Year Up can give any organization access to talent that may otherwise go untapped.

Diversity and Inclusion

Many of the organizations listed above are major proponents of Diversity and Inclusion in the tech industry. Some companies are interested for social justice reasons, some to be able to attract top talent as that is very difficult to sign if no one at the company looks like your prospective candidate, and some because they want to build the best company they can. There is overwhelming evidence that diverse teams simply perform better on a number of vectors. Regardless of how many of the reasons are important to your organization, diversity & inclusion is a very important part of having successful inputs and outputs in your hiring process.

Clarity

Regardless of your method for ensuring a consistent input to the system, it’s important to have clarity about what you would like each new employee to contribute to the team. Post the jobs on your company’s job board, announce the open positions during all-hands, talk to the members of the team where the new employee will join and make sure in all cases you know what you’re looking for. That will greatly increase your chances of success as you bring candidates through the process.

The Process

Once we have a good candidate in mind, the entire purpose of the process is to build increasing confidence in our extension of an offer. That is, the further down the pipeline the candidate goes, the more confidence we have in extending an offer. We do not want to continue to spend time on candidates who will not get an offer. It’s disrespectful to the candidate and it’s disrespectful to the team.

Each company and culture will be different. The process I describe below is the result of years of hiring and the intent is to provoke thought and discussion about your hiring process. This approach is to apply the 1st way of DevOps (systems thinking) to the problem. We want to optimize the process from beginning to end in order to maximize the flow of work (candidates) through the system.

First Contact

The first contact with a candidate who has agreed to enter the process can be handled by a recruiter or by the hiring manager. If it is a recruiter, it can be to answer basic questions about compensation ranges, benefits, etc. It can also be a very basic screening if you are lucky enough to have more candidates in the pipeline than you, as the hiring manager, can realistically handle yourself.

If you give a recruiter questions for an initial tech screening, you need to be as blatantly specific as possible. If the recruiter is not technical and you leave ANY room for interpretation, you will end up with more work than just doing the work yourself from the start.

Good Examples
  • Firefox, Mozilla, and Safari are all examples of what? (web browsers)
  • In programming, “for” and “while” are examples of what? (loops)
Bad Examples
  • What is the difference between a map and an array?
  • Explain how the Domain Name System works.

The latter examples require either specialized knowledge or nuanced understanding on the part of the recruiter. The Good Examples have straightforward answers that do not leave much room for interpretation. You may be saying to yourself that those questions seem really simple. They are, intentionally. I’ve had positions open for senior engineers with over 10 years of experience where half the applicants were project managers with 2 years of experience. A simple question like “What does DNS stand for?” (Domain Name System) is more than enough for those candidates to be weeded out by a recruiter, requiring no technical background whatsoever.

By the time it gets to the hiring manager, you are trying to both evaluate if this is the right candidate as well as sell the candidate on the position (and by extension the company). If you are a larger company, you will usually talk about the benefits, the ability to focus on a specialty and grow. If you are a smaller company, you will usually talk about the ability to have a big impact, to learn many things from varied areas, and to grow towards a leadership position within the company. Note, this can be technical leadership. If we are hiring engineers, it is best not to try and sell them on a completely different job (people management). Talk about who they will have the opportunity to work with, tell them how they will be supported in their role. The process of changing jobs carries with it enough stress, that anything we can do to make it less stressful and more informed will put your position closer to the top of their list.

When evaluating the candidate, this is the time to understand their background. One of the most important questions you can ask is “What would you like to be doing in your next job?”. This shows the candidate that you care about their careers and their opportunities and you are not just trying to fill a body into an open slot. There is typically a lot of flexibility in a position and you want employees who are active and engaged. This is much easier to do when people are working on things that they find appealing. An old boss once told me that “anyone can hold their nose and do something for two months”. What happens after two months?

Communicate

One of the biggest lessons I’ve learned while recruiting is to communicate. We’ve said many times that hiring engineers is partly sales. The best sales are built on relationships and communication is key in any relationship! I have had candidates tell me that the reason they picked the job I was offering was because I moved quickly and communicated clearly throughout the process.

The first contact is an excellent opportunity to explain the complete hiring process to the candidate. Because we’ve thoughtfully constructed our pipeline, we know exactly what happens at each stage. As the candidate progresses from stage to stage, it’s important to keep them updated about where they are in the process.

Even if you don’t have an update, let them know “I don’t have an update for you, here’s where we are at the moment”. If you start to build a good communication pattern with this person early, that can only help to benefit your relationship if they become an employee. There are so many horror stories of companies that have “ghosted” candidates, only to reach out to them weeks later, which makes candidates feel like an afterthought at worst, and not a priority at best. It makes it unlikely that the candidate would ever consider or recommend working there.

The Coding Test

The coding test this early in the process? If we’re hiring engineers, they will need to write code. Even infrastructure engineers need to write code these days. If you are hiring engineers and do not have a coding test yet, now is the best time to start one, and I’m here to help you get going.

Guiding Principles

The test should be reflective of the actual work. Does the day to day job of the person you are hiring require them to regularly implement a bubble sort from scratch off the top of their head? If not, then do not test for that ability. The object of the test is to assess the ability of the candidate to write clean, functional code that can be maintained over time. It is not a time to determine their ability to recall minutiae. Ask them to write tests, ask them to implement a procedure that is a common part of the job, but don’t test some obscure piece of knowledge from your college computer science classes, this is not Trivia Night at the pub.

If the job does not involve coding on a whiteboard, neither should your test. I don’t believe there to be any whiteboard coding jobs out there, and if if I’m right and they do not exist, this practice needs to end. No professional writes code without an IDE anymore. Unless your whiteboard has tab completion and syntax highlighting, asking someone to write code in a completely different modality tests nothing more than someone’s ability to write code under 100% artificial conditions. That time could be spent in much better ways, so save the dry erase for conceptual work like architecture diagrams and keeping track of areas of a problem to be discussed, not code.

If the job does not involve someone watching you code, neither should your test. If your company has a well developed pair programming discipline, this is a great time to introduce your candidate to that practice! I once interviewed at a well known company where part of the interview was to work with another engineer off a real task from their queue. I had to look things up, write some code, discuss my thoughts, and work side by side with an engineer to solve the problem. If you do not have a well developed pair programming discipline, then you need to send the candidate a coding test that they can do at home in no more than 3 hours. That is a very long amount of time to ask someone to dedicate independently to your interview. The test should be straightforward with clear instructions that are continually improved over time based on feedback from candidates. At the end of the test, the candidate should feel like they were genuinely tested, so they know that other employees at the company have met that bar. If the test is too easy, they will think that coding is not valued. If it is too hard, they will think that you are not really interested in their success in the job.

There are some exceptions! If you are interviewing the creator of X, you don’t need to ask them to code in X. If you are interviewing senior candidates, it is completely acceptable to skip having them write any code. If your candidate has an extensive catalog of code in Github or elsewhere, ask them to recommend a particular piece of code they believe is representative of their style and ability. Asking a senior engineer with lots of code in the public domain to jump through hoops for your process is insulting to the candidate and ignores the purpose of the coding test. If they are actually a senior candidate, they will also be in high demand in the market. If they even agree to do the test, you are wasting time that could be spent on downstream stages in your hiring pipeline.

What if they cheat?

Many people raise the objection that if they allow a candidate to take the test home they will “cheat” (as if looking things up on Stack Overflow is cheating), or get someone else to write the code for them and then submit it as their own. You can certainly have a candidate sign something attesting to the fact that they performed the test within the rules. However, the coding test does not end when the test is submitted or graded.

Whether it is code they wrote as part of your test, or code they identified as their own from another source, the next part of the interview process is to review the coding test with the candidate. This most likely happens during the onsite or online interviews.

Onsite/Online interviews

During the onsite or online (for remote candidates) we are still trying to move the candidate through the pipeline and gain confidence they will get an offer at the end of the process.

Coding Test Review

The first interview after the coding test can be onsite or online depending on the culture of your company. The important part of this review is to pick a piece of code to be reviewed together by the candidate and an engineer who is well versed in that language and its techniques.

If the candidate did not write the code, or does not understand the code, this should reveal itself under scrutiny. If the candidate indeed did the work, then this is an opportunity to find out how well the candidate understands what was written. I once had a test submitted by a candidate where one of my senior engineers pointed a method and said “They didn’t write that method, it’s completely different than all the other code”. When pressed to explain each line of the method, line by line, the candidate was evasive and standoff-ish. After being asked repeatedly if they’d written the code and to explain what they’d been thinking in their choice of variable names, etc., they relented and said they’d copied and pasted from Stack Overflow. We failed them not because they lied about the authorship of the code, not because they copied and pasted it from somewhere else, but because they were willing to submit code they did not understand as their own. There are few things more frightening to an engineering manager than to discover their engineers are shipping code to production when they don’t actually understand how it works!

One-on-one and Team Interviews

After the coding test is accepted (because otherwise the process ends), then it is time for the candidate to meet with the various people with whom they will be working. These can be members of the team that they would join. These could be leaders of other teams with which they may work often. The goal of this stage of interviews is to allow the respective parties to get to know each other, to find out if this is a person with whom they would want to work.

They are trying to determine things like:

  • What is it like to work with this person?
  • Are they capable?
  • How do they approach problems?
  • How do they discuss solutions?
  • How willing are they to learn?
  • What is their learning style?
  • What makes them feel supported? Heard?
  • How do they resolve conflicts?
  • How familiar are they already with a certain topic?
  • What related experiences do they have?

You will notice that most of these questions are not standard engineering questions. They are to assess people’s talent, and their interpersonal skills. All engineers work in socio-technical systems and ignoring one in favor of another can leave us with “brilliant jerks” which are always a bigger drag on the organization than anything they can contribute.

These interviews can take as long as you feel is necessary to assess a candidate. I always try and have these meetings on-site if possible. Even if a candidate is to be working remotely, it’s always good for them to put faces to names (unless you have a completely distributed company). This is also a good signal to the candidate that the company is willing to invest in them and are not simply trying to find cheap labor in another market. This can only strengthen the relationship, even if the candidate can’t actually make the trip.

If you are going to have them participate in a series of interviews, there are a few things to keep in mind:

Limit yourself to a few people interviewing each candidate. It is not fair to a candidate to go up against a large panel of interviewers. You should have enough trust on your team between team members so that everyone’s presence at the same time is not necessary. Interviewing is an incredibly stressful and draining experience where the candidate is trying to perform at the top of their game for hours at a time. Going into an environment where they feel ganged up upon does not make for a better interview or a more relaxed candidate.

Because it is so draining, allow time for breaks! These are humans we’re talking about, not machines. Everyone appreciates the opportunity to allow the glucose back into their brain, or use the bathroom, or check in with work or family. By demonstrating that we care about the well being of the candidate, we are showing them that we’re not the type of organization that expects them to work nights and weekends because we don’t know how to plan properly.

Three or four hours (including lunch) is as many interviews as you can realistically do in a day. After that, the candidate is unlikely to be able to put their best foot forward. They have often taken some time off work and it may look strange for them to be away for so long, or they may have other responsibilities. If you are unable to complete all the interviews with a single onsite, consider whether remote interviewing is a possibility for any further interviews as the candidate may have more flexibility in that situation.

Culture Fit

One interview that is often thrown into the onsite/online interviews is one for “culture fit”. Do not do this. Interviewing with the goal of culture fit is a fast way to a homogeneous team. As with the Diversity & Inclusion discussion above, if you want a less creative team, that underperforms compared to their more diverse peers, then I can’t recommend the culture fit interview enough. Instead, make sure your candidates and your team are able to communicate well with one another, and you will have a team with diverse skills, backgrounds, and talents that other teams will be chasing.

The Outputs

You’ve gone through all the interviews. Either no one has raised any red flags or everyone is enthusiastically supporting hiring the candidate (different companies have different standards as to how consensus is reached). Now it’s time to make the offer. We know that ultimately recruiting is a sales activity.

Make a fair offer. It’s not necessary to be Netflix and pay top of the market, but with the competitive landscape of today’s marketplace, expecting people to take less than their worth for the privilege of working at your company is folly. To try and lowball someone at this point is just telling them that they are not really valued, unlike all the other activities that reinforced their importance during the interview process. The compensation consideration conversation should have begun early in the process so at this point there should be no surprises. Surprises at this point are like exceptions in the process we’ve engineered, so if they do happen, we need to go back and fix our process to prevent exceptions at this late phase from happening in the future.

Encourage any new employee to take time off between jobs if possible. It’s a poor beginning to have a new employee show up burned out on the first day. If they can’t afford to take any time off, try and arrange a small signing bonus to cover the time. The financial cost of the bonus is much less than PTO would cost the company, and less than the loss of productivity from an employee who is not ready to start a new job.

Lastly, do not leave the onboarding of the new employee to chance. The team should have a clear plan for at least the first two months in which they give thoughtful consideration to the ways they will bring their new teammate up to speed. They will need to make accomodations for the learning style of the new employee, but they already learned about those factors during the interview process. When the new employee arrives in a deliberate, supportive environment, it will help to establish the psychological safety from which we get our highest performing teams.

When hiring new employees, often our process is a haphazard series of handoffs as a candidate is identified and shuffled from one department to another after which they are often left forgotten, disappointed, or disheartened about the role they would have in the new organization based on their treatment in the hiring process. Instead, if we are clear and deliberate about the stages in our process, and show our prospects dignity and respect throughout a smooth flowing pipeline, we can hire the engineers we want to create high performing teams while other companies are left to flounder with delays, communication failures, and time spent on meaningless exercises. Which choice will you make?

Distributed, Not Remote

This post was originally published on CIO.com on December 23rd, 2019.

Times they are a’ changing…

Ok, not in the sense of the Bob Dylan song necessarily, but the availability of working outside of a traditional “everyone in one office” environment has changed radically over the past few years. There are companies who have been doing this for a long time. I first started managing a remote team in 2001. I had folks in Europe, Hawaii, and up and down the west coast of the United States. A lot has changed since those days.

My team started out with IRC and then moved to Jabber because it had support for SSL so we could exchange secrets. We made heavy use of Skype. Now there is Slack, Zoom, and Microsoft Teams, but many of the same ideas apply, some are even better.

Today having the ability to work with teams and people around the world is becoming more and more popular. Maybe your company wants access to talent in places they don’t have a traditional office. Maybe your company operates in many different geographies. Maybe your company was founded to be 100% distributed!

Regardless of the reason, there is no question that distributed teams are far more commonplace than they were in 2001. There are many challenges to working in this manner, and there are many things that we can do to make such teams successful.

Challenges

As the expression goes, nothing good is easy. There are huge advantages but there are also a number of challenges to being successful with this style of work.

HQ Effect

If you’re a company which has a very heavy headquarters culture, it may be difficult for those not located at HQ to advance effectively in the company. So many of the key things that happen take place there, the executives are there, most decisions are made there. It can be a challenge for those in other offices or locations to make their contributions and ideas known. Often, many more promotions happen for those located at HQ than in other places. It’s important that we are conscious of this situation if we want our high performers, wherever they may be, to be recognized for their excellent work. Similarly, we need to make sure that those who are struggling, get the support they need so they can continue to be valuable members of our organizations.

Teams

Ultimately the success of our organizations falls heavily on the performance of our teams. There can be a number of challenges for teams, especially if the majority of a team is located in an office, and there are a few team members in other locations. This is why the title of this article is Distributed, Not Remote. We want to think about the team as being distributed. If we think about having “regular” employees and “remote” employees, this creates bad dynamics for the team. It establishes a hierarchy of those who communicate freely where others cannot participate, and those who are struggling to be part of the same conversation.

But working on distributed teams can be difficult even if we make the effort to communicate openly as language can be a difficult form of communication. We’ve all seen emails that seemed innocuous enough, and then witnessed someone respond horribly offended because they interpreted a phrase much differently than everyone else. Sarcasm ALWAYS comes through perfectly in text form and is ALWAYS read and appreciated as intended.

I remember a comic strip when I was a child where the two characters were sitting back to back and one looking down at an injury on their leg said “I gotta scratch” to which the other responded “So go ahead!”.

Distributed teams can also be hard for junior staff. Before any team takes on junior members, they should be well aware of the amount of (worthwhile) work required to bring those members up to speed. When someone is located at home, or in an office overseas, this can be a big challenge. Additionally, a big part of being junior staff is learning how to work, not just the work itself. If you are just getting started with your career, and you’re trying to work from your kitchen, and you have roommates, and the dishes are piled up in the sink, and you have a lot of laundry to do, it can be a challenge to get the long blocks of sustained time required to be truly effective.

Approaches

There can be many challenges working on a distributed team. But there are many ways to make those teams more effective.

One remote, all remote

The first rule most successful remote teams follow is, one remote, all remote. That is, the entire team needs to adopt the mentality of a distributed team no matter where they happen to be working. If you’re fortunate enough to work for a company that is flexible about whether or not you need to be in the office every day, you already have an advantage in this thinking.

If you’re having a conversation with the person in the next cubicle, which is probably relevant to the team at large, one needs to be cognizant of this fact and take it into a forum that is visible to the entire team, wherever they are. If you are in a private office talking to one of your managers about who should be promoted, obviously those conversations don’t belong out in the open. But if people are making an architecture decision that the team will have to support for the next few years, that can’t be something relayed at a later time.

Making Work Visible

It is not only being cognizant of having all team members participate that allows distributed teams to be successful. One of the easiest ways to be transparent with information is to make it a normal part of how teams work, by making work visible.

Chat applications

There are many ways to make this information visible. I’ve used Slack, Teams, etc. to involve everyone in the conversation. For US employees with team members in Europe or Asia, they at least have a chance to see the topic being discussed and can contribute their thoughts and ideas when they are awake and at their computer (or if they so choose to contribute on their mobile devices, as long as that is not expected).

When working at Salesforce, we used a tool called Chatter that was a bit like business Facebook. We would start a conversation on a topic and over the course of the next 24 hours, teams from all over the world would contribute their ideas in a threaded conversation. People could “like” posts to express their support, or they could link to documentation, etc. to give others in the thread more information about the solution that was being proposed.

Of course, all these tools have great search capabilities so that if one wanted to go back and see what the reasons were behind a decision made or an action taken, it was self-documented.

ChatOps

Successful engineering and operations teams take this concept a step further! They practice ChatOps in which not only the discussions and decisions happen in Slack and Teams but the actual work itself.

They set up chat bots like Lita, Errbot, and Hubot, which allow them to take actions directly in chat. This can be things like silencing alerts, or asking for a graph of CPU utilization, but it can also be things like deployments. This is a great way to give a unified interface for doing deployments as the commands used for the deploy can be different in the backend code depending on what kind of deployment we are performing (container, JAR, schema change, etc.)

These teams will also put a lot of effort into custom emoji that reflect the culture of the team and the company. This can often make communicating much simpler and easier. A well placed emoji or meme in response to an event or statement can quickly communicate a long history of common understanding.

Agile tools

Our Agile tools are also excellent ways to make our work visible. A sprint or kanban board allows anyone to see work in progress and anything that has been stalled or blocked. There is no need for worthless “status report” meetings to find out what has been accomplished on some component of a project. Anyone, anywhere, can find out the status of a component by bringing up the board. The better teams who take advantage of integrations between tools like Jira and Github even have the ability to drill down into individual commits related to a work item, and see the exact code that has been developed to meet a business need.

Nothing about these tools or this kind of integration requires people to be colocated in the same office, or even the same country.

Presence

Another property of remote work is presence. The ability to meaningfully connect on a human level with other team members over distance.

Psychological Safety

As we learned from Google’s re:Work project, psychological safety is the most important indicator of high performing teams. Gene Kim even highlights this in his most recent novel The Unicorn Project as the 4th of the Five Ideals. But how to foster psychological safety on a distributed team?

One method I’ve found effective for such teams is to use the Agile daily standup to foster friendships and familiarity between team members. We all know that the cardinal rule of standup is to make it as quick and effective as possible. Then the meeting ends. With distributed teams, there can be other goals as well. For my distributed SRE teams, I always let standup include a good amount of banter and chatter before the meeting got started. Instead of standup starting at the top of the hour, maybe it actually got started at 10 after the hour. During those ten minutes, we heard about pregnancy announcements, or the neighbor who set their house on fire by using a circular saw improperly, or how close someone was to graduation from their night courses. The point of this time was to create personal connections on the team. Knowing the other members of the team on a personal level was critical for those times where there were differences of opinion on technical topics and each team member had to understand that everyone was participating with their best intentions.

But even the physical environment can help to foster these feelings.

It was just another day in the office until we saw the UX design team moving furniture around. No, this was not a common occurrence. They had taken some chairs from one of the small meeting rooms that had been setup with a nice mic, an HD webcam, and a TV. They were replacing it with a couch and some plants from another part of the office. “Oh, it’s just those crazy artist types doing their thing” thought all the engineers. However, what we came to discover, was that that small meeting room now felt like a living room. People felt very comfortable in there and it was reflected during meetings. Many who follow the “one remote, all remote” philosophy will eschew meetings where those in the office will sit in the big, antiseptic, conference room. Instead, everyone will do the meeting from their desk so that everyone will be on equal footing.

But the people who work from home are not in an antiseptic environment. They have specifically constructed an environment where they are comfortable. By interacting with co-located team members who are in more of a living room setting, everyone is put at ease and the conversation flows freely. Those in the room still need to be mindful to watch or listen for clues that those who are not in the office want to contribute, sometimes they even need to consciously ask for input. But, by creating a relaxed dynamic for discussion, we can allow for the benefits of psychological safety to benefit the whole team.

That does not mean that all meetings need to take place in a living room setting, as desirable as that may be, but when working as a distributed team, members should feel free to have one on one meetings whenever the need arises. Just as you would feel comfortable turning to the person in the next cubicle as asking “What do you think about the proposed caching strategy?”, that bar should be equally low with distributed team members. Sending direct message in Slack asking “Got a few minutes to talk about the proposed caching strategy?” should be neither weird nor unwelcome.

Virtual

I wrote that the “living room” was set up with a nice HD webcam. I also wrote on Twitter a few weeks ago that I thought all business meetings should be video meetings. Why is this so important? As I’ve spoken about at a number of conferences, human beings have specialized neurons called mirror neurons. When we witness someone else experience something, the neurons in our own brains that would respond to that same experience also fire. It is almost as if we are experiencing that same thing ourselves. I believe that this activation is crucial to developing empathy with, and understanding of, other people. If that is the case, then adding a webcam to our conversations instead of simply using audio, allows us to build deeper connections with others, which also contributes to psychological safety.

The ability to see the expression on another’s face, or their body language during a conversation, allows us to collect so much more information than we would with simple audio or even text. This is not the same as sitting down for a team lunch together or anything like that, but it is an effective way to use non-verbal communication.

One non-verbal technique I learned from a brilliant software engineer was the “thumbs up” sign. In chat, you can often see folks react to a statement with a thumbs up emoji to express agreement. During group discussions over video conference, this can be effective as well. Instead of interrupting the speaker to say that you agree or disagree with an idea expressed, one can nod or give a thumbs up or thumbs down sign to the camera, to express to all participating, agreement or disagreement with that idea. This is something that is simply not possible with audio only communication.

Some companies have taken the idea of remote presence to a new level with the use of robots! I remember the strange feeling of seeing one of these robots roll down the hall to a conference room and saying hello to the person on the screen as I opened the door for them to roll to their place at the table for the discussion. That is really being welcoming to distributed team members!

Actual

While encouraging distributed teams and team members to participate over many different mediums, there is still often a need to share time together in a common space to build those strong interpersonal relationships that are difficult to forge any other way. For this reason, many high performing distributed teams will gather once or a few times a year, in the same location.

Whether it’s happy hour, cooking classes, or going to the gun range, being able to spend time in person, with people with whom you spend so much time online, can have a great effect on team cohesiveness. While hack events (hack days, hack weeks) can be great collaboration builders in a distributed manner, there is also something special about friendly competition among engineers, accompanied by lots of cheering, encouragement, and laughs when the demo at the end of the hackday doesn’t work quite right for a team.

DevOps

We’ve talked about how thinking of our team as distributed and not remote is important, but how does this apply to DevOps? We already know that when we build bridges between our Operations and Engineering teams, we get better results. This is proven every year in the State of DevOps report. What we also know from neuroscience is that people behave differently in the presence of those that they consider in an “out group”, that is, someone they do not consider to be in the same group as themselves. If we were to draw a distinction that there are employees who are in a specific office, and there are employees who are “remote”, we are drawing a line that divides “in groups” and “out groups” within our own teams. There is a lot of scientific literature that supports this. If instead, we think about our team as being distributed, then we are essentially creating one group, of which we all are a part.

When we are talking about diversity and inclusion in the workplace, we discuss this problem as “othering”. When we see people as the other, that is, not a part of our same group, and we are not accepting of others, then various undesirable results can occur. Instead, if we want to build and maintain high performing distributed teams, we need to think about how to expand our “in group” so that all team members can participate equally, regardless of their physical location.

Go forth

It has been a great almost 20 years watching the rise of distributed teams, especially in the technology space. What started out as an exercise for the bold that arose out of the pure necessity of operating around the globe, has turned into a major advantage for companies who are willing to make the investment in enabling distributed work to be done properly.

Companies that embrace this strategic advantage are able to find, develop, and maintain talent wherever they happen to be located, instead of being restricted to a few geographies where the competition for talent can be fierce and expensive. By taking advantage of the great tools and learnings about the culture of distributed teams, we can give our businesses a leg up on the competition, and set ourselves up for the win.

Plan the Work, Execute the Work

This post was originally published on CIO.com on November 21st, 2019.

Background

We had just acquired the company and I was getting introduced to different members of the engineering staff. I was asking all sorts of questions about how the architecture was laid out, what their workflows looked like, etc. Then I found out that their on-call rotation was a nightmare. “Why don’t you work on improving the stability of the infrastructure?” I asked. \ “We’re too busy” came the response. “You’re too busy to stop your life from being a never ending Groundhog Day of fighting fires?”.

I couldn’t believe my ears. But in many organizations, the method for choosing work is to work on whatever is in front of them at the moment. The thought process is “If something new lands in front of me, I work on that.” By doing so, we can keep ourselves very busy, and at the end of the day, we are tired. However, rarely is there a sense of accomplishment, because rarely in those circumstances are we actually productive.

Productivity

The 2019 State of DevOps Report defines productivity as “…the ability to get complex, time-consuming tasks completed with minimal distractions and interruptions.“ According to this definition, there are a number of conditions that must be satisfied for one to be productive:

  • A block of time (not just 15 minutes here and there)
  • Minimal distractions
  • The work needs to be complex

Peter Drucker wrote in The Effective Executive (also talking about knowledge workers) in 1967: “The more he switches from being busy to achieving results, the more will he shift to sustained efforts - efforts which require a fairly big quantum of time to bear fruit.”

Going back to our example of the acquisition above, we don’t find any of these conditions met. The team was being paged constantly, which meant there was no opportunity to work for a block of time. The pages themselves were a constant source of distraction, and were often ignored. The remediations for these items were never complex, it was always the simplest action that could be performed, in order to get past the current alert. When talking to the engineers on this product, they always talked about how much work they had to do, but it was all toil, it was never productive work.

In order to be productive, we need to plan the work and execute the work. If we only work on things directly in front of us, we will never be productive. Productivity will not find us, we need to find productivity. In fact, the only way to get out of the situation above is to be productive.

A block of time

Because we want to work on time-consuming tasks, we need the most important resource of all, time.

Drucker observed: “Effective executives know that time is the limiting factor… .Of the other major resources, money is actually quite plentiful… People – the third limiting resource – one can hire, though one can rarely hire enough good people. But one cannot rent, hire, buy, or otherwise obtain more time.” If we don’t have the ability to make more time, then we need to be as effective as possible with the time we have.

Agile

One way of doing this is to have structure to our work. I’ve written in the past about Agile meeting schedules and why this structure is essential to being productive. I’ve had people work for me who’ve said that their previous engineering teams would have a meeting once every few weeks. They felt like they were left to figure out what to work on, and every few weeks, they would have an inefficient meandering meeting that was supposed to tell them what to work on next. They did not feel like there was a real plan for the work to be accomplished.

Part of the reason we have so much ceremony in Scrum, is that we are trying to be as predictable as possible about the output of the team. Of this, Sprint Planning is probably the most important for productive work. During this meeting, the team agrees on how much work to take on based on past performance, and what those specific work items should be. This predictable output is by definition, productive output. It is also output that is based on a prioritized backlog, which we discuss below. A high performing Scrum team is a productive team. A team where the velocity is extremely low, or varies wildly from sprint to sprint, is unlikely to be very productive.

20% time

There have been many different takes on Google’s idea of 20% time, but regardless of the implementation at different companies, the idea remains that time can be put aside for the purposes of exploring new ideas. The output of this exploration (i.e. innovation) time is often complex, and whether it leads to a new product or not, it is still productive time. Google has a word for work that is not productive, which we will talk about when we explain complex work.

Minimal distractions

Prioritization

One of the best ways to minimize distraction is to make it clear what the priorities are for our teams. Giving overall approaches to how they should think about their work is good and sets the team in the right mindset. But the work that has to be done in order to be productive must be prioritized in a structured manner if we are to ensure that we are always working on the most important work for the business. This means teams sitting down with a leader (in Scrum this is the Product Owner) who has already groomed the backlog of items for work that are not a priority. It also means discussing and ranking the work.

When teams negotiate a prioritized backlog with their leadership, it is essential that there is never a question as to what is the most valuable work to be done. The work is always planned, and if someone were to work on reactive work, which was not prioritized, the rest of the team would ask why they were not working against the backlog. This also allows us to make long term plans, as we have the ability to break up work into smaller units, and then move those units to the top of the backlog when necessary to achieve the longer term goals.

If we were to try and tackle the problem we discussed at the start, with the team that was getting paged constantly, we could assign some team members for reactive work, and assign other team members dedicated to making that type of work go away, rather than burning out the entire team.

The 4 Flow types

When it comes to allocating the work during prioritization, we can do so according to the 4 different types of flow items as outlined by Dr. Mik Kersten in From Project to Product:

  • Feature
  • Defect
  • Risk
  • Debt

Depending on the needs of the business at that time, we can allocate these types of work in different ratios. Are we near the end of the year and want to focus on stability? Perhaps we lean more heavily on Risk and Debt. Are we trying to get a major new release out by the end of September? Perhaps we prioritize Feature delivery. In any case, we want to ensure maximum productivity out of our teams, by keeping the work they perform as highly aligned as possible to the needs of the business. This is being Agile, as opposed to doing Agile.

But, what if there are more things on the backlog then we can get done? What if people are adding items to the backlog faster than we can move them to Done? This is where our focus on productivity can really shine. We will never get through all the items in the backlog. The sooner teams give up on that fantasy, the sooner they will get down to the business of maximizing their productivity with the time they have. If we are always working the most important items, we are always being maximally productive. If the business can see how much work is being accomplished, and there is still a desire for more work to be done in a given amount of time, that is when we can start having discussions about increasing efficiency or headcount.

Complex work

In order to perform complex work, we actually need to be doing complex work. Not spending time on items that do not contribute value to the business.

Toil

If a requirement for being productive is our ability to accomplish “complex, time consuming tasks”, then Google’s Site Reliability Engineering departments have defined what most of us would recognize as the opposite of productivity, toil.

Toil is the kind of work tied to running a production service that tends to be manual, repetitive, automatable, tactical, devoid of enduring value, and that scales linearly as a service grows. - Vivek Rau

The Google SRE Book has a lengthy discussion about toil. Just like in our discussion of Agile above, we do not want to spend a significant amount of time on toil. With many of my clients, I will use the parts of the business executing the most toil as the starting point for discussions about where we can make improvements. You should recognize by the definition above, that toil is not productive work. We do not need to clear large blocks of time to concentrate on accomplishing toil. Toil tends to be the type of work that winds up “in front of us”, and therefore it’s what we end up working on. Toil is typically not planned, unless it is part of a larger effort to actually eliminate toil, as in our example above about setting aside some members of the team to work on toil, and some members to work on process improvement efforts.

If you recognize the definition above as being most of what you spend your time on during the work day, then you are probably not planning the work to be done, and then executing against that plan.

Time thieves

In addition to toil, we must also be mindful of the “Time Thieves” as described by Dominica Degrandis in her book Making Work Visible when executing complex work. In her book, Ms. Degrandis discusses in depth the problems with:

  • Too much work-in-progress (WIP)
  • Conflicting priorities
  • Unknown dependencies
  • Unplanned work
  • Neglected work

Each of these “thieves” can keep us from being productive. If we have too much work in progress, we actually accomplish less. Conflicting priorities can leave us oscillating between things that are never completed. You get the idea. Each of these thieves stand in the way of us being able to move items to Done, and work on the next complex task required to keep our businesses moving forward faster, and with more quality, than the competition.

Conclusion

Too often our teams get caught up in the day to day of running the business. They are never idle, so therefore, they feel like they must be getting a lot of work done. However, that is not necessarily what is best for the business. In order to be truly productive, we must approach our work deliberately, not simply using recency as our measure.

By doing so, we can execute major changes to our processes, our culture, and our infrastructure. Being Agile along the way allows us to adjust course as we go, so that we are always working toward the right goals. When we plan the work, and then execute the work, we enable the capacity to do great things.

Deploy on Fridays, or Don’t

This post was originally published on Hackernoon on October 24th, 2019.

There seems to be a debate that has gone on for quite some time now on the Twitters about whether or not you should do Friday deploys, and whether there should be Friday moratoriums, etc. There are a lot of accusations being thrown around about fear, testing, time to recover, and the like. To be very clear, I am not a big fan of Friday deploys. That opinion is not based on merely how I feel about deploying on Friday, but also based on the science of it, as well as my learned experience.

With a title like “Deploy on Fridays, or Don’t”, I realize the expected continuation of that statement would be “…I don’t care”. However, nothing could be further from the truth. Let me explain.

My advice to anyone who will listen is, if you’re cautious about your Friday deploys, don’t feel bad, and don’t let anyone make you feel bad.

shaming

Shaming

It is pretty disconcerting to see a tweet like Kelly’s where the vast majority of the comment thread are attempts to shame anyone for having that opinion. The arguments basically boil down to some variant on

  • Deploying shouldn’t be scary
  • You should be confident in your deploys
  • You lose 20% of your productivity without deploying on Friday
  • You just need more tests

Those are all interesting ideas, and they reflect a very interesting type of smug optimism. They are often “backed up” by quoting Accelerate or the State of DevOps report. They eventually arrive at a compromise that you should do the best you can, and keep maturing your deployments until you can deploy anytime you like 365 days with “confidence”. There is also acknowledgement that this can be hard, and having worked with a number of clients and companies over the years, with this I agree.

However, here are most of the problems with the logic:

  • Quality Engineering
  • Even Elite performers have change failure % > 0
  • Mores are not Moratoriums
  • All days are not the same

Lastly there is some strange argument that choosing not to deploy on Friday “Shouldn’t be a source of glee and pride”. That one I haven’t figured out yet, because I have always had a lot of glee and pride in being extremely (overly?) protective of the work/life balance of the engineers who either work for me, or with me. I don’t expect that to change.

Quality Engineering

While working at Salesforce, I had the opportunity to learn a lot about quality. This was also the time I when I read Continuous Delivery by Jez Humble and Dave Farley. This was a book that changed my life, and I say that confidently. One of the things I loved about that book was the idea that the more testing that was done, the more confident you could be in the artifact you were deploying. When pitching CI/CD pipeline proposals to executives, they would ask how confident we could be in our artifacts, and I would respond with “How much do you want to spend?”. The more money they were willing to spend, the better testing we could do, and therefore, the more confidence we would have. One of the other things Continuous Delivery taught me was how important it is to have fast feedback. Ultimately, your confidence when deploying to production is going to be some compromise between those two. If you do automated testing for 15 hours, you should obviously have more confidence than if you do so for 30 seconds.

One thing that was not present in the book however, was any notion that you could be 100% confident in what you tested. That is, you could not assure the quality of that artifact tested. Now, Salesforce has a very mature testing pipeline. There were literally hundreds of thousands of tests that were being run more than 5 years ago and yet, they had a quality engineering discipline, but not a quality assurance discipline. Why?

Because one cannot assure quality in software. In manufacturing, if I am making shampoo, I can have quality assurance test for quality. QA takes a statistically representative sample of each bottle of shampoo coming down the line, and test to make sure that the chemical composition of what is being produced is within the tolerances as described by the quality specifications. They statistically assure the company that the quality is consistent.

In software, you cannot do this. You can not take a random sample of code coming through your continuous delivery pipeline, test those lines of code, and then assure that when that code is deployed to production, it will perform at a level consistent with what has been defined.

Therefore it is foolish to lecture people that they should be deploying on Fridays, because they just need to “be confident in their code”, or “write more tests”. How many people dispensing this advice have hundreds of thousands of tests being run on their code? How many people have 100% code coverage in their tests (if this sounds appealing: please don’t do this, the last percentage points will suffer from the laws of diminishing returns)?

What I did not take from Jez and Dave’s book, is that you should fool yourself into thinking that you should be 100% confident in everything you push, just because you have tests. Thankfully, Jez continues to talk (along with Nicole, Gene, and others) about tests in the State of DevOps report.

DORA Report

The DORA Report is often referenced as proof that you should deploy on Fridays, just like any other day. Because it provides data to help classify organizations, including defining how higher vs. lower performing organizations deploy, it’s useful to look at in cases like this. For instance:

  • Elite performers - mean change failure rate 7.5%, recovery < 1hr, deploy multiple times a day
  • Low performers - mean change failure rate 53%, recovery > 1w, deploy > 1 month

So, assuming that the advice being dispensed is saying “just become an elite performer”, which is in no way a trivial exercise, they still have a mean change failure rate of 7.5%! Does that sound like they will not have any failures during deploys? That’s one way to make Friday afternoon more exciting! I realize the mean recovery time is < 1 hr, but that is also the mean. What does the distribution look like? Is there a cluster at 0%? Is there a cluster at 7.5%? I don’t know. But regardless, there is no guarantee of any deploy being failure free, because even the elite performers have failures.

We also know that change is the leading cause of outages. I’ve seen estimates as high as 75+% of all incidents are at the change boundary. As a friend has said “introducing a change boundary in the 4-6 remaining hours before the whole team is off for 50+ hours seems … like not a high probability play”. But, let’s use the 7.5% change failure rate for elite performers. Do you wear a seatbelt in the car? Yes? Why? What if you had a 7.5% chance of a minor accident? What % chance of a major accident would make you wear a seatbelt? If your argument is that accidents are out of your control, I’d like to introduce you to complex distributed systems…

To put it another way, at Google when you violate your SLO, what is slowed down? Releases. Not more tests, not more monitoring, releases.

The other problem is that it’s not even necessary for you to cause an outage for your weekend to be interrupted. I learned long ago to be very careful about when I did firewall upgrades. Why? Because every firewall upgrade was generally accompanied by days of spurious correlations about whether something was affected by the upgrade. “Dave, I can’t print, didn’t you recently upgrade the firewall?” “Neither the network traffic for your laptop, nor the printer goes through the firewall.” “But couldn’t…” “No”.

Having the capability is necessary

Now choosing not to deploy on Fridays is very different than having the capability to deploy on Fridays. You should have the capability to deploy at any time. Things break, code needs to be shipped. You should absolutely be developing this capability if you do not have it.

I have worked with elite performers. We still chose to be very careful about our Friday deploys.

We also chose to make sure our feature flagging, blue/green, and dark launching capabilities were robust. We had developers deploying their own code whenever they wanted. We deployed multiple services multiple times a day. Every day.

But when Friday afternoon came, if someone was going to push a deploy, they would consider if that was necessary, or if it could wait until Monday. After all, there were other things to do.

Cultural Norms are not Moratoriums

Moratoriums

Being choosy about deploying on Friday is not the same as a moratorium. Moratoriums generally require some kind of change advisory board to approve special cases for releasing during the moratorium. Instead, minimizing risk during times where the results of a failure can have outsized impacts is part of the communication and respect we see in DevOps. If the core of DevOps is empathy, then considering the impact our actions can have on others is exactly that - empathetic. Besides, change advisory boards are useless!

cabs

Netflix

I was happy to learn that I’m not the only one who has worked at places that were cautious about their Friday deploys when Aaron Blohowiak tweeted about Netflix:

Netflix

Even the tech giant Netflix has a cultural norm to avoid Friday afternoons. Not a moratorium, but part of the culture of the company.

Another great example from Netflix is the Chaos Monkey. The Chaos Monkey runs during business hours so that people will be more available to respond should something untoward occur. Is that because Netflix doesn’t do enough testing? Or maybe their monitoring is not good enough to run on the weekend? If every hour of every day is exactly the same as every other day, this would make no sense. Instead, they run the Monkey when people are around to address problems, not when they have other commitments.

Complex Distributed Systems

The fact remains that generally we are working on complex distributed systems, and the causes of outages are often elusive. Often when we discover the nature of a problem, it is only obvious in hindsight.

A number of years ago, I was rebuilding a SQL proxy tier at a company. We were automating our proxy builds, and deploying new versions of the software. We used these proxies to keep the short lived nature of our PHP application requests from opening thousands of requests per second to the database. The cost of connection initiation and teardown was not free, so we had an intermediate tier that was designed to take this kind of load much more effectively than the database itself.

I was building the new tier but was not sending any traffic to it because the weekend was coming. After everything was built, things seemed calm. Until about an hour later, when the database started having problems. Connections were randomly timing out. We looked and saw that the database was often hitting max connections and as a result, many requests were not making it through. Ultimately we determined this was because my new tier had opened connections to the DB, even though they were not being used and that had pushed us over the limit when a certain class of traffic appeared. Seems pretty obvious in hindsight and we ultimately determined what happened through monitoring.

But the facts were:

  • This was a new tier that had never taken production traffic
  • It was in a brand new VLAN that had never seen production
  • This was a new version of the software
  • The databases had been running fine the entire time this tier was being created
  • The new tier had all the latest monitoring on it and showed no signs of problems
  • The tier it was replacing also showed no signs of problems

And yet the database was dropping connections, and it was an all hands on deck situation on a Friday afternoon when most people were thinking about their weekend. Thankfully, it was relatively easy to resolve.

Are these types of things common with releasing new software? No, but they happen.

Four Day Work Week

So, if we’re not going to deploy on Friday afternoons, what do we do with that time? Do we just give everyone Friday off? Less shipping means no work? One thing we can do is be protective of our employees through work life balance and reduction of stress.

I have read with great interest about Four day work week experiments. Among the validated results of moving to a four day work week were:

  • Boosted productivity
  • 24% improvement in work life balance

Being protective of employees is something of which I’ve always been very supportive. Whether it’s booting the person off Hangouts who had a blanket over their head and a hot bowl of soup in their hands, or insisting people take the day off when they’ve been up all night troubleshooting an especially difficult issue after a bad deploy.

I realize most companies are not going to investigate a four day work week, but Friday afternoons can be used for:

At one of the companies I worked for, most of the Ops team would go out for lunch every Friday. That meant lots of Friday morning deploys, and then lots of great collaboration in the afternoon.

If you’re working with a globally distributed team, do you do a Friday afternoon deploy on the west coast of the United States? That’s almost the next day in most parts of Europe. Most Europeans are not excited to be called back to work late in their evening to help figure out why 8% of traffic is getting 500 errors after a Friday afternoon deploy.

Weekends (all days are not the same)

One of my favorite things about working at Salesforce was the number of people who chose to wear Hawaiian shirts on Friday. This is something I’d done on and off over the years ever since the release of Office Space as a way of recognizing the specialness of Friday.

If the argument is that releasing new software is the same regardless of the day, that ignores what people do on weekends. People make plans to go away, they go camping, they go to the opera, they read books in a hammock by the shore. They go to their kids soccer games, they work in the community garden, etc. Do some of those things happen occasionally on a Tuesday night? Sure they do. But the vast majority of weekend travel happens on the weekend and doing activities that can jeopardize that doesn’t show a lot of respect for your coworkers, or your employees’ work/life balance. They need to take this time to rest and recharge.

This is one of the reasons I always liked on-call rotations that rolled over on a Thursday. I always wanted my teams to be able to take a Friday off to get away for a three day weekend. The more downtime in a block, the better.

If the argument is that not shipping on Friday afternoon is going to hurt productivity, remember the old Agile adage “you have to go slow to go fast”. Driving a system at 100% capacity is actually a way to reduce your throughput, not maintain it.

Does this mean that you should never deploy on Friday? Of course not. That also does not mean you shouldn’t consider what you’re deploying. It may seem to make sense when someone says “We scheduled the move from Oracle to Postgres for 4 hours, so if we start at 1 p.m., we should be done in plenty of time.” My answer to that logic is: NOOOOOOOO.

You should be able to, but you don’t have to deploy on Friday afternoon. You should not be shamed for having a culture that respects Friday as being different. It is different.

Deploy on Friday’s or Don’t. The choice is up to you.

Make the Right Way, the Easy Way

This post was originally published on CIO.com on October 23rd, 2019.

Background

A number of years ago, I had just started a new job in a leadership role, and I attended a senior leaders discussion, led by the Chief Information Security Office (CISO) of the company. There were various topics being covered, but one had to do with the configuration of the servers, which was a topic near and dear to my heart.

The CISO was talking about how they were going to demand compliance from all appropriate staff that the servers were to be configured in a certain way, with a certain version, etc. in order to have better security in the company. If anyone had a problem with that, then there were other places they could be employed. If necessary there would be mandatory training…you get the idea.

To the shock of many in the room, I put my hand up and asked a simple question: “What if we just made the right way, the easy way?”

Effecting Transformation

This notion of making the right way the easy way has come up time and time again when working with clients on their transformations. Almost always a transformation requires getting people to change the way they operate the business. A tool they often reach for is a stick.

The solution is often “We’ll get an executive to tell them they have to do it”.

But that ignores the Westrum Generative culture we are hoping to build in high performing organizations. We need to be able to trust that people will make the right decisions because they are closest to the information. Information that reaches the executive level is highly filtered and as a result suited to strategic decisions, not decisions made day to day on the front lines. If the specific configuration we specify does not meet the specific needs, even though it does not cause any security problems, then we can be hampering the performance of our own organization. In the best case, they will not be able to do their work without being delayed by starting down the long slow line of asking for an exception. In the worst case, they will bypass the new process, and hide the specific change they have made so that they can still get their job done and not have to jump through hoops.

If we make the right way the easy way, then almost anyone who is faced with one of these situations, will almost always choose the way that has been provided for them. There are very few humans who like to go out of their way to take on extra work if they can choose the easier way. In the case of the CISO above, if we simply provided pre-configured packaged software to the systems engineers, with all the appropriate versions and security configurations, there is very little reason they would have for trying to go out and do all that work from scratch when they could simply install a package. Today we would do this with containers, but the principle is the same. Why would I want to go and make my own secure (or performant, or whatever) container from scratch, if I could just use the one that suits my needs provided by the company? In the case it does not meet my needs, that is a problem that needs to be solved, as easily as is warranted.

In Practice

Because we often talk about DevOps in this column, we can look at some examples from different parts of the business where making the right way the easy way can have a big effect.

Dev

We have talked before about how important it is for Operations to facilitate the speed of developers moving software to market. We’ve talked about how Operations provides a platform.

The reason to have a platform is that it makes the right way the easy way. In Operations, we often talk about putting in guardrails. In one case, we can turn the developers loose on the AWS platform and allow them to spin up databases, virtual machines, queues, etc. In the platform case, we can give them tools to do these things.

If a developer needs an AWS Elastic Compute Cloud virtual machine instance, they could go find some virtual machine image, and set up some wide open security group rules, and open the instance to the Internet so they don’t have to worry about being blocked by firewall rules, and all the other possible mistakes, etc.

Or we can give them a tool to launch an instance with all the right settings configured correctly. With this tool, they only need to choose from a minimal set of requirements so they get exactly what they need. That’s making the right way, the easy way, and there are few developers who would sign up for the manual method.

Ops

This can also apply to Operations teams. If the correct way to configure an application is to put the configuration files in /directory/appname for every application, then we can build tools around that pattern to make it very easy. Have some brand new microservice to be deployed? No problem, just use this tool here and setup your configuration and we will make sure that part is deployed successfully. How many people would demand that the application configuration be placed under /somewhere/else?

This also helps the Operations staff to reason about applications either during normal operations, or during an outage. Because we’ve made the right way, the easy way, if it’s 3 a.m. and my SRE has just been woken up to a production site outage, if they want to know the configuration of an application, they don’t need to spend time trying to figure out where to look. It’s always under /directory/appname, regardless of the app. That means less time trying to remember things under stress, less time looking through runbooks trying to find the section on configuration, and much faster Time to Repair (TTR) for our production installation.

Business

Even our business operations folks can take advantage of this idea. If we have a tweet that has to go out in a specific format, at a specific time of day, would we like to spell out those instructions for someone in marketing to follow to the letter? What if they make a mistake? Should we fire them for being human?

Or can we make the right way the easy way and invest in a marketing automation tool that is designed to post tweets, from templates, with a scheduler? I would much rather schedule a tweet to go out at 2 a.m. during the afternoon marketing meeting, than I would like to get up at 1:58 a.m. and hope that I push the button correctly.

Does anyone believe that more training, or threatening people’s jobs will prevent problems at 2 a.m. for a process that can be done the right way and easily be handled by a computer?

Conclusion

When organizations are going through transformations especially, or even operating normally, there is often a tendency to want to specify exactly what people should do. Unfortunately, situations are varied, and there is rarely a one size fits all solution. This is not a problem exclusive to technology companies but affects aviation, medicine, and other professions as well.

If we put our employees in situations where the “approved” way is also they easiest way, and we make it so it fits their needs, then we will have a situation where people can move quickly and use their expertise to arrive at the most successful outcomes for the company.

Moving Operations to Simple With Cynefin

This post was originally published on CIO.com on September 19th, 2019.

Background

In 2007, David Snowden and Mary Boone published an article in the Harvard Business Review called A Leader’s Framework for Decision Making. In it, they describe a way of looking at different classes of problems, and how the methods used to solve those problems will be different depending upon in which context you are operating. They called this framework “Cynefin”, which is a Welsh word that describes the often unforeseen factors that influence our decisions.

When learning about this framework, I could not help but think of their descriptions in the context of many problems that I’ve had to solve over the course of my career in Production Operations and Engineering. The authors even describe Cynefin in a context that will look very familiar to those who have been in this same role: “Leaders who understand that the world is often irrational and unpredictable will find the Cynefin framework particularly useful.”

Irrational and unpredictable? How many production outages have I been involved with that appeared irrational and unpredictable? Most of them! I began to think about how Cynefin could be applied in a DevOps context.

Cynefin

cynefin-model

As can be seen in the diagram, Cynefin separates our problem or decision types into 4 distinct quadrants: Simple, Complicated, Complex, and Chaotic. In each of the scenarios, there are different leadership skills that must be applied to successfully navigate the scenario.

Simple/Obvious

The Simple or Obvious domain is one characterized by simple inputs and outputs. A simple input leads to a well defined output. There is no ambiguity. These outputs are often characterized as best practice. If I need to make fries at a fast food restaurant, there is a specific volume of fries and a specific amount of time they must be cooked at a specific temperature. Any deviations from the norm should be minor and should easily be handled by the operator.

Complicated

In the realm of the Complicated, there needs to be some expert knowledge applied to to the problem set, in order to arrive at a decision. The authors called this “good practice” and there needs to be an interpretation of the problem before a decision is made. It is not simply deciding which “best practice” is applied to each situation. An answer is definitely achievable, but it will not necessarily be immediately obvious. “Reaching decisions in the complicated domain can often take a lot of time, and there is always a trade-off between finding the right answer and simply making a decision.”

Complex

According to Snowden and Moore, many problems in organizations can be characterized as complex. These are situations for which there is no clear well defined outcome, and the problem must be probed in order to ascertain the correct path forward. I have seen many production environments that I would consider to be complex. “Complex distributed systems” is a very common phrase in our profession. There are many outages that have happened, because some input to the system had a completely unexpected outcome, and resulted in a major problem.

The Knight Capital disaster is a classic example. No one had predicted that a deviation on one system, would lead to a catastrophic outcome for the company. When dealing in the realm of the Complex, caution is warranted and decisions should be made based on evidence, not simply past experience.

Most situations and decisions in organizations are complex because some major change…introduces unpredictability and flux. In this domain, we can understand why things happen only in retrospect. Instructive patterns, however, can emerge if the leader conducts experiments that are safe to fail.”

Chaotic

The chaotic is the area of unknown unknowns. As it is described, the only objective in the Chaotic arena, is to remove oneself from that arena, as quickly as possible. Leaders in this area are advised to make a decision, and try and move to another quadrant, any quadrant, from which a definitive path forward can be taken.

In Practice

So, how can we apply Cynefin in a DevOps context? What can we recognize about these four domains that is applicable to our responsibilities of keeping the site up, and keeping developers moving as fast as possible?

“…then sense where stability is present and from where it is absent, and then respond by working to transform the situation from chaos to complexity, where the identification of emerging patterns can both help prevent future crises and discern new opportunities.”

What I came to realize, was that our job in operations is to move problems clockwise around the Cynefin diagram, trying to make most problems faced by developers simple. For example, if I want a new virtual machine in AWS, it is a simple, well defined API call that needs to be made in order for this to happen. All the inputs are well defined, and all the outputs are well understood. Exactly like the bottom right quadrant.

Damon Edwards likes to say that “Operations provides a platform”. As this is the case, then part of our jobs in Operations is to provide a platform, similar to that presented by the AWS API, which enables self-service activity by the development teams, so that tasks they are trying to accomplish are simple and obvious. To ensure it does not require them to apply any expert knowledge to get their work done. I once worked with an engineering team that estimated they spent more than 60% of their time on “plumbing”, or wiring up the virtual hardware necessary for them to accomplish their task. Work that could be provided by a platform developed by Operations. Coaching these teams to a new way of working provided some very quick ROI for that client!

If our goal is to be as close to the Simple quadrant as possible, we can look at some examples where this is not the case, and some ways in which we can do better.

Environments

I have often worked with clients who have made a large effort to build out their production environments where everything is very clean and well defined. That does not mean that the environment is trivial (or obvious) to understand, but they make it possible. They are using Infrastructure as code, they package everything into containers, they do regular deployments, and there is plenty of documentation. I would characterize those environments as Cynefin Complicated. They do require some expert knowledge to understand, but we can reason about them.

When it comes to their staging environments however, these same clients have left it so that everything is a mess. In a misplaced effort to “save money”, the staging environment is where all the corners are cut. Instead of 5 separate web tiers like production, there are 5 web configurations jammed into one host on different ports. Instead of an Oracle RAC database, there is a single Postgres instance that is “close enough”. Of course, as this environment looks nothing like production, it’s basically worthless for testing, and because it’s such a hack of previously isolated things jammed together, we’ve actually moved from Complicated to Complex, and have a much harder time maintaining the environment.

A simpler way to deal with the problem (and save money) is to simply run smaller instances of the production tiers in the staging environments, and use the exact same business logic to build both. If we are running on a c4.4xlarge instance in production, then we can use a c4.large instance in staging (or whatever is appropriate). This way, the environments are basically identical, except for load. This also means that any code intended to manage production can be tested in the staging environment first, and as Gene Kim says: The ability to build representative test environments on demand is one of the strongest indicators of high performing IT teams.

We may not have moved all the way to Simple in this case, but we’re in a much better place then when operating in the Complex.

Deployments

Another example of Cynefin in action can be in our deployment processes. For many years, we have seen deployments as nightmares for Operations teams. Deployments that happen infrequently batch up large amounts of changes just waiting to interact in new and exciting ways under production load.

Often these infrequent releases involve multiple teams, executing a series of steps, all designed to work together over a series of multiple hours, until the deployment is finally complete. If there are any problems, there are complicated rollback procedures, only some of which have been tested. Generally each application will have its own deployment procedure depending on its age, coding language, development team, etc. This is definitely in the realm of the Complex, because not only do we need to apply expert knowledge like in Complicated, but because every procedure is a unique snowflake, i.e. we don’t know what effect any one action may have on any other system.

The first step in moving to the complicated would be to try and align all the different deployment schemes around a common pattern or three. In this way, for any one deployment, we only have a limited amount of possibilities to reason about. This can bring us into the area of “good practice”, where we do not need to consider a bunch of anomalous outliers.

If we wish to make the final jump to Simple, we need to create an environment where developers have a self-service platform that is constructed with well defined inputs and outputs. We can use a Chatbot like Hubot, Lita, or Errbot to make the inputs, the interface, uniform for any type of deployment. Regardless of the deployment itself, the interface to the chatbot will make everything appear the same, and return the same well defined output, even as the actual mechanisms for deployment are hidden from the end user. Thankfully, even in this case, the documentation of how the actual deployment is done, is the source code itself, so the mechanisms can be explored and understood as well. In this case, we’ve moved our deployments from Complex to Simple. There is no question: this is a large but worthwhile investment.

Conclusion

Often as leaders, we are asked to make decisions about which is the right path forward. Depending on the context of the situation, there can be different choices made. The Cynefin framework gives us a way to look at these situations, and decide what is the appropriate response.

By applying this same framework to Operations work, we can move toward more self-sufficient, high performing engineering teams. As we create platforms that present engineers with interfaces that are Simple, that are well defined, and don’t require a lot of creativity and expertise to utilize, we allow them to focus on things that do require those skills, like writing code and growing our businesses.

I look forward to exploring the various ways that we can help Operations teams enable development and product to go ever faster in more detail in future columns.

Reflections on Health 2.0 2019

Background

I’ve been spending a lot of time getting back into the intersection of health and technology (after a long time away). Because I currently have HealthTech clients, and because I wanted to learn more about the specific challenges facing health tech companies with service delivery, I attended the September 2019 version of Health 2.0 in Santa Clara.

Observations

The first observation about the differences between a health tech conference and conferences I’m generally acquainted with speaking at or attending is, it’s just tech! That may sound like a silly observation, but so many people over the past year have told me how health tech is 15 years behind regular tech. The crowd at Health 2.0 was definitely not representative of the larger industry. But it was interesting to compare and contrast my usual experience with this one.

Similarities

Even at this conference, with the cutting edge attendees, there were still some things that genuinely struck me as conversations we’ve had in the DevOps community for a number of years. Let me explain.

HiTrust vs. ITIL

I haven’t had much time to dig deep into the HiTrust CSF, but I was able to have a great discussion with Ryan Rich and Katie Peterson from Datica about their HiTrust solutions.

Ryan was explaining that there are sometimes very prescriptive requirements for the certification. As a verteran of many certification processes (GDPR, PCI, SOX, ISO, etc) it was interesting to hear about the approach taken by HiTrust. In some respects, I wondered if there were opportunities to fulfill the requirements if we understood the spirit behind the controls.

In other respects, I was concerned that HiTrust might look too much like ITIL. ITIL was definitely best practice for the time, but I have major concerns about a government agency being able to keep up with the pace of change in technology today in specifying standards. After watching the communications standards in healthcare hamstrung for so many years by an abundance of overly strict agreements, before the advent of FHIR, I’m concerned that the utility of HiTRUST may be so only for a limited time. After such time (or maybe already), I have a concern that it may be a weight dragging down the pace of innovation in health tech.

DevOps

I also heard the concerns of those who felt that those with a tech background (Google, Apple, Amazon, etc) may not understand the real concerns facing those in health tech. That the consequences of failure in health tech are so much greater than those in the technology companies, that it’s a very different problem. Maybe dragging down the pace of innovation in health tech is not so bad. We certainly don’t want to “move fast and break things” when dealing with heart valves!

I think that really misses a lot of the accomplishments we’ve made with the DevOps movement in the past decade or so. I used to be told at Salesforce that it doesn’t matter if Netflix goes down, that Salesforce was a different beast. My response was always, if that’s the case, why are they so much better at running production systems than you are?

It ignores Gary Gruver’s experience automating of the testing of the JetDirect cards at HP. This was not software being shipped to a website like Google Docs, these were pieces of hardware that were to be installed on customer premises.

There are countless examples of things that we’ve learned about service delivery, quality, and safety over the last decade. The irony is a lot of the things that John Allspaw and Adaptive Capacity Labs are doing bringing “research-driven methods and approaches to drive effective incident analysis in software-reliant organizations”, come from medicine and can be applied to health tech “software-reliant organizations”!

In Practice

One of the most direct displays of what we know from the DevOps movement was given by Michael Palantoni from Athena Health. He was talking about what was required to deliver Value Based Care.

VBC-Devops

While there are many lessons on the slide we’ve learned in DevOps about communicating with, and deeply understanding customers, the two that really jumped out at me were those related to Quality and Collaboration. These are both topics I’ve spoken about at DevOps conferences and recognize as essential for the successful implementation of a socio-technical system, which is at the core of health tech!

Scale

Dr. David Levin talked about how health tech is learning how to handle thousands of health tech IOT devices where communication needs to be a 1st order citizen. He expressed how health tech was learning how to deploy, collect, analyze, and respond to this type and volume of data. When running Site Reliability Engineering for the SolarWinds Cloud, our products routinely ingested and processed hundreds of thousands or millions of data points per second depending on the product. One product was able to do this while taking 37 seconds of planned partial downtime in a year. Losing a single customer datapoint was taken very seriously. We don’t need to reinvent the wheel, we need to stand on the shoulders of those who have come before.

Dr. Levin’s sentiments were echoed by Cris Ross, CIO of the Mayo Clinic. In announcing their partnership with Google Cloud, he told the audience, “We want to configure ourselves as a platform company”. The explosion of health care data, sensors, EMRs, etc. introduce a lot of complexity and we “need systems that help manage that complexity”.

These challenges are not only challenges for health tech, they are challenges for tech in general and it’s exciting to consider how lessons can be applied from each domain into the other.

Automation

Lastly, I was struck by the statement by Rachel Blank from Rory. She explained that there is often a misconception that her and other health tech products were looking to replace doctors. As Ms. Blank explained, nothing could be further from the truth.

This sounds so much like John Allspaw explaining the role of automation in technology. Automation is not to replace humans, it’s to augment the humans, to partner with them, to make them more successful. It would be foolish to hand over all autonomy to automation. This sentiment was echoed by machine learning/AI folks at the conference as well. Radiologists are not being replaced by an algorithm.

Guys

There are so many similarities between tech and health tech, that one major difference did jump out. Even in panels specifically focused on diversity topics, or the role of women, the moderators continually referred to the entire panel as “guys”. I understand that this was probably only jarring for me, but at this point, is something to which I’m not accustomed. After hearing Bridget Kromhout passionately explain to the DevOpsDays Ghent crowd in 2014 how this was exclusionary, I’ve taken it to heart.

I’m sure the health tech community is not far behind on this one either.

Differences

One major difference that I discovered between a traditional tech conference and the Health 2.0 one was the emphasis on solving real problems in the world. I’m sure there are many folks in tech who believe that their cat food delivery service is solving a real world problem. Other than perhaps the closing panel on ethics at SRECon APAC this year, I’ve never seen a tech conference address such real societal concerns head on.

These were active discussions about pragmatic solutions to many problems. Just like with the Automation discussion about, these were not “tech solves everything” solutions. They were using the power of technology in transformational ways.

The first day had a panel called The Unacceptables. There were examples of solutions people were working on dealing with homelessness, maternal mortality, and human trafficking. These are real problems affecting society about which we should all be directly concerned.

The second day had a panel called Social Movements in Healthcare Amplified by Technology. This panel dealt with issues of data privacy, data transparency, diversity in funding, and sexual harassment.

These panels were extremely impressive, not just for the topics they were discussing, but in their belief in using technology for good (as opposed to persecution and death).

Next Steps

I really learned a lot attending the Health 2.0 conference. I am grateful to the organizers for putting together such a great conference. Not only did I have a lot to think about during the two days of attendance, but there are many people I hope to follow up with for more discussion. There are challenges that I will be considering some time to come.

But, I would also like to participate. I would like to be part of the conversation. I’m considering trying to speak at the next HIMSS 2.0 Conference in Orlando, FL next year. What would I talk about? Operational Concerns for Health Tech Startups? How to scale your health tech product with reliability and security? I’m interested in your thoughts.

Stability vs. Speed, Pick Two

This post was originally published on CIO.com on August 20th, 2019.

Background

In holding various leadership positions in Operations engineering organizations over more than 20 years, either as a team lead, architect, or leading a global Site Reliability Engineering (SRE) organization, I’ve developed a philosophy that I’ve used as guidance for my teams. I wanted an easy way for them to make decisions either about what to work on, or how to prioritize their work.

It comes down to two simple rules:

  1. Keep the site up
  2. Keep the developers moving as fast as possible

That’s it. When Operations (often known as systems, DevOps, Site Reliability, SiteOps, Tech Ops, etc) engineers keep those principles in mind, they will almost always make the right decisions for the business. Because I subscribe to David Marquet’s philosophy about the organization achieving its highest performance with active, engaged contributors, this has worked extremely well over the years.

Engineers must decide regularly during their Agile planning meetings which work is most important to the business. Do they make an improvement which ensures faster, more reliable, and easier deployment of new databases? Do they work on being able to ship features faster? There are many considerations that need to be factored. As those engineers are closest to the information, they are also in the best position to help make the right decisions. Often as we move “up the ladder” in an engineering organization, the information we get is filtered and interpreted before we receive it. This is why it’s so important to do things like skip-level meetings. This culture is best described as Performance Oriented by Ron Westrum, which encourages collaboration, and cooperation among teams. When our engineers are empowered, high performance is the result.

I have had the opportunity to work for an organization once where these directives were reversed. The organization valued feature delivery over making sure the site was available to end users. Certainly something to try and wrap your head around on your next commute.

So, how does this work in practice? What do these directives mean? Why are they ranked as such?

Stability

If you’re in the business of running a website or service available over the Internet, the site has to remain available. Even if our primary mission is not developing software, as Watts S. Humphrey said: “Every business is a software business.” This means that no matter what kind of software our business delivers, we cannot make money on that software unless it is available.

If we have downtime (even scheduled downtime) the business impacts can be less/lost or delayed transactions, less awareness of our products, reputation damage, or frustrated customers. Regardless of damage, the software must not only be available, but it must also be performant as well. As Charity Majors often points out: “Nines don’t matter if users aren’t happy.”

I once worked for a company that used a metric for capacity planning that was something close to the average page load time when the site was up. The last five words in that sentence are not a typo. This is because the site was operating under the principle that scheduled downtime was ok, and therefore downtime was not to be counted. However, the metric that was used had no component referring to whether or not the downtime was scheduled. If we fell outside this measure, they would look at expanding the size of the database tier, or moving some customers to different hardware. By this measure, however, if the site was down for 29 out of 30 days in a month, but had very good load times for that single day, then everything was fine! Yes, we did manage to get that changed.

Instead, as operations engineers, we need to focus on both the stability and performance of the website. When working with a team that understands this is a priority, we can make significant changes with very little impact to our customers. Oftentimes, making major changes with a minimum of disruption can require additional planning, and additional steps, but good engineers understand the significance to the business.

I had the opportunity to work with an engineering team that was orchestrating a move from the old Amazon Web Services (AWS) Classic environment to the new Virtual Private Cloud (VPC) environment. Aside from some pretty tricky networking requirements, moving most of the services was a fairly straightforward process.

  1. Stand up services in the new environment
  2. Migrate traffic to the new services
  3. Let traffic drain from the old environment
  4. Terminate the old service

Pretty simple for stateless (and possibly even stateful) services, until you get to the primary data store. Unless you are going to run some kind of distributed database cluster across a complicated network topology, there will be downtime involved. At some point, database writes are going to have to move from the database in the old environment to that of the new one. Full stop.

The team moved all traffic so that it was running through database proxies, instead of directly from the applications to the database itself. If they had not placed such an emphasis on availability, they could have simply skipped this step. They could have spent an hour or more updating configurations and restarting services all over the infrastructure with an associated amount of downtime. They also setup replica databases in the new environment that were replicas of a replica that would be promoted to the new primary. The process became:

  1. Ensure all read traffic was coming off the new replicas
  2. Break the replication from the old environment
  3. Point all write traffic to the database in the new environment

That’s it. For that migration, we took 37 seconds of partial planned downtime for the year. 37 seconds of planned downtime for the year is a pretty enviable achievement, but additionally:

  1. All read traffic continued uninterrupted
  2. All writes for data types that had been seen before were queued and written after the migration
  3. Only brand new data, that had never been seen before in any capacity received an HTTP 503 error code for those 37 seconds.

Speed

Being able to keep a site up, running, and performant is, of course, only part of the responsibility of an operations team. If we are not shipping software, continually, then it is impossible to maintain parity with, or beat the competition. As Dr. Mik Kersten points out in his book From Project to Product, the organizations that master software delivery will survive, those that do not need only look at the Killer B’s of Blockbuster, Barnes & Noble, and Borders.

To that end, Kersten describes 4 types of work that engineering teams can spend their time on, what he calls flow items:

  • Feature - new value
  • Defect - quality problems, bugs, etc.
  • Risk - security, compliance, etc.
  • Debt - tech debt

Of course, operations teams keeping the developers moving as fast as possible, is not just merely shipping features. As the litany of security breaches over the years have shown us, mitigating risk can also be extremely important! But regardless of the type of work we ship, the more simply, safely, and quickly that an operations team can facilitate the flow of that work through the system, the more potential for the success of the business. This focus on the smooth flow of work through the system should be recognizable as the 1st way of devops for anyone who’s read The Phoenix Project.

As we’ve learned from the DORA State of DevOps reports, this success holds, regardless of the size of the business. Not just with vehicles, speed kills!

StartUps

For startups, the ability to ship, safely and often is how we find product market fit. For anyone who’s read Eric Ries’ The Lean Startup, it’s all about how many experiments we can run in a period of time. Ries calls this “validated learning”, and if we can do this faster and better than our competitors, then we can find our fit first. The competition can try and imitate and copy, but if we are flat out better at delivering software, we can even make more mistakes, and still maintain our lead position in the market.

Enterprises

For enterprises, the average age for an S&P 500 company is under 20 years, down from 60 years in the 1950s. Dr. Kersten would argue this is because of their inability to deliver software effectively, as well as perhaps, our stage in the Deployment Age as proposed by Carlota Perez.

But a simpler explanation could also be a concept with which we are already familiar, the Innovator’s Dilemma. Incumbent companies do not feel the same pressure to innovate as do the startups because of their existing revenue base. From the linked Wikipedia article: “Value to innovation is an S-curve: Improving a product takes time and many iterations. The first of these iterations provide minimal value to the customer but in time the base is created and the value increases exponentially. Once the base is created then each iteration is drastically better than the last.” The words leap off the page from the same rationale we used referring to StartUps above. Those who have the ability to experiment quickly and thus innovate, become the disrupters and push the incumbents out.

What better explanation for why it’s critically important that Operations engineering enables the enterprise to innovate as quickly as possible?

In Practice

Perhaps no organization better typifies this balance between stability and speed than Google SRE. As we learned in the Google SRE Book, Google SRE has a concept of error budgets.

These seem pretty easy to understand at first blush. Each product supported by the SRE team reaches an agreement for an amount of downtime the service is permitted each month. Engineering is free to deploy as often as they like until they burn up their error budget. At that time, new feature deployments stop. The budget has been exhausted and we do not want to give the customer a bad user experience. This seems like a great argument for stability!

But, there is a flip side to the concept of the error budget. If you team does not use up a sufficient amount of their error budget in a month, there is also a conversation asking engineering why they did not use up enough of their error budget. A conversation about being too stable? This is because Google recognizes the need for speed, the need to push innovation, the need to stay ahead of the competition.

Stability or Speed? Why not both?

Conclusion

Whether your team is called Operations/System/DevOps, or Site Reliability Engineering, those teams have a very important role to play in the success of the business. I argue that Operations Engineering can be a strategic differentiator for the business, allowing customers constant access to the product, while enabling the developers to outpace the competition.

I’m sometimes asked why the developers can’t just do all this work themselves? Well, they can. But what are we giving up? I’ve seen some of the most talented development teams in the industry try and do Operations type work “on the side” as they try and deliver the four flow items to production. It always looks as if it were done in that manner. I’ve also seen talented development teams spend more than 60% of their week maintaining the “plumbing” required to keep their websites up, running, and deployable, instead of using that time to ship product.

I look forward to exploring the various ways that we can enable our organizations to deliver with stability and speed in more detail in future columns.

An Agile SRE Meeting Plan

sre-schedule

Engineers dislike meetings. What engineers really dislike are meetings for which they perceive no value. Below is described a meeting plan developed, iterated upon, and used over many years at multiple companies that has proven very effective to both maximize meeting value and minimize unnecessary time in meetings, so that engineers may do what many enjoy most, building things. This may not be the perfect plan for your organization, but will hopefully inspire conversation and discussion about how to structure the time of your SRE team.

The Structure

Over the years, I’ve experienced many different Agile implementations. Scrum is considered to be a pretty poor match for interrupt driven teams like Site Reliability Engineers (SREs), but how to get the agile benefits of Kanban and still retain many of the advantages of Scrum? How to have a schedule that is relatively light on meetings, but still keep the maximum amount of communication and transparency? How to continue to be agile, instead of just doing agile, especially with distributed teams?

In the plan outlined herein, we try to balance many of those things. We lay them out as Monday to Friday, but they could certain be Tuesday to Tuesday, or whatever fits best to line up with the development team’s sprints. (Protip: line up as best you can with the development team sprints) When we first embarked on this path, we were on two week iterations, to match what the dev teams were doing. Over time, we discovered that we lacked the responsiveness we wanted to provide to those teams, and thus switched over to one week iterations, in order to maximize (internal) customer satisfaction.

The iteration starts off a bit meeting heavy to emphasize alignment, and then allows for plenty of time for our standard SRE work (elimination of toil, etc.), and finally closes the iteration with time to reflect and improve.

The Meetings

Iteration Planning

Purpose

The iteration planning meeting is, well, just what it sounds like, planning the iteration. Because SRE teams can be interrupt driven and use Kanban, iteration planning is not for committing to delivering specific work in a time box (like Scrum). Instead, it’s for making sure that the entire team is on the same page in terms of priorities, needs of the business, project work assignment (e.g. Anne really wants to work on the VPC network project), dependencies on other teams, urgency of different tasks, and looking forward toward the next few weeks and what work may be taken on.

This is really a time for discussion, and for identifying things that will require more in depth discussion. It is not a time for going deep on any particular task, but to make sure that everyone on the team is aligned for the next batch of work. Oftentimes, work from the previous iteration will simply continue in the current one, but this is also a good time to check in that the work is being delivered to expectations especially as a result of the demos (which we’ll get to later), that closed out the previous iteration.

Because we want to be agile, instead of simply doing Agile, this doesn’t mean that once work is agreed to that it’s set in stone for a week. It just means that we don’t want to oscillate wildly from day to day, or even week to week, and the iteration planning meeting is the opportunity to ensure the team is moving in the same direction simultaneously. Because we are “walking the wall” while negotiating tasks, this is also a great time to recognize blocked work and any of the “time thieves” Dominca Degrandis (@dominicad) describes in her book Making Work Visible.

At the end of this meeting, the team leader should have worked out with the team a balance of work being requested by other parts of the business and work proposed by the team itself. Some iterations, other team’s requests will occupy more time, some iterations the needs of the team will take priority. A successful manager will navigate the balance between the two ensuring that the needs of the business are being met while simultaneously allowing the team to reduce toil, perform chaos engineering experiments, collaborate with other teams, etc.

Mechanics

The iteration planning meeting should begin with an already prioritized Kanban board. The team can negotiate changes to the priorities during the meeting, but this is not a time for debating what it is that the business values. Depending on the size of the team, and the amount of discussion needed of specific tasks, this meeting should take no more than one hour. Past this point, engineers will lose focus and interest and “just want to start getting things done”.

  1. Holdover discussions from previous iteration
  2. Explanation of the top priorities from the business
  3. Explanation of the top priorities from the team
  4. Identification of merged prioritization
  5. Identification of resources working or interested in tasks
  6. Assignment of work if no resource volunteering for necessary work item
  7. Parking lot

Daily Standup

Purpose

The daily standup has the same purpose as it does in Scrum. To keep the team in close alignment with respect to deliverables and identifying any items that require assistance from other team members or management in order to keep the team operating at its highest velocity.

This plan only has standup during the middle of the iteration because the beginning and end already have time for team discussion with the iteration planning and iteration review meetings.

The daily standup should be restricted to the work at hand and not devolve into in-depth discussions on specific tasks which extend the meeting and hold the rest of the team hostage for the duration of the discussion.

Mechanics

  1. What did I do yesterday, what am I working on currently, identification of blockers. (for each team member)
  2. Parking lot

If working with distributed teams, one might want to allow the standup to extend out to a full 30 minutes to allow the team to socialize with one another and therefore build some of the bonds you would otherwise get by colocation. In that case, the standup should conclude as soon as parking lot is complete.

Inter-team Sync

Purpose

If good intra-team communication can be difficult to do well, then good inter-team communication can be even more difficult. Couple this with dependencies between teams, and one can easily see the need to set aside an agenda specifically for this purpose.

The Inter-team Sync is to ensure close coordination and transparency between the SRE team and their primary customer. We don’t want to fill each iteration with sync meetings between the SRE team and any customer they may have as that number may be very large and frequent context switching is a major impediment to delivery of work. But, the team that works most closely with the SRE team should have a short meeting to discuss work in progress, dependencies, upcoming projects, etc. In this way, we attempt to ensure that both teams are working at the maximum safe velocity and minimize misunderstandings and the conflicting priorities and unknown dependencies time thieves (see Degrandis above).

The Inter-team Sync is how DevOps is done!

Mechanics

The Inter-team sync meeting should be no longer than 30 minutes. Any discussions of deep architectural questions should be put on the agenda for the Architecture Meeting. There should be an agenda (we like Kanban boards for this purpose) created and maintained by the team leads or managers, that is widely available, which anyone can contribute to, that tracks all the work items shared between the two teams, especially dependencies.

The person who runs the meeting simply “walks the wall” until there are no more items to sync upon and then ends the meeting.

Architecture “Arch” Meeting

Purpose

If the iteration planning meeting and the daily standup are not the place for in-depth discussion, then the weekly arch meeting is exactly the place for such discussion. This is the forum for any deep technical discussions on the SRE team. This is also a forum where members from other teams can either be invited or be regular attendees to give guidance, ask questions, provide clarification, etc. of work with which the SRE team is tasked. In other words, DevOps!

The outcomes or inputs to the arch meeting are often technical specifications, diagrams, documentation, requirements documents, and experiments. This can be a time for senior staff to give feedback on proposals to other, both senior and junior, members of the team. This can be a time to solicit opinions from the group on a new or existing technology or to review past postmortems. This can be a time for helping to figure out how to navigate toward a long term goal. The opportunities are very wide open (by design), but the goal should be that by the end of each arch meeting, the entire team should have taken a step forward toward achieving the goals of the team and of the business.

The number of times I’ve heard the phrase “let’s table this and add it to the agenda for the arch meeting” over the years, are far too numerous to be counted. This is another opportunity for the team to ensure they are highly aligned as they move into the meat of the iteration.

Mechanics

The agenda for the architecture meeting tends to build itself over the course of the previous iterations. We always use a simple kanban board or Google Doc for keeping track of proposed topics. The person running the meeting can cover each topic in turn or it can be run Lean Coffee style or if someone has an especially important topic for discussion, that can be moved to the beginning of the meeting or the end (to allow more time for open ended discussion). It is really up to the attendees to determine which best suits the style of the teams involved.

Demos/Retros

Purpose

Students of Gene Kim’s Three Ways of DevOps know that the 2nd way is all about feedback. In order for this to be successful, we need to set aside time in our week (in the form of a meeting) to specifically enable that feedback to occur. The Demo/Retro meeting has two purposes:

  1. Have the team demo the work they accomplished (not necessarily completed) during the iteration.
  2. Have a retrospective to discuss how to improve the team in a psychologically safe environment (see re:Work)

The demo allows the team to get fast feedback on work they have completed or is already in progress. There is a saying in Agile, “maximize the work not done”, which reminds us to spend our time on work that is critical to our success. If someone is delivering a project that does not meet our needs, we’d like to give them that feedback before they finish the project so they can adjust course, not after all that work is complete. The bar for a demo is extremely low; unit tests, working demos, command line tools, a single API call are all acceptable demos. The point isn’t to dazzle, the point is to demonstrate working code.

The retrospective (retro) gives the team the space to improve in the kaizen fashion. We follow the traditional retrospective format (what went well, what did not go well, what could be better) with some modifications. The goal is for each team to be higher performing at the end of every year than they were at its start. By setting aside a safe space for the team to talk about how the iteration went for them, and how the team can improve, we are creating an environment that fosters and encourages that improvement.

Mechanics

The demo part of the meeting should be open to all. Any stakeholder that wishes to participate should be able to attend. Borrowing from a technique I developed with Greg Oehman at Salesforce, we always record the demos and post them somewhere afterwards (wiki, Google Drive, etc.) so that anyone who was not able to attend will be able to see the demos. This is critical if you have a globally distributed organization where time zones make attendance for all a challenge. However, the feedback from those folks can be invaluable to making sure we deliver the right work on time. Again, we’re trying to create an environment that maximizes transparency.

In the retrospective part of the meeting, only the team members should participate (in Agile terms, only the pigs). There should be no executives or project managers attending this part of the meeting. It is strictly for those who need a psychologically safe space to have open and honest conversation in order to move the team forward, or discuss problems without any fear of retribution or interference. Team members fill out a shared document or perhaps their own document with their thoughts about the iteration (what went well, what did not go well, what could be better) . Then each member in turn has an opportunity to read their contribution and explain in greater detail so that they know that they have been heard. During this section of the meeting, clarifying questions can be asked, Arch meeting agenda items can be added, etc.

When holding the demo/retro at the end of a week, we like to have the team spend the rest of the day working on documentation, testing in staging, development, etc. Basically anything that does not touch production before a weekend.

Conclusion

Finding a cadence upon which to work as an engineer can be difficult. As engineers are generally averse to meetings, oftentimes we wind up with sporadic meetings and a lot of people who are unclear on their priorities and goals. On the other side, we can find ourselves in environments that are extremely meeting heavy, and engineers often left wondering when there will be time to actually do the work they believed they were hired to do. The establishment of only necessary meetings, at specifically defined times, allows engineers to plan their time to minimize context switching, and and to maximize the time invested in their meetings with one another.

This plan is certainly not a one-size-fits-all solution, but is deliberately broad and flexible to allow modification to fit into your organization, while being prescriptive enough about the purpose of each interaction to allow for different implementations that accomplish the same goals, namely: transparency, collaboration, agility, and effectiveness.

I hope you are able to use it to advance the capabilities and success of your SRE teams.

Thanks

The thinking demonstrated in this plan has evolved and will continue to evolve over the years. There is no way it would have been possible without specific input from many trusted friends, and co-workers. I’m in debt to Evan Wiley (@absoludicrous), Peter Haggery, Peter Norton, the SFDC ISD team, Jeff Frasca, SWI Cloud SREs, Jathan McCullum, Mauricette Forzano, Eric Rapin, Stuart McCulla, Kelly Courier, and John Irwin.

Why I’m Leaving Tech for Healthcare

I’m leaving tech. Not leaving technology, I just want to leave technology for technology’s sake. I’ve spent a lot of my career working for “tech” companies and helping to advance the state of tech, or DevOps, or Operations in the industry. I’m looking for an engineering leadership role in healthcare.

Background

My first job out of college was as a research assistant/programmer for the National Institutes of Health in an Alzheimer’s disease research lab. Every day, no matter how bad it was, no matter whether my C code wouldn’t compile, or if the network was down, or how much trouble I had finding “normal controls”, I had done something to help people. We talk a lot in our industry about how we are “changing the world”, which is an admirable goal, if that’s indeed what we’re doing. Often however, it’s just changing the state of technology which may or may not have a positive effect on the world. I want to go back to doing something that has a tangible benefit every day, not a theoretical one.

Things I’ve Learned

I’ve spent a few months now just trying to get my own research done and in order. I’ve read a lot, I’ve talked to a lot of people at a lot of different companies. What I’ve learned in a short time breaks down into two categories.

  1. Established companies. These are companies that have been in the healthcare space for a while. As one person I talked to described it, there are large parts of some of these companies that are 10 years or more behind in technology. They may be running some K8s, but they are also just as likely to be running Cobol.

  2. Moonshot startups. These companies have the intention of changing something massive about the industry. Just like any startup, some have more traction than others but all see large opportunities in front of them. These tend to be smaller and have a lot less legacy artifacts to contend with.

Based on my background, I have a small bias towards the established players.

Why You Should Hire Me

I’ve made a career out of advancing the state of operations and software delivery at companies large and small. If every company is a software company, then at least I would hope those skills would be broadly applicable.

I’ve led DevOps transformations for companies as big as Salesforce and have worked on improving dysfunctional culture at small startups.

One of the CTOs I’ve worked with and I were talking about what I was doing at his company and why. My answer was that they were allowing me to build an engineering organization in a way that I would want to work as an engineer. An organization that values transparency, diversity, work/life balance, Agility, empowerment, and talent. I feel the results spoke for themselves. We were able to sign an engineer who had interest from 18 companies and had narrowed it down to 3. Out of the 3 he chose us because we had the culture described above that was exactly what he’d been looking for. Even months after he started, he still talked about how he’d made the right decision.

If that’s the kind of culture you’d want at your company. If you want to be able to attract top talent, and have an organization that reaps the benefits described in the 2018 State of DevOps Report, and you’re in the healthcare space, then we should definitely have a conversation.

I’ve talked to some companies that were worried they had too many challenges for an engineering leader like me to want to work there. That is neither an exclusion nor acceptance criteria for me. I’m looking forward to the discussions.

Thank you.