Maintaining Mean-time-to-Joy: Managing a Global Incident at Netflix

J. Paul Reed

Senior Applied Resilience Engineer
J. Paul Reed began his career in the trenches as a build/release and operations engineer. After launching a successful consulting firm, he now spends his days as a Senior Applied Resilience Engineer on Netflix's Critical Operations & Reliability Engineering (CORE) team, focusing on incident analysis, systemic risk identification and mitigation, applied Resilience Engineering, and human factors expressed in the streaming leader's various sociotechnical systems.

Tim Heckman

Senior Site Reliability Engineer
Tim is a Site Reliability Engineer at Netflix, working on the team responsible for the reliability of the Streaming Platform. Prior to becoming an SRE at Netflix, he worked at startups in roles focused on the operation, reliability, and security of their applications and infrastructure; as well as assuming the commander role in active security incidents. While he has primarily relinquished security responsibilities in his current position, it is still an area he is deeply passionate about and is a focus of the work he does. When not managing an incident or hacking on Go code, you are likely to find him up in San Francisco either vegging out with games on his PC, behind his mixer, or helping out in the Go community.

Holly Allen

Head of Reliability
Holly Allen is the head of reliability at Slack, with SRE, Monitoring, and Resilience Engineering in her portfolio. She is tireless in her efforts to make Slack the software reliable and scalable, and Slack the company a delightful place to work. Prior to Slack Holly worked at startups, DreamWorks Animation, and was Director of Engineering at 18F, a civic tech startup in the US government.

Lex Neva

Site Reliability Engineer
Lex Neva is interested in all things related to running large, massively multiuser online services. He has years of Systems Engineering, tinkering, and troubleshooting experience and perhaps loves incident response more than he ought to. He’s previously worked for Linden Lab, DeviantArt, and Heroku and currently works as an SRE at Fastly helping to make sure the Internet keeps running.

Tony Ferrelli

VP of Technology Operations
Tony is a 25-year Internet industry veteran who has served in various Network Engineering and Operation leadership roles, including Google and DoubleClick. Tony spearheads the management and operations of all Catchpoint monitoring data centers, supporting Catchpoint’s expanding corporate strategy, delivering stable, secure, and reliable operations.

Maira Zarate

Site Reliability Engineer
Maira is an Application Engineer at Autodesk, based in Novi Michigan. She is obsessed with learning, but especially with the learning process that accompanies on-boarding monitoring concepts for better site/service Performance and Availability. She has dedicated her past years to site reliability, working with different Synthetic and RUM monitoring tools.

J. Paul Reed:

Well, thank you so much, Peter, for that introduction. I really appreciate it. As you said, my name is Paul Reed and I'm from Netflix, here with Tim Heckman. It was funny, we were noticing that the summary of the talk wasn't on the website and that's probably actually because I didn't get it to you. But we're going to be talking today about what it was like at Netflix on the CORE team in a global pandemic and how we dealt with that situation.

Let's see if I can get this work. What we're going to be talking about today, obviously we're going to follow a little bit of the standard storytelling model. We're going to talk about the setup, who we are, what we do, the confrontation, what we faced in those first few days. And then, of course, the denouement resolution. And then, maybe some thoughts for the future for folks as we all make our way through this COVID situation.

This is the standard "About me" slide. I'm not really going to say much about it other than I'm J. Paul Reed on Twitter. So, feel free to interact with me there. Oh, the one other thing I mentioned in my last panel, we talked about human factors and system safety. So, that's my jam and I talk a lot about that. And there's a picture from when I stopped shaving for quarantine, but I decided that was too much. So, I ended up shaving. Heckman?

Tim Heckman:

Yeah, I'm Tim Heckman. I'm a Site Reliability Engineer on the CORE team. You can follow me on Twitter @theckman. Unlike Paul, this is my picture from before I stopped shaving and getting a haircut. And so, we decided to [inaudible 00:02:54] those off of each other. But even though there's two of us, I think the one thing to call out is that there's a lot more folks us that really supported us in what we did here and who made this successful.

What you're seeing are all the mugshots, as it were, of the CORE team. And these folks consist across multiple time zones, multiple offices, and really the success of the story is largely built on the foundation we've all put together ourselves. And there's also Dave. Dave's our manager. Dave usually doesn't look like this, but there are some cases where things do happen and that's the response you get. But all jokes aside, Dave's an awesome manager. I couldn't be more lucky to work under him, this scary picture aside.

J. Paul Reed:

Yeah, and by the way, of course, we have to put in the plug, we are hiring... Netflix is obviously hiring, but we actually have some roles on the CORE team hiring. So, check that out and you can ask both [inaudible 00:03:45] Slack if you're curious about that. All right, so let's talk a little bit about the setup.

I want to set the stage for what CORE does because the CORE team, which actually stands for Critical Operations and Reliability Engineering, is a little different than maybe a lot of SRE teams at various organizations or operations teams. We do the standard stuff that you'd expect where a focus on reliability and availability. We do incident management for Netflix for both the streaming side of the house, and then also we've started doing incident management for the content and studio space. So, those are the teams that have applications that are related to film production and things like that.

And then, also quality of experience management. This used to actually say customer quality of experience. But again, it also relates to that... We've got productions and producers and actors and film crew using applications to get things done. And so, really trying to understand if they're experience any quality issues in terms of using those applications, and then how we can really help with that.

This is one of the things we're a little different with a lot of teams or SRE teams about. What CORE does not do is we don't manage other team services. We are not on call for other team services. So, we follow that standard, which is becoming more prevalent. You build it, you run it ethos. We don't run runbooks. So, we are not a knock. We don't run other team's runbooks and we actually don't really have runbooks of our own per se.

Dave, I think that picture of him was when somebody mentioned the term runbook because he has thoughts and feels about runbooks. We don't page teams when they're service breaks. This might be kind of confusing. And what that actually means is if a service has a degradation or a problem or something like that, we expect the teams to be monitoring it and setting up those alerts.

We will page teams if we notice that their service has an impact to some of the KPIs that we measure, and there's two main ones for streaming product, and then we have some for content in studio as well. But if their service doesn't impact those metrics, then we're not going to page them, we're not going to interact with them. We put that freedom and responsibility on them to really own their service in that way.

We're not going to file Jira tickets, even if we do an incident review. That's not the role that we play. We're not going to help you groom your backlog. This is the standard Netflix thing about context, not control. We set the context of what the teams, what their impact is on some of those KPIs and that customer experience that we want to maintain and allow them to do that work because they have the best locally rational knowledge.

And then, I think most importantly, we're not the accountability police. It's not our job to run around the organization with a clipboard and tick things off and make sure that people are doing the right things. Most people probably know this. That's not how the Netflix culture is designed to operate. And so, of course, even as a CORE team that is really responsible for maintaining that the service is operational, we still don't act that way. All right.

Tim Heckman:

Jumping out from there, we mentioned one of the responsibilities is incident management. And the word incident is actually a very heavy-loaded term. What is an incident even mean? And that can be different for each business. So, the way that we think about incidents internally at Netflix is that it's any unplanned disruption, degradation, or risk which impacts the user experience, the technical systems themselves, or the business.

And quite often, those require coordination, people have roles in that to play. And generally, we do this via Slack. Most of our incidents are managed via a single channel and we tend not to have a voice breach to go along with it. For more hairy or more complicated incidents, that may change as the situation dictates. But generally, it's all done via Slack.

J. Paul Reed:

Oops, wrong window. All right, so now that we have the stage set up a little bit, the confrontation. As you all know, one of the reasons we're doing the SRE From Home conference and from remote is this little monster, which is the coronavirus. And so, we want to talk a little bit about what actually happened in those first few days of March and into April as this unfolded across the world.

And of course, as you can see, it was one of the, I think, interesting things that took not only SRE teams and teams that were operating various types of services by surprise, but the world by surprise. It's like how quickly we went from, "Hey, this seems like it might be scary," to "We're all locking down and sheltering in place." I think it's interesting that that headline about COVID-19 shutting down sports in 43 hours. There's that maybe 72-hour period in, I think, the third week of March where it got really serious.

So, what does that mean? A CORE managed incident in a pandemic, Tim?

Tim Heckman:

Yeah. And so, I think it may make sense to give some history of how we've seen incidents work at Netflix. And so, for incidents that reflect on availability and reliability, the product working, we find that those tend to be more short lived, usually in a number of hours, sometimes days, depending on the issue. But it's very unusual for an incident, an impact, to run a week, two weeks, a month. And so, one thing our team is built around is really the short-lived incidents, doing communication around that and coordinating around those sorts of things.

For the longer running ones, we have found that security incidents are usually related to this a lot more. And so, I won't go into it here, but there is a separate team at Netflix that handle security incidents solely. And they've built tooling and muscle around those longer incidents and the behaviors you want to have during them.

But during this, because we had to shift that longer incident, we recognized early on that this wasn't going to be a two week, a three week sort of issue. We needed to prepare to be in the state for months, maybe even through the end of the year. And so, we had to start thinking about what components of our incident work well for longterm things, and what things don't. How do we need to change how-

PART 1 OF 4 ENDS [00:10:04]

Tim Heckman:

Work well for longterm things and what things don't. How do we need to change how we respond and how we communicate to larger groups inside the company when providing context? So, one of the things we started looking at was we probably need to have a regular meeting for this sort of thing, right? It is an incident, but it's not something where it's on fire. We're not going to have somebody in the Slack channel 24/7 reacting to things, because for most of the time we're just going to be waiting, we're going to be observing and seeing how the system changes around us.

We spent some time to come in every day to meet and look at our metrics, look at our signals, and try to see what was changing. Were more people coming in using Netflix? And was that going to be a scaling challenge for us? Were we going to run into cases where we exhausted network throughput? Because certain networks might be over-subscribed for what they have for full capacity. So those working groups meetings really work through those intricate details.

J. Paul Reed:

I was just going to mention, we tried to have a little bit of levity. As I remember, Tim, the name of that meeting, and it was 10:00 in the morning, that was our first meeting, was and now your moment of COVID-19.

Tim Heckman:

Yeah. We're not morning people, so 10:00 AM is basically our 8:00 AM. And so we were being a little extra punchy, I think, definitely at that meeting invite. But along with that, I think the big thing we recognize is that this was going to have substantial impact on the people at the company, right? While the systems may need to scale and we may need to make changes to meet a new global demand, that is much different than how the humans, how are peers, how the people we care about and work with are really going to be impacted by this. We knew that folks would be unavailable, they would have to take care of their children now, or other things would interrupt them throughout the day, which would be a big distraction.

Part of this is that we realized that we needed to communicate with folks what was going on, what the current state of things were, not only so they had the current context of what we were thinking, what we were suggesting, and where we were headed, but also to give them some confidence, to know that the system around them would be fine. They don't have to worry about their database falling over or their service needing to be manually scaled. They know that the teams that focus on that are looking at that and keeping tabs on that. And so, part of the weekly update really was just the warm and fuzzies, letting people know we're still on it, we're still looking at it, and we are still supporting them in this process.

J. Paul Reed:

I was one of the ones that kind of wrote the first draft of that. And I can tell you, when you're writing an email to about 5,000 engineers and trying to make it quick enough to read but also dense with useful information, that email took us what, Tim, at least a day, sometimes a day and a half to get all the verbiage right and make sure that it was highly valuable for the thousands of people reading it.

Tim Heckman:

Yeah, and I think that dovetails nicely into the coordination of the teams, right? One thing that's maybe different here than other companies is that our engineering leadership did not step in to provide guidance or to direct this response. What we actually ended up seeing is the folks that were engineering leaders looked to our team and said, "Hey, we're going to point to the core team and have that team do the coordination, do the communication, and provide all of the recommendation for engineering because they built that muscle out to do that regularly." And so, that was adding weight to those emails, right? Is that the chief product officer and effectively the mobile VPs of engineering were looking to us to do the communication and the conduct sharing for them. And so, that involved working with those leaders, with engineering managers, and even individual contributors on across all of those teams to understand what was happening in their local context, what were they seeing and feeling, and what context would be most useful to them that we could provide to them and their peers.

J. Paul Reed:

One of the things, too, that I think the role evolved, if you think of, it's a model that I talk about a lot, [inaudible 00:13:44] if you're an incident commander or a captain, that would play this air traffic control role of seeing the system from a different view and then directing things. What was an interesting, not challenge, but change in the way we think about it was we wanted to actually be more like stoplight or informational signposts along a roadway where teams that were thinking about, Tim mentioned the database or whatever, and this is the point around interviews. We did a ton of interviews with lots of different teams to just gather what are they worried about, what keeps them up at night, what are they working on, what did they have to lean into, what did they have to push off their plates?

A lot of that was distilling it down so that when somebody would come to us and come up to that stoplight and say, "Hey, I'm really curious what data science is doing and the load on X,Y,Z." We could play that stoplight and say, "Okay, go talk to that team." Or those low effort, low easy to consume sign along the road of, "Hey, you may have heard that we were concerned about this particular thing, but now we're not, and we're feeling really good about that."

We did a lot of actually qualitative interviews with folks. It wasn't just looking at lots of graphs, lots of data. It was actually talking to people and getting a sense of really gut feels that actually led us to go get data. And in some places we found, oh, that's a major thing we need to actually go look at. And another place, it was like, "Oh, people are worried about that, but they don't need to be." We're in a good place as a system there. Anything else to add to this one, Tim?

Tim Heckman:

No, I think that pretty much covers it.

J. Paul Reed:

Okay. So there's an interesting question. Right? So, we're in this once in a century global pandemic, and the question is, "Do you stop deploying?" And there was a poll question around this. And we wanted to talk a little bit about that. And I'll reveal the poll results because I think they're interesting after we go through this. But this was discussed on Twitter. There was a conversation about should companies put halts on deployments or not? And a lot of DevOps proponents said SRAs are like, no, that's actually not what you want to do.

I actually think that's asking the wrong question. So this comes from human factors, and it's this idea that we talk about ends of the socio-technical system, and this is in the academic writing language, this is how they refer to it. They talk about the blunt end of the system and the sharp end of the system. And the sharp end of the system is where all the work is being done. It's where engineering teams doing the work, very close to the work, very close to the systems. And the blunt end might be more like the leadership or managers that are actually distant in time and space from the work that's being done and the decisions that are being made.

What I think asking that question of, "Did you stop doing deployments," kind of misses is it doesn't take into account who's asking that question.

Tim Heckman:

Yeah, so jumping into that, I think the most important piece is understanding where is the question coming from and why are they asking that question? What sort of context behind that is there that we're not seeing? I think it's much different if a leader or a leadership person makes a suggestion or asks that question, "Should we stop deploying? Should we no longer be making these changes, or order?"

J. Paul Reed:

Yeah, exactly.

Tim Heckman:

Versus the individual contributor asking that question. Because they're coming from different places, right? Where is the pressure coming from that they're feeling that to ask that question? For an individual contributor, that might be that they don't feel their system is safe enough for they don't have confidence in their deploys to know that things will be okay. And from a leader, it could be a similar sort of origin, but it could also be other things, as Paul stated, because there's so distanced from the actual work being done, they may not have the right context to make such a broad statement or question about the system because it is very nuanced.

J. Paul Reed:

I'll read the results real quick. It's interesting. About a quarter of folks said they were told to reduce or freeze the deployments and they complied with the guidance. But 56% said there were no changes to deployment frequency. Tim, if you can walk us through real quick, different teams at Netflix made different decisions about the answer to that question, but it was those teams that made that decision. Right?

Tim Heckman:

Yeah. I think it'd be good to give some of the context of how that discussion happened. I think we had some suggestions from the leadership when they handed this project off to us of what they desired. Right? And they wanted us to retain some amount of normalcy throughout this. They weren't asking for anything prescriptive, but just that there are risks when we do stop deploying and there are risks when we do deploy, and that we should try to find some middle ground where things aren't being stuck, but we're able to move in a way that is safe.

Our initial recommendation from the higher level was that we weren't suggesting that folks stop making deployments, right? We weren't suggesting anything, but we were asking them to be more careful, be more mindful, be more thoughtful of the changes they're making, because there are more people watching Netflix now. There are more of your colleagues that are not available or not able to respond to issues. And so, not only could those impact more customers if something goes wrong, it could also impact the employees as well because of their side effect of the coronavirus stuff.

Tim Heckman:

What we ended up seeing, and it was kind of our desire, is that local groups would decide what was best for them. The client UI teams, I think, for a few weeks, did institute a code freeze. Right? They stopped things and they wanted to see how things shook out before they started making changes again. There were some other teams that took more hybrid approaches. They slowed down. There were others that were full steam ahead.

One interesting anecdote out of that is we had an incident because the system never expected to be paused. There was a deployment pipeline that had never been paused in its full lifetime. And one of its expectations was violated because it didn't have a new version number that day.

J. Paul Reed:

It was supposed to deploy every day, I think. Right?

Tim Heckman:

Exactly.

J. Paul Reed:

Something like that? Yeah.

Tim Heckman:

Yeah. And so, just because they hadn't bumped the version for the previous day, it caused the automation to freak out and actually cause an issue. So we do find that there are cases where a code freeze themselves do introduce other "changes in the system," because you're not exercising the things you did regularly.

J. Paul Reed:

Tim mentioned this, and I love this above the line, below the line. It's from Adaptive Capacity Labs. And so people we mentioned in the panel, John Allspaw, Richard Cook. And so, there's three important components to the system that I just want to point out real quickly. There's obviously the people in the system, and then below that line there's the systems as they actually exist. But the third really important component is our mental...

PART 2 OF 4 ENDS [00:20:04]

J. Paul Reed:

... as they actually exist, but the third really important component is our mental models of that particular system, so three components. What we found is initially, as Tim had mentioned, there was a lot of focus on the actual systems themselves. Can we handle this capacity? Are we scaled up to do the right things? The other thing we found is because we're ... or that engineer's found is because we're seeing this increased load, we're seeing new behaviors, we're maybe not deploying, so we're seeing assumptions there. What we found is engineer's mental models of the system also, there was a lot of focus there, a lot of updating of mental models and things that there was sort of learning going on around not only things that had changed before the pandemic but things that had changed as a result of the pandemic. The main point being, yeah, go ahead.

Tim Heckman:

You finished [inaudible 00:20:52] at the end.

J. Paul Reed:

No, go ahead.

Tim Heckman:

Well, I was going to say, I think along that while folks were having to change the context about how the system was operating, I think there was also a side effect that there are a lot of great folks that have expertise in their areas, right? So, I don't know if we would've jumped to the internet traffic shaping story, but one of the examples I know of was before anything was in the news about us changing the bit rates in Europe to help the internet be more stable over there, there was an engineer on the CDN team who like a week before, just by chance on like a Saturday, it was like, "This might be a useful feature. We might want to be able to degrade the bit rates of the video streams to protect the internet around us."

So, it was kind of a hack day sort of implementation we put together, trialed it out, it seemed to work. In less than a week, it actually came up. That need was there, that the internet [inaudible 00:21:39] was having issues because they were over subscribed. We were able to quickly iterate on that code to make it more production ready to deploy that and make changes to protect those systems. But as Paul said, those were the initial things we looked at, right? That was our first look at this as what systemic things would fall over, where the scaling challenge is going to be.

J. Paul Reed:

Right. Right. So to Tim's point though, over time, what we realized was, again, we need to be focused on the people. So, we mentioned work schedules. This has been actually an ongoing thing that we've been looking at for the past four or five months. I actually wrote up an internal memo that got sent out to everyone around burnout and what burnout looks like. It was based on Christina Maslach's research at Berkeley, but how we can look for that in our colleagues. We'd been talking about a lot about emergent conversations that you might have at the office. How do we replace that? Can we replace that? What does that look like? How our team's coping?

One of the big ones actually, especially for platform operators, what does on call look like? Is that still healthy? Because the heuristics around what makes a healthy on call schedule have gone sideways and wonky. So, those are the things that we really looked at is once we were really clear that the technical systems were going to be okay, our mental models had been updated of those technical systems, so they're accurate. Then it really focused, "All right. Let's all take a breather. How are we feeling as a team of people in a group, an organization of lots of people?"

All right. So resolution, what did that look like? Maybe we'll talk a little bit of SQL or what the future looks like. So, when we start to look at exit criteria, this is something like we realized, what is a months long incident look like? In fact, what is a months long incident where you have 5,000 plus responders involved in it look like? We quickly realized that actually, if there's always an incident, there's never an incident, we couldn't keep this running for long periods of time.

One of the interesting heuristics is we got to a point where there wasn't much to say in those weekly emails anymore. We were referencing things that we had before just as reminders, but there wasn't a lot of new information. So, that was something that was a cue to us managing this incident. "Okay. We need to think about what it looks like to wind this down." So Tim, what are some of the exit criteria we looked at.

Tim Heckman:

Yeah. I mean, I think when we first started this, we didn't really know what to think about. It was like a very open acknowledgement that we are not going to drop some exit criteria on the table at the beginning and try to figure things out as they go, because it is an emerging situation. So, when I went back to sort of the systemic risks that I mentioned, one of the common concerns we heard from teams was, would the system be able to withstand any increase in traffic? Would we hit some peak that exercise one of the components higher than they ever had before and that would cause systemic problems that we wouldn't be able to mitigate or that we wouldn't easily be able to mitigate because it wasn't ready to go?

So, part of this was looking at what areas of the system do we think are the most vulnerable to being overloaded or provide the most value to alleviating pressure in other parts of the system. One thing that we've built into the streaming product at Netflix is the idea that the experience should degrade as much as possible before completely stops working. So, if one of the backend systems starts to have trouble, that row in your UI which will be gone for a little bit until it comes back and we try to make it so those things aren't invisible to you. So the question we had was, what things in the UI are the heaviest? What components can we turn off to still retain most of the user experience while protecting a baseline for the larger group?

J. Paul Reed:

Yeah. There were also some big batch jobs I remember that's like, "Do we need to run all of the computation batch job right now?"

Tim Heckman:

Yep. The other was, as I'm sure many of you have seen, depending on the business you were in, your traffic shape was much different as we were going into this lockdown. If you were an entertainment service or anything that people use to burn their time, you probably saw that increase. You probably saw that go up. So, part of our exit criteria was, how do we know that we're getting back to some sort of normalcy, right? What does this growth look like? How long are we going to keep growing for? Is it just going to stop and fall off a cliff? We weren't sure. So, instead of trying to find a certain watermark to aim for or something like that, we try to look at week over week growth. Are we still growing to the same amount? Is it still whatever it may be.

Tim Heckman:

Once they start to slow down that's indicating that the world around us maybe isn't changing, we're still transitioning to that new world as much, but things are starting to stabilize. So, it took a bit of digging, I think, to figure out which metrics you wanted to use for that, whether it was the number of customers coming, the amount of traffic. Eventually we landed on a few that we looked at to make sure that we weren't going to exhaust some systemic capacities and that the numbers weren't going to keep growing at a very large amount. I think ... Oh, go ahead.

J. Paul Reed:

No. Go ahead.

Tim Heckman:

I was going to the final one, which I think this one was a lot more challenging, right? Netflix while very seasonal, there is a good pattern to when people watch Netflix and we see things regularly week over week, day over day. We can build models on that. We can build assumptions in the system based on how traffic usually is. There definitely current events was changed at, the Super Bowl halftime show, World Cup. Those sorts of things will change how people watch Netflix and what the volume is like, but we realized that as more people are going to be locked down and do things at home, this new change in signal will probably be longer lasting than any of those previous ones we've seen before. World Cup matches, a couple hours long.

That impacts you for that period and things generally come back to normal, but for this, we were going to expect that people behaving and watching Netflix much differently than they had previously. So what data models, whether it's for scaling, availability, number calculations, which of those models no longer made sense or had poor assumptions built in that we now violated based on the current metrics? So, we had to think about which of those models and list them out could be impacted so we can engage with those teams so they're at least aware of what the side effects could be.

J. Paul Reed:

And do they need to rerun them.

Tim Heckman:

Yep.

J. Paul Reed:

So, one of the things ... the point we really wanted to make here is that when we quickly realized in terms of what the end game goal was. It wasn't actually to manage Netflix and the Netflix engineering ecosystem through COVID. That wasn't a thing that we could do and it wasn't a possibility. What we were really focused on is this is a time of tremendous change for the Netflix engineering ecosystem. Now, you might've heard the term adaptive capacity. The teams themselves and individuals, individuals and teams and the organization were starting to draw on this adaptive capacity and trying to figure out how to adapt.

That's the transition that we were trying to help them through is get them to a place where summoning and using that adaptive capacity was easy and was resulting in good outcomes for them. That made for some teams that adaptive capacity that they needed to summon might be very small, so that transition is a little easier for them. For other teams, it may have been very huge and they may have to actually cultivate a reserve now through the rest of COVID of this new level of adaptive capacity that they can use to bring to bear on the problems, but our goal was to get all the teams when they're going through that transition through that transition so that they can then live in this sort of new normal and new world. Any last thoughts on that, Tim?

Tim Heckman:

Nope.

J. Paul Reed:

Okay. Cool. Well, and the future. So, this is something that I've been thinking a lot about. It's something that the core team is thinking about. I'm going to walk you through it. So, the one thing that I'm going to do is I promise there's ... I'm going to go through like a little bit of a academic definition. It's going to be about two or three minutes. I promise you, there's an interesting insight at the end. So, bear with me. There's about, I don't know, five slides here. So, this is work that J. Bloom and he's over in the Red Hat office of, I want to say office of dev ops but that's not right, global transformation. That's right. That's what it is.

He talks a lot about how we perceive time. One of the things he brings up is this idea of the time span of discretion. So, the point here is that for ... and this is a great example of like blunt and sharp end of systems. For people doing the work, they're often thinking about the work in terms of sprints. So, on the order of a couple of weeks and as you move towards the blunter end, you might have C level executives that are looking at three to five to seven year business plans. Right? So, the thing is, is that-

PART 3 OF 4 ENDS [00:30:04]

J. Paul Reed:

... Five to seven year business plans. So, the thing is, is that the time span that people are looking at can be very variable in a system, and again, it increases as you go through knowledge, worker, director, VP, and C-suite. So, we'll come back to this and why it's important. The other thing is that we often communicate, and it's actually a way to reduce cognitive load by telling each other stories. We actually tell each other stories about the future. One of the most relevant ones for an SRE is an engineering team that build the service might tell us a story about its reliability, or its observability, or its expected performance, or whatever it might be. They're telling us a story that then we can rely on in incident management and a bunch of other contexts about what's actually going on. And so, a component of that story is when we look forward in time there's a set of things that are probable. Things that are, based on the story that we've told each other, are likely to happen. There are things that are plausible and obviously that's a wider space to look at.

There are things that are preferable and those are the things that we as a team may want, we as individuals may want, we as a C level executive may want. And then there's obviously things that are possible, and of course, interestingly COVID is probably on the edge of that possible dot in this chart. It was not something that was really, I think, on the vast majority of people's radar, especially like in an SRE context. So, the point is here... Did I duplicate this slide? I did. So, the point is we walked through the world shining this flashlight based on where we are, the stories and knowledge that we have, and these different framings of what's possible, what's probable but also possible, what do we actually want, what's preferable. Now Tim.

Tim Heckman:

Yeah. So, 2020 turned our flashlight into a dumpster fire.

J. Paul Reed:

Yes, and it's funny, Tim and I we were laughing about. We did a team offsite for the last week of January, and we came up with lots of good stories and plans that we told each other, and I believe you see those floating down this little river. So, the point that I want to make here is as we go on in COVID, the stories that we told each other, and I mean our colleagues, our kids, the stories that the news media even tells us, the story that the stock market is telling us with how it reacts, went from on the order of depending on the relationship weeks, months, years, to I don't know that I'm going to be able to eat next week and I'm out of toilet paper and I don't know where to get that. So, everybody's sense of time, and the stories that were relevant to their lives were compressed, extremely compressed, and that's very painful. It's cognitively painful. So, the point here that I want to make is, as we start to emerge from COVID, one of the things the core team is really worried about is we need to actually reconstruct these stories.

The stories that linked this to make a coherent narrative were broken by COVID. Now that compression I said was very cognitively painful. It's also actually painful when you start to think, "Okay, I can start to think three months out, six months out, a year out." It's the same pain in terms of how our brains work. And so. The point is though, because these aren't linked, that's a problem that we'll need to solve as we come back together as organizations. We need to actually figure out what the stories are for ourselves. And then we actually need to share them with each other so that the story that we actually tell about systems and about teams and whether or not they're underwater or not, or have the capacity or same thing with technological systems, to make those coherent again. That's actually something that we need to work at, which we're seeing this a lot. This is why as teams come back together, as things are starting to settle into new normal, it feels weird. It feels weird because we've lost that shared narrative that we had.

It's almost like we're not reading the same book anymore, and we need to work on that. All right. So, season recap. What are some of the takeaways from managing a global incident at Netflix?

Tim Heckman:

I think this one was probably our first, or maybe a little, actually, more surprising. And so, if any of you've read the Netflix culture deck you'll know that we are big on freedom and responsibility. And what that ultimately means is that you, as the individual are free to make the choice that is best for you, your team for Netflix, and be responsible for any side effects or things that come out of that decision. So, if you decide to build a service with tons of tech that it's on your team to pay that down or be responsible for it. One side effect of that was, during these communications, as I shared earlier, we weren't super prescriptive of what teams should do. We left it up to them to decide what made sense in their local context, because we didn't have all of that nuance in our minds. We did find out that for certain groups, especially individuals or teams, that might even feeling more pressure around even that of capacity that freedom actually added more cognitive load and more of a challenge to them in the moment because they were no longer sure whether they should deploy, or whether it made sense just because they weren't grounded in the world around them.

And so not that I'm saying we would necessarily do anything different, but it's something to clarify and understand when you're communicating your decision and kind of your thinking there to give people an out, or give them something it's easier to help them make a decision. Something as simple as if you're feeling uncertain about it, maybe just pause your deploys until you feel more comfortable. Well, that kind of initially was against how we were thinking about it, that sort of thing under the guise of freedom and responsibility. Ideally, would encourage the right behaviors of if you're not comfortable and you don't want to do it, then don't do it,` but if you're feeling good about it and you're feeling confident and go forward.

J. Paul Reed:

Yeah, that was one of the bits of feedback that we got in terms of a suggestion is for some teams that actually, they were so distracted with whatever was going on in their local context, whether it be family or other things, that they actually said, "It would have been helpful if you would have given us like a menu of three options." And spell them out, and then they could have engaged with that and said, "That feels right. That doesn't feel right." Or take that and say, 'This feels right, but we want to tweak it in this way." When we left it kind of open-ended, then they have to think about all this stuff and they may have to come and engage us, which takes time. So, that was, I think, a really interesting lesson for us as a core team.

So, for long running incidents and especially incidents where the blast effects are creeping, and I mean this in terms of the impacts on people's lives, on their personal life, on their ability to be present at work and do work that supports the functioning of the system, the primary signals become actually the less important. So, this really speaks to that point that Tim made around. When we started looking at what the extra criteria looked like, we didn't actually say, "Well, when this one metric gets to this one level and stays at this one level, that's what we'll look at." We took the derivative of some of those metrics and said, "When we see the metric not bouncing wildly around or not on this trajectory, that's really high trajectory, that's what we'll look at to see that the rate of change has stabilized." And so, again, the interesting thing here is when the length of the incident becomes really long, those primary signals, they're still useful, they're still useful from a tactical perspective, but they become less useful in terms of all right. Well, what do we do from a strategic management of a long running incident perspective?

Anything to add there, Tim?

Tim Heckman:

Nope.

J. Paul Reed:

And then this is the thing I was saying. You and your teams are going to have to deliberately craft new stories as part of constructing this new normal. We've heard that term repeatedly, "The new normal, the new normal." It's like, what does that actually mean? Part of it is actually repairing this inner predictability and the narrative that we actually use that narrative to reduce our cognitive load as we move through the world, because there's assumptions we can make, and we can rely on the framing of the story to understand what's going on and we don't have that. So, we actually need to deliberately do that as an exercise as part of repairing, in some sense, the trauma that COVID has wrought on all of our socio-technical systems.

Tim Heckman:

Yeah. I think, kind of adding to that real fast, there's some folks on our team, the core team that, to this day, they're not really working much during the week. They have so many things going on outside of work that we have to support them in that, and they're effectively working part time. I think it was in Liz's panel, Tony mentioned, you kind of have to make time for you, your family, and whether it's kids or your pets. You really have to focus on that, and that's the important thing. And so a lot of what we've done as an organization, as a company, is recognize every single person is going to have a new normal now. It's going to be very nuanced, their family situation and where they live. Some folks in Netflix hae actually moved cross country to be closer to family, to get childcare help. And so, there are a lot of things that we've seen change, where we were trying to support the individual and the family and make it real important that that's their number one focus and that while Netflix is their job and there's responsibility there, we're really here to support them and make them be successful. Forcing them or putting work or whatever it may be on them, isn't it going to do any of that for them?

J. Paul Reed:

Yeah, 100%. All right. Well that's all we got. Stay safe, stay healthy, please wear a mask, and we'll SRE through all of this dumpster fire together.

PART 4 OF 4 ENDS [00:39:53]

Next talk

Yes, You Can Improve Your Team's Wellness

This talk breaks down the science behind stress and burnout, and how you can apply that to create an on-call process that supports your team's wellness.

All talks

Morning Sessions
Morning Sessions
Yes, You Can Improve Your Team's Wellness
Yes, You Can Improve Your Team's Wellness
DevOps Parenting
DevOps Parenting
OK, so you are not Google. What should SRE mean for your organization?
OK, so you are not Google. What should SRE mean for your organization?
The Pandemic Brief: Assuring Essential Services
The Pandemic Brief: Assuring Essential Services
Afternoon Sessions
Afternoon Sessions
Emerging from Burnout
Emerging from Burnout
Panel Q&A: Ask an SRE
Panel Q&A: Ask an SRE
Maintaining Mean-time-to-Joy: Managing a Global Incident at Netflix
Maintaining Mean-time-to-Joy: Managing a Global Incident at Netflix