Amy has worked in web operations for 20 years at companies of every size, touching everything from kernel code to user interfaces. When she's not working she can usually be found around her home in San Jose, caring for her family, practicing piano, or running slowly in the sun.
Head of Reliability
Holly Allen is the head of reliability at Slack, with SRE, Monitoring, and Resilience Engineering in her portfolio. She is tireless in her efforts to make Slack the software reliable and scalable, and Slack the company a delightful place to work. Prior to Slack Holly worked at startups, DreamWorks Animation, and was Director of Engineering at 18F, a civic tech startup in the US government.
Site Reliability Engineer
Lex Neva is interested in all things related to running large, massively multiuser online services. He has years of Systems Engineering, tinkering, and troubleshooting experience and perhaps loves incident response more than he ought to. He’s previously worked for Linden Lab, DeviantArt, and Heroku and currently works as an SRE at Fastly helping to make sure the Internet keeps running.
VP of Technology Operations
Tony is a 25-year Internet industry veteran who has served in various Network Engineering and Operation leadership roles, including Google and DoubleClick. Tony spearheads the management and operations of all Catchpoint monitoring data centers, supporting Catchpoint’s expanding corporate strategy, delivering stable, secure, and reliable operations.
Site Reliability Engineer
Maira is an Application Engineer at Autodesk, based in Novi Michigan. She is obsessed with learning, but especially with the learning process that accompanies on-boarding monitoring concepts for better site/service Performance and Availability. She has dedicated her past years to site reliability, working with different Synthetic and RUM monitoring tools.
All right. Hi, I'm Amy Tobey. I'm not going to spend a lot of time on myself. This is really about all of you. I am a staff SRE at Blameless. I have a strong interest in topics like mental health, trauma, and recently burnout. The last few years, I've learned an awful lot, and I hope I can share a few things with y'all. First, the disclaimers. I am not a doctor or a mental health professional, nor have I played one on TV. What I'm going to do here is I'm going to pull in kind of a lot of what I've learned from reading and studying in the mental health space and in the SRE space, and in the resilience engineering space, to try to tie what worked for me in emerging from burnout with what we already know as SREs.
So content warnings, I'm going to talk about burnout. I'm going to talk about mental health, a little bit of depression and anxiety and things like that, and I'm going to bring up heart disease briefly. And if you need to step away because these topics are hard for you, I totally understand. Take care of yourself first. With all of the things that I might suggest as I talk here, the main thing to keep in mind is that your mileage is going to vary. What works for me isn't going to work for you, and what works for you may not work for me. And only by sharing with more of our peers what we're going through and working with each other can we discover the things that work best for each of us.
So the first kind of tip thing I'd like to bring up is I want to say thank you in advance, or maybe in retrospect. Getting deep into burnout and emerging from it is a painful experience in that we fail the people that we don't want to fail sometimes, and we fail ourselves. And there's that tip that goes around that says, "Stop saying, 'I'm sorry' and try to say 'thank you' instead, and find gratitude." So this is my little attempt, and I hope you'll try this because it has proved useful to me to show gratitude. So to the people who have stuck with me through all this, thank you. I'm really glad we still get to have a relationship.
So let's start with the serious part. A few weeks ago, a dear friend of mine, an SRE, messaged me and said, "Hey, I just want you to know I'm in the hospital." What's going on? Is it COVID? And no, he had a heart problem, and they didn't know what it was for a long time. It took weeks to figure it out. And when they did finally figure out... And of course, I'm worried about my friend being in the hospital. He's in there with COVID, and he can't even have his wife with him because of the restrictions at hospitals. And then to go through this... And the first thing I thought of when he said this to me, the first thing I thought it was burnout. And I'll drop the link in the chat a little later, but there is evidence that deep and extended burnout can start to get cardiac arrhythmias or arrhythmia and different problems, so cardiac arrest even.
And so I can't draw that line of causality from burnout to this person's heart problems, but it was immediately what struck me, because I care about this person deeply. And I've watched them over the years as one of the most technically skilled SREs I have ever worked with bar none, who would start on a project, pour everything he had into it, make incredible progress, pull off technical feats that I can't even dream of sometimes. And then, because he cared so much and because he invested so deeply, and because we live in this society that that takes more than it gives, our companies often do, he'd burn out and then the cycle would start over again. And then it'd start over again, and start over again. And it kept going. And now we had a discussion just recently where I asked permission before I talk about this.
And it was a sobering moment to talk about it with someone who is in this condition, where they know they now have a permanent hardware installed in their heart that could be the result of doing this too much. So I wanted to start with this to impress on everyone that it might feel like it just kind of sucks right now, but the research shows that extended burnout can impact our bodies in ways that are deadly. So Jamie already talked a little bit this morning about what burnout is, and so I've got a few words up here, phrases from things I found. And another word for burnout is cognitive injury. There are a couple other names, this one I like. It tends to be more about prolonged work stress, or I sometimes think of it as slow trauma, where we are immersed in stress that we have no control over so very often.
And it goes on for a long time, and it starts to have that compounding impact on ourselves, leading to symptoms of depression, hallmark symptoms of depression. Interestingly, when I just realized found recently, and makes complete sense in retrospect, is executive function impairment. This is what ADHD folks talk about all day long, is our executive function is impaired relative to the normal curve. And so it's harder to start on things. It's harder to do stuff. So when you're burned out and you sit there and be like, "Why can't I work on this stupid report? It'll take me 10 minutes to do it, and then I will get accolades from my boss." That's executive function impairment. And another thing I really want to drive home is it's not about working too much, although it is often correlated. It's not about how much we work. It's about whether we're working in alignment with ourselves.
So to give you a more formal definition, I'm just going to read one real quick. Vital exhaustion, commonly referred to as burnout syndrome, is typically caused by prolonged and profound stress at work or home. It differs from depression, which is characterized by low mood, guilt, poor self esteem. The results of our study further establish the harm that can be caused in people who suffer from exhaustion that goes unchecked. And this is from one of the papers that Jamie mentioned earlier, but this one was Dr. Parveen K. Garg of University of Southern California. So I'm not going to spend a lot more time on defining things. I think most of us are already pretty much on the same page here. So let's talk about how emerging, right? I wrote so many versions of this trying to figure out how to talk about this, and then I realized, I just need you to come back to my sweet spot and yours, which is, we are already people who deal with complex systems.
We already deal with things like cascading problems, like that time that the storage system slowed down, so applications backed up and then the queues backed up. And then when things started working the storage, the queues went full bore and took the network down. And when the network went down, ZooKeeper died and then everything all fell apart. That's how burnout feels to me. The time that I got fired from Netflix, I was so deep into burnout. There were other things going on, but when I think about all of the things that got me to that point where I lost a position that was probably the pinnacle of my career in a lot of ways, is that there were so many compounding things happening in my life, not so much at work.
There were some things going on there, but it's often the weight of everything. So as SREs, we are good at thinking about this. We do it every day. So as we move forward, I'm going to take this from a perspective of what we already are good at, and use that to try to find ways that we can be more patient with ourselves and to focus on change over months and years, [inaudible 00:08:29] hung up on what works today.
So let's start with some theory, some SRE theory. We often say that there is no such thing as human error. I hope everybody agrees on this. But where I think we stop short sometimes, and I've heard peers and friends do this, is we still blame ourselves for things. We still look inside and beat ourselves up, and we make mistakes when we fail to do the things that we promise. But we're living in a pandemic, and we have to think about that greater environment that we exist in and what drives our behavior sometimes, and what makes it so that it's almost impossible to be working at our full cognitive capacity, because it's impossible right now for most of us. And so when we think about human error in our organizations, when somebody pushed the wrong button and took the system down, instead of saying, "Bob, why'd you push that button," we say, "Why was the button there for Bob to push?" And we have to do that for ourselves too. And when we do it for ourselves, we get better at doing it for others.
Root cause is also not a thing. And these things are going to feel like they're layering, because this is all familiar territory. We live in a complex social technical system. Any one thing that we're experiencing that day that you can't get out of bed because it's too much, and then you snowball. It's not just because I can't get out of bed, I don't have enough will. It's because my will has been sapped by all the things going on around me. It's because maybe I haven't been getting enough sleep for the last few weeks. It's maybe that I'm afraid I'm going to lose my healthcare if my company shuts down. All of these things contribute to the emotional states that we're in. And I believe that as professionals who think a lot about these topics of contributing factors, we can turn that inside to help ourselves, to better contextualize the things that are happening around us and not blame ourselves so much.
As I mentioned, contributing factors... I mean, just these three alone are enough to keep me up late at night and to disturb my sleep patterns, and to make me worry about actions I might have to take in the next 6 to 10 months as COVID rages and as the political environment in various nations around the world heads closer to fascism. So I mentioned a little bit, as I was talking about those, is cognitive capacity. I could explain cognitive capacity, but I actually prefer to explain it in terms of spoon theory, because where we are today is probably... One of the themes that I really love... I guess it's kind of weird to love it, but I think a lot of people are learning through COVID is about ableism. And most of us who are experiencing burnout learn this in a very poignant way, in that we are obviously now not able to just do whatever it is our goals are in life without extra work.
Everything becomes a little bit harder. So spoon theory, in a very short version of it, is if I wake up in the morning, and let's say I'm 20 years old again and I go back to that time in my life when I could stay up until two o'clock in the morning, and I could get out of bed at seven, roll out of bed, stumble in the bathroom, brush my teeth, do my morning routine, walk out the front door. If I had a fistful of spoons, by the time I got out the door, I might've spent one, because I was an able-bodied person. I'm a little older now. It's a little harder. I'd probably spend two spoons, because I'm still relatively able-bodied. If I need an assistive device, so if I need a cane or a walker or a wheelchair, obviously now it's added an additional burden that I carry that the normal curve doesn't.
So I spend two more spoons. So I spent four spoons before I even get out of my front door. And so that's kind of how spoon theory works, very much in a nutshell. And a lot of the kind of the mental process of thinking about it transfers very nicely onto cognitive capacity. Most of us as knowledge workers understand that feeling when you get to like two, three o'clock in the afternoon, and you've been at it real hard all morning and through lunch, and you just hit a wall and you're just like, "Oh, I cannot think anymore," and things come slower to you. It's harder to sell bugs. And that's because you're running low on cognitive capacity. And there are really cool ways to do short term replenishment.
I've been adding more stuff to my schedule, especially working from home. I've been doing it for a while, but I've been doing it more recently. Things like taking naps, going out and sitting in the sun, meditating or coffee breaks. All those little things that can add a little bit of... Top off our cognitive capacity a little bit. And then there's kind of our most important, and I want to stress if there's a practice, and it often isn't in our control, but to try, is to put our sleep on a schedule. Sleep is our highest priority as cognitive workers, because it is our primary way our bodies recover cognitive capacity.
Hindsight biases. I love this, because very often when we are examining the events of our lives and our performance over time, we're looking backwards, and our biases influence how we interpret those events. And when we're looking at an incident as SREs, very often what we will do is we will look back at the incident with our biases, and then we have a rationalization process where we say, "Okay, so I'm looking at this through my biases. And I go and find other perspectives and other evidence to expand my view, and to find ways to expose my biases and to get closer to the truth." So we do this in our organizations in our everyday work, but we don't often do it for ourselves, and I've heard people do it.
I have this thing I say to people when they're saying, "Oh, I'm ugly," or, "Oh, I'm not that smart," is if you were talking about another one of my friends like that, I'd be very upset with you right now. And you were talking about one of my friends like that. So we need to shift how we look and consider our biases, and especially when we're burned out and depressed and have anxiety, is they influence. They add bias to how we interpret the past. And so if we go in knowing that we have this bias, we're better armed to be able to look around the bias and see the truth, and see the light a little bit more frequently. And sometimes that's all we're looking for, is that little sliver.
On to observability. This one's a little bit weaker, a little bit more of a stretch, but observability says, "How can we know what is actually happening in this incredibly complex system?" And often what we have is our feelings, and we also have some physical things, like if we have shaky... I'm always kind of shaky. If I have caffeine, I'm even more so. It's a good way for me to judge if my leg is bouncing at 150 BPM as opposed to 100 BPM as it normally does, it tells me that there's something going on with me that I need to think about. And so, if we want to think about how we use observability in our work to observe and see how different events are connected in our huge complex socio-technical systems, we can also turn that skill inward and think about things at a higher level and say, "Okay, what clues are going to tell me what is going on with me?"
SLOs. I got two points from this one. One, the first thing that we say when we talk about SLOs is perfection is impossible, and what most of the research shows is one of the major contributors to burnout is perfectionism. And so already as professionals, we say it all the time to people. We say, "Perfection is impossible." 99.95% is a more reasonable target than five nines, because reaching that five nines has a personal cost to our bodies and to our minds, and to the people around us. And then there's the law of stretch systems, right? When a system is stretched, something's got to give. And if we stretch ourselves too thin, the things that give might be our heart. It could be our minds. It could be all kinds of things. But we have to think of ourselves as stretch systems in this time, because that's how we can start to find what we can give up. What can we drop? What can we delay?
Chaos engineering. I haven't met an SRE in a long time who doesn't think chaos engineering is super cool. And you know, there's practical applications to chaos engineering. I think of chaos a lot when I do yoga in my backyard. It's one of my exercises I do to help keep myself well. I think about how I am exercising parts of my body, that if I just sit here in my office on my ball, my yoga ball, or standing here, but I'm not using everything. And I have places where maybe muscles are hardening and tendons are tightening, so yoga is a great way for me to kind of exercise this full system.
Always again, cardio exercise is a use it or lose it thing. And probably I think the best analogy with chaos exercise is time off. When we take time off, we discover where we were crucial to the business in a way where we are a single point of failure. And so if you need to, you can argue this. Your boss says, "Hey, I want to do chaos engineering, but we're not ready yet." Say, "Okay, let's have people take more vacations. Let's find out who's critical and who is in the middle of an important process where it might cause us a vulnerability."
And finally, it's not necessarily an SRE concept, but if I had a magic wand to change the community of SRE in one way, I would like to put kindness as the number one value of an SRE. Because all of these things that we just talked about, these things from resilience and these things from reliability in our professional lives, come down to a form of kindness that I believe in, which is not a nice kindness. I'm not a nice person, I'm a kind person. That means if somebody comes at me with intolerance or untruth, I have a responsibility to say something to reject it. Being anti-racist is kind to your communities, to yourself.
And if I have one self tip and one thing that I've worked really hard on over the last few years, and really I have to thank my therapist for putting me on this path, is being kind to yourself when you fail, when you miss a deadline, when you're just feeling awful. Instead of that pounding on ourselves that our society has taught us to do, to turn that eye of kindness. And you can still be critical. We have to be critical of ourselves. I brought up anti-racism. That's about being self critical, but you can still be kind and understand how you got there, and what things you grew up around, and what societies you live in, where you absorb the stuff through your skin, because it's everywhere.
So we have to look with kindness to be able to do the hard work. And when we do the work in incidents without an incident command, or fixing Kubernetes clusters because the developers are locked up, it's still an act of kindness to do it to our best of our ability. So with that, our time is up. I hope that this is useful to some of you. I will be in the Q&A channel for quite a while, and everybody please be kind to others and be kind to yourself. Thank you.