Panel Q&A: Ask an SRE

Follow along as SRE leaders from Slack, Fastly, Catchpoint, and Autodesk discuss defining SRE, avoiding pitfalls, managing incident anxiety, informing the C-Suite, information silo-ing, managing 'at-home' distractions, and starting a career in SRE.

Liz Fong-Jones, Moderator

Principal Developer Advocate
Liz is a developer advocate, labor and ethics organizer, and Site Reliability Engineer (SRE) with 16+ years of experience. She is an advocate at Honeycomb for the SRE and Observability communities, and previously was an SRE working on products ranging from the Google Cloud Load Balancer to Google Flights.

Holly Allen

Head of Reliability
Holly Allen is the head of reliability at Slack, with SRE, Monitoring, and Resilience Engineering in her portfolio. She is tireless in her efforts to make Slack the software reliable and scalable, and Slack the company a delightful place to work. Prior to Slack Holly worked at startups, DreamWorks Animation, and was Director of Engineering at 18F, a civic tech startup in the US government.

Lex Neva

Site Reliability Engineer
Lex Neva is interested in all things related to running large, massively multiuser online services. He has years of Systems Engineering, tinkering, and troubleshooting experience and perhaps loves incident response more than he ought to. He’s previously worked for Linden Lab, DeviantArt, and Heroku and currently works as an SRE at Fastly helping to make sure the Internet keeps running.

Tony Ferrelli

VP of Technology Operations
Tony is a 25-year Internet industry veteran who has served in various Network Engineering and Operation leadership roles, including Google and DoubleClick. Tony spearheads the management and operations of all Catchpoint monitoring data centers, supporting Catchpoint’s expanding corporate strategy, delivering stable, secure, and reliable operations.

Maira Zarate

Site Reliability Engineer
Maira is an Application Engineer at Autodesk, based in Novi Michigan. She is obsessed with learning, but especially with the learning process that accompanies on-boarding monitoring concepts for better site/service Performance and Availability. She has dedicated her past years to site reliability, working with different Synthetic and RUM monitoring tools.

Liz Fong-Jones:

Hello, welcome everyone. Good afternoon to everyone, especially to the folks on the West coast, for whom it's about to actually be noon. So, I figured that we'd dive in with quickly discussing how is SRE defined at your organization? What does SRE mean to you in a few sentences? Let's start with Lex.

Lex Neva:

Okay. Hey folks. First of all, thank you so much, Peter. That was so kind, and more importantly, I want to turn the thank you back to the community for all the writing, because I would be nothing without that. So thank you.

How is SRE defined at Fastly? Still evolving. Definitely it's at least partly an Ops role. We also have a team that is incident analysts, who that's all they do is analyze incidents, and I think of them as kind of SRE, although that's not in their title, but yeah.

Liz Fong-Jones:

How about you, Holly?

Holly Allen:

Like Lex says, it's always evolving. At Slack, SRE aren't operators, they're DevOps generalists who partner with product teams to solve problems. So they might directly contribute to some software, product software systems, if that's really high leverage. They might do something like redesign our pre-prod strategy or write tools, for example, to set alerts automatically. So it's all about partnering with, trying to solve those schematic problems.

Liz Fong-Jones:

And how about you, Tony?

Tony Ferrelli:

So I've been here for a month, so I'm just learning the way Catchpoint has operated over the years. I've kept close to this with Mehdi, so I know quite a lot about what happens. I think it's really ... I mean, it's the same story, right? It's evolving. I think it's a continuing story, so how do we get better partnering with our dev and product folks on building operating principles into the application itself? How do we make it stronger? How do we make it more reliable? How do we make performance better overall that we can feed back to the community that uses our platform? I think it's a work in progress. I think we have a long way to go, and I'm excited to take that journey. It's been a lot of fun so far.

Liz Fong-Jones:

And how about you, Maira?

Maira Zarate:

Hey. Yeah, so here at Autodesk, it's actually a new formal position that was posted up within our application teams. Since we have so many applications and services to take care of, the monitoring team was just too small, so what was started newly this quarter was that SRE position versus having the monitoring team do all that work.

So it's something, like everyone else mentioned, that's evolving. It's something really exciting to have incorporated, especially because, like I said, it's very hard for a small team to tackle down all the issues and work with all the teams that are responsible for their certain areas and their applications or services, so definitely here the same as all. We're evolving as well.

Liz Fong-Jones:

And in my home organization at Honeycomb, we actually have no one with the SRE job title, but we have definitely internalized a lot of the SRE processes and methodologies. So every engineer at honeycomb is on call, there is no distinction, but there is a platform engineering team that handles a lot of the infrastructure and scaling needs, but we all practice a lot of the SRE things day-to-day.

So I am I think only person who was formally had the SRE title in the past, but we definitely have a lot of folks who, for instance, used to work at the legendary Linden Labs as systems engineers, so there was that rich culture and history that evolved in parallel with the Google SRE discipline.

Liz Fong-Jones:

So this is a super exciting to hear. So it sounds like there are some organizations represented among us where SREs have a consulting role, other organizations where SREs are working hands on, and yet other organizations where SREs are [inaudible 00:06:25] practitioners, and then kind of a mix of all these three things. So as you can see, there's no one right way to do SRE. There's a myriad of different ways, but what would you say unifies all of us in our philosophy of SRE? Right? What's the one thing that we can agree upon that SRE strives to accomplish regardless of how exactly we accomplish it? I'm going to open the floor.

Lex Neva:

I was going to say the user experience, that's what I have written down in my notes, but I'm going to change my answer to kindness because Amy Tobey, that was amazing and you are so right. It's all about people. SRE may be about tech stuff, but the tech stuff's about people, so it's kindness.

Holly Allen:

Yeah, for sure. The thing I had written down is that empathy is super key to succeeding as SRE at Slack because you're bridging all those boundaries between different teams and different ideas. It's all about people in the end.

Maira Zarate:

Yeah. Along with that, I would say of course our focus is the user and availability performance of our application and services along with the reliability, but along with that, a lot of our job is communication with our teams, understanding each other, having that bond between any of the development workers, SREs, monitoring teams, so a lot with that.

Tony Ferrelli:

Yeah. I think communication is probably one of the most important pieces because we're not just working with one team, we're pretty much working with all the teams that are creating systems that run on the platform. So being really clear, having good communication between the different groups to make sure that what we build we can support and we can keep it reliable and keep performance good. That's what we do. That's at the heart of what we do.

Liz Fong-Jones:

Excellent. And this is a really great segue into the next question, which actually is a audience question. The audience question is what's the one thing that an SRE always puts first? So Lex, I'm going to call on you first.

Lex Neva:

See, that's what I thought we were answering just now. So, the user experience is all the matters, ultimately. Charity Majors legendary quote, "Nines don't matter if users aren't happy." That's it.

Liz Fong-Jones:

Yeah, and I think that definitely what we saw with Twitter's incident last week, with Twitter's security incident is the thing we're prioritizing for is not necessarily even reliability, right? It's meeting users' expectations. And it turns out personally, if I'd been in the shoes of a Twitter SRE, I would have pulled the plug on the servers. If there was a risk that they were leaking data inappropriately, turns out there are things that are more important than fetishizing availability, so I agree. We are the people who look after the users' interests and yes, sometimes that involves reliability, but often it extends to other cross-cutting domains as well. What are other people's thoughts? Tony? Maira?

Tony Ferrelli:

Yeah, we're sort of the guardians of what the application is doing for our users. If we're putting something out there, we want to make sure that users are happy. Now, there's design stuff and everything that goes into that, but number one for us is it's got to be up, it's got to be reliable and it's got to do the job that we meant it to do. If it doesn't meet the user expectations, like Lex said, then it defeats the purpose of what we're doing, so just meeting that is super important.

Maira Zarate:

Yeah, coverage would be a big part, so making sure you have everything covered to be able to identify any user from different regions, so having the coverage everywhere, not setting yourself up for a blind spot would probably be what we'd say we'd probably put first.

Liz Fong-Jones:

I really like that because that also hearkens to what Amy just talked about, right? Maybe a thing that we should always put first is our own health and safety even before the reliability of the service. I think that that's another important thing to think about too now that we've brought that subject up. Thanks. That's really, really wonderful.

And that again, perfect segue. It's almost like with planned this. We have another audience question, which is what's the best way to handle anxiety when something breaks? Holly?

Holly Allen:

Yeah, so of course we've had a number of really great talks today about this topic, so I won't go too deep into. Don't blame yourself in the middle of the incident or afterwards. Don't blame yourself. These things always happen because systematic problems. I would say after an incident, after it's over and you can debrief with yourself, ask yourself does your organization learn from its incidents? Does it feel like a safe place to fail? Even though it's not your fault, is it a safe place for systems to fail? Those are baseline. If those things aren't true, if you're not learning and it's not safe, then that anxiety when something breaks is real. If it's not, if your place of work is safe-

... If your place of work is safe and the organization does learn, then try to get in touch with how that anxiety feels in your body, because you have to maintain a sense of urgency and an incident, but you have to go slow enough that you're in control of your mental reactions and your physical reactions are part of that. So, there's a wealth of resources already presented by the great speakers in advance, but asking those questions about your environment you're working in are important.

Maira Zarate:

Yeah. So my end, I would say I had to handle anxiety when things hit rocks, is I'd say take a shot and focus only on the solution versus anything else. Who did what, don't focus on that. What was done. Understand the issue of course, but just have tunnel vision to the solution. And then after that, of course, things will get fixed and then you can worry about anything else you have to do. But just have that primary focus for your solution.

Liz Fong-Jones:

Does that spark any thoughts for anyone else?

Tony Ferrelli:

Yeah, the only thing I would add is be methodical. Don't go off in 50 different directions at once. Make sure somebody is guiding, somebody steps up and takes control and helps guide what solution we're going to go after. What things are we going to try to fix first. And that helps calm everybody down.

Liz Fong-Jones:

Yeah, I definitely feel that if you have a good instant management practice, that that makes a lot of that stress go away, right? If it's something that's rehearsed, if you know what role you're supposed to play, right? And also if you've practiced and rehearsed this in advance. That again, lowers the stress level of when it happens for real.

Tony Ferrelli:

Yeah. I've always been a big fan of test your DR, but test it for real. Go through real life scenarios, right? Cause things to happen in a planned event and then you have the team practice that methodology of what you're going to go through and how you're going to go solve that problem.

Liz Fong-Jones:

Great. And Lex?

Lex Neva:

And one quick thing. We had talked about blamelessness I think it was either Jamie or Amy or both that said this, that we remember ... I often remember to be blameless with other people, but I forget for myself. And I blame myself even right in the middle of an incident and we got to remember it to extend it to ourselves.

Liz Fong-Jones:

Cool. Spectacular. These are hopefully good suggestions for people who feel anxiety around instincts. So, moving along, let's talk a little bit about how you adopt the SRE model. So the question specifically from an audience member that I think was pretty representative of a lot of these questions was how do you want to evolve an organization that's a small, maybe 10 to 30 people, keep the lights on systems administration or operations team? How do transition that to SRE when this skill set is so different, when the mindset is so different, when the pressures on the organization are so different? What has that kind of transition looked like for you in the past? Tony?

Tony Ferrelli:

Yeah. So I'm actually jealous of the 20 to 30. We have about 14 people on the operation side at Catchpoint. So we're small, it's been pretty incident operational focused. So, it's really about start small. You don't have to boil the ocean. Don't try to tackle too many problems at the same time. Focus. Find something. Use your incidents to frame what's a good problem to go after? What's a good scope problem that you know is going to have impact?

Start small, understand it's an investment, which means you got to give people time. So, it's really great if you can see the team with somebody that has this already experience. That's super great. That doesn't always happen. But finding somebody that's super passionate could be good also.

And then you just got to carve out time for the rest of the team to slowly come up to speed, which means you've got to talk about it a lot. Build a vision and then build a strategy and a plan on how you're going to get there and then don't give up because this is not a three month thing. And a lot of people forget. If they can't solve it in two weeks, then it falls by the wayside. You got to know this is an investment you're going to make over time and it's not going to take three months. It's going to take a year. It's going to take two years. It's going to take three years. It's going to be ever evolving, like we talked about in the beginning. You're never going to get to that end state. But what you'll do is you'll slowly make things better and better and better and more efficient.

Liz Fong-Jones:

What are your thoughts, Lex?

Lex Neva:

So, the original question as written by the person that wrote it in compared the traditional keep the lights on operations team versus SRE in saying that the skill sets are so different. How do we get from one to the other? And I would say, they're not that different. So, your traditional keep the lights on are going to be laser focused on the user experience, which is pretty important.

They're going to have a pretty strong incident response. Lean into those skills. They're going to probably be a pretty good hand with shell scripts or Python for all the tooling that they've written. Lean further into that, learn software engineering kinds of skills. So, just expand on those. That's where I started at Linden lab that you mentioned, we weren't SREs there. And I learned those skills there and I extended that to this degree.

Liz Fong-Jones:

Yeah, in my experience, it's not necessarily a problem with people's skills being totally inapplicable, it's that you need that end, right? You also need to invest in people's software writing skills, where you need to invest with people's ability to write automation rate, which requires people to have time to learn it and also have time to put it into practice. So, in general, that tends to require people to have the time away from doing the break, fix, ops work in order to hone those skills and to not be distracted.

Tony Ferrelli:

Yeah. I was going to say carve out that time for people to go and do that and then encourage people. If there's a task that they do manually that takes them five minutes and it's going to take them four hours or five hours to code it the first time, say, "Go code at that time." Because the thing that takes you five minutes, you do 10,000 times in a month, right? Or 1,000 times in a month.

Just think about it. It's going to take you three hours this time or four hours to code it this time. But maybe it takes you an hour next week and maybe it takes you 10 minutes in the future and just think about that time that you saved from doing those manual steps to go figuring it out. RDP to a box, or SSH to a box, run this, run this, run, this, run this. Write a script that does that and even if it takes them longer the first time that they do it, that's okay. Invest that time because it's going to pay off in the future. Because the only way people learn to code is if they practice. Learning Python is practice. Do it, do it, do it, do it, and you'll get better at it over time.

Liz Fong-Jones:

Also, people don't have to do this alone, right? You can bring a software engineer onto a team of systems engineers. That's how the SRE movement at Google got started by this idea of let's combine these two skillsets, put them on one team, hold them jointly accountable for [inaudible 00:19:19] and mentor and train people both ways.

Tony Ferrelli:

Yep. I know on my list, build time with devs. Make sure you get a lot of dev time as you're doing it. Because in the end, what you want is you want your team to start influencing the way developers actually build the products they're going to go use. You want those hooks in there. You want that operational experience.

Liz Fong-Jones:

Great. So let's do a really quick lightning round. Two sentences from each person. What's the easiest thing about SRE to get wrong? Start with Holly.

Holly Allen:

Doing it too much by the book and not meeting the org where it is. I think we've heard, organizations are starting from all different places. People are starting from different places. You have to meet the reality of the moment and do what's going to work and push business outcomes.

Liz Fong-Jones:

Lex.

Lex Neva:

Exactly what Holly said. Also over focusing on air budgets, go look at an article by Will Gallago about air budgets, where he talks about how can you know that something that you were doing was going to chip into your air budget? It's an after the fact valuation.

Liz Fong-Jones:

Maira.

Maira Zarate:

Yeah. So, Holly and Lex have two great statements in regards to this, I'd agree with them. Something SRE to get wrong from my perspective would probably be, for a larger company like mine, is that we know every single bit and end of [inaudible 00:21:01] which in part, we talked early about collaboration and communication plays a key part. We don't know it all, especially when in the larger corporations with small teams.

Liz Fong-Jones:

And Tony. Two sentences. What's the easiest thing [inaudible 00:21:18]

Tony Ferrelli:

Hm. I think figuring out the right thing to measure and if you're measuring the wrong thing, you're not actually solving the right problem. So establishing those SLOs and making sure those SLOs make sense and actually achieve a [inaudible 00:21:34] result.

Liz Fong-Jones:

For me, the two easiest things that she could get wrong, because I'm the [inaudible 00:21:40] and I can say two things is, first of all, trying to bite off too much at once, trying to transform your entire org when you haven't even figured out how to do it from one product or team. And the second one is adopting the name SRE without changing anything is also a very frequent anti-pattern where people just rename teams without changing what they do and why.

All right. So hopefully that addresses the question from our audience. So, let's talk about the next question, which is, okay, you've set up your SRE team, you've set it on the right path for success. But how do you actually communicate to your stakeholders that this experiment is worth extending or practicing wider across your organization? How do you keep the C-suite informed about the values of your contributions?

Holly Allen:

Yeah. This is really key, right? Because you need that C-suite buy in on the whole way of working. So, not to be a broken record, but you've got to tie it into the business outcomes that the business cares about. So, maybe they care about developer productivity because the development teams aren't delivering things fast enough. Or maybe they care about uptime or error rate. Whatever that is, tie into that and really solve those problems and then tell that story.

And so I think that it has to be a mix of providing the hard metrics and data, but also telling those anecdotes that paint, the fuller picture that people are going to then go and repeat and use as justification when you're not in the room. And then the last thing there is include other teams in those wins, right? This isn't about the SRE team succeeding. It's about the business getting better. And so you've got to make those wins be cross team, cross pillar, cross everything.

Liz Fong-Jones:

I think the other element is talking about not just where did we fail and where did we succeed? But instead, what failures did we prevent? The more near misses that you can talk about, the more things that you can say, "We improved this, and this is why this isn't happening. I think that's a huge improvement." Or talking about like, "Hey, this is how we improve the velocity of this team and its ability to reliably shift features." Right? [inaudible 00:23:44] don't matter if you're not actually able to ship code, able to deliver value to users. And therefore you really have to focus on how you're making the developer experience and the customer experience so much better.

Are there are any other thoughts on this topic from the rest of our panel.

I'll take that as a no.

All right. So the next question is starting to get to the SRE from home bit, rather than the ESRE bit. So what steps do you take to combat information silo-ing, especially when people no longer are sitting in the same office and overhearing conversations. So let's start with Maira.

Maira Zarate:

Yeah. So the first thing you'd probably have to think to yourself is how important is your team to you? So because we're in a different environment, at least most of us are there's a few that's always been from home. But either way, I think once you determine that your team is important to you, you reach out to them. So having conversation in standups, having an email, whatever works for you guys. A Slack channel DM is something that we do where we have the team and I'll ping someone, or someone on my team will ping me and be like, "Hey, we have this issue. This is how we solved it." And have a group conversation, whatever works the best.

We also set up things like Wiki documentations and let people know, "Hey, this is here." Just figure out how your team works as one, because one, you just don't want to keep it to yourself. You can't move forward as a team and improve on your skills without everyone having that skill amped up. So I would say, number one thing, how important is your team to you? And go ahead and relay all the information to them.

Liz Fong-Jones:

Excellent. And how about Tony? What are your thoughts?

Tony Ferrelli:

Yeah, that was a great answer. So for us what we've been doing, and he's like, I think daily standups make people cringe sometimes, but daily standups are so great< because it gets everybody in the same room, virtually, you get to talk about issues. And then I think it's really important for like the TLs, or the team leads or the manager, to make sure, because you're aware of more stuff going on than the team hears. So it's important to draw those conversations out.

Like, "Hey, we had an outage yesterday, this is what happened. Bob, why don't you explain, or Jill, why don't you explain what happened?" And it gets the team talking and it gets them getting people involved in stuff that maybe they didn't see right away. And then honestly like Teams and Slack and IRC and all these things are really, really great tools, just to keep the communication up and working.

The other thing I like is, set up a water cooler VC, that people can just join throughout the day, or set up a coffee break, or just set up a place where people can just gather and have a conversation. That it's just open and people can drop in and out, and there's no obligation that they have to stay on there. And I think it just gives people, it mimics that turn around from your desk and have a conversation with somebody over your shoulder. It just helps keep that communication going. And it's hard. Like if you're not used to working from home, it's going to be more difficult, especially if you really like that team environment. So you have to figure out ways to mimic it. And then as a leader, you have to help encourage those conversations to happen sometimes.

Maira Zarate:

Yeah. So just to chime in here, I feel as an individual, sometimes regardless of your position, sometimes we have to take initiative. So sometimes you want to say, "Hey, manager needs to tell us all this and all that." But sometimes our managers have a lot to do, or maybe the leads have a lot to do. So if something's overheard, how much of your time is it going to take you to help out your peers, or make sure that everyone's on the same page with things?

So sometimes not relying so much on, "Hey, I don't have to worry about it, because my manager will talk to me if it's really that big a deal." Take that initiative. Go push and make sure your whole team knows about everything that happened. How this solution was worked. Why not? Let's not be so dependent on one individual, because they are management, like why?

Tony Ferrelli:

Okay.

Liz Fong-Jones:

One thing that my team has discovered is we're encountering increased fatigue around open water, cooler conversations. But a thing that has a lot of momentum on our team is a Slack bot called Donut that will set up random pairings of people and actually prod you to schedule 30 minutes to meet with each other. And that's really, really great for actually facilitating conversations where people can feel like they can actually be candid rather than, "Oh my goodness. I'm speaking in front of an audience of 10 people."

Tony Ferrelli:

Right. That's great.

Holly Allen:

Yeah. We use that too. And I think all of these ideas are really good, but I'll just say, in my experience, we're doing all of those things and it's still really hard. And you have to devote a ton of intention to it. So I don't think that expecting it to be just as good as all working in the same open office, which has its own problems, will be possible. And so you have to put in that effort.

Liz Fong-Jones:

All right, excellent. Let's go ahead and move on to the next subject, which I know is near and dear to me to Holly you, Maira, which is what's the biggest challenge in handling the observability tools and infrastructure and why is it our job as SREs to think about observability? So I'll start with Maira.

Maira Zarate:

The hardest job to maintaining observability tools. From my end, it's the management of the tool itself, the functionality and any of the personal maintenance from our side. But talking about from the Autodesk and how we experience things, is that we have a really small team, over 3000 active monitors, different services. Just keeping up with what's going on, what patterns we're seeing, the communication with the application teams. And sometimes even getting their attention.

So I would say it's everywhere in our point, everything's difficult. But mostly is the management of the tool itself, because primarily you need to make sure that your tool is healthy, stable, in order to be able to test all these services and applications, functionalities.

Liz Fong-Jones:

Yeah. I definitely agree with that. Like if you're monitoring and observability tools are not trustworthy, then nothing else is, because you have no way of measuring it.

Maira Zarate:

Exactly. Yeah.

Liz Fong-Jones:

Yeah, and a lot of the solutions that are out there that are self-managed are very costly in time, even if they're free. Like every single person who runs their own ELK Stack, you will talk to them and they'll tell you about every time they've torn their hair out, doing it. How about you, Holly? What's, what's most challenging about running an observability stacking and being responsible for it.

Holly Allen:

Yeah. So, we're lucky enough to have a dedicated monitoring team. And they're not in charge of monitoring everyone's stuff. They're in charge of maintaining the tools and making sure they're meeting the needs. And one of the biggest problems we have is getting out of that operator mindset of like the ELK Stack is working, the third party tools are working. Our SLO's are fine. Mentality and into, are we solving the real problems that developers have?

When we switched a couple of years back from having a centralized operations team, how do you teach all these developers how to use these tools well? How do you ` balance the fact that they do need to learn some new skills and expertise, but also how do we meet them where they're at and build better tools to make it easier for them? And therefore you make your monitoring team into a software development team like everything else.

I think a product manager can really help here if you're a company can spare it. And then again, partner with dev teams and solve those problems shoulder to shoulder, which can be challenging to make time for when you've got that giant ELK Stack to keep working. But it's got to come back to that. But it is very challenging because typically outside of the data, any kind of data warehouse that you've got, it's typically one of the biggest systems in the company. And so running that and just keeping it working isn't enough. You have to also solve the real problems. And that can be hard.

Liz Fong-Jones:

Yeah. One of the things that Charity Majors, my boss says is that over time, the observability needs of an organization will become roughly 30% of its cloud budget. That you need to spend that amount of energy, whether it be of your own engineers, or through a vendor solution, in order to have sufficient visibility into what your systems are doing, what your software is doing. So you can either spend it by wasting your engineers time, or you can spend it to spend money.

And in a lot of cases, this is not worth it. You have to personally lift yourself. This is things where you can rely on shared solutions. But yeah [crosstalk 00:33:30] I think the other interesting angle here is observability driven development. Test room development, our testers on modern software engineering teams testing tooling. They don't write your test for it. The same should be true of monitoring and of observability, Holly, I cut you off earlier. You were wanting to say something.

Holly Allen:

I just loved that framing it as, you're like, don't think about it purely as your cloud budget, or your on prem budget, think about it as your developers time. And that's the most valuable thing you have in a software-driven organization. And so you've really got to then turn around and be able to quantify that.

Liz Fong-Jones:

Excellent. So another question about SRE from home, what are people's productivity and self-care tips for people who are adapting to working from home for the first time? And especially if you are in a situation where your life has been changed, both by your change to your working schedule, as well as by potentially having children, whether it be you have children, or there are people working on your team who have children. How does that change the dynamic of your teams?

Tony Ferrelli:

So, I don't have children, but I do have lots of folks that have children. So I think the advice we got from the previous speaker, I think was Amy. She said, "Be kind." You got to be kind to yourself. You're going to have a different schedule. So you have to establish what that schedule is going to be. And don't be afraid to talk to your team about it, talk to your manager about it. Like if you need to go take care of the kids for an hour, or two hours or something in the middle of the day, that has to be okay.

I have dogs, two dogs, I have two Huskies. They need lots of walks. They have a lot of energy. So I carve out time in my day to make sure that I can go do that, and then just adjust my working schedule to it. And if I have problems, then I talk to my boss, I talk to my team about it, and we figure out how we're going to work around that issue. So the first thing you really have to do is just forgive yourself. It's never going to be perfect. It's just figure out what's going to work from you, and see if you can establish something.

And then with your partner, or your kids, or everything, try to lay out some ground rules. Sometimes that works, sometimes that doesn't work. It's going to fail sometimes. So it's okay if your kid pops up behind you on a VC. Actually, it's really entertaining. And I think people get excited about seeing my dog jump on my shoulder. They're not here right now. I locked them out of the room.

Jumped on my shoulder. They're not here right now. I kind of locked them out of the room for now, but that's okay. Let that stuff kind of happen. It's life, right? We're living life right now, so just be really kind to yourself and be flexible. just understand you have to be flexible.

Holly Allen:

Yeah, those of us that ... Oh.

Liz Fong-Jones:

I love that people focus on kind of the idea of the team as a unit of delivery, not the individual. Like we can cover for each other, right? We can show each other that compassion. Also, we are not just working from home, we are working from home during a pandemic and during fascism. That's very right.

Maira Zarate:

Yeah, it doesn't just affect the working people, it affects the entire family down to pets, right? [inaudible 00:00:42]. So from my perspective, I've been working from home. However, like I said, what everyone's going through right now, it didn't just affect me in my work because I've always been working from home. I do have two children and it affected the whole family. So before it was like, okay, they went to school and I was alone. I got me time. I was able to focus and I never realized how much they ate in a day. Oh my God. But now that they're home, of course, we set up our agenda. So what I need to do now, because they're home all day with me, is I need to set up my agenda. Sometimes, like Tony said, I need to block a few hours during the day so I can make lunch and then have lunch together, have breakfast.

A lot of things that they've actually just recently learned, I have an eight and a six year old, is how to be more independent. I'm like, "Hey, in the morning if I don't make breakfast guys, and my door's locked, you guys need to leave me alone and go get some milk and cereal." They'll go ahead and do it themselves. I mean, it's coping to work what works out for us as an individual, you want to make sure you get your pauses, your 15 minute breaks, go out for a walk so you don't stress yourself out for work. But I mean, it's a coping mechanism. Just figure out your best solutions for yourself. Take care of yourself, and of course your family, but it's hit hard for everyone.

Tony Ferrelli:

Yeah, and then be careful with the trap of sitting at your desk all day. It's really easy to spend 10 hours staring at a screen, right?

Maira Zarate:

Time flies.

Tony Ferrelli:

Time flies really, really fast. So take those breaks, go stretch, go do yoga in the backyard. Do something else. Be conscious what you're doing and make sure you take some time for yourself. Then for God's sake, take a day off. I know we can't go really anywhere anymore, but if you need a day, take a day. Take a day off. It's not a bad thing. Go take a drive if you have to, but just be safe about it. You were going to say something, sorry. Lex, rather. Sorry.

Lex Neva:

My tip for working from home is set specific work hours. So nine to five, something like that. Leave work at five as if you were going to catch a train, even though I know you're just going to your living room or whatever like that, that way your family knows when they can trust that you'll be home. Your work knows that they can't ask for you to stay forever, though you're not working a 10 hour day. You'll have better work life balance for it.

Take a lunch break as well because as SREs, we need to ... I mean, work life balance, as Amy says, it's really important for ourselves, also for the reliability of the services that we're supporting. If you have a bunch of people that are pushing themselves into burnout, then your service reliability will suffer as a result. So we need to be ensuring that we're keeping that work life balance and also modeling it for everyone who's not an SRE and maybe doesn't have reliability on their mind all the time. I'm kind of a pain everywhere I work, poking people to take vacations or telling them, "Hey, you've been on for a long time today, are you sure you still need to do this today?" That kind of thing. Take time off from normal hours if you have to respond to an incident at 3:00 AM or even 6:00 PM. Yeah, model that work life balance to improve our reliability. It's really important.

Liz Fong-Jones:

I really love what Amy said about taking vacations because that's a form of predicted breaking, right? I'd rather break and find out about a single point of failure when I know it's going to happen rather than have a catastrophic here that, for better or for worse, leaves someone offline for two months. That's bad if you burn someone out to the point they can't work for two months. So be really cognizant of that and give yourself a lot of forgiveness.

All right, so with our remaining couple of minutes, I don't know whether we're supposed to get back on track with the original schedule, but kind of maybe really quick, what's everyone's ... we've talked about organizational dynamics, but what about the individual people who are on this call who are looking to get into SRE? How can people make that transition into SRE, potentially coming from a software development background, maybe college students? What would you recommend that people look into?

Holly Allen:

It kind of depends on where they're starting. If they're starting from a software background, start running some systems. Start learning that operational component so that you can think about the full life cycle of the code that you're writing and get beyond the like, "I checked it in to get up. I'm done," thing that we all learn, especially in college. If you're starting from the operational background, do the opposite. Learn how to write a software, which has its own flow. It's not just writing scripts. You have to learn the whole life cycle of software development as well. And so balance out what [inaudible 00:05:50].

Maira Zarate:

Along with that, I feel that something a lot of software engineers kind of don't really step deeply into is monitoring. Monitoring is key to make sure you understand how to successfully monitor, go out there and look at common issues, how they occur. If you're already working for a company, how it was mentioned earlier during our conversation here with this panelists, is go and see what kind of issues have been in the past. Question everything. I question everything, especially when you start. Why did this happen? Why are we doing it this way? Make sure you fully 100% understand because only when you do, of course, when something breaks, you kind of know how things work and where to start off. You have somewhere to jolt to. That's something I would say, especially for everybody who's stepping in, "Hey, make sure you know monitoring. Take a look at all the tools and all the effects that can happen in a system, all the issues that we're seeing or have been encountered in the past. And just always ask why. Why? Why? Question everything."

Lex Neva:

Focus on how all the pieces of the system fit together from a macro level, understand how, and no one's actually going to be able to get the whole system in their mind, but start to see how those pieces fit together and ask yourself reliability questions like, "What happens when this fails? How are we going to respond to this? How does the whole system, including the people work to deal with this kind of problem?" Read as much as you can of the greats in our field, like Laura Maguire, Lizanne Bainbridge, Richard Cook, Taylor Woods, John Alspah, all of them.

Liz Fong-Jones:

Clarify what you mean by "our field" Lex, because I think that those are books that many people who've been admins for a long time or software engineers for a long time may not have read. What is this field to you?

Lex Neva:

Oh, great. Now you've done it. I don't know. Next question. I don't know. SRE reliability engineering. What does it mean to engineer reliability or resilience? That's a good question. And I don't know that we have a real answer for that yet, even though I make a newsletter for it.

Liz Fong-Jones:

Yeah, we're working it out, but I think a lot of these things are lessons. Like Richard Cook writes about human factors, and writes about safety systems and human factors, which is something that a lot of SREs don't necessarily have a formal background in, but many more should I would argue.

Lex Neva:

Yes, thank you. Yeah, definitely.

Liz Fong-Jones:

I definitely think the number one trait about a good SRE is curiosity, trying to understand why does it work, how does it work in both the human factors as well as the technical factors. I think that that attitude will get you a long way. You need to find a place where you are empowered to ask questions and then ask lots of questions, right? Perpetually be a learner. That's really a critical trait for SREs that you can cultivate. Holly, what backgrounds have you seen people come into your work with?

Holly Allen:

I've been lucky to have a mix. So I've got some folks who definitely came from more of that admin background who are taking what some considered like the traditional path into SRE. Then I've got some folks who came way more from a software background who just love the impact they can have as an SRE and are learning the things about monitoring and about network level work that you can do to increase reliability and things like that. So that's really great because then you can set mentoring relationships.

Liz Fong-Jones:

Yeah. We joke a lot, or we talk a lot in Silicon Valley about the kind of T-shaped profile, like both really, really wide, but also deep in an area. I think that that is especially true for SREs, that SREs by necessity need to feel comfortable collaborating with other teams, like knowing the boundaries of their expertise and always pushing at that to understand, "Okay, here's who I talk to to build those human relationships," to understand who I talk to if I encounter something at the edge of my expertise, but also trying to widen the expertise so you can cross functionally diagnose things.

All right, excellent. So we've now run about 45 minutes, so we're going to go ahead and wrap up the live portion of this panel.

Next talk

DevOps Parenting

There’s no such thing as a staging environment when it comes to parenting. Every decision you make is in production. In this talk, Dawn will share her experiences adapting to parenting, working, and schooling and how she incorporates DevOps principles into her parenting style.

All talks

Morning Sessions
Morning Sessions
Yes, You Can Improve Your Team's Wellness
Yes, You Can Improve Your Team's Wellness
DevOps Parenting
DevOps Parenting
OK, so you are not Google. What should SRE mean for your organization?
OK, so you are not Google. What should SRE mean for your organization?
The Pandemic Brief: Assuring Essential Services
The Pandemic Brief: Assuring Essential Services
Afternoon Sessions
Afternoon Sessions
Emerging from Burnout
Emerging from Burnout
Panel Q&A: Ask an SRE
Panel Q&A: Ask an SRE
Maintaining Mean-time-to-Joy: Managing a Global Incident at Netflix
Maintaining Mean-time-to-Joy: Managing a Global Incident at Netflix