OK, so you are not Google. What should SRE mean for your organization?
SRE, that is SRE as defined by Google, is not applicable for most organizations. Organizations need to take the thought process and culture behind Google’s SRE and adapt it just enough to make it suitable and viable for their organization’s business needs. As I see it today, large enterprises are mostly failing at doing this. They are either attempting to adopt SRE in its purest form, not realizing they are not Google, or totally changing (corrupting) it to suit how they do things, how they have always done things, to their broken culture, hence making what they call SRE, SRE in name only. This session will delve into the underlying philosophy behind SRE and present practical approaches to adapt and adopt SRE in the enterprise.
Sanjeev Sharma is an internationally known DevOps and Cloud Transformation, and Data Modernization thought leader, technology executive, and author. Sanjeev’s industry experience includes tenures as CTO, Technical Executive, and Cloud Architect leader. As a former IBM Distinguished Engineer, Sanjeev was recognized at the highest levels of IBM’s core of technical leaders. He is currently a Principal Analyst at Accelerated Strategies. Sanjeev provides leadership to drive the adoption of cutting-edge solutions, architectures and strategies for DevOps and Cloud transformations, and advises C-level and senior technical executives leading these transformations. Sanjeev published his 2nd bestseller book ‘The DevOps Adoption Playbook’ in 2017. He regularly blogs and podcasts on DevOps, Cloud, and Data Modernization on his popular blog http://sdarchitect.blog
Head of Reliability
Holly Allen is the head of reliability at Slack, with SRE, Monitoring, and Resilience Engineering in her portfolio. She is tireless in her efforts to make Slack the software reliable and scalable, and Slack the company a delightful place to work. Prior to Slack Holly worked at startups, DreamWorks Animation, and was Director of Engineering at 18F, a civic tech startup in the US government.
Site Reliability Engineer
Lex Neva is interested in all things related to running large, massively multiuser online services. He has years of Systems Engineering, tinkering, and troubleshooting experience and perhaps loves incident response more than he ought to. He’s previously worked for Linden Lab, DeviantArt, and Heroku and currently works as an SRE at Fastly helping to make sure the Internet keeps running.
VP of Technology Operations
Tony is a 25-year Internet industry veteran who has served in various Network Engineering and Operation leadership roles, including Google and DoubleClick. Tony spearheads the management and operations of all Catchpoint monitoring data centers, supporting Catchpoint’s expanding corporate strategy, delivering stable, secure, and reliable operations.
Site Reliability Engineer
Maira is an Application Engineer at Autodesk, based in Novi Michigan. She is obsessed with learning, but especially with the learning process that accompanies on-boarding monitoring concepts for better site/service Performance and Availability. She has dedicated her past years to site reliability, working with different Synthetic and RUM monitoring tools.
Let's get started. So thanks for the introduction. I gave you some tongue twisters there, but I appreciate it. But as mentioned, my name is Sanjeev. Sanjeev Sharma. My background has been in the DevOps space, I've written two books on DevOps. I wrote the original DevOps for Dummies, IBM Edition, which was a free 70, 80 page download from IBM, available over there. And then a few years ago, actually the same year when the SRE book came out from Google, I wrote the DevOps Adoption Playbook. So these are my claim to fame. And feel free to get hold of these books. And I blog very regularly on my blog, sdarchitect.blog.
And, in fact, this talk came out of a blog piece I wrote, and the title of that blog post was "You Don't Need SRE. What You Need Is SRE." And, essentially, I was talking about what we're going to talk about today is, how do you adopt SRE in an organization which is not Google, right? I mean, Google's SRE book, while it is great, I mean, it is the standard, it defined SRE. What do you do when your organization does not look like Google, right? So let's jump into that. Right?
So the first question that arises there is, what does Google's organization look like? So I can know whether I look like that or not. Right? So here's some key factors. Why did Google actually go and create SRE as a practice? Well, Google has massive data centers. And what's interesting about these data centers is that they are all based on commodity hardware. Right? The hardware is fairly identical across teams. And they are commodity hardware, and just because of their massive size, they have a very high incidence of failure. Hard drives fail, switches fail, other physical devices fail. And they talk about this in the introduction chapter of the SRE book. That it is not uncommon to see dozens or hundreds of failures in a massive data center on a regular basis.
So what they need to do is have the ability to dynamically shift workloads around so that when one piece, one hardware component is being serviced, they can easily move the workload which is being run without interruption or loss of quality of service to another part of the data center, which is still operating as desired. This, by itself, could qualify most organizations. Because in most companies, where I've gone there and looked at data centers, they are made up of hardware. Which is, first of all, not commodity. And in most cases, custom hardware. But if not custom, generic hardware has been optimized for the workload being run. Right?
So you get hardware, you get compute, network storage, memory, based on the specs of what you need for the workload you're going to run on it. Not the other way around, which is, in a cloud where you go look at certain [inaudible 00:04:07] of images and instance over types available. And then you'd pick one of them based on your workload. And if it's not optimal, you change your workload. You can't go to Google and say, I want you to create a new image type for me. I want you to create something different for me. Right? It's what's available from them, the services which are available are what are available.
So at Google, these operation teams have to handle these incidents and outages on a constant basis. And like we heard on Jamie's talk, earlier on, two talks ago, right? It results in a lot of point. It results in a lot of stress for them because they're constantly being faced by these situations. But one good thing about that is that because their data centers are fairly homogenous across them, you can walk into any data center anywhere and it looks pretty much the same in this group for all hyper scale cloud vendors. They can automate a lot of these things that happen, the responses to the incidents, because they keep happening over and over again. So the idea here behind SREs is that I will have [inaudible 00:05:08] looking in my ops team. And I'm oversimplifying. And their job will be to identify tasks which are repetitive, and automate them. So that the actual ops people who are dedicated to doing incident response and ops can focus on the outliers.
They can focus on things which are not the norm. They are outside the boundary conditions. They are once in a few months kind of incidents, which are totally outlier. So the goal of SRE, at an organization like Google, is to develop software. To automate as much of the repetitive tasks as possible, and that's detection and remediation of incidents and outages and degradation in quality of service. So that your experts in incident response, they can focus on outlier. Now, so you heard this sentence before, right? I mean, when software engineering meets ops, this is when you get SRE. So what do you do in a organization which is not Google, where you don't have these kinds of highly optimized, massive data centers which are homogenous, where incidents of the same kind happen over and over again?
Well, you still need reliability engineering. You still need to make your services and your systems more reliable. So maybe you change the S definition of S from site reliability engineering to service reliability engineering. But you don't need to do it the way Google does. You need to do it a bit differently because your needs are different. So before I talk about how to do SRE, and I've been in this space for around three, four years now, and I've worked with a lot of clients where I've helped them adopt SRE-like practices. And even before that, I've helped clients adopt reliability engineering in general, resilience and reliability. How to meet their SLA goals for high availability, for example. And there are certain anti-patterns I see. And these are the top three anti-patterns I see, right?
First is, they replace their ops team with SREs. And the entire ops team is gone, and they're replaced with people who are software developers. And they attempt to do this, it never actually happens because they realize it doesn't work. But this is an anti-pattern, right? You're replacing your ops team with all SREs, and getting rid of all the expertise you have on your data centers. And we'll delve into each one of these in the next few slides, is an anti-pattern. The other anti-pattern I've seen, which is even more common, is you take your DevOps team and replace it and rename it SRE team. Right? In fact, I was talking to a colleague of mine and he was saying that a developer he mentors, his title has just changed from DevOps to SRE. And he's been given more training.
He's been given no tools, and suddenly is SRE. Now, and I'll go into this in more detail in the coming slide, right? And third is, try to adopt the SRE principles by the book with no effort to change your culture. Bring all the automation, but the culture doesn't change. Well that's a problem. You will have significant challenges when you do that, because SRE requires a cultural change. So let's delve into each and every one of these. So often the enterprise is not easy, if you're a startup, or if you're a company, even an enterprise which is purely in the cloud, and you're going all in on one cloud and everything is there, life would be easier. But most large enterprises are not like that. They are not homogenous. They are more monolithic, right? I mean, there will be a multiple generations of technologies living together, right?
And in a large enterprise, a bank or healthcare company, or even a telco, you walk in. Or any kind of company, right? I was recently doing some work for a trucking company. You walk into the data center, and it's what if it was a human house that we would call a multigenerational family. Right? There are five generations living there together, right? There is mainframes running 20 year old systems. There is Kubernetes running in another part of the data center. There is a very outdated version of WebLogic running somewhere else. Windows is all over the place. And I mean, Windows, older versions and they're all trying to patch it to keep it up to date. It's a multigenerational household. Secondly, the data centers are very limited capacity. They're not elastic, right? You're not going to buy 50% more capacity in your data center so that you can move things around dynamically, like a hyperscale cloud does.
As I already mentioned, most of the hardware is custom. Optimized for the workload you're running. Thirdly, the networks are a pin, because you're trying to link all these systems together. And most organizations are hybrid and multicloud. You have SAS things running somewhere else. You have third party services coming. Networks are a problem. Most of these data centers have hardware coming in and services and systems coming in from multiple vendors. Vendor management is itself a challenge, right? I mean, some network goes down, who do you call? Do you call Equinix? Do you call your network router vendor? Do you call your switch vendor? Who do you call? Is it a software issue? And, hiring software engineers to do ops is extremely difficult. In fact, this is a question I ask every CIO who I have this conversation with is, are you having trouble hiring software engineers to do software engineering?
And the normal answer is, yes, it's very painful. And yeah, so you are having trouble hiring developers to do development, for development for developing your product. And you think you're not going to have a problem hiring software engineers to become SREs and write code for ops? What makes you have that assumption? So you still need your ops team, but you have to supplement them with SREs. So secondly is, how do you adapt? How do you [inaudible 00:10:41] SRE teams, right? So renaming your DevOps team SREs is an anti-pattern. So first and foremost, and let me get on my soap box here. You shouldn't have a DevOps team. There is no such thing as a DevOps team. What you do, DevOps is something everybody does. They have a different role in how DevOps is adopted and adapted, and the various tasks are performed, but there shouldn't be a new silo called DevOps team who is the intermediary now between all the stakeholders in your application delivery pipeline.
You don't have a DevOps pipeline, you have an application delivery pipeline, and you shouldn't have a DevOps team. If you have a DevOps team, most likely those are the people writing automation, right? They're writing scripts. And that's a very, very, very narrow sliver of DevOps. But let me get off my soap box. SRE roles and DevOps roles are different. SREs focus on reliability engineering, on resilience, on managing your SLAs and SLOs. And identifying and automating stuff. Whereas DevOps is to have continuous flow of innovation and feedback from what you are delivering so that you can learn. So let ops specialists do ops, let SRE software engineers who you hire to do ops, which is going to be very difficult to do, do SRE. And let everybody else who's involved, all the stakeholders together, operate as a unit and do DevOps, right?
Let's get rid of that DevOps team. If you have a DevOps team, if something a team called DevOps team, they should be DevOps coaches. Their job is to teach people how to do DevOps currently. But I'll get off the soap box real quickly. Lastly, establishing an SRE culture. Right? First and foremost, in most traditional enterprises the measure our success is mean time between failure. It is a horrible metric, mean time to restore. Mean time to recovery is a much better metric. Why? Because mean time between failure tells me nothing about the failure. Did I have a 10 minute failure? Did I have a 20 second failure? Did I have a two day failure? Did I lose data? Did I just lose quality of service? What happened? Am I going to be fined by the Federal Reserve because I had an outage? Am I going to have to pay penalties because I didn't miss the SLO?
Doesn't tell me anything. So you have to move from a culture of mean time between failure, to mean time to recovery. And establishing the right SLIs indicators and the SLOs for that. I think Jamie talked about earlier, most organizations don't have [inaudible 00:12:55] budgets. Exactly. Get them. Establish them, right? And incident postmortems, right? We talk about blameless incident postmortems, right? And trust me, most postmortems of incidents which happen in large enterprises are blameful, they're not blameless. The whole idea is to figure out, who can we throw under the bus? To me, that's the wrong thinking. But it is even incomplete thinking to figure out why it happened. Not the, what happened, right? The incident, goal of the incident postmodern to is figure [inaudible 00:13:23] it happened. And what can we change? Can we learn from, so that we can improve how we operate. Whereas [inaudible 00:13:30] so that it doesn't happen [inaudible 00:13:33]. Right? Based on repeating this. I once used the term, I'm like a broken record in front of my children.