The Pandemic Brief: Assuring Essential Services

Henri Helvetica

Freeland Developer
Henri is a freelance developer who has turned his interests to a potpourri of performance engineering with pinches of user experience. When not reading the deluge of daily research docs and case studies, or indiscriminately auditing sites in devtools, Henri can be found contributing back to the community, co-programming meetups including the Toronto Web Performance Group or volunteering his time for lunch and learns at various bootcamps.

Holly Allen

Head of Reliability
Holly Allen is the head of reliability at Slack, with SRE, Monitoring, and Resilience Engineering in her portfolio. She is tireless in her efforts to make Slack the software reliable and scalable, and Slack the company a delightful place to work. Prior to Slack Holly worked at startups, DreamWorks Animation, and was Director of Engineering at 18F, a civic tech startup in the US government.

Lex Neva

Site Reliability Engineer
Lex Neva is interested in all things related to running large, massively multiuser online services. He has years of Systems Engineering, tinkering, and troubleshooting experience and perhaps loves incident response more than he ought to. He’s previously worked for Linden Lab, DeviantArt, and Heroku and currently works as an SRE at Fastly helping to make sure the Internet keeps running.

Tony Ferrelli

VP of Technology Operations
Tony is a 25-year Internet industry veteran who has served in various Network Engineering and Operation leadership roles, including Google and DoubleClick. Tony spearheads the management and operations of all Catchpoint monitoring data centers, supporting Catchpoint’s expanding corporate strategy, delivering stable, secure, and reliable operations.

Maira Zarate

Site Reliability Engineer
Maira is an Application Engineer at Autodesk, based in Novi Michigan. She is obsessed with learning, but especially with the learning process that accompanies on-boarding monitoring concepts for better site/service Performance and Availability. She has dedicated her past years to site reliability, working with different Synthetic and RUM monitoring tools.

Cool, cool, cool. So once again, just want to thank everyone for having me, Catchpoint, Peter, shout outs, Elena, and the good people, like I said at Catchpoint. I'm sorry, I just have to move a few things here. So that being said, I'm going to get going, because like I said, it's going to feel like a lightening talk. It's not often that I get a nice lean 20 minutes to do a presentation, but I think it's going to be a lot of fun. So that being said, Hey, here we go.

All right, so arrows it is. Why is my screen frozen? Oh, there we go. Hello, my name is Henri, actually. So I do speak French. It is a French name. That's why I spoke with I. You could find me @henriveltica, which is on Twitter, which is actually also on Instagram. And long story short, it is a non plume, it's not my real last name if you're not familiar. So that being said, I hope that that answers that question. I'm from the greatest city on the planet, which is Toronto. It's in Canada, home of the world champion Toronto Raptors, so I have to throw that in. And it's also home of one of the greatest Caribbean festivals around the world, but certainly in North America, it's called Caribana. I mean, it's not going to happen this year, obviously because of COVID. But if you're ever around in 2021 and you want to see something like this, please come by, check us out. We have a massive Caribbean community here in Toronto and I love it and that's why I stay here.

That being said, I don't have a lot of time, so I want to get cracking right away. So we're here talking about SRE. So it's funny because I never knew what SRE was. I remember a little while back, I was digging through some documentation, I was like site reliability engineering. I'm like, "I guess it's sort of performancy, but I'm not sure." And I remember asking some people left and right. And they kind of gave me a little heads up about it, but I was like, "All right cool." But it sounds like there is a little bit of performance involved.

Why is that important? Because I tend to hang around the performance spaces. I picked up a lot of documentation. I read blogs, I've spoken at conferences almost exclusively around performance. I've been doing a little bit of accessibility lately, but it is something that I enjoy talking about. So we're here talking about essentially a bit about performance, which is what I'm going to lean on for the next little bit. Now I was digging through some docs and some blog posts and whatnot, and I did [inaudible 00:03:09] this quote by Johan Anderson, who works at Google actually. And he said, "When it comes to SRE, you're responsible for whether a service is working or not." And I thought to myself, "Okay, that sounds sort of like a performancy, so I could take that." And then he added that, again, what is wrong with my deck, "In SRE, the projects being worked on have direct impact on users."

And that's when I started to really vibe with that because on the performance side, but we certainly do talk about the users and specifically the user experience. So that's when I was starting to really understand what was going on. And like I said, the user experience is kind of like a P1 for us. And another quote that I liked, which was this one right here by Liz, who's going to be speaking a little later, SRE, she described it as a behind the scenes role, but a critical one. And again, I felt that that really created some parallels in my head because on the performance side, we are sort of behind the scenes, we sort of pop up in dev tools. We look around, we poke around it tends to be a little slow. Now we do some profiling, some tracing, we read source code, yada, yada, yada. It's not really that fun.

But performance, like I said, is something that's super important and it's certainly important to me. And I wanted to sort of talk about it in the sort of scope of this pandemic that we're having. So I'm entitling this talk, the pandemic grief. You might see exactly what's going on here, but specifically it's because of the sort of information that's out there that we have to sort of get ahold of. And we are trying to assure the essential services are there for us to access. So with that being said, let's get going.

This story is really about, and it is a story, it's about the web and how important the web has become to us. It's become a bit of a vital resource. In fact, the web for us really, and I'm sure myself and yourselves as well, it's providing a bit of a quotidian service. Pretty much every day we're on the web, getting something done. We might be transferring money. We might be paying a bill. I don't know, we might be buying clothes, getting food, renting a car, whatever it is, paying for transit even, but the web has become an essential service, that's without a doubt.

So with that being said, when a crisis comes around, we certainly are going to be faced with a challenge because again, will you be able to pay these bills? Will you be able to get on a Metro, will you be able to, like I said, rent the car or pay for food, whatever it is that you might need to do. So things without the web and without some kind of sort of stability, reliability, things can get kind of serious. Or as I would like to say, things can get SRE, serious. I don't know how you pronounce that, anyways, but that's where in fact performance could come in very handy because we've seen a long times that a site that is not performing well... A site that's not performance can bring about some certain challenges.

And this talk really kind of came about when I was sort of rifling through Twitter recently, and someone brought up, a buddy of mine, actually brought up a tweet of mine from 2018. And is right around the time Hurricane Michael kind of hit the US shores. And I tweeted this out and I was basically looking at some crisis sites and specifically I'd gone to, I believe it was fema.gov. And I was basically doing a quick profile, quick audit of the site. And I was noticing some heavy resources coming down the pipe. And I thought to myself, "This is the kind of site that needs to be lean." Because people might go to fema.gov for some information here and there. But obviously during a crisis moment, they're likely going to be flooded with requests.

And that's when I start to take a look at the site and I was like, "Okay, you know what? This could be a little different whatnot." And just kind of rifling through some ideas in my head. But like I said, people are going to these sites because they are the one location they need to go do for some vital information. And I do mean vital because you're talking about literally tornadoes, hurricanes, and the information that they need is likely going to be posted on this site. And I get back the idea that again, the fact that these websites are the places that we need to go to, they are essentially essential services or the art of an essential service.

Now this is something that, again, I've been sort of kicking around and playing with some time back. And I remember, I mean, everyone goes to cnn.com, but I didn't whether or not people knew that even some of these news sites are understanding and knowing that there are moments in time, moments in life, where they might be again, cataclysm, hurricanes, tornadoes, whatever it might be. They do have these variants and these lite sites. And if you actually go to lite.cnn.io, you're basically going to get the exact same information as cnn.com, but in a lite version. So a lot less CSS, if any at all, very little JS, if any at all, and it's meant to load as fast as possible. And under the conditions that you'd probably expect during a crisis, like crappy bandwidth, intermittent connections because who knows some towers may be down. So you want to make sure that some of that vital information, some of the vital resources can come down the wire onto your mobile device and you can get the information that you need.

Now in 2020, I'm trying to think, has there been a crisis that we're kind of... Oh yeah, there might be. This may look familiar, you may have seen it up close. If you haven't, it kind of looks like this. COVID-19, now I'm sure like everyone else, it was a bit of a shock, certainly. You probably woke up one morning and the lights just went off, but the crisis that it created overnight and worldwide was literally unprecedented. You're talking about internet traffic that went from just what we believed was normal to suddenly skyrocketing because people were obviously working from home. If they weren't working, they were certainly at home and doing what they knew to do, which was get on the internet. Whether it be looking for information, whether it be working from home and working from home now meant being on Skype, Zoom, Google Hangouts, or whatever tiers of video conferences that they have over at Google.

But this was taken from a blog post from Fastly, just giving you an idea of the change in traffic across the world. And this is mostly G7 countries with some data from the US as well. So you can see right here that there was definitely a surge in internet traffic, and there was a need for people to be on the net to get the information, if not to work.

So bunch of things start to happen. I mean, say information like this. I thought that this was actually kind of funny and almost a joke, but it turned out not to be because we are seeing that again, the reliance on the net, video games operate right on the internet and they need a lot of bandwidth and breaking news was that they're asking video game players to basically ease up on the bandwidth or times of day that they were playing so that we can assure that there would be some kind of bandwidth for people who need to do work, et cetera. And this was just the beginning.

Many may have heard about this, sort of this news that came out of Europe first, obviously. So the big video platforms were being asked to cut their streaming quality, to ease up on the networks. And this was actually really interesting to me as well, because first of all, it did have to do with some performance. As well, I've been sort of playing around with the idea, or not the idea, playing around with the sort of researching really what was going on in the video space, because since video was being projected to be one of the top resources being used a couple years from now, there was some research being done around codex and things like that. But COVID basically forced that basically maybe two, three years early. So right now I would not be surprised that tI the end of the year, we start to look at some of the data and we'll see that video was immensely the number one resource just being used to cross the net.

So when the European commission would ask YouTube and Netflix to cut down the quality, [inaudible 00:14:03]. Pardon me. So Netflix, the engineers got to work and they start to sort of figure out what to do. And because they had been asked to go from HDD to SD at one point, but they're obviously smart people over there, so they make do and they were able to cut their traffic bandwidth down by 25% for some of the key areas like Italy and Spain early, because they got hit.

Now, there are so many metrics that we employ to sort of look after performance, from TTFB, time to first byte, [inaudible 00:14:41] content loaded, FCP, LCP, speed index, TTI, FID, there are so many. In fact, in a time of crisis, there's another one called the Waffle House Index, which I thought was very interesting, but FEMA was so amazed how the Waffle House could manage in a crisis they start to look at the Waffle House in terms of the severity of a storm coming. But they're also impressed with how reliable they've been able to manage their own resources during a crisis. You have to look that up, I'll post the link in a Slack maybe, but it's really interesting. But when we get back to metrics, they are absolutely important. And that's how we're going to essentially figure out the health of a site. As we like to say, you have to measure all the time and for measuring, you're going to get some metrics.

Now, let me get you to the brunt of the story. As the crisis sort of grew, and we knew that users were going to get to the internet to get all the information they needed, people had to make updates. You have to make sure that your site that used to possibly get the odd requests here and there for some information went from a little bit of traffic to a immense amount because people need to get some of that vital information.

Now, this is a great story, actually, so the California government set up this covid-19.ca.gov, but specifically made sure that they made some site updates so it was as user friendly as possible. And this is a blog post you'll be able to read, but I thought this was fantastic. They essentially wanted to make sure that they prioritize the user in a crisis. And they talked about building the California COVID-19 response site. And they go through a bunch of the technologies that they use and why they used [inaudible 00:16:56]. But they certainly list performance as one of the key factors. And you'll see right here later on the page, they talk about preserving performance and they posted some performance scores that they have here.

Now I'm sure many of you are familiar with Lighthouse, which is a Google tool that is there to sort of audit a website, and one of the audits is a performance. And you get a performance score out of it. And here on a MotoG4 at 3G, they had a performance score of 96. Now 96 is a fantastic score, it's out of a hundred. Obviously some people are out there looking to try and get their a hundred, their C note all the time. But the point of the matter is that they looked after the user. They made sure that their website was as accessible as possible. They talk about actually accessibility in this blog post as well, but they certainly do talk about the performance. And what does that mean? They probably have a disciplined delivery of resources coming down the wire. Now the performance score here is 96 and I'm bringing that back up because the average Lighthouse score across the net right now is 41. So essentially there's a lot of work to do, but it's great to see that they are actually absolutely near the top.

But I also decided to take a look at some of these other crisis areas that are having a bit of a hard time dealing with COVID. And I'm assuming that a lot of the residents in these States will go to their sort of emergency sites to get some information. Well, this is the Florida Department of Health website, so floridahealth.gov, and they scored a 22 out of a hundred in Lighthouse. And you can see there's some recommendations, we can get into that a little later. Their time to interactive was a little low, first Contentful Paint was low, but that's what they're faced with. Texas, which is apparently not in the greatest place right now, their performance score was 25. So they unfortunately have a little bit of work to do. And Georgia, again, going through a little bit of a moment right now, they have a score of 23, and that was the DPH, Department of Health, georgia.gov.

So that being said, I'm not saying that they're not looking after their users, but I'm certainly saying that their users could be challenged in accessing the site. And there's some ways to go here. So once again, this is about getting vital information down to user as best as possible. How do you avoid bottlenecks? How do you make sure that they can log on? Because they're going to be on mobile and probably with the multitude of people hitting the site at the same time, they could be against some congestion. You want to make sure that the resources could come down the wire as quickly as possible.

Now this is all about COVID right now, but as Don was talking about earlier, what about education? We know that right now, a lot of students are going to be staying at home, not everyone has amazing bandwidth. I don't want to sit there and brag, but I'm going to for a hot second, I just posted the fact that I think I was getting north of 500 megabytes a second, but this is very serious because with the rush to get the students back in a sort of like classroom, whether it be virtual or not, if it's going to be virtual, they're going to be online. So you want to make sure that everyone can access the information as easily as possible.

Now, with that being said, a lot of the challenges is that, especially in education, sites are going to be media rich, that means it's images, that means it's a lot of videos and that means also it's going to be data rich. So it's going to be a lot of bites coming down or megabytes coming down the wire. Luckily, right now, there are next gen video formats and image formats that are being developed, worked on as we speak. That's where I'd love to sit there and talk about for like another hour because I could, but maybe I'll do that in a Slack channel afterwards. But this is the type of research that's being done.

We talk about SREs, understanding and knowing best practices. Well an SRE would probably either know themselves or make sure they deal with their team and understanding what the best practice in this area would be, media management. So whether it be sort of using latest codecs or something like WebP, because that's now supported by Safari, these are things that you want to make sure you work into your solutions.

So in conclusion, I talk about the disciplined delivery of data, I think COVID is teaching us that firsthand. Obviously the people at the california.gov department that looked after the COVID-19 site also understand that. And from what I remember, I think they're working with a lot of different teams across the country. So ultimately what I'm trying to say is, try to be responsible, try to make sure that your team is as sort of frugal with the resources to make sure that everyone has access to their site as much as possible. And lastly, I just want to say, where your masks please. And thank you once again, Peter, Catchpoint, Elena and I'm out.

Next talk

Emerging from Burnout

Over the last few years, Amy has been in and around burnout. She'll talk about what burnout is, how it impacts us, how we can spot it, and finally realistic expectations for the future.

All talks

Morning Sessions
Morning Sessions
Yes, You Can Improve Your Team's Wellness
Yes, You Can Improve Your Team's Wellness
DevOps Parenting
DevOps Parenting
OK, so you are not Google. What should SRE mean for your organization?
OK, so you are not Google. What should SRE mean for your organization?
The Pandemic Brief: Assuring Essential Services
The Pandemic Brief: Assuring Essential Services
Afternoon Sessions
Afternoon Sessions
Emerging from Burnout
Emerging from Burnout
Panel Q&A: Ask an SRE
Panel Q&A: Ask an SRE
Maintaining Mean-time-to-Joy: Managing a Global Incident at Netflix
Maintaining Mean-time-to-Joy: Managing a Global Incident at Netflix