Polar Squad

View Original

Get started with SRE, the easy way

Site Reliability Engineering (SRE) is all the rage right now. It might be trendy, but it deals with timeless concepts to improve your teamwork, processes and tools, ultimately improving the reliability of your services. In this article, I’m going to provide you with easy ways to determine if SRE is for you and to get started with reaping its benefits. I argue that done right, SRE is a consistent, no-nonsense way of enhancing the effectiveness and wellbeing of your tech teams.

What is SRE?

The quote that goes along with the acronym SRE is by Benjamin Treynor Sloss from Google: “SRE is what happens when you ask a software engineer to design an operations team.”

SRE deals with creating scalable, reliable systems using software engineering methodologies. Automation is an important part of SRE, but so are cultural aspects, communication and the creation of multi-disciplinary teams for these purposes. SRE is often seen as heavily concerned with uptime, but it’s not just about maximising uptime, but rather finding meaningful SLO and working towards it.

Do I need SRE?

At Polar Squad, we see SRE as one possible practical implementation of the ideas of DevOps. It’s important to note that the concepts are not novel – DevOps deals with things that have been done since the nineties. What’s new about SRE is the number of responsibilities that an SRE team has and coordinates.

That is to say, you probably already ‘have’ SRE in some form or another. However, if you observe certain symptoms, it is worthwhile to review your processes. 

  • If you’re a bank and your maintenance window overflows into Monday, that’s a symptom. 

  • If you have too many service disruptions, that’s a symptom. 

  • If your customers are complaining about degraded performance or poor reliability, that’s a symptom.

  • If your ops team members are burning out, that’s a symptom as well.

A competent SRE consultant can look at both business and engineering to help trace the symptoms – regardless if these are technical or cultural.

Who’s the right person for the job?

Typically, companies start by renaming their system ops to SRE, which is in many ways logical, but not optimal.

When you try to shift from system ops to SRE, the sysadmin – who’s very much used to dealing with just technology – suddenly needs to talk to people and manage incidents. If they identify a problem in app logic, they need to contact the relevant teams and propose solutions while considering team dynamics, communication and other social aspects. It takes another set of skills. It also requires specific leadership: You need to lead an SRE team differently from a sysadmin team simply because you need to encourage and motivate communication between teams.

The fact that sysadmins do not automatically make for good SRE experts highlights what’s actually new about SRE – not the skills themselves, but rather who has to have these skills.

So we arrive at the way I recommend starting out with SRE: In my experience, it’s often beneficial to take in SRE consultants to survey the situation. If you have people who are already conscious of reliability and other SRE-related themes, great! You can start building the needed culture and tech stack. Consultants can help you identify potential experts for forming SRE teams. It’s good to note that the work doesn’t have to be carried out by a dedicated SRE team – what’s important is that the SRE work gets done. 

What’s the typical SRE process?

Most of the time, it’s a good idea to start with an SRE assessment, which will help you identify your maturity in SRE-related topics. 

This assessment is a snapshot of your system, teams and environment. Each customer has their particular mix of culture and technology. The assessment helps us learn how well this mix works.

Working together with your teams, we’ll identify the symptoms and go through the underlying factors. The assessment team might, for example, take a look at one specific service, from standpoints of business and engineering reliability. 

As a result, you will have a comprehensive overview of the reliability situation in your company. There’s always a workshop describing the results and the recommended steps to take.

Assessments are extensive work, so it’s a good idea to start small. With a smaller scope, it’s easier to see specific, actionable results. 

Many recommendations make sense to tackle with internal forces, while some might call for assistance. For instance, if there’s a team suffering from alert fatigue, it’s better remedied with the help of an external team. What happens with alert fatigue is too many alerts going to humans, as it’s routed wrong in fear of something critical going amiss. That’s usually better fixed with an external expert, who can help you look at it critically and root out unnecessary alerts. We can also help you communicate the changes to relevant parties. 

It’s tremendous – staying on top of alerts means staying ahead: You’ll be able to communicate problems proactively to your customers. Reliability isn’t about being 100% up but realizing when you’re down and telling your users proactively. All told, roughly half of the management of SRE is about communications.

Conclusion

Numerous companies answer questions of uptime objectives with “Oh, it’s 100%” or some random number of “nines of availability.” If you’re chasing something you can’t achieve or isn’t meaningful to your users, every moment of downtime is a stressful failure. 

When you understand that 100% is impossible and define attainable service level objectives, life is good again. Achievable goals enable you to work smarter and foster good communications with users and customers.

We can help establish the culture. Being a good tech organization is not about hitting 100% uptime, but about coming to terms with “shit’s gonna happen” and planning ahead.


Vítek Urbanec is SRE Lead at Polar Squad – he’s been in Reliability Engineering before it was Site.