What’s SRE anyway? Part 2

 
SRE2.jpeg
 

The Three Amigos of SRE are back! This time, we’re answering questions about the principles of SRE. So you heard all about SRE in part 1, but now you want to understand how it really works and what are all those terms they use? Our experts are back to explain ideas such as SLO, SLA, and the error budget.


Isn’t SRE sort of a silo? What makes them so different from the classic operations team?

Yair: Like many other ideas, SRE can become a silo; it depends on the implementation. I was part of an SRE team some time ago; the whole company was a huge silo, and of course that we were a silo (inside a silo). If you will treat your new SRE team as a regular Ops team, then why bother? One needs to understand why and how the SRE personnel will be used. This is not an easy effort, and all parts of the company should be ready and accepting the change.

Jaakko: Unlike a classic operations team, SRE is integrated as part of every team. It’s part of the entire software development process from developers’ editors to running software in production. The SRE work itself would be delivered as part of the team’s schedule instead of thrown over the fence to another team. The work can be completed by an SRE, a developer, an ops person etc. depending on how the team has agreed to split the work.

Juhani: Nothing necessarily. There’s a real risk that the SRE team is just a retitled sysadmin/ops team. In my mind, the notion of 70–30 is what separates SRE from sysadmin teams.


What’s the relation between QA and SRE?

Yair: QA should work in tandem with the SRE team; they can exchange many ideas and information. I would say QA nowadays should be understanding that resilience is a feature, and should be ready to test it according to well-defined objectives or requirements. Both teams could benefit from exchanging ideas and knowledge.

Jaakko: QA and SRE go both hand-in-hand and head-to-head. Having a solid foundation for QA helps prevent issues before they hit production, which eliminates burden production support. However, if the QA process is siloed, then it can hurt teams’ ability to deliver new things fast enough.

My understanding is that QA typically functions in a testing/pre-production environment, while SRE operates at all levels. However, QA considers the user experience, and not just that the service works.

Juhani: QA is about end-user UX, SRE is about dev and business UX.


How do you think we should define SLI, SLO, and SLA?

Yair: SLIs are what we are looking for or at, SLOs are the goals we want to achieve from that SLI, and SLA is an agreement we have with our customers. We should start with a good idea of our SLI since the SLO is derived from that.

Jaakko: I think it’s essential to define SLIs in a way that they accurately represent how well your software is working for the user, while at the same time keeping them measurable. Both the SLA and SLO values should be derived from SLI measurements with SLO having higher standards over SLA.

I think the most common way things go wrong is when the SLI doesn’t represent availability accurately enough for a user. That’s why I think it’s important to re-evaluate the SLIs regularly.

Juhani: Google SLx and you’ll have your answer.


What’s the “error budget”, and when should we use it?

Yair: Error budget is a construct that makes the developer team think before they release new features to production. If they exceed their defined error budget — they would not be able to release new features. Error budget means that for the defined time if the team exceeded the error budget, they will focus on reliability and not other features — the idea is to get the service back with-in the SLO.

Jaakko: It’s how much leeway you have to break things without breaking your goals for reliability, i.e. 1 — SLO. I’d say the primary purpose for it is to bring clarity on when to launch risky changes: if you don’t have any remaining error budget, you should postpone the changes. If you have a lot of budget remaining, you could spend it on chaos tests that may expose components you need to improve further or running tricky changes where downtime is expected.

Juhani: If it’s production. There should be a plan for error budgeting or an error budget in place. An error budget is a rebranding of service level agreement.


SRE brings many new terms. Which one do you think is the most important and why?

Yair: I don’t know if there is one term that is more important, they are all part of an ecosystem that tries to add a new (or old but forgotten) idea to the table. For me SRE is a bit like Go programming language: you use old constructs but stretch them to fit our new demands — which are different but always the same. If I need to choose one, I would think SLI since from those measurements we can derive the SLO.

Jaakko: I’d say the SLO or error budget are the most important concepts in SRE. I think they’re the foundation for SRE. With them, you can derive most other aspects associated with SRE, such as how much you should invest in improving your service reliability.

Juhani: Weeeell, I’d argue it doesn’t necessarily bring anything new to the table, but for me, the 70–30 is the most important thing.


Do you have any questions about SRE or DevOps? Empathy is one of our values, and sharing our knowledge and views is one of our ways to contribute to it.

Now might be a good time to start doing some first steps in the realm of DevOps transformation. Maybe we can guide you in the right direction. Feel free to contact us or anyone from the Polar Squad team.

Polar Squad