Polar Squad

View Original

What’s SRE anyway? Part 1

We are a DevOps company, but we are invested deeply in the Site Reliability Engineering world. For this reason, we decided to make a series of interviews with 3 of our experts where we share our views on SRE.

Can you give a bit of background about yourself?

Yair: 25 years in the business, I started with VAX/VMS and now I am all around the cloud-native computing world. I enjoy changing the way people release and maintain software for better (yes it can be measured). I am DevOps Lead Germany at Polar Squad. I studied philosophy and history.

Jaakko: I’m a software developer turned SRE/DevOps consultant. I’m interested in proper full-stack development: everything that goes into developing a software service from designing and typing the code to maintaining it in production. Today, I’m the Lead SRE Consultant for Polar Squad.

Juhani: An all-round good guy turned CNCF-lead and DevOps/SRE consultant. I’m passionate about helping devs be more productive and secure so that businesses can rely on 110% on whatever the product might be. Also, I’m a self-proclaimed YAML-wizard with a taste for un-hipster beer.

What is SRE and what does it mean to you?

Yair: Site Reliability Engineering stems from Google. It’s a methodology on how to maintain production for high volume sites. Unlike traditional software teams, SRE will try to think about resilience from day one and change the way we deliver and maintain software.

Jaakko: To me, SRE is about finding the balance between reliability and rapid delivery of software by taking a data-driven and holistic approach to software development. “Data-driven”, meaning that decisions to improve reliability or deliver new software are based on measurements rather than guesswork. “Holistic”, meaning that we take all layers related to running the software in production into account.

Juhani: SRE for me is an old concept rebranded, but it comes with a lot of good old stuff, that doesn’t age in software development, namely: resilience, reliability, automation, metrics, monitoring, SLIs, SLOs, and SLAs… etc. What Google brought new to the table with SRE, in my mind, is the concept of 70–30, so “sysadmins” are working 70% in dev and 30% putting out fires.

The whole idea of SRE was started on Google, and it seems to be fitting many software companies. What do you think are the benefits to get from SRE in other organizations? Why should, for example, an automotive company use that?

Yair: As I see it, in today’s world almost every company is doing software. Most companies that are doing software would need to be resilient in one way or in another.

Jaakko: I wouldn’t be surprised if other industries were already following similar practices in their fields. The core of SRE is basically quality control for software services done continuously.

Juhani: If you want to give your business people hard facts about “ why SRE?” Just ask them: “how much would it cost for assembly line robots to malfunction because of a software bug pushed into production”.

What are the business benefits that a well-defined resilience plan can give you?

Yair: I think that the most important feature of any software is reliability. It’s so important that we take it for granted. Any organization dealing with software should put resilience as an important goal; it should be advocated straight from the design stage. When done right, reliability can be measured, but what should be measured is a hard topic.

For me, software nowadays becomes more and more complex; it demands a different paradigm in the way we monitor it, and even perceive it.

A business built with a sustainable resilience plan can release and maintain better software.

Jaakko: I’d say the benefits of resilience are comparable to the benefits of security. Both are ways to mitigate the risk of losing business and trust. Similarly, having a resilience plan is a way to attract customers that expect it. Some customers, especially the enterprise ones, require a level of reliability to reach before you can call them your customer.

Juhani: Can you measure the value of trust? I think it’s the only thing to strive for.

What’s the relation between DevOps and SRE?

Yair: Google says: ”Class SRE implements DevOps”, meaning that the DevOps movement does not explicitly define success criteria, it is like an abstract class or interface in programming. It defines the overall behavior of the system, but the implementation details are left up to the author.

Jaakko: To me, they’re attempting to achieve the same thing, but from a different perspective. In both DevOps and SRE, the idea is to make the delivery of software services as seamless as possible without sacrificing the quality of service, and apply improvements to all areas of the service lifecycle from development to production. Both ideas share pretty much the same practices, but SRE specifically brings in its own principles for how to handle service quality.

Juhani: DevOps is a framework and SRE is a toolset.


Join us next time when we will be answering questions about SRE principles!