DevOps, but it’s just the tools

 
 

What does it look like when you adopt DevOps tools but don’t change your organization or ways of working? This is what a lot of organizations do, but it’s not the best way to get the full benefits of DevOps. This blog post shows what a tools-first approach can look like.

DevOps is not a well-defined concept, so different people attach different meanings to it. Some people think of DevOps as just tools: Implement a CI/CD pipeline, set up a Kubernetes cluster and some time-series-based monitoring, and now you’re “doing DevOps.” You might also set up a “DevOps team” with “DevOps engineers” who are responsible for setting up and maintaining all this infrastructure.

While there are benefits to adopting modern tools, you won’t get the full benefits out of DevOps if that’s all you do. The tools are crucial, but they only enable the things that provide real value. Understanding what the tools are for and breaking out of the old mindset of sysadmins and development teams is essential.

But what does it look like in practice when you don’t just adopt DevOps tools but also change your mindset to be more DevOps? To make this all a bit more concrete, let’s look at the evolution of monitoring in a fictional company. While this example is fictional, I’ve based it on behaviors and patterns I’ve observed in various real-world projects.

Phase 1: Traditional monitoring: Nagios and states

At the start of our story, work at the company is neatly organized into development and operations teams. There are clear responsibilities: The developers write software and hand it over to the operations team to run. The developers aren’t too familiar with what happens on the operations side. The sysadmins in the operations team see the software in terms of processes running on a server.

 

The good old Nagios interface.

 

The operations team uses an application called Nagios to monitor the health of servers and applications that they are responsible for operating. Nagios uses states and traffic light colors to represent the health of systems:

 
 

There’s also a fourth UNKNOWN state which means Nagios can’t determine the state. When something is broken (WARNING or CRITICAL), Nagios sends out alerts to sysadmins, who investigate the issue and implement fixes until everything is back in the OK state.

The sysadmins are monitoring and alerting on things like:

  • Application server health

  • The health of processes running on the application servers

  • Resource usage (CPU, memory, and disk)

Sysadmins in the operations team are the only ones looking at the monitoring dashboards. The developers don’t consider it part of their job to know how the software they write works in production, and Nagios doesn’t provide good actionable information for them anyway.

There are a couple of feedback loops:

  • Developers get feedback by running the software locally on their machine and trying what works and what doesn’t. The feedback comes from observing the software and running automated tests.

  • Sysadmins get feedback from the monitoring system and react to it by fixing things as needed.

What’s missing is a feedback loop that covers multiple teams.

Phase 2: Modern tools, old ways of thinking

Time has passed, and the operations team is now called the DevOps team. Its members are no longer sysadmins but rather DevOps Engineers. The application is now running on Kubernetes, and the DevOps Engineers have deployed Prometheus for monitoring. They are also using Grafana to visualize the time series data collected by Prometheus. Alertmanager is used to send alerts to DevOps team members whenever something breaks.

 

The Grafana interface shows some time series. (Linux Screenshots, CC-BY 2.0)

The DevOps Engineers are monitoring and alerting on things like:

  • Server health in the Kubernetes cluster

  • The health of containers in the Kubernetes cluster

  • Resource usage (CPU, memory, and disk)

DevOps Engineers in the DevOps team are the only ones looking at the monitoring dashboards. The developers don’t consider it part of their job to know how the software they write works in production, even though Prometheus could provide helpful information on application performance in production.

People in the organization are still quite siloed and don’t communicate that much. The feedback loops are much the same as before – what’s missing are feedback loops that would cover the whole product lifecycle. For instance, people outside the DevOps team don’t look at the data collected by the monitoring system. There’s no feedback loop from how the software behaves in production to people involved in the earlier stages of product development. Potential production issues can remain hidden until they grow into bigger problems later.

When incidents occur, the DevOps team works independently to resolve these, but there are no follow-up actions or postmortem meetings that would involve anyone outside of the DevOps team. The DevOps team does not systematically notify development teams about recurring incidents caused by bugs in the software. There’s an on-call rotation, but only DevOps Engineers are in that rotation.

Phase 3: Change of mindset

After using the new infrastructure for a while, the teams have learned how to use it better. DevOps is no longer seen as the responsibility of just one team, so there’s no DevOps team anymore. Instead, there’s a platform team responsible for providing Kubernetes as a productized platform for development teams.

Development teams are using the monitoring stack by adding custom metrics to the software they write. These metrics are automatically picked up by the monitoring system and used by developers to create dashboards that give feedback about how the software they write performs in production.

The people responsible for maintaining the infrastructure collaborate with the development teams frequently. Whenever there’s an incident, all relevant stakeholders have a post-mortem where they try to figure out what could have been done to prevent the incident from happening in the first place.

The most important thing that has changed is that information and feedback now flow more freely within the organization. The tools are only a means to an end, although they are important as well: It would be more challenging to implement the same feedback loops using older tools like Nagios.

Where to start?

DevOps is more about information flows and breaking siloes than tools. While the tools are crucial, they are only enablers for the things that provide real value. The first thing to do is find common ground between people in traditional ops roles and development teams. What sort of useful information could people responsible for operations communicate to developers, and how?

Joint production meetings between operations and developers could be one starting point. Initially, the agenda can be sharing pain points and seeing if another participant – perhaps from the “other” side – could help to solve that pain point.

What I described above is just one example of how operations might evolve. This example is not a prescriptive maturity model or the only way to accomplish good results. There are many other ways to achieve the same things. The most important concept here is feedback loops: They should operate beyond the scope of just one team.

 
 

Risto Laurikainen is a DevOps Consultant with a decade of experience in building cloud computing platforms.

 
Risto Laurikainen