Polar Squad

View Original

Polar Squad & Keyko: Working together to create a reliable new web!

Web3 + Site Reliability Engineering = Ecosystem Reliability Engineering

The road to novel solution development is often long and arduous. So when a like-minded traveler offers to share the burden, you jump at the chance to split the load. 

Such was the case when the paths of Keyko and Polar Squad crossed. With similar outlooks on the world of tech and a shared desire to improve the status quo, we both saw in each other a kindred spirit. 

Keyko, focused on Web3 development, and Polar Squad, focused on DevOps & Site Reliability Engineering, may seem like strange bedfellows at first glance. But we both subscribe to the same goal: to make our respective domains better, and have fun doing it. 

And so we set forth on this journey together. Two upstart talismans. Two unique skill sets. One united goal: SRE for Web3, aka Ecosystem Reliability Engineering.

What is SRE?

Site Reliability Engineering (SRE) is a discipline that incorporates aspects of software engineering and applies them to infrastructure and operations problems. The main goals are to create scalable and highly reliable software systems. According to Ben Treynor, founder of Google's Site Reliability Team, SRE is "what happens when a software engineer is tasked with what used to be called operations.”

One could see SRE as a very specific implication of the DevOps methodology. Instead of dealing with how fast the new features are being deployed to production, SREs would deal with how reliable those new features are as applied to infrastructure and production environments.

What is Web3?

Web3 applications or DApps, are built on decentralized peer-to-peer networks like Ethereum and Celo. Instead of being run by a sole entity, these networks are built, operated, and maintained by their user communities. They’re self-organizing and lack a central organization or authority that oversees decision-making. Additionally, they’re open-source, meaning it is possible for anyone to build upon this shared infrastructure. Some say that Web3 treats the internet itself as a shared infrastructure.

If it’s decentralized, is it already reliable by design?

If no one has full control over all the infrastructure, a very common way of thinking about it would be to believe it’s even more reliable. One would assume that there is no one single point of failure and would expect reliability to be a feature of the DApps.

However, Reliability is not just the availability of certain services or of the infrastructure that is used by the application, but also a feature that brings into play a whole new concept of how to deal with the design of applications. The fact that there is no single organization that controls everything actually makes reliability even harder.

Both Polar Squad and Keyko believe that reality is different. With all the current novel technologies, such as cloud-native computing and infrastructure as code, the traditional web is moving away from a single point of failure design paradigm. When it’s so easy to create networks and deploy computing power, companies will go the extra mile to have a redundant application. Now we also need to incorporate elements of SRE into governance protocols so that these decentralized solutions can scale and for entire ecosystems to become resilient, from soup to nuts.

What are the risks that are specific to Web3 and Reliability?

The beauty and complexity of this new unstoppable machine is that it adds a totally new dimension to the traditional operation of an existing system. In fully decentralized networks, where there is no single entity or actor controlling the destiny of the whole ecosystem, there exist the users and the governance they provide. The reliability is in some way displaced from the technology side to the operational side via this new governance setup.   

A typical decentralized governance framework includes a list of stakeholders and on-chain actions that are only executed when the specified action to take is confirmed by a certain threshold that is defined upfront. With this, you achieve more democratic and transparent decision making. But the price paid can be significant. Stakeholders of a decentralized governance committee could have different interests or priorities, they could be in different timezones, they could be breathing or not (hello AI!). Depending on the nature of the governance change, alterations could affect the effective reliability of the solution built on top of a Web3 network.   

 

Site Reliability or Ecosystem Reliability Engineering?

Both Polar Squad and Keyko think that the emergent ecosystem approach requires a change in perspective. This means that the idea of SRE, or Site Reliability, may not fit here. What we need is a new look at how things should be done...Hence we’ve coined the term ERE: Ecosystem Reliability Engineering!

We think that the idea of a site becomes redundant when there is no longer one entity governing the infrastructure and data that is used by the new standard. Instead, reliability must be ensured across all sites and the governance that defines them across the ecosystem.

The goal of Ecosystem Reliability Engineering is to place more importance on how the ecosystem governs the decentralized environment, including the following:

  • Community Voting. How the community discusses and votes on the different actions to be taken that ultimately allow the ecosystem to evolve.

  • Consensus Agreements. How decentralized is the process? What are the necessary conditions to be met in order to confirm an agreement able to modify the current behavior of the system? Do we need a ⅔ majority of signatures for approval, or just a 51% stake threshold? 

  • Community Decisions Execution. How the ecosystem executes the decisions taken by the community. After consensus and approval, are the decisions executed automatically? Or is there a centralized manual part where someone needs to pull the trigger? 

  • Governance Monitoring. How do you validate the ecosystem participants that can promote and vote for changes related to the ecosystem operations?

  • Solid software and release process: How do you guarantee the quality of the software that controls the core of the networks? Is there some automated process to validate, and/or should be some manual audit to guarantee and oversee every core change?

A technological or cultural challenge? Maybe both?

We believe that reliability is a fundamental feature for any application or service, and unlike older models, it is something that should be implemented from the design stage in an agile way. In order to be effectively implemented into the process, the traditional SRE methodology requires several key considerations:

  • Reasonable Monitoring -  How do we design and set up the right monitoring that would fit Web3 applications? 

  • Incident Response - Effective incident management is key to limiting the disruption caused by an incident and restoring normal business operations as quickly as possible. 

  • Embracing Risk - Being aware of the fact that building a 100% reliable complex system is impossible or prohibitively expensive. By embracing the risk, we acknowledge the fact that our system will fail. This allows us to identify and mitigate risks.

  • Eliminating Toil - How do we remove all the boring, repetitive work and put more focus on the cool and important things?

  • Simplicity - It’s much easier to maintain a simple system, but how can we achieve that as a cultural and technological goal? This is especially difficult in decentralized systems...

Decentralized systems require a cultural change motivated by the concept of no single entity defining the next steps and availability of the solution. As we discussed before, the classical SRE pillars require augmentation to include the situation where there is not a single entity controlling the system. In Web3 applications, ERE & Governance are here to help by trying to mitigate the complexity of many independent actors driving the same solution in a way that allows for the best reliability possible across the ecosystem.