Blog Infrastructure

Your software’s reliability on the public cloud: key themes for your reliability! (DevOps)

Traditional ISVs are often structured by operational and development teams, with clear boundaries for each team. E.g. operations deploying and managing the infrastructure, and development focusing on building and improving software.

The question remains: who is responsible for the overall reliability?

Reading time 3 minutes. Published: 25 March 2024

Reliability on the public cloud

Traditionally, operations provide a ‘Virtual Machine’ (VM) with the necessary SDKs and runtimes installed. However, as 'Platform as a Service' (PaaS) and Cloud Native technologies gain more popularity, the lines between responsibilities are starting to blur. Reliability has become a shared responsibility.. And a big one. The shared responsibility spans across the cloud provider and all teams working on the product, DevOps included.

Poor infrastructure decisions can result in performance and availability issues. If the infrastructure is not designed in line with the application requirements, the solution's behaviour can impact the infrastructure’s availability. In fact, the opposite is true as well. Decisions related to the software architecture can impact the infrastructure. And then there is the overall architecture. Your architecture design determines what SLAs you get and what SLA you can promise to your customers.

Looking at this shared responsibility, in this article, we will dive into the key topics and technologies regarding reliability.

1. Azure Advisor

Without a doubt, Azure Advisor has proven to be valuable for reliability. It provides recommendations on costs, operational excellence, performance, security, and reliability. Not just for VMs, but for a range of Azure technologies such as AI, computing, networking, and databases.

2. Chaos engineering

Testing during the development cycle, e.g. unit testing or smoke testing, is a common best practice within development. But how to test infrastructure? How to ensure that your design remains highly available, resilient, and reliable?

Chaos engineering is your answer! This practice induces controlled failures to determine resilience. What if one of your VMs has memory issues? What if DNS becomes unavailable? these are common scenarios and are potentially caused by Azure issues, faulty code, or a human mistake while maintaining the infrastructure.

Azure Chaos Studio facilitates this process, by creating experiments on different types of resources, individually or simultaneously. It helps you to understand more about your architecture’s resilience. It allows you to learn about monitoring and discover your internal processes.

3. Disaster recovery

You may have promised RPOs and RTOs to your customers and/or internal stakeholders. But can you stick to that promise? Let’s look at a basic example of a database recovery: In an SLA agreement, you agreed with your customer that a database will be restored within a maximum of an hour, in case it would be necessary. However, the time required for a database recovery was last tested several years ago. Meanwhile, the database has increased in size. Is it still feasible to perform a database recovery in 1 hour?

Fortunately, with Microsoft Azure, we see more and more customers using Infrastructure as Code, such as ARM Templates, Bicep, or third-party solutions like Terraform. Not only do they provide consistent deployments and enable you to implement GitOps practices, but They also provide a solid disaster recovery scenario for your infrastructure.

4. Deploy infrastructure in different regions

What if there is a regional outage? Do you have resources set up in various regions to ensure availability? Not many organizations do, as it significantly impacts operational expenses for resources that aren’t actively used. You will need to have escalations paths in place. You should strive to deploy your infrastructure in another region with just a simple click.

Are you using Azure directly through Microsoft, or perhaps through a CSP? So, is your team or partner capable of assisting in case of an emergency? Good support contracts (with Microsoft) can provide you with insights (and maybe even ETAs) to help you decide if failing over to a new region is the way to go. And partners can provide capable staff to help you get back on your feet.

Reliable software on the cloud

As a wrap-up, it is safe to say that there are many facets to the reliability on the public cloud. Most importantly, the shared responsibility for DevOps and your cloud provider. You can deliver reliable software to your customers with a decent collaboration and the right technologies.

Contact Romy Balver

Let's make it a shared responibility!

We can help you manage your application with our Managed Services. 

Managed Services