Demystifying the Azure SLA

Each Azure service has its own SLA with associated terms and conditions, limitations and service credits. But what does this mean exactly? In this article we will explain the Azure SLA and what this means for your organization.

The Azure Service Level Agreement (SLA) describes Microsoft’s commitments for uptime and connectivity for individual Azure Services.

Each Azure service has its own SLA with associated terms, limitations and service credits. Some (free) services don’t have an SLA, for example Azure DevTest Labs. Other services require a specific configuration such as Virtual Machines. The SLA starts at a lowly 95% on Single Instance Virtual Machines using Standard HDD Disks to 99,99% for multi instance Virtual Machines deployed across two or more Availability Zones in the same Azure region.

SLA’s are updated regularly and therefore always have a version number.

(A very cool Azure SLA board is available at: https://azurecharts.com/sla)

 

SLA Levels

At the time of writing, only one Azure Service has a 100% SLA. Azure DNS, which is the core of many services like Azure AD. Microsoft guarantees that ‘valid DNS requests will receive a response from at least one Azure DNS name server 100% of the time’. Microsoft accomplishes this by offering four DNS servers in different geo’s using different top level domains: .com, .net, org and ns{x}-{xx}.azure-dns.net. To qualify for the 100% SLA you need to configure your app or service to use all four DNS servers.

Percentages matter

SLA uptime percentages are always quite big numbers. 99,5% seems like a good SLA, and 99,999% (the magic 5 nines) might seem excessive. But if you look at the associated maximum downtime this changes considerably, as few users will accept your app from experiencing unplanned downtime for more than an instant. In my humble opinion any design offering a service level below 99,95% of 99,99% should not be used for production environments.

Uptime percentage and maximum downtime:

99,5% = 7 minute 12 seconds downtime per day
99,9% = 1 minute 26 seconds downtime per day
99,95% = 43 seconds downtime per day
99,99% = 8 seconds downtime per day
99,999% = 0,8 seconds downtime per day (6 seconds per week)

Source: https://uptime.is

 

Composite SLA

When you app uses multiple (Azure) services, you need to look at the SLA for each service. What do you guess is the expected maximum downtime for the app in the example below?


If either the Web App or the SQL Database fails, the whole application fails. The probability of each service failing is independent, so the composite SLA for this application is:

  • 95% (Web App) × 99.99% (Database) = 99.94%

That's lower than the individual SLAs

By adding an independent fallback path, you can greatly improve availability. What do you think will happen when we add a Queue Storage with a lower 99,9% SLA in the example below:

In this example, the Web App is still available even when it can’t connect to the SQL Database. The app will fail when both the SQL Database and Queue Storage are down at the same time. The expected percentage for a simultaneous failure is 0.0001 × 0.001, so the SLA for this combined path is:

  • SQL Database or Queue Storage = 1.0 − (0.0001 × 0.001) = 99.99999%

Which makes the Composite SLA:

  • Web app and (database or queue) = 99.95% × 99.99999% = ~99.95%

This example and a multi region calculation is available in the Azure Architecture Center. You can also find an example for a extremely high available design with an 99,999999% SLA using four Azure regions.

 

SLA Limits

As you might expect, Microsoft has set limits to what the Azure SLA will cover. Anything outside of their reasonable control such as natural disaster, war, acts of terrorism, riots or government actions will not be covered by the SLA. Also any network or device failure external to the Azure data centers, including at your own site or between your site and our data center is excluded.

 

Service Credits

Most SLA’s will offer a Service Credit, which is the percentage of the Applicable Monthly Service Fees credited to you following Microsoft’s claim approval. Some services such as Virtual Machines will offer up to 100% service credits when the monthly uptime percentage falls below 95%, and 25% when uptime falls below 99,99%. But other services such as Azure Functions will offer a maximum of 25% service credits when the monthly uptime percentage falls below 99%.

The SLA will only offer service credits and will not cover any additional damages your organization might have incurred when your app was down, and is therefore by no means an alternative for adding high availability or resiliency to your Azure design.

To request a service credit, you’ll need to submit a claim in the form of a support ticket with all required information by the end of the calendar month following the month in which the incident occurred. You will also need to add the Outage incident identifier from the Service Health page. Your Cloud Solution Provider (CSP) partner can assist you with this process.

 

Increasing availability

The Azure SLA is an excellent starting point when considering your Azure design. By looking at the requirements for a specific service level for a specific Azure service, you will get an insight into how Microsoft has intended the service to be used. For example looking at Cosmos DB; by default Cosmos DB offers an impressive service level of 99,99%, but by configuring multiple Azure regions as writable endpoints for a database account, Cosmos DB offers 99.999% SLA for both read and write availability.

You can improve the availability by using several native Azure options, for Virtual Machines you can use Availability Sets. An availability set is a logical grouping of VMs that allows Azure to understand how your app is designed to provide for redundancy and availability.

For services like VM Scale Sets (used in Azure Kubernetes Services), Azure Backup, Storage Accounts, Key Vaults and many more services you can use Availability Zones. An Availability Zone is a high-availability offering that protects your applications (and data) from datacenter failures. Availability Zones are unique physical locations within an Azure region and each enabled region has at least three separate zones. Each zone is made up of one or more datacenters equipped with independent power, cooling, and networking.

You can also scale your app worldwide and use the Cross-region load balancer to enable some awesome geo redundant high availability scenario’s.

 

Monitoring Availability

You can monitor and track the global Azure status in the Azure Status page. For individual Azure components you can access the Service Health page of that specific resource. The Health page will contain ongoing service issues, upcoming planned maintenance, explanations and relevant advisories. Events in the Health page are stored for 90 days. Resource Health provides information about the health of individual cloud resources, such as a specific virtual machine instance.

For Web Apps you can also use Application Insights to calculate and report the SLA using the Availability -> SLA Report option. You can configure your own criteria’s such as the web test parameters, failure thresholds and outage windows. The SLA Report also allows you to break down outages by time or even by location:


To continue you availability journey, visit the Azure Architecture Center or attend one of the free of charge Azure Events hosted by Intercept. Topics include Application Insights, Fundamentals for ISV's, Cloud Native in a Day, AKS Deepdive or a DevOps workshop.

Geschrieben von

Rinie Huijgen

Rinie Huijgen

CTO at Intercept

Möglicherweise auch interessant: