In this article, we will talk about Azure SLA and the following topics
- SLA Levels;
- Percentages that matter for your SLA;
- Composite SLA;
- SLA limited;
- Service Credits;
- How to increase the availability;
- The availability of monitoring.
The Azure Service Level Agreement (SLA) describes Microsoft’s commitments for uptime and connectivity for individual Azure Services.
Each Azure service has its own SLA with associated terms, limitations, and service credits. Some (free) services don’t have an SLA, for example, Azure DevTest Labs. Other services require a specific configuration such as Virtual Machines. The SLA starts at a lowly 95% on Single Instance Virtual Machines using Standard HDD Disks to 99,99% for multi-instance Virtual Machines deployed across two or more Availability Zones in the same Azure region.
SLAs are updated regularly and therefore always have a version number.
A cool Azure SLA board is available here.
SLA Levels
At the time of writing, only one Azure Service has a 100% SLA. Azure DNS, is the core of many services like Azure AD. Microsoft guarantees that ‘valid DNS requests will receive a response from at least one Azure DNS name server 100% of the time’. Microsoft accomplishes this by offering four DNS servers in different geos using different top-level domains: .com, .net, org, and ns{x}-{xx}.azure-dns.net. To qualify for the 100% SLA you need to configure your app or service to use all four DNS servers.
Percentages matter
SLA uptime percentages are always quite big numbers. 99,5% seems like a good SLA, and 99,999% (the magic 5 nines) might seem excessive. But if you look at the associated maximum downtime this changes considerably, a few users will accept your app from experiencing unplanned downtime for more than an instant. In my humble opinion, any design offering a service level below 99,95% of 99,99% should not be used for production environments.
Uptime percentage and maximum downtime:
99,5% = 7 minutes 12 seconds downtime per day
99,9% = 1 minute 26 seconds downtime per day
99,95% = 43 seconds of downtime per day
99,99% = 8 seconds of downtime per day
99,999% = 0,8 seconds of downtime per day (6 seconds per week)
Source: https://uptime.is
Composite SLA
When your app uses multiple (Azure) services, you need to look at the SLA for each service. What do you guess is the expected maximum downtime for the app in the example below?
If either the Web App or the SQL Database fails, the whole application fails. The probability of each service failing is independent, so the composite SLA for this application is:
- 95% (Web App) × 99.99% (Database) = 99.94%
That's lower than the individual SLAs
By adding an independent fallback path, you can greatly improve availability. What do you think will happen when we add a Queue Storage with a lower 99,9% SLA in the example below:
In this example, the Web App is still available even when it can’t connect to the SQL Database. The app will fail when both the SQL Database and Queue Storage are down at the same time. The expected percentage for a simultaneous failure is 0.0001 × 0.001, so the SLA for this combined path is:
- SQL Database or Queue Storage = 1.0 − (0.0001 × 0.001) = 99.99999%
Which makes the Composite SLA:
- Web app and (database or queue) = 99.95% × 99.99999% = ~99.95%
This example and a multi-region calculation are available in the Azure Architecture Center. You can also find an example of an extremely high availability design with a 99,999999% SLA using four Azure regions.