Design servers for High availability.

Recently I came across an interesting article about DynamoDB. Do you know that DynamoDB provides what is termed as 4 nine availability which means that if DynamoDB fails to meet this commitment, AWS will compensate the Consumers?

All the public cloud infrastructure follows this rule. To understand why this is crucial, let's delve into understanding what System availability means. Availability equates to System uptime, a percentage of time the system is up and running. 99% availability means that the System will be unavailable for 3.65 days a year. Therefore, it is measured as the success rate of requests. For example, 99% availability is 99/100 which means 1 request for every 100 fails.

This begs the question, What is a highly available System? Is 99% availability good enough? 98%? 90? The accurate answer is none. High availability is not about numbers but about architecture and process.

Consider this scenario, a single instance server that has not failed for a week, meaning all its requests were fully processed, can you describe that as a highly available system? No. If the Single server fails, the entire system becomes unavailable, who knows when it will become available again?

For a highly available system, a high uptime is not a goal, it is a byproduct. This means that a highly available System will have several nines available due to being designed, implemented, and maintained as such. High availability is about putting more focus into the design process and the architecture of the system.

Below are patterns and guidelines that can be used in designing highly available systems.

Eliminate single points of failure.
Ability to switch servers without losing data.
A highly available system is protected from atypical client behaviors.
Protected against failure of any of its dependencies that is Failure detection and monitoring.
Ability to detect failures as they occur.

Below are some engineering processes behind a highly available system:

But to truly know that your system is available, you need to measure the System's availability. One of the famous ways is to guarantee a minimum availability value to the client. To put it more specifically, the clients need to be given a certain number. Such a number is SLO- service level Objective. This number can be single value or a range of values. The commitment is usually between the System's service provider (You) and the client. This agreement is often described as Service Level Agreement.

That is it for today. Thank you for your time

Why you should use Public cloud and how to design Servers for high availability.