Have you ever wondered what happens when you go to some website on the internet? For example, you search something on the “Google Search” page and it’s just always there. Many other pages work almost always as well. But sometimes something like that happens:
“Early June 8, a customer pushed a valid configuration change that included the specific circumstances that triggered the bug, which caused 85% of our network to return errors.”https://www.fastly.com/blog/summary-of-june-8-outage
And Fastly actually powers a lot of websites, so 85% meant a lot of websites were down.
The topic of how to make services/websites more available still fascinates me. And it’s actually pretty hard to achieve. I found this article a great introduction to HA: https://www.atlassian.com/blog/statuspage/high-availability What if we are not satisfied with 99.9% availability? What can we do? I found these resources very useful:
1. “Beyond Five 9s” https://aws.amazon.com/builders-library/beyond-five-9s-lessons-from-our-highest-available-data-planes/ (Amazon builders library contains great content in general)
2. And the idea of shuffle sharding from the above explained in more detail: https://aws.amazon.com/builders-library/workload-isolation-using-shuffle-sharding
3. Caches: use two TTLs: a soft TTL and a hard TTL. from https://aws.amazon.com/builders-library/caching-challenges-and-strategies/
4. Google SRE book https://static.googleusercontent.com/media/sre.google/pl//static/pdf/building_secure_and_reliable_systems.pdf This book is much more than just about availability. The book explains the topic of reliability and security thoroughly. I’d recommend for anyone involved in programming/designing to at least skim through it and read the most useful chapters for them.