Category:

systems

Interesting discoveries – High Availability

by Tomasz Jarosik July 10, 2021

Have you ever wondered what happens when you go to some website on the internet? For example, you search something on the “Google Search” page and it’s just always there. Many other pages work almost always as well. But sometimes something like that happens:

“Early June 8, a customer pushed a valid configuration change that included the specific circumstances that triggered the bug, which caused 85% of our network to return errors.”
https://www.fastly.com/blog/summary-of-june-8-outage

And Fastly actually powers a lot of websites, so 85% meant a lot of websites were down.

The topic of how to make services/websites more available still fascinates me. And it’s actually pretty hard to achieve. I found this article a great introduction to HA: https://www.atlassian.com/blog/statuspage/high-availability What if we are not satisfied with 99.9% availability? What can we do? I found these resources very useful:
1. “Beyond Five 9s” https://aws.amazon.com/builders-library/beyond-five-9s-lessons-from-our-highest-available-data-planes/ (Amazon builders library contains great content in general)
2. And the idea of shuffle sharding from the above explained in more detail: https://aws.amazon.com/builders-library/workload-isolation-using-shuffle-sharding
3. Caches: use two TTLs: a soft TTL and a hard TTL. from https://aws.amazon.com/builders-library/caching-challenges-and-strategies/
4. Google SRE book https://static.googleusercontent.com/media/sre.google/pl//static/pdf/building_secure_and_reliable_systems.pdf This book is much more than just about availability. The book explains the topic of reliability and security thoroughly. I’d recommend for anyone involved in programming/designing to at least skim through it and read the most useful chapters for them.

July 10, 2021 0 comment

systems

Pack your own parachute

by Tomasz Jarosik June 9, 2020

written by Tomasz Jarosik

“An example of a really responsible system is the system the Romans used when they built an arch. The guy who created the arch stood under it as the scaffolding was removed. It’s like packing your own parachute.”
― Charles T. Munger

June 9, 2020 0 comment

software-development systems

LXC snapshots scheduling and expiration

by Tomasz Jarosik June 6, 2020

written by Tomasz Jarosik

I run most of my containers with LXC. Just recently I discovered that there is a automatic way of making snapshots and expiring them. All you have to do is a few commands. You can apply these options per single container or per profile. In my case the profile is called “prod-nvme”:

lxc profile set prod-nvme snapshots.pattern 'snapshot-{{creation_date.Format("20060102")}}-%d'
lxc profile set prod-nvme snapshots.expiry "14d"
lxc config set prod-nvme-wordpress snapshots.schedule "0 2 * * SAT"

Snapshots schedule follows CRON format so it’s very easy to use. See https://crontab.guru/#0_2_*_*_SAT for explanation of above schedule. All newly created snapshots will follow snapshots.pattern and they will expire after 14 days. This doesn’t apply to patterns created before setting the schedule. And it will apply to all manual snapshots created after setting up the schedule.

$ lxc info prod-nvme-wordpress
Name: prod-nvme-wordpress
Created: 2018/06/23 16:33 UTC
Status: Running
Type: container
Profiles: default, prod-nvme
.............
Snapshots:
  prod-nvme-wordpress-30-04-2020 (taken at 2020/04/30 20:08 UTC) (stateless)
  backup-1.05.2020 (taken at 2020/05/01 20:28 UTC) (stateless)
  backup-23.05.2020 (taken at 2020/05/23 21:07 UTC) (stateless)
  snapshot-20200606-0 (taken at 2020/06/06 00:00 UTC) (expires at 2020/06/20 00:00 UTC) (stateless)

June 6, 2020 0 comment

OSS software-development systems

SSH with U2F based on Teleport and Yubikey

by Tomasz Jarosik June 2, 2020

written by Tomasz Jarosik

I wanted simple and secure shell access to my home lab, which runs many containers. I have physical u2f key from Yubico, so I wanted to have second factor with it. Also, recording of SSH session would be nice. All of that I discovered in the Teleport service from Gravitational. See here: https://gravitational.com/teleport/

Usage

From the user perspective, you can access the Teleport service via web or the command line:

User manual is easy to follow and it is here: https://gravitational.com/teleport/docs/user-manual/

After successful login process, you can see all machines. In the “Login as” there is a list of usernames. E.g. you might have access to “tjarosik” username, but if that username doesn’t exist on a machine, you will not be able to log in.

You will see shell in your browser, when you click on user in “Login as” column. You can also join existing session or view and replay previous sessions in the sessions list:

From smartphone

It is also possible to use your smartphone to login securely with U2F:

Configuration

In my case, I have Teleport Proxy Web interface behind load balancer. One thing to remember is that ports, even default 443, must be specified explicitly in the config files. Setup is really simple. Here are sample configs:

Auth server and proxy config:

Single node config (in my case it’s a container running Ubuntu):

June 2, 2020 0 comment

productivity software-development systems

Uptime 15,364 days

by Tomasz Jarosik January 10, 2020

written by Tomasz Jarosik

There is a fascinating talk on youtube: https://www.youtube.com/watch?v=H62hZJVqs2o It’s rare thing to see system working for so long completely remotely. Worth watching!

I’m usually curios about how did it happen. What kind of choices there made to make that project possible. There are few things in this talk that can be applied in many domains (well, not everyone can build a spaceship ;)).

“Don’t make engineering choices which could limit the lifetime of a spacecraft.”

Also, there are 3 main design principles which helped this project last that long, and which I think can be applied to other systems as well:
1. Reliability
2. Redundancy
3. Reconfigurability

January 10, 2020 0 comment