AWS — Improving System Resilience While Being Commercially Reasonable.
Having been adversely affected by the infamous AWS us-east-1 outage of 25th November 2020 (I intentionally refrain from using the phrase “being a victim of”), after the initial confusion settled down, and i was able to get a full nights sleep, my thought process eventually shifted towards ways of improving the resilience of our production systems.
AWS has since clarified (in surprising detail) what caused the Kinesis stream outage and how it affected the other services like Cognito. Based on that information, Cognito was only using kinesis streams for analytics gathering and should not have been critically affected. But a bug on the Cognito side caused it to be unavailable during long periods of kinesis unavailability. AWS’s own Service Health Dashboard was using Cognito and their own support operators could not log in to the dashboard to post updates.
In the aftermath of the incident, I saw several people suggesting that the designers of all the systems affected should have planned and designed for such an occurrence, and made their system available across all forms of failure. This by itself is a naïve way of thinking and disregards several main aspects that impact the architecture of any system,
· Infrastructure cost
· Development cost/time
· Maintenance complexity
The availability requirement of a system is business driven and needs to be balanced with the above mentioned aspects. In often cases the SLA of a system is a best effort, bounded by commercial viability, which is a cumulation of the above aspects. AWS’s own SLA for their services goes “AWS will use commercially reasonable efforts to make each Included Service available with a Monthly Uptime Percentage of at least 99.9% for each AWS region during any monthly billing cycle”
Our own case
Based on the business requirements of our production system, my main focus was to find a minimalistic approach of increasing availability without considering drastic changes. In other words, to be “commercially reasonable”.
Despite not having a dedicated DR infrastructure setup, our system architecture was resilient in a traditional sense. The backend services and the front end are all deployed through ECS with multiple instances running on different availability zones. And the main database is deployed through Aurora RDS, with a dedicated read replica and a writer running on different availability zones. All of these were up running and available during the outage. Of course, since CloudWatch was inaccessible, we could not look at logs or metrics, but the ALB target group health checks were passing, and the services were accessible.
We have some external data being pumped into the system through AWS kinesis data streams, these were obviously unavailable, hence our kinesis consumers were sitting idle. However, this by itself was not critical, as the system should have been still somewhat usable, even with stale data. But none of this was relevant, due to the unavailability of one main service, AWS Cognito.
Since Cognito was down, none of the users could log in to the system. Ironically, we had introduced AWS Cognito to our system only 4 months prior, as a part of a larger security and single sign on implementation. The implementation was smooth, and the desired results were achieved, but obviously this had introduced an extra point of failure, which was on a critical path for the system.
We use Cognito with a SAML based federated identity provider through a user pool. Cognito itself does not store user information in its own pool. We have also configured Cognito for several App Clients.
To make this interaction regional failure tolerant, we can introduce another user pool in a different region, which has identical configuration to the primary. As of now, AWS Cognito does not provide any native cross region user pool sync mechanism, hence this would have to be done manually. Even if the users were managed by Cognito itself (without the usage of a federated identity provider), you can setup a sync mechanism by utilising the cross-region replication of a service like Dynamo DB. Such an approach is explained here.
The application services can either automatically or manually fail over to the backup service hosted in the other region. This failover can happen at the DNS level through Route53, which is a region agnostic service.
Another AWS service in the critical path is Elasticache (Redis), which was not affected by this specific outage, but should be considered for improving system resilience. Unlike Cognito, Elasticache for Redis supports native cross region replication via global data stores. If the primary cluster experiences a degradation, a secondary cluster in another region can be promoted to primary. But this promotion of the secondary cluster has to be done manually, through the console or CLI.
We can go further and setup backup kinesis streams, cross region read replicas for Aurora RDS, backup ECS clusters in a different region etc…, but as I mentioned before, things have to be commercially reasonable. It is actually easier to over engineer things and end up with minimal returns, or worse yet, system un-stability.
The critical services that you would have to consider for your own system would be different from these, and the scope of the regional failure tolerance that you might end up doing might be different as well. But the core concept that i want to convey here is that your solution should be Commercially Reasonable to you, without blindly replicating every service, as suggested by a few other sources.