With recent Cloudflare and AWS (and previous meta outage) it comes down to one simple tradeoff :
on the one side is everyone running an internet service saves some money by paying a Cloud provider to run the infrastructure for them -
So the cloud outfit get to amortize a lot of costs over all the customers by having a small number of big data centers (numbered in thousands) instead of millions of enterprise computing services run by every tom, dick harry, tescos, twitter, openai, slack, signal, ticketmaster etc etc (all actual examples of people who lost service during aws and cloudflare outages)
As Spidey says "with great power comes great responsibility". So cloud providers not only provision carefully,
but they do actually provide some levels of fault tolerance by providing redundant servers,
and even run some consistency protocols to make sure of a customer's service needs very high availability,
so long as a majority of the duplicate servers are running, the service is ok - this can operate globally,
so even if there's a whole country disconnected (e.g. international fiber cut, or national grid outage, both also real events in recent years), the rest of the world can move on ok...
But it would seem that they don't fully apply this distributed, replicated, fault tolerant/high availability,
possibly somewhat even decentralised thinking to the implementation of their own internal necessary infrastructure - so in AWS and Cloudflare case, the error was central - someone in AWS didn't consider a particular pattern of performance that meant a DNS config (their design is 95% sane) led to a slow server updating DNS and overwriting more recent entries, causing customer services that had needed those new entries to be unable to find them In the Cloudflare case,
a centrally managed configuration file grew in one overnight update by twice the size, exceeding cloudflare services maximum file size constraint (this is actually rather sad in terms of being fairly esasy to prevent by normal system checking/validation processes. The AWS one is slightly more subtle, but not much more. in fact, earlier outages in replicated/distributed services (actually at cloudflare earlier) took PhD level thinking to come up with long term solution - see this paper for one example: Examining Raft’s behaviour during partial network failures
The cloudflare example is also a little reminiscent of the Crowdstrike outage, but that wasn't cloud - crowdstrike has a rulebase for its firewall productsm and microsoft windows is required to allow thir parties to install firewall products (ven though a modern microsoft OS firewall is actually good) - crowdstrike had a bug in a new rule base so when all the windows machines using that product updated their rules, the firewall code (inside the OS, allowed in by microsoft due to anti-monopoly rules) read a broken file, which caused an undetected code bug to triger an exception, and, due to oversight the crowdstrike software engineers had not put in an exception handler, which would have led to a safe exit of that code, so instead the exceptin caused an OS crash (i.e. bluescreen!)...i this case, the central error affected millions of edge systems directly. - and due to the way the s/w update worked, needed a lot of manual intervention by many many people in many organisations..
In a non-cloud setup, you'd have natural levels of diversity in all the millions different enterprise deployments (even if just different versions of things running) so outages would typically be restricted to particular services, but in the cloud setup, an infrastructure outage takes down thousands of enterprises....(I think AWS reckoned about 8000 large cusomters - not sure about Cloudflare but estimates are they run about 25% of the Internet ecosystem's defenses)...
background
AWS explainer
Cloudflare explainer
Replication failure during partial network outages
UK government very useful report on data center sustainabilty (has lots of useful statistics):
To be fair to the cloud service folks at AWS and Cloudflare, they found, fixed and publically reported the problems in under a day, so the concentration of resources in cloud also meant a concentration of highly paid, really expert people who can troubleshoot a problem, and once its fixed, the deployment is also quick. on the other hand, a decentralised setup (more like the frowdstrike example) can also be deployed fairly fast if they had been slightly more careful about their s/w update process....
so i'd say clould v. edge, at the moment hard to pick which is more resilient, which is cheaper
No comments:
Post a Comment