ArchiTECHt Daily: The great Amazon S3 outage of 2017

So, that was a trying few hours yesterday, huh? Amazon S3 starts seeing "increased error rates" in it
ARCHITECHT
ArchiTECHt Daily: The great Amazon S3 outage of 2017
By ARCHITECHT • Issue #27
So, that was a trying few hours yesterday, huh? Amazon S3 starts seeing “increased error rates” in its Northern Virginia region, and the world starts claiming that the internet is broken. 
In defense of the hysteria, though, the outage did bring down a whole lot of popular sites. Jordan Novet at VentureBeat compiled the largest list I have seen, although I’m sure there’s a longer one floating around somewhere. There were no doubt thousands of smaller companies and minor applications storing stuff directly on S3 or via a third-party services like Heroku (including Revue, my newsletter provider) that went down as well. Here’s the list from VentureBeat:
The issues appear to be affecting Adobe’s services, Amazon’s Twitch, Atlassian’s Bitbucket and HipChat, Autodesk Live and Cloud Rendering, Buffer, Business Insider, Carto, Chef, Citrix, Clarifai, Codecademy, Coindesk, Convo, Coursera, Cracked, Docker, Elastic, Expedia, Expensify, FanDuel, FiftyThree, Flipboard, Flippa, Giphy, GitHub, GitLab, Google-owned Fabric, Greenhouse, Heroku, Home Chef, iFixit, IFTTT, Imgur, Ionic, isitdownrightnow.com, Jamf, JSTOR, Kickstarter, Lonely Planet, Mailchimp, Mapbox, Medium, Microsoft’s HockeyApp, the MIT Technology Review, MuckRock, New Relic, News Corp, OrderAhead, PagerDuty, Pantheon, Quora, Razer, Signal, Slack, Sprout Social, StatusPage (which Atlassian recently acquired), Travis CI, Trello, Twilio, Unbounce, the U.S. Securities and Exchange Commission (SEC), The Verge, Vermont Public Radio, VSCO, Wix, Xero, and Zendesk, among other things. Airbnb, Down Detector, Freshdesk, Pinterest, SendGrid, Snapchat’s Bitmoji, and Time Inc. are currently working slowly.
Apple is acknowledging issues with its App Stores, Apple Music, FaceTime, iCloud services, iTunes, Photos, and other services on its system status page, but it’s not clear they’re attributable to today’s S3 difficulties.
Parts of Amazon itself also seems to be facing technical problems at the moment. Ironically, it’s restricting AWS’ ability to show errors.
There are conflicting reports about whether Netflix went down, which may have something to do with geographic location. However, Netflix is often the poster child for smart AWS architecture during these outages (including in September 2015, when “increased error rates” took down the Amazon DynamoDB service for a while), illustrating the importance of building highly available services and planning for failure. 
Some folks will use yesterday’s outage as an example of why companies shouldn’t use the cloud, or (rightly) why they should consider platforms other than or in addition to AWS. But it’s important to remember that AWS only amassed such a large number of users because it works so well overall. Frankly, many of the services affected wouldn’t even exist if not for AWS, and those that did might be down far more frequently if they were forced to rely on their own infrastructure. 
For every paragon of software and infrastructure engineering like Facebook, there several fail whales.
And while folks in Redmond and Mountain View might have been smiling ear to ear yesterday, most of them knew better than to get too cocky. Outages at competitive cloud providers don’t cause nearly this large a stir because they’re not serving nearly as many popular applications. (Although, god, what would I do if SnapChat were down?!) 
Those other cloud services are not perfect, either. A comment by a Googler on the Hacker News thread about S3 quickly resulted in a litany of complaints against its cloud services.
However, the good news for everyone is that cloud computing providers are getting better, cloud services are getting better and cloud-native architectures are getting better. Hopefully, we should see a lot fewer of these hiccups in the coming years, and a lot smaller impact even when they do occur. In a few years, it would be reassuring to know that if our favorite services are down, we know some serious shit went down.

Sponsor: Datos IO
Around the web: Cloud and infrastructure
Speaking of cloud providers and their performance/availability … Partnerships like this will only help Google in its quest to win enterprise workloads.
Here’s a feature on some of Facebook’s telco efforts that I highlighted in yesterday’s newsletter. But the official line all sounds a little too altruistic: I still assume Facebook et al want to cut out the middle man where they can.
Revenue up 27 percent to $2.29 billion, net loss up more than 100 percent to $51.4 million. I know Salesforce is entrenched, but it seems ripe for a real disruption.
Good analysis of the situation AWS is currently facing regarding how tightly to embrace Kubernetes for container orchestration. It’s worth noting that AWS did open source its Blox scheduler for ECS, not that anybody cared much.
The fact that so many companies ran MongoDB environments exposed to the public internet and seemingly ignored warnings from hackers is not really the fault of MongoDB. But these incidents look bad nonetheless.
It thinks CPU+GPU processors are the key to hitting a key milestone for high-performance computing. I’m just hoping to see brontoscale systems in my lifetime—because I will chuckle every time I hear that prefix.
Around the web: Artificial intelligence
This has always been mentioned as a low-hanging fruit for NLP and voice-recognition systems, although augmenting human agents might be more effective depending on the situation and domain-specific training required.
Aka “influencers” in social media circles or “whales” in casinos. Basically, the argument here is that smarter algorithms will help companies find the small fraction of customers, metrics, etc, that deliver outsized returns.
hbr.org  •  Share
This is some very cool research out of Georgia Tech, teaching computers to rationalize why they’re making a certain decision. It could help mitigate black-box situations, as well as optimize human-machine interactions in factory or other settings.
arxiv.org  •  Share
Source: Fast Company / Periscopic
Around the web: All things data
Keeping the original headline here so I can point out a critical distinction: Big data is a technology; IoT is a use case for big data that, yes, is driving its adoption.
More specifically, a metadata vault for government research that has been vanishing (or is in danger of vanishing) from Data.gov. Projects like this could prove quite important over time.
medium.com  •  Share
If Reflect can master the art of embeddable data graphics, the more power to it. But developers don’t always like to pay, and there is lots of competition.
I’m not sure this is too informative, actually, but it is interesting.
Just going to use this as an opportunity to remind everyone of my interview with C3 founder Thomas Siebel (yes, that Siebel) on the ArchiTECHt site.
Did you enjoy this issue?
ARCHITECHT
The most interesting news, analysis, blog posts and research in cloud computing, artificial intelligence and software engineering. Delivered daily to your inbox. Curated by Derrick Harris. Check out the Architecht site at https://architecht.io
Carefully curated by ARCHITECHT with Revue. If you were forwarded this newsletter and you like it, you can subscribe here. If you don't want these updates anymore, please unsubscribe here.