ArchiTECHt Daily: Should Google have open sourced Spanner? No.

According to journalist Timothy Prickett Morgan at the Next Platform, it would have been a bigger dea
ArchiTECHt Daily: Should Google have open sourced Spanner? No.
By ARCHITECHT • Issue #19
According to journalist Timothy Prickett Morgan at the Next Platform, it would have been a bigger deal if Google’s Valentine’s Day announcement had been an open source release of Spanner, or perhaps support for CockroachDB (an open source Spanner clone), rather than the proprietary Cloud Spanner service. The argument certainly has some merit from a user point of view (who doesn’t like open source?), but the economics of cloud computing suggest it’s probably not the best decision financially.
The biggest reason is that the business of open source databases has not been a particularly lucrative one thus far. As the recent RethinkDB saga illustrates, it’s often a dog-eat-dog world where open source projects fight over precious users but there’s no guarantee of big money. By all accounts, the revenues at MongoDB, the current big dog in that world (or even MySQL before it was twice-acquired), would barely move the needle at a company like Oracle.
The thing with databases is that if they’re fundamentally important to businesses, like Google expects Cloud Spanner to be, companies will often pay for them right up front—absent some proven, viable open source alternative. An example from the startup world might be MemSQL, which, although I don’t know for sure, appears to be doing alright despite not being open source. 
Amazon DynamoDB highlights another factor to consider: For some companies, computing in the cloud is as much about operational simplicity as it is about anything else. They want a database that works, that’s scalable and that they don’t have to invest manpower in maintaining. For cloud providers, it’s almost certainly easier to build these service from scratch using code they’ve developed in-house than it is to turn someone else’s project into a managed service.
When Amazon Web Services launched its DynamoDB service some years ago, word began to spread that it was the fastest-growing AWS service ever. Yes, it could have instead open source Dynamo or built a managed version of MongoDB or something. But why?
A counterpoint to all of this might be that cloud providers, including Google, were quick to offer managed versions of open source projects such as Hadoop and Spark. However—unlike pretty much any open source database you can name—those were already very popular projects with no real peers. Cloud providers had the choice of offering managed versions with a built-in price premium, or having users install the Apache versions and only pay for cloud resources. But not running Hadoop or Spark was not really an option.
(For more on the fight over who will own big data workloads in the cloud, you should also read Matt Asay’s InfoWorld post from yesterday, titled Hadoop finds a happier home in the cloud.)
However, even offering Hadoop and/or Spark services hasn’t stopped cloud providers from also offering their own proprietary options. In the cloud, business is all about providing users with what they want, but also having an opinion. Google, for example, has Cloud Dataflow as a Spark alternative based on its own technology, which Google argues is the superior method for programming and running data-processing jobs. 
Google wants its customers to use Cloud Dataflow, but it’s fighting an uphill battle against the entrenched options. So it released a lot of Dataflow as part of the Apache Beam project, hoping to spur adoption from the bottom up. (I discuss that effort here.)
Google might eventually open source Spanner, or decide to support CockroachDB in some fashion, but it will require a lot of pressure—primarily in the form of users voting with their feet. Until then, Google is hoping Cloud Spanner can be a cash cow, and the competitive advantage that helps make a serious run at AWS and Microsoft up ahead.

Images from the YouTube-8M dataset
Images from the YouTube-8M dataset
Around the web: Artificial intelligence
Speaking of Dataflow, Google released the 1.0 version of its deep learning cousin, TensorFlow, yesterday. It’s faster, it’s more flexible and, Google says, it’s more production-ready.
This is part of the release of a big audio-visual dataset from YouTube. Contestants can train on the YouTube-8M dataset, then label about 700,000 unseen audio-visual examples. It’s pretty cool.
More from Google. This time, it’s Eric Schmidt admitting at RSA that he vastly underestimated how good AI technologies would become, and how fast. Perhaps the first “AI winter” jaded him.  •  Share
Via SparkML, TensorFlow, H2O and more. It’s hard to tell if users are asking for this, or if IBM is hoping they’ll thank it later.
As far as guides to the various deep learning frameworks go, this one is pretty short and sweet.  •  Share
It’s getting hard for me to not take these advances in computer-vision-based diagnosis for granted. It’s probably a bit easier if one of these algorithms saves your life or at least improves it.
I posted the blog post on this the other day, but this is a research paper about visual search. And, really, who has more invested in this space than Pinterest?  •  Share
Is anyone really asking for their BI tools to explain results? There was a moment a few years ago when several companies were trying … maybe Narrative Science has nailed it where others have been underwhelming.
Around the web: Cloud and infrastructure
This is always a fun read, even if we assume RightScale has quite limited visibility into cloud usage overall. This year it suggests multi-cloud is real, and that Azure and Google Cloud are growing while AWS is flat.
Even the biggest software-as-a-service provider has nowhere near as many users, or probably as much data, as someone like Facebook. That some are tired of running data centers shouldn’t be too surprising.  •  Share
Speaking of Saas, Equinix customers can now plug into Salesforce without relying on the public internet for a connection.
Living in Nevada, I cannot stress enough the outsized influence Switch has in the state. Also, companies and governments still need data center space, and Switch seems to be getting a lot of that business lately—even from well-known name like eBay.
I’ll admit that I poo-pooed DataDog when judging a startup competition several years ago. It has done pretty well for itself, and is now into the lucrative world of application monitoring.
There are several reasons this particular project might not take off, but as Spanner has shown us, consistency and global scale are the way of the future.  •  Share
Around the web: All things data
Remember open federal data and a White House dedicated to helping solve societal problems using that data? That was nice while it lasted. On a related note, I didn’t like seeing Intel stop backing science fairs.
On the one hand, it’s a great time to be in the Apache Kafka business. On the other hand, Kafka, like all oldish-guard data technologies, is at risk from cloud-based services.
For some reason this list seems wrong. Maybe because so much real data science is done in open source?
This is marketing copy from Hortonworks, but gets at a data architecture I think many enterprises would be happy to see. Have to wonder how hard it will be to get there in brownfield Hadoop environments.
Did you enjoy this issue?
The most interesting news, analysis, blog posts and research in cloud computing, artificial intelligence and software engineering. Delivered daily to your inbox. Curated by Derrick Harris. Check out the Architecht site at
Carefully curated by ARCHITECHT with Revue. If you were forwarded this newsletter and you like it, you can subscribe here. If you don't want these updates anymore, please unsubscribe here.