ArchiTECHt Daily: Spark, we hardly knew ye

The subject line of this issue is a bit of prognostication, of course, but it's nonetheless remarkabl
ArchiTECHt Daily: Spark, we hardly knew ye
By ARCHITECHT • Issue #46
The subject line of this issue is a bit of prognostication, of course, but it’s nonetheless remarkable how fast the next big thing comes along. Apache Spark likely has a long road ahead of it as the data-processing engine of choice for a wide variety of jobs, ranging from stream processing to machine learning. However, researchers are working fast to overcome some of Spark’s limitations. 
The latest effort, called Flare, comes from researchers at Purdue and Stanford. They have developed a system they claim can outperform Spark on SQL queries, while still maintaining solid performance across other processing tasks. Importantly, they have also designed Flare to take advantage of the prevalence of resource-rich cloud computing instances, which can increase efficiency by letting users scale up rather than scaling out. 
As they explain: 
“Today, machines with dozens of cores and memory in
the TB range are readily available, both for rent and for
purchase. At the time of writing, Amazon EC2 instances
offer up to 2 TB main memory, with 64 cores and 128
hardware threads. Built-to-order machines at Dell can be
configured with up to 12 TB, 96 cores and 192 hardware
threads. NVIDIA advertises their latest 8-GPU system as
a “supercomputer in a box,” with compute power equal to
hundreds of conventional servers. With such powerful
machines becoming increasingly commonplace, large clusters
are less and less frequently needed. Many times, “big data” is
not that big, and often computation is the bottleneck.
As such, a small cluster or even a single large machine is
sufficient, if it is used to its full potential.”
Notably, the researchers at RISELab—UC-Berkeley’s successor to AMPLab, which birthed Spark—are also working on various projects that seek to improve on Spark’s limitations. Professor Michael Jordan presented on a machine-learning-focused project, called Ray, at the O'Reilly Strata conference earlier this month.
I don’t know about the work at RISELab, but the researchers working on Flare have retained support for Spark’s APIs, which are user-friendly and powerful, and also widely used at this point.
Basically, what we’re seeing right now across the big data space is what happens when innovation in cloud computing mixes with open source development. The result is that platforms like Hadoop and Spark, designed for yesterday’s use cases using yesterday’s infrastructure, end up showing their age faster than some folks might expect. 
This isn’t necessarily a big deal for the current big data software market. Enterprise adoption is still the bellwether by which companies trying to commercialize these technologies are judged, and enterprises are typically slower to adopt new technologies and have lots of legacy systems that need need to integrate. They also tend to prefer proven, commercially supported technologies to research projects. For Cloudera or Hortonworks or MapR or Databricks, there’s still plenty of work to do and plenty of opportunity to bring those large companies up to speed at their own pace.
Spark, and even Hadoop, will have a long and fruitful life inside large companies as the focal point of data efforts ranging from analytics to artificial intelligence.
What I always wonder, though, is what the pace of innovation means for the startups of today and tomorrow—borne of the cloud and open source eras. They’re all aiming to become the next large enterprise, and they don’t have technical debt tied to legacy databases or even big data systems. Some newer large companies with open source roots, like Facebook, continue to develop their own technologies when the legacy stack can’t keep up.
It seems like we’re poised to start hearing about the next big thing in big data rather soon. I’m excited to find out what that will be, and also curious to see how long it lasts before its replacement comes along.

Around the web: Artificial intelligence
I’ll give Steven Mnuchin the benefit of the doubt and assume he is not worried about humanoid robots and superintelligent systems taking jobs for at least 50 years. Because automation and machine learning totally are already.
You could take issue with some of these (I might argue AI companies should think about their tech’s effects on workers), but companies should heed the advice about conversational interfaces.
I remember speaking with folks about this 2-3 years ago while at Gigaom, and it’s amazing how far we’ve come.
This article looks at the serious uptick in AI papers published by Google over the past several years, and why that might matter for attracting talent. To be clear, though: Better research does not mean better products.
Korean scientists have created something that’s low-power, physically flexible and possibly ideal for imbuing connected devices with “intelligence.”
This is a good column pushing Canadian companies and research institutions to work on commerce as well as science. If AI has the economic impact some predict, the stakes are huge.
DARPA wants to build an AI system that learns from its past rather than always being retrained for new tasks.  
This obviously has huge potential for investigative purposes. Less so for real-time identification, given the consequences of false positives.
The press release talks about X-rays, so I assume they’re working on baggage screening. I unknowingly flew a roundtrip with a Swiss army knife in my backpack, so maybe they’re onto something.
Sponsor: Datos IO
Sponsor: Datos IO
Around the web: All things data
To be honest, I didn’t realize that Alteryx was going public. It’s an analytics company that’s kind of like a mix between SAS and Tableau.
“Reverse engineer” is a strong term, but the author of this post does show what’s possible using public data sets. He’s able to figure out patterns in power consumption by type of business, which is the first step in optimizing efficiency.
Around the web: Cloud and infrastructure
This story got way more press than I would have expected. I think it’s the realization that China is a huge market that American cloud providers might not own. Also that machine learning is a big deal.
Standardization is probably a good thing, if only so storage providers aren’t forced to choose one platform or scheduler over another. Seems like container companies need to make sure they don’t step on storage providers’ toes, though.
Did you enjoy this issue?
The most interesting news, analysis, blog posts and research in cloud computing, artificial intelligence and software engineering. Delivered daily to your inbox. Curated by Derrick Harris. Check out the Architecht site at
Carefully curated by ARCHITECHT with Revue. If you were forwarded this newsletter and you like it, you can subscribe here. If you don't want these updates anymore, please unsubscribe here.