Hadoop successor sparks a data analysis evolution

2015 could be the year for Apache Spark

If 2014 was the year that Apache Hadoop sparked the big data revolution, 2015 may be the year that Apache Spark supplants Hadoop with its superior capabilities for richer and more timely analysis.

"There is a strong industry consensus that Spark is the way to go," said Curt Monash, head of the IT analyst firm Monash Research.

"Next year, you will see a lot of [Hadoop] use cases that transcend Hadoop," said Ali Ghodsi, CEO and co-founder of Databricks, a company formed by a number of the creators of Spark that offers a hosted Spark service, as well as technical support for software distributors selling Spark packages.

Spark is an engine for analyzing data stored across a cluster of computers. Like Hadoop, Spark can be used to examine data sets that are too large to fit into a traditional data warehouse or a relational database. Also like Hadoop, Spark can work on unstructured data, such as event logs, that hasn't been formatted into database tables.

Spark, however, goes beyond what Hadoop can easily do, in that it can analyze streaming data as it is coming off the wire.

As such, it can serve as a faster replacement to the Hadoop MapReduce framework for data analysis. In the annual Daytona Gray Sort Challenge, which benchmarks the speed of data analysis systems, Spark easily trumped Hadoop MapReduce, and was able to sort through 100 terabytes of records within 23 minutes; It took Hadoop over three times as long to execute the same task, about 72 minutes.

Initially, real-time processing may not seem like a big distinction, however, such capabilities have been used to create entirely new lines of businesses.

"We've built our intellectual property around Spark," explained ClearStory Data CEO and co-founder Sharmila Shahani-Mulligan. ClearStory Data offers a new business intelligence service that allows teams to assemble a series of data visualizations into a narrative, as if they were a PowerPoint presentation. The data can come from many sources and can be updated as new data comes in.

"People want fast response times. They don't want to wait a day for an answer," Ghodsi said. For instance, Spark could be used to help digital advertisers decide what ad to serve to users based on their last few clicks, rather than on what sites they clicked on a few days or weeks prior. Spark's data processing speed is important, because while the amount of data we collect is growing rapidly, the advancement of computer processing power is tapering off.

Spark also offers a richer palate of ways to analyze data, Monash said. Hadoop's default analysis engine, MapReduce, is chiefly capable of executing one kind of problem, involving the filtering and sorting of data across different servers (the "map" portion of the job) and the summarizing of the results (the "reduce" side of the problem).

In contrast, Spark was designed to tackle more complex queries involving techniques of machine learning and predictive modeling, among others. "Things that Hadoop MapReduce was pretty good at, Spark is potentially better at," Monash said.

Another early adopter of Spark has been music streaming service Spotify, which uses the technology to generate playlists of music based on the user's specific tastes based on a set of machine learning algorithms.

Even Hadoop users are getting the message. Hadoop distributor Cloudera, which also includes Spark in its releases, has about 60 enterprise customers using Spark in some form or another, according to Monash. Other Hadoop distributors, notably Hortonworks and MapR, also offer Spark in their distributions.

The Spark project was started in 2008 at the University of California, Berkeley's AMPLab (the AMP stands for Algorithms, Machine and People). Now under the guidance of the Apache Software Foundation, the project gets more contributions than any other Apache software project. Core contributors include engineers and developers from companies such as Intel, Yahoo, Groupon, Alibaba and Mint.

Spark can be used in conjunction with Hadoop, to analyze data on the Hadoop File System (HDFS), or it can be run on its own. Developers build applications off of Spark using either Python, Java or the Scala programming languages.

"Part of the attraction of Spark is that it has a pretty nice API [application programming Interface] that makes it accessible to use for developers and engineers," said Reynold Xin, a Databricks co-founder.

We will see many more products and services based on Spark next year, predicted Databricks' Ghodsi. Programmers are often are asked about their Spark chops.

"We've had multiple [job] candidates out there say that they have seen multiple exciting Spark projects," Ghodsi said.

Joab Jackson covers enterprise software and general technology breaking news for The IDG News Service. Follow Joab on Twitter at @Joab_Jackson. Joab's e-mail address is Joab_Jackson@idg.com

Join the newsletter!

Or

Sign up to gain exclusive access to email subscriptions, event invitations, competitions, giveaways, and much more.

Membership is free, and your security and privacy remain protected. View our privacy policy before signing up.

Error: Please check your email address.
Show Comments

Latest Videos

Conversations over a cuppa with CMO: Craig Davis

​Leadership resilience, startups scaling up, marketing best practices, customer insights - these are just a few of the topics we manage to explore in the latest episode of Conversations over a Cuppa with CMO featuring Craig Davis.

More Videos

JP 54, D2, and D6 EN590,JET A1 AVAILABLE ON FOB DIP AND TEST IN SELLER TANKWe Can supply Aviation Kerosene,Jet fuel (JP 54-A1,5), Diesel ...

Collins Johnson

Oath to fully acquire Yahoo7 from Seven West Media

Read more

JP54,D2, D6, JetA1 EN590Dear Buyer/ Buyer mandateWe currently have Available FOB Rotterdam/Houston for JP54,D2, D6, JetA1 with good and w...

Collins Johnson

3-pronged marketing approach for property disruptor Brickx

Read more

With a response rate of 80-90%, a well optimized chatbot is a must-have for every business. Check out this link to explore how you can en...

Drishti Khurana

How NRMA’s Arlo the Koala chatbot won over customers

Read more

Hey, With a response rate of 80-90%, a well optimized chatbot is a must-have for every business. Check out this link to explore how you c...

Drishti Khurana

7 innovative brand chatbots

Read more

hey Ever wondered how a business could leverage WhatsApp to grow? Find out here - http://s.engati.com/2rf

Unnit Dedhia

Sydney Uni taps AI for new COVID chatbot

Read more

Blog Posts

Life beyond the cookie: 5 steps to mapping the future of marketing measurement

​There’s no denying there’s been a whirlwind of response to the imminent demise of the third-party cookie from all parts of the industry. But as we’ve collectively come to better understand the implications, it’s clear this change is giving the digital advertising industry the opportunity to re-think digital marketing to support core industry use cases, while balancing consumer privacy.

Natalie Stanbury

Director of research, IAB Australia

Ensuring post-crisis success

The COVID-19 pandemic has exposed brands’ CX shortcomings and a lack of customer understanding. Given ongoing disruption, customer needs, wants and expectations are continually changing, also causing customers to behave in different ways. Just look at hoarding toilet paper, staple and canned food, medicinal and cleaning products.

Riccardo Pasto

senior analyst, Forrester

A few behavioural economics lesson to get your brand on top of the travel list

Understanding the core principles of Behavioural Economics will give players in the travel industry a major competitive advantage when restrictions lift and travellers begin to book again. And there are a few insights in here for the rest of the marketing community, too.

Dan Monheit

Co-founder, Hardhat

Sign in