Data Scientists Frustrated by Data Variety, Find Hadoop Limiting

A survey of data scientists finds that a majority of them believe their work has grown more difficult.

Companies are focusing more and more attention on building out big data analytics capabilities and data scientists are feeling the pressure.

In a study of more than 100 data scientists released this week, Paradigm4, creator of open source computational database management system SciDB, found that 71 percent of data scientists believe their jobs have grown more difficult as a result of a multiplying variety of data sources, not just data volume.

Notably, only 48 percent of respondents said they had used Hadoop or Spark for their work and 76 percent felt Hadoop is too slow, takes too much effort to program or has other limitations.

"The increasing variety of data sources is forcing data scientists into shortcuts that leave data and money on the table," says Marilyn Matz, CEO of Paradigm4. "The focus on the volume of data hides the real challenge of analytics today. Only by addressing the challenge of utilizing diverse types of data will we be able to unlock the enormous potential of analytics."

Even with the challenges surrounding the Hadoop platform, something has to give. About half of the survey respondents (49 percent) said they're finding it difficult to fit their data into relational database tables. Fifty-nine percent of respondents said their organizations are already using complex analytics -- math functions like covariance, clustering, machine learning, principal components analysis and graph operations, as opposed to 'basic analytics' like business intelligence reporting -- to analyze their data.

Another 15 percent plan to begin using complex analytics in the next year and 16 percent anticipate using complex analytics within the next two years. Only four percent of respondents said their organizations have no plans to use complex analytics.

Paradigm4 believes this means that the "low hanging fruit" of big data has been exploited and data scientists will have to step up their game to extract additional value.

"The move from simple to complex analytics on big data presages an emerging need for analytics that scale beyond single server memory limits and handle sparsity, missing values and mixed sampling frequencies appropriately," Paradigm4 writes in the report. "These complex analytics methods can also provide data scientists with unsupervised and assumption-free approaches, letting all the data speak for itself."

Sometimes Hadoop Isn't Enough

Paradigm4 also believes Hadoop has been unrealistically hyped as a universal, disruptive big data solution, noting that it is not a viable solution for some use cases that require complex analytics. Basic analytics, Paradigm4 says, are "embarrassingly parallel" (sometimes referred to as "data parallel"), while complex analytics are not.

Embarrassingly parallel problems can be separated into multiple independent sub-problems that can run in parallel -- there is little or no dependency between the tasks and thus you do not require access to all the data at once. This is the approach Hadoop MapReduce uses to crunch data. Analytics jobs that are not embarrassingly parallel, like many complex analytics problems, require using and sharing all the data at once and communicating intermediate results among processes.

Twenty-two percent of the data scientists surveyed said Hadoop and Spark were not well-suited to their analytics. Paradigm4 also found that 35 percent of data scientists who tried Hadoop or Spark have stopped using it.

Paradigm4's survey of 111 U.S. data scientists was fielded by independent research firm Innovation Enterprise from March 27 to April 23, 2014. Paradigm4 put together this infographic of its survey results.

Join the newsletter!

Or

Sign up to gain exclusive access to email subscriptions, event invitations, competitions, giveaways, and much more.

Membership is free, and your security and privacy remain protected. View our privacy policy before signing up.

Error: Please check your email address.
Show Comments

Latest Videos

Launch marketing council Episode 5: Retailer and supplier

In our fifth and final episode, we delve into the relationship between retailer and supplier and how it drives and influences launch marketing strategies and success. To do that, we’re joined by Campbell Davies, group general manager of Associated Retailers Limited, and Kristin Viccars, marketing director A/NZ, Apex Tool Group. Also featured are Five by Five Global managing director, Matt Lawton, and CMO’s Nadia Cameron.

More Videos

Hi,When online retailers establish their multi channel strategy and they are using or will to use live chatbot to support their customers...

Alice Labs Pte. Ltd.

CMO's top 8 martech stories for the week - 6 May 2021

Read more

Thanks for nice information regarding Account-based Marketing. PRO IT MELBOURNE is best SEO Agency in Melbourne have a team of profession...

PRO IT MELBOURNE

Cultivating engaging content in Account-based Marketing (ABM)

Read more

The best part: optimizing your site for SEO enables you to generate high traffic, and hence free B2B lead generation. This is done throug...

Sergiu Alexei

The top 6 content challenges facing B2B firms

Read more

Nowadays, when everything is being done online, it is good to know that someone is trying to make an improvement. As a company, you are o...

Marcus

10 lessons Telstra has learnt through its T22 transformation

Read more

Check out tiny twig for comfy and soft organic baby clothes.

Morgan mendoza

Binge and The Iconic launch Inactivewear clothing line

Read more

Blog Posts

Getting privacy right in a first-party data world

With continued advances in marketing technology, data privacy continues to play catchup in terms of regulation, safety and use. The laws that do exist are open to interpretation and potential misuse and that has led to consumer mistrust and increasing calls for a stronger regulatory framework to protect personal information.

Furqan Wasif

Head of biddable media, Tug

​Beyond greenwashing: Why brands need to get their house in order first

Environmental, Social and (Corporate) Governance is a hot topic for brands right now. But before you start thinking about doing good, Craig Flanders says you best sort out the basics.

Craig Flanders

CEO, Spinach

​The value of collaboration: how to keep it together

Through the ages, from the fields to the factories to the office towers and now to our kitchen tables, collaboration has played a pivotal role in how we live and work. Together. We find partners, live as families, socialise in groups and work as teams. Ultimately, we rely on these collaborative structures to survive and thrive.

Rich Curtis

CEO, FutureBrand A/NZ

Sign in