The misunderstood relationship between big data and machine-learning

Moving big data from "it's complicated" to the honeymoon phase.

By Markus Noga, head of machine learning at SAP and Dan Wellers, global lead for digital futures at SAP

It’s no secret that machine-learning and big data have emerged as a “power couple” for enterprises looking to leverage new automation technologies. Machine-learning trains itself

Markus Noga

SAP's Markus Noga

on data, and for a time, that data was scarce. This is no longer a problem. By 2025, the world will create 180 zettabytes of data per year (up from 4.4 zettabytes in 2013), according to IDC.

Big data and machine-learning may seem to be a perfect match, coming together at just the right time. Their relationship is one that’s understood in terms of a simple equation: large amounts of data mined=actionable insights that were previously unknown or invisible.

But it’s not that simple. Without a thorough understanding of both the strengths and limitations of the data at hand, having more of it can actually increase the likelihood of making spurious connections.  

Historically, most of the data that businesses analyzed for the purpose of decision-making

dwhead3

SAP's Dan Wellers

has been of the structured variety: easily entered, stored and queried. In the digital age, however, the connected world enables the capture and storage of more—and more diverse—data sets than ever before. Nearly 5,000 devices are being connected to the internet every minute today; within ten years, there will be 80 billion devices collecting and transmitting data around the world. As a recent McKinsey Global Institute report noted: “Much of this newly available data is in the form of clicks, images, text, or signals of various sorts, which is very different than the structured data that can be cleanly placed in rows and columns.”

This creates a data-management challenge similar to the age-old computing axiom: garbage in, garbage out. To quote UC Berkeley professor and machine learning expert Michael I. Jordan, data variety leads to a decline in data quality—“It’s like having billions of monkeys typing.”

So how do we move big data and machine-learning out of the “it’s complicated” zone and into the honeymoon phase? For machine-learning tools to work, they need to be fed high-quality data, and they must also be guided by highly skilled humans. Preparing data can be heavy lifting, but it can also be the most important part of a data scientist’s job—one that accounts for as much as 50 percent of his or her time, according to some estimates. In fact, it took one bank, 150 people, and two years to achieve the data quality necessary to build an enterprise-wide data lake from which advanced analytics tools might drink.

These challenges are not insurmountable, but they reinforce that big data and machine-learning will be a perfect match with the necessary human intelligence. The demand for data scientists has reached critical-level value, predicted to grow at double-digit rates for the foreseeable future. When done properly with the right human workforce, the benefits of a big data and machine-learning couple for enterprises will almost certainly be huge. Over time, companies must work through complications and drawbacks to reap the long-term benefits of this oft-hyped couple.

This blog is based on a piece which was published in The Digitalist Magazine, online edition. Dr. Markus Noga and Dan Wellers

Show Comments
Hide Comments

Join the discussion

We welcome your thoughtful comments.
All comments will display your user name.

Want to participate in the discussion?

Register for free

Log in for complete access.

Comments

  • Data quality can be provided using digital communication signals instead of analog signals such as 4-20 mA. Digital signals include error detection and includes status indication. Learn how digital plants do it from this essay: https://www.linkedin.com/pulse/iiot-control-can-you-trust-your-sensors-jonas-berge Also, in my personal opinion, machine learning (ML) is not a silver bullet. There are many applications where other forms of analytics is more suitable. ML requires a learning period. For instance, to detect and distinguish between different kinds of pump failures ML has to see several failures of each kind. This will take a long time since pumps run reliably for years. Moreover, the plant suffers downtime during these failures which is disruptive and costly. Right? Instead using model-based analytics there is no learning period. All the subject matter knowledge is built into the readymade app which just need quick configuration and commissioning.

    Reply

RSS feed for comments on this page | RSS feed for all comments