If there is one cliché that makes Wael Elrifai wince, it is that big data is the "new oil".
As Solution Engineering VP for Hitachi Vantara, his work includes developing artificial intelligence (AI)-powered solutions for clients that help them transform their relationship with data.
The goal: moving from mere reporting, to prediction, then prescription.
But data, he tells Computer Business Review, is nothing like oil: "It's a terrible metaphor, not least because oil is a finite resource that's getting harder to access, while data is enjoying a production increase that's off the charts. It's just a clunky way to say data is valuable."
To put that production increase in context, 90 percent of the data in the world was generated in the last two years alone. As DOMO's sixth Data Never Sleeps report highlights, over 2.5 quintillion bytes of data are created every single day, and it's only going to grow from there. By 2020, it's estimated that 1.7MB of data will be created every second for every person on earth.
These figures are unlikely to surprise most enterprises - most of which face having to deal with burgeoning amounts of data - but what is surprising many is the complexity of extracting value from discrete data sets in a multiplicity of formats. Everyone wants to jump on the AI and Machine Learning (ML) train; nobody wants to get left behind.
But extracting, preparing, and blending enterprise data from industry "siloes" (another cliché that Wael Elrifai is keen to see consigned to the history books) is still a major challenge for many industry actors - and only the start of the process of generating value from big data.
Many in the industry use Hadoop, the open-source architecture that allows developers to build software that analyses big data - for example data from social media on retail trends - to generate predictive answers about future sales. This, in theory, might allow a company to optimise its supply chain amid a move to "fast fashion" - in which first-mover advantage based on Instragram reaction to Paris Fashion week is crucial to sales success, for example.
But even its advocates admit that Hadoop is complex and difficult to use and the value-added often offset by the opex required for the army of engineers needed to write code against it.
That's even if users are making sure the data is clean enough and diverse enough to power machine learning, in which algorithms get smarter with each batch of data. Because data, Elrifai emphasises, is very different depending on who you ask.
"Let's say you're an automotive company, You sell 4000 cars and 1000 get returned. How many have you sold? Your sales team will give you one figure; your aftermarket repairs team a very different one - they'll have disparate data sets about different things in different formats."
Understanding the Industry
As a result, to generate meaningful insight from big data, you have to know your client and you have to know the industry before that you can put machine learning techniques such as Artificial Neural Networks (ANN), for example.
Elrifai offers two examples; the oil industry and the shipping industry: "Many oil companies operating in the Gulf of Mexico face elevated levels of unscheduled downtime; say 14% rather than an industry standard of 3%."
"To make sense of this algorithmically and generate value from your data you need myriad data points: is the geological structure different - i.e. you need to blend geological surveys across sites; you need data points on machinery age; an understanding of the capital structure - when you check that and assess the downtime and costs associated with replacement of legacy equipment, 14% might be good, for example. Or let's say you can generate a model based on industry data that can predict with 100% accuracy the failure rate of engines in a shipping fleet - 20 minutes in advance of the event. What good is that to anyone if it happens in the middle of the ocean? You need to understand the supply chain; the timings of parts delivery, etc."
Coding (or not) for Big Data Insight
While the computational capabilities that make predictive data analytics possible have finally become widely available, the hurdle remains - as touched on above - of needing to code integrations from a multiplicity of data sources in heterogeneous formats.
Hitachi Vantara, which Elrifai describes as a "$4 billion startup" was established last year by blending the data centre specialist Hitachi Data Systems, BI and analytics brand Pentaho and the big data unit Hitachi Insight Group. It aims to tackle this problem head-on.
The company is a powerhouse in the sector (encompassing hardware, software and consulting) with a unique capability: the ability to provide "drag and drop" simplicity to some of the most complex coding challenges out there.
Why this matters
Data analytics, streaming and intelligence tools go through waves of industry popularity: one year it may be Spark, or Flink, or Hadoop. Vantara's Pentaho toolkit allows users to deploy any of these, but only with one coding - something Gartner has recognised as reducing the time needed to code data analytics platforms by 85%.
As Elrifai notes: "Spark, for example, needs coding in Scala, Python or Java. Try to find a Scala programmer these days - that's not an easy task!"
Hitachi Vantara describes Pentaho as a "tightly coupled data integration and business analytics platform" that helps customers gain value from blended big data much more quickly compared to other tools. It allows customers to architect big data blends at the source and stream them directly for more complete and accurate analytics in the form of dashboards, reports and interactive visualisations, using an open standards-based architecture that makes it easy for customers to integrate the platform or extend the existing infrastructure.
It's a solution that has seem some of the world's most data-intensive companies, for example the NASDAQ turn to it as an answer to a question that had been asked for years: How can I improve business transparency, mitigate compliance risk - and see things coming that are not anywhere on the horizon.
As Elrifai puts it: "I've been coding since I was five-years-old. I have over 30 years of coding experience but if you put 1,000 lines of code in front me, it's still going to take me a while to work out precisely what it does. By using visual interfaces and automating swathes of the back-end of the data engineering process, it's not only orders of magnitude faster and less buggy, but it's self-documenting. Tools like ours can cut your time to concrete predictions fast; that might be understanding the precise outcome of what happens on a production line if X change occurs, or optimising your sales team to make sure they are understanding why X sub-segment of customers is dropping off."
It's smoothing, in short, the path from business intelligence, to artificial intelligence.