The digital economy spurred a move from a system of records to a system of intelligence. Every interaction that a human or machine has with the digital world leaves behind a trace. This digital footprint carries significant information embedded in the data that provides a fertile ground to build systems of intelligence.
Today, an increasing number of enterprises are inclined to make data-driven decisions. Therefore, it’s important that we understand that a renewed focus on data strategy is gaining ground. However, the quality and robustness of our insights are directly proportional to the quality of the data. The following simile provides a helpful way of viewing data in this context:
“Data is just like crude. It’s valuable, but if unrefined it cannot really be used.” – Clive Humby (2006)
How to refine data is an important consideration. Data-driven insights are generated from data that is structured (or collected as part of our systems of records) and unstructured (non-transactional data like social media, video, speech, and text contents not captured as part of traditional systems of records). Combining data from such disparate data sources is difficult to achieve, due to obvious reasons, and almost impossible to maintain. What’s more, data pollution, leakage, and spillages can have serious repercussions (much like oil spills).
It’s dangerous for machine learning implementations to rely on an external data source that doesn’t have clear authoritative sources or has little understanding of how data is collected. This can create hidden debts that lead to non-scalable, unstable systems.
Finding a safe, data-driven approach is therefore important. And the quicker you find it, the better. The impact of machine learning on humankind is growing. Governments around the world are enacting new regulations to ensure transparency around the type of data used by enterprises, how they use their insights, and how privacy should be protected. Hence, standardization and data governance are focused on fixing problems at their source, which affects the organizations owning the data.
On this point, this second comparison of data mining with oil exploration is illuminating:
“The difference between oil and data is that the product of oil does not generate more oil, whereas the product of data will generate more.” – Piero Scaruffi (2016)
Data is the raw material for building systems of intelligence. It’s vitally important that enterprises are equipped with good quality data for gaining consistent value in their journey towards becoming a data-driven organization. That way, data can serve as a clean natural resource for your machine learning implementation.
Published in The D!gitalist Magazine by Paul Pallath on July 17, 2017. Dr. Pallath is the Chief Data Scientist & Senior Director with the Advanced Analytics Organisation at SAP.