Data, like water, comes in many forms. The human mind has evolved to filter out most of the data that comes our way because there's simply so much of it.
When you open your eyes and ears, data is everywhere. The color of the wall, the sound of the air conditioning and the smell of your neighbor's coffee are treated like humidity. The water is in the air all the time but it's not useful to pay much attention to it.
When water condenses into fog, it forces you to see it and makes understanding the world around you all the more difficult. Incomplete datasets, corrupted data, bad science, false conclusions and cognitive bias all make you lose your way in the mist.
Data falls like rain. When there's just a little, it is wildly unsatisfying– just enough to make your car dirty and confuse the conversation. You find yourself wiping away the spot on your glasses as somebody spouts some random data point, gleaned from some obscure source.
- Stale water in a shallow pond is dangerous. Data, collected from an unreliable supply, neither cleansed or normalized and left to grow stagnant, can easily lead to faulty conclusions.
- A steady trickle of water can be just enough to fill a canteen or sustain a woodland ecosystem. Just three data points (the number of emails sent, versus opened, versus clicked) can sustain a marketing program.
- A healthier flow of data in the form of a small creek can be used for bathing. A continuous data flow allows benchmarking and historic comparison. Landing page optimization can be accomplished with steady conversion data.
A modest river can power a mill to saw wood or grind wheat. A recommendation engine only needs the reliable contribution from a handful of tributaries to provide an increase in the value of shopping carts.
- A waterfall of can propel a huge waterwheel and a sufficient influx of information can drive a real time, dynamic content system.
- A river that's wide and deep enough can support an entire transportation industry. Enough data can float barges and cargo ships in the shape of a collection of cookies from advertising networks, loyalty card program data aggregators, and data brokers.
When data arrives in expected amounts at anticipated times, it can be captured, channeled and put to use. Irrigation systems, dams and reservoirs provide a feeling of control and allow for construction of an ever-broadening infrastructure with canals, locks and dams. Data warehouses have been built on less trustworthy flows.
Cleanliness is Next to Godliness
Clean water is vital to the success of life, irrigation, running power plants, etc. The definition of ‘clean' might change for the purpose; it's OK if there is algae in water that cools a power plant and it is not acceptable if there are more than 10 parts per billion of arsenic in drinking water.
Data is the same. In a direct mail application, whether you have a person's title (Mr., Mrs., Ms.) is inconsequential… unless you're mailing to doctors. But dirty data will trip you up every time.
As U.S. Chief Data Scientist, DJ Patil, put it at a First Round CTO Summit, “If you're not thinking about how to keep your data clean from the very beginning, you're f^¢&ed. I guarantee it. Trying to clean it up after the fact will take months at least.”
If you heat water to the boiling point, it can power an entire Industrial Revolution. Data seems to be doing the same thing. From the moment computers could store as well as calculate, data has been collected as fast as the storage equipment could be created to do so.
The Data Lake
As the data from these tributaries trickles through the mills engines, it all ends up in the lake, behind the dam. As data is let out in a controlled fashion, it powers the turbines of the data industry; those giant engines of data processing with names like Google and Facebook. There will be no drought here.
And, finally, there is a deep pool of water, waiting for the analyst to dive in. Scuba gear and spear gun in hand, the analyst investigates the deep, maps new ground and discovers new species. It's a very exciting time to be a data explorer.
That's why so many of them have been showing up for the eMetrics Summit since 2002. The next opportunity is in Boston, September 27 to October 1, 2015.
A Bridge Too Far
And what of the power of data to carve the next Grand Canyon? What about the glacial melting of structured data? How do we treat waste water in a world becoming more and more privacy conscious?
Those are questions for another time and water under the bridge.