As organizations start on the journey to become sustainably data driven there is a tendency to focus immediately on data acquisition; the story goes that data will be needed no matter what happens. While this is certainly true, how data is acquired, and by whom, is often an important but ignored aspect of these first steps towards becoming sustainably data driven.
To be leveraged effectively, collected data must be properly documented and stored with accurate and robust metadata, which will enable future consumers to place the data in its appropriate context and thereby generate insights. Similarly, organizations must focus on implementing robust and agile data quality measurement to ensure that data is and continues to be fit for purpose; data is rarely static and is subject to degradation in its quality for a variety of reasons. Additionally, a robust data governance practice is key to overcoming these challenges and achieving your data-driven goals.
Frequently, how data is initially stored is not properly thought out, leading to drastic consequences. With the advent of big data, and schema-on-read technologies, it is common for organizations to dump data into data lakes as it exists in the original source system, an approach that often creates data swamps instead. Data standardization is an integral part of the ingestion or acquisition process. The extent to which data needs to be standardized depends on the organization: some organizations deploy entire industry models to standardize data and to serve as a foundation for further consumption, but at the very least, we recommend that you evaluate all fields being acquired and to classify them according to an Enterprise Taxonomy. Do you need 6 different formats for postal codes? While diversity can be wonderful, it's certainly not something which is needed in the data formats for our analytical stores. Too much diversity will only hinder the agile enablement of analytics.
Validation rules should also be used at the ingestion phase to address issues proactively and to provide confidence in the data. Most data consumption will rely on the curated, validated, and standardized data; however, the raw data is also needed on occasion. Effective maintenance requires agility, even at the level of the rules and taxonomies. The rules used to curate and standardize data evolve over time, and as such may need to be re-run against the raw source data. For this reason we advocate that the raw data be kept along with the separate "curated" version. The raw and curated zones form the foundational layers of semantic ascent.
In terms of capabilities, organizations often tackle formal acquisition from the core systems but then drop the proverbial ball by not planning for a robust and agile ad hoc acquisition. Typically, external sources, such as census information or weather, are ignored and left to the wild west of data acquisition. We should note that an analytical process is only as good as its weakest data source; it is important that any useful data be available promptly and be robust.
Now for the crux of the issue: if your organization has accurate and robust metadata, robust and agile data quality measurement, you likely have an effective data governance program, or at least the key foundational aspects for one. If you have taken care to standardize and classify data as a part of your ingestion then you likely have a good handle on the data you are managing. Furthermore, if you have implemented a robust and agile ad hoc acquisition mechanism for the "other" sources, you are well on your way! Now let me let you in on a little secret: the secret sauce for Adastra is automation. We aim to take data ingestion and to turn it into an exercise in configuration rather than an exercise in development. Automation gives you the improvements in your time-to-market, cost savings, as well massive improvements in the quality of your ingestion processes.
Now surely some of you have attained these goals at an enterprise level. Let me assure you that you are to be commended as you are the cream of the crop. If you are not quite where you want to be you are certainly not alone! Feel free to reach out to see how I may be able to help!
Stay tuned for the following steps in the "sustainably data driven" series:
- Step 2: How to build data structures for efficient and flexible consumption, in an agile manner. To store or not to store, that is the question…
- Step 3: Reporting, dashboarding and visualization in a big data world.
- Step 4: Advanced analytics that make a difference: how to derive insight in a robust and agile fashion.