Innovation is the driving force in the world of technology. Because of this, few technologies have remained unchanged over time. The relational database, however, is one extremely pervasive exception! An entire Information Management industry has been built around the definition, storage, and consumption of data based on relational databases and tables. While these technologies have improved significantly, they haven’t strayed very far from the basic concept first proposed in 1970.
While this industry has created immense value since 1970, the time has come to innovate and Big Data technologies are central to this creative destruction. Naturally, we can expect some resistance here: the world of data architects, who have been defining data structures, has remained relatively untouched as those who consume and store data have consistently had new tools to learn over the last 47 years.
Big Data technologies have brought us "schema-on-read", which allows data to be stored in advance of its structure being known: in the Big Data world, one can iteratively define the structure of the data after storing it. This enables a much more agile approach that is one of the most cited reasons for organizations to move into the Big Data world. The idea is that "data architecture is a thing of the past" – the claim is that by using schema-on-read, you get to have your data without incurring the cost of figuring out how to store it! Could it be that we have finally found the elusive free lunch?
Not so fast! In my experience, the value proposition for data architecture is not often articulated by the leading practitioners of this field. This often contributes to the idea that data architecture is just a cost center and just takes too long. New users assume that with schema-on-read you can define your structures when you need to consume them, et voilà, instant results! In practice, this shortcut is not always feasible for two fundamental reasons: performance and cost. The performance of such an approach is often suspect, especially for large volumes. When one has many consumers of some shared data resource, it makes little sense to have every consumer incur the cost to define their own schema at consumption time; doing so can lead to drastic consequences from a data governance perspective as well. If you do this, are you sure you are interpreting the data in the same way if each consumer defines their own schema to consume data? Having said that, does it make sense to invest in fully defining a data model up front, before you know your consumption patterns?
At Adastra, we recognize that agility is paramount and that working with a sense of urgency is essential for success in today's business world. While we recognize and support the demise of up-front, big bang data architecture, schema-on-read is not a silver bullet, and data architecture still has an important role to play. Data architecture is about so much more than an up-front investment to define data structures that will be used to store data; it's about understanding the data and its relationships and how these can be put together to generate insights. Ask yourself the following question: does it make sense to have each consumer spend time understanding the data and its relationships, only to throw away the fruit of all of this labour and have the next consumer perform the same (or very similar) work again?
There is a better way! Our approach, evolutionary data architecture, is agile and focuses on doing only as much work as is needed to get the job done. The key is to ensure that the fruit of this work can be leveraged by future consumers through the appropriate use of metadata and agile development practices. Adastra’s approach relies on agile, iterative prototyping that uses schema-on-read to quickly understand what needs to be built and to build it! If a search in the data catalog for existing applicable structures comes up empty, prototype structures are built with data ingested according to the process discussed in my previous post using tools that automatically capture all aspects of metadata about these data products, including lineage as well as documentation on assumptions, caveats, etc. Prototyping also uncovers new and improved validation rules that are funneled to the ingestion process to ensure data is of the utmost quality. As insights are generated, they are added to the data catalog for broader consumption.
To further enable agility, end users do not consume data directly. It's important to decouple data storage and data consumption to enable you to add new insights cost-effectively and without impacting consumers who do not need access to this new data. To accomplish this, we deploy a semantic layer – a simple view on a relational table in many cases – that enables changes to the underlying structures without impacting consumers.
Deployment of these derived data structures mark the achievement of the next level of semantic ascent after the raw and curated zones (described in Step One) have been established. Our prototyping approach permits us to implement business rules, derive analytical master data, implement analytical models, and generate insight. The consumers of the insight may be the experts in what data is needed, but they may not be the experts in how this data is to be provisioned. Once the required structures are created, it is important that this work be transitioned to an operational team responsible for maintaining, optimizing and provisioning the data. Often, the initial prototypes are good enough; if so, they should just be transitioned as they are in the name of agility. Sometimes, you will know what structures will be needed and can build them in advance and transition them to the operational team even before consumers start prototyping.
So the question is, do we store or do we not store? In short: yes. Build data structures optimized for consumption, and, if they have any potential future value, transition them to the operational team to enable future use. Evolutionary data architecture provides an incremental, agile approach to storing and optimizing consumption data structures, and to setting up a process that encourages reuse across the organization, reduces time-to-market, and enables more robust data consumption.
Stay tuned for the following steps in the "sustainably data driven" series:
- Step 3: Reporting, Dashboarding and Visualization in a Big Data World.
- Step 4: Advanced Analytics that Make a Difference: How to Derive Insight in a Robust and Agile Fashion.