Becoming Sustainably Data Driven: Step 3

This post is the third in our "Becoming Sustainably Data Driven" series. Read the other entries here:


Reporting, Dashboarding and Visualization in a Big Data World

In the “olden days” of BI, reports were generated from structured data, most of which came from a relational database, where schemas are predefined to serve a specific purpose. The advent of Big Data has increased not only the volume and variety of data, but its sources as well: increasingly, BI is asked to consume and deliver data from schema-on-read sources. Data now comes from sensors, log files, emails, Web-site clicks, and a host of other unstructured sources; for example, Hadoop clusters, with thousands of nodes, can extract data from and perform text analytics on a stream of PDF documents that contains a complex treasure trove of information that would be otherwise inaccessible to traditional, structured databases.

The relative lack of a standardized structure presents a major challenge for BI developers who must now meet the demands of their organization to extract meaningful, contextualized reporting not only from traditional Data Warehouses, but also from a Data Lake, or an Analytic platform.

When developing BI from Big Data, your first decision should be which reporting tool you will use: a traditional BI tool, a visualization tool, or one of the new Big Data tools, taking into account your organization’s existing tools and the skills of your employees.  Most traditional BI tools now support a Hadoop connection either with a native connector or a standard ODBC/JDBC connection; in some cases, you might need to connect to the data source using REST API or a third-party connector. New visualization tools also support querying Hadoop via ODBC/JDBC. And, of course, Big Data tools such as Hive and Apache Spark are built to access Hadoop data via SQL on Hive or Apache Spark.

Once you decide on the tool(s), you will need to consider several important aspects:

Icon-Safe copy.png

Security: You will need a comprehensive framework to allow easy and secure access to an increasing number of Hadoop users, each with a user identity defined by your security architect. If you use Kerberos for AD and LDAP authentication, you should consider the Apache Knox Gateway (Knox) for centralized authentication and access to the Hadoop environment. Use an SSL connection to secure data transfers. You can also setup specific security for HIVE and Spark; for example, you can manage access to Hive and/or Spark on an individual basis.

Icon-Gears copy.png

Data Architecture: A decent data model is crucial for a BI strategy. As we discussed in post 2 of this series, the data model should be defined iteratively and use a semantic layer to allow for its evolution while minimizing the impact to data consumers (including your BI tools themselves). Some people opt for a SQL type of structure on Hive or Spark, while others decide to keep all their data in a NoSQL or even a flat file format. A common reporting data flow would be:

Source Data → Data Lake → Hive tables → Views on Hive.

A traditional ETL tools can be used to move the data along the data flow; however, once you have your data on Hive,  some fine tuning may be required. For example, you might end up using materialization to improve performance, or, if you wish to perform time-frame analysis, you might consider moving that specific data to a traditional database.

Chair copy.png

Administration: Capacity planning is also very important. You need to monitor the number of users, data types (zip files, PDF, etc.), and data volumes. Apache Ranger provides a centralized security framework for access control that allows administrators to define access to files, folders, databases, tables or columns. Apache Ranger works across the Hadoop Distributed File System (HDFS), Hive, HBase, Storm, Knox, Solr, Kafka, NiFi, and YARN.

rocket copy.png

Performance: The large amounts of data involved in Hadoop business cases are at the core of the challenge faced by any reporting tool. Existing performance techniques are still valid, but not always possible. For instance, in cases where you can’t cache your reports in advance, you should include an appropriate selection of prompts or parameters to reduce the amount of data retrieved by an end user.  In some cases, creating in-memory cubes can improve performance. Regardless of your approach, fine-tuning the SQL produced by the reporting tool is a must.

Icon-LaptopClick copy.png

Self service: There has been a noted shift in the industry away from depending on a central IT department to develop products towards a self-service model. Extending this to BI has forced organizations to implement governance frameworks to effectively manage the data, and to evaluate visualization tools for their ability to work within a self-service framework. New visualization tools allow experienced employees with minimum technical knowledge to develop a business model that can be used by business end users and decision makers.  The self-service model only works in large organizations when it is implemented following good practices, standards, and a governance model. Nevertheless, when the data resides on Hadoop, its intrinsic complexity, combined with the particular culture of the organization, present significant challenges for a self-service model that need to be addressed.

graduate.png

Skills: Traditional BI developers deploy skills in SQL, basic coding, data modeling, and business knowledge. New roles and skills have risen alongside Big Data technologies. Data Scientists and Data Analysts are now in high demand for their facility with now-fundamental tools such as R, Python, Hive, MatLab, NOSQL, Spark, Java, SSPS, etc. The creation of a Center of Excellence (COE) team is highly recommended if you are planning ad hoc analysis practice on Hadoop.

Big Data sources have altered fundamental aspects of the workflow and technologies associated with BI. BI developers must reconsider security, performance, data architecture, methodologies, and best practices and how they must be adapted to the Big Data challenge. In addition, Big Data tools are still evolving alongside the organizations they serve. Only organizations with the right culture and a focus on experimentation, who engage the right employees and right partners will succeed in fully realizing Big Data’s potential.


Elkin Arboleda, Data Visualization and Reporting Practice Lead

Elkin Arboleda, Data Visualization and Reporting Practice Lead