The Emergence of the Data Lakehouse

by Gary Bordey

Data is that indispensable asset which powers business decisions and enables innovative thinking. Over recent years, it has become  increasingly more available, usable, and trusted by organizations and personnel. Because data is so indispensable, data professionals have already developed well-established tools and processes to prepare and consume data. Organizations are finding success with their data initiatives, as noted in the recent NewVantage Partners annual survey, in which 92.1% of organizations report that they are realizing measurable business benefits, up from just 48.4% in 2017. Of the organizations surveyed, 91.7% indicated that their company planned on increasing investments in data and artificial intelligence (AI). As this data hints, the economic value that organizations are realizing from their investments is material and incontrovertible. While it is encouraging to see the tremendous progress of the data + AI juggernaut, challenges to sustaining that realized value of data assets remain. 

It is not uncommon to find different business groups within organizations that employ their own sets of processes, tools, and data. This can lead to segregated and specialized sets of data and tools for organizations, which increases complexity and reduces agility as well as speed to insight. Additionally, often inefficient governance over data landscapes can cause organizations to insufficiently represent the dimensions in a properly functioning data governance program. There is also a cost premium for learning how to use, run, and support multiple architectures and platforms.

Can a reimagined data architecture and data platform remedy this predicament in which many organizations find themselves today? Increasingly, the answer organizations find is to move toward unifying their data architecture and modernizing their data platform.

 
 

Reduce the Complexity

Traditionally, organizations employ data warehouses for their business intelligence (BI) use cases, and data lakes to support their artificial intelligence initiatives. In some cases, they also employ data from bespoke arrangements with system vendors.

Data warehouses, data lakes, and bespoke data constructs are each architected differently and typically reside on different platforms. Supporting myriad architectures and platforms is a burdensome complexity which creates an inadequate and unsustainable environment. This inhibits an organization’s agility to develop data driven insights, products, and designs — at a time when success is predicated on speed to market.

Complexity is rooted in the presence of multiple architectures and platforms, each specialized to deliver a specific type of data use case. Could there be a way to solve this complexity with a single, unified solution that can deliver on most, if not all, data use cases?

The content, structure, and intent of each data warehouse, data lake, and bespoke data system differs. Each design is optimized to address its own universe of use cases. Data warehouses enable data professionals to efficiently execute BI use cases, while data lakes enable AI use cases. Somewhere in between, and in conjunction with BI and AI platforms, data from bespoke systems address its own set of use cases.

Can an evolved architecture consolidate the structures, content, and inherent benefits of the data warehouse, data lake, and data from bespoke systems to promote strengths and mitigate weaknesses of each individual architecture? Could a whole greater than the sum of its parts be realized?

 As it turns out, there is one solution which offers potential. Enter the data lakehouse. Its design features similar data structures and data management found in data warehouses, but with the low-cost, flexible storage of data lakes. All data, structured, semi-structured, or otherwise are welcome to stay. They check into the lakehouse, and no further movement or proliferation is necessary.

Centralize the Data

Data situated across an organization in different depots is difficult to manage, particularly when that data is subject to regulatory mandates. Even when some measure of governance is applied to data located across various domains of an organization, it is difficult to verify if that data remains relevant today, especially when it has been copied and stored multiple times. Situating requisite data in a unified architecture and platform enables organizations to apply governance consistently and to realize—and monetize—the benefits from well-managed data.

There are economies of scale to be realized when data assets are stored in an infrastructure with a common architecture. The data management disciplines can be consolidated and standardized. Automation and self-service are more efficient when all requisite data is known and within reach. The lakehouse enables consistent data, which was once a concern with data lake content existing again elsewhere. Developing a sound, robust governance program for data assets is easier to achieve and sustain with a data lakehouse, where you can exercise agility and build governance capabilities incrementally, over time.

There exist additional consequences with data proliferation and ineffective governance. In their article for the MIT Sloan Management Review, Redman and Davenport, note that failure to address data quality "leads [personnel] to waste critical resources (time and money) dealing with mundane issues. Bad data, in turn, breeds mistrust in the data, further slowing efforts to create advantage." How can organizations fix this? Support tools already in use.

Support the tools already in use

Moving to a new architecture or platform has traditionally necessitated the adoption of new processes and tools to perform data-driven duties. New tools and new processes are not trivial investments, and prove a difficult hurdle for organizations and their personnel.

With a delta lake open source storage framework, the lakehouse can work with existing data lake storage systems and avoid locking itself into a proprietary format. The open nature of the lakehouse allows users to employ various tools to continue use of current applications. The processes that data personnel follow will remain largely the same. Consumers of data assets served by the lakehouse will experience minimal disruption. If responsibilities include developing and delivering insights on Microsoft® PowerBI® for example, there’s an enabling tool for that so they can keep using their tool of choice.

Support all the use cases

Organizations have myriad uses for data, and their success is often predicated on the ability to execute on all of them. Most organizations currently rely on multiple architectures and platforms to deliver business intelligence and artificial intelligence. A unified solution must deliver on all the use cases that organizations execute on their current set of platforms. 

Can a novel architecture featuring a lakehouse platform support all current use cases satisfied by data warehouses, data lakes, or other platforms? The possibility is very likely. When properly architected and implemented, the data lakehouse is a natural evolution of the data lake and the data warehouse, unifying into a single solution to simplify operations that, owing to its cloud design, have the added benefit of shifting expenditures from capital to operational.

Bring the data professionals together

Bringing technology, processes, data, and tools into a unified construct creates opportunities for efficiencies and synergy if organizations also bring their professionals together.  We noted earlier that business groups within organizations often have their own team of data professionals. Data analysts, data scientists, operations analysts, and a host of other titles often work independently of one another.

The convergence of technology, processes, data, and tools in a novel architecture that features a lakehouse confers to an organization the opportunity to unify its data and its people. Compared with the traditional siloed approach, a shared and common approach to data work can facilitate the collaboration necessary to infuse data into an organization's products, services, and decision-making. The virtual barrier once imposed by various technologies, toolsets, data structures, content, and processes is minimized by the lakehouse.

David Waller, who penned an article on creating a data-driven culture for the Harvard Business Review, advises "Don't pigeonhole your data scientists … in addition to dragging data science closer to the business … pull the business toward data science."

 
 

Sustained value realization

An organization’s investments in BI and AI technologies as well as in people, processes, and data undeniably result in measurable economic value. Organizations are bullish on investing in data + AI, and the initiatives being launched along with the investments being made in those initiatives are healthy and growing. In the healthcare industry, Optum found that 98% of the 500 senior healthcare industry executives they surveyed either have an AI strategy or are planning one.

BI architectures and infrastructures are now benefitting from years of maturity where effort is focused on refining use case execution ability rather than stability. Likewise, AI and its enabling technologies are now relatively stable, transforming concern over technical considerations such as data volume, velocity, and variety to concern over data science matters. AI now routinely helps organizations make decisions regarding future decisions. Each day, a model is created, refined, and put to good use.

At the current stage of maturity and stability, is there any reason for organizations to look beyond their instantiations of BI and AI designs? I think there are a lot of reasons for organizations to do so. After all, no one settles for the proverbial (mature and stable) mousetrap. We all want to build a better design, and so do data-driven organizations who will push for the rapid evolution of their data designs to bring more data, information, and insights into the service delivery, product design, and decision-making process.

Organizations are going to want data warehouse performance with data lake economics. They are going to want next-generation data engineering that abstracts away complexity. They will want more real-time data and to stream it in a robust and scalable pipe. They will want a unified data science team with all the organizations’ data available to accelerate endeavors. They will want to automate operational analytics and reporting. They will want a consolidated and secure governance to apply to all their data assets. They will want to enrich their data and bring in supplemental data wherever and whenever they find it. They will want to scale on demand without being locked into proprietary technology. They will want all the benefits that a convergence of BI and AI people, process, technology, and data can give them.

Today, advancements in data architecture and technology enable us to realize even further benefits that businesses will continue to seek. Paired with a platform, a novel architecture featuring a data lakehouse will enable data-driven organizations to realize more successes, extending gains previously unattainable in a world of segregated BI and AI constructs. The agility and capabilities enabled by the data lakehouse translates into not only the ability to execute more use cases, but also speed to market.

Organizations which have invested in BI and AI have realized value. The continued commercial success of BI and AI is paving the way for new architectures and technologies which improve upon incumbent systems. It is against this backdrop that we see the data lakehouse come to the fore, heralding the evolution in the BI and AI space—a convergence which minimizes the limitations of each, while amplifying their respective strengths.

For many organizations, the time is now to consider the evolution of their data capabilities in the quest to become a better, data-driven organization. With maturing BI and AI processes and technologies, organizations can now explore new architectures and platforms to underpin next generation data capabilities. Organizations now know a lot more than they did when BI and AI initiatives commenced in their respective areas. I think organizations will find that a modern architecture and platform featuring a data lakehouse will elevate capabilities and enable the continued realization of an investment’s value for the foreseeable future.

References

Bean, Randy, et al. [2022]. "Data and AI Leadership Executive Summary 2022," newvantage.com. https://www.newvantage.com/_files/ugd/e5361a_ad5a8b3da8254a71807d2dccdb0844be.pdf.

Bordey, G., [2019]. "Agile in Data Governance Design," Business Intelligence Journal, 23(2), pp. 23-32.

Redman, T., and T. Davenport, [2020]. "Getting Serious About Data and Data Science," sloanreview.mit.edu. https://sloanreview.mit.edu/article/getting-serious-about-data-and-data-science.

Waller, David, [2020]. "10 Steps to Creating a Data-Driven Culture," hbr.org. https://hbr.org/2020/02/10-steps-to-creating-a-data-driven-culture.

 Special report: Learn what 500 health care leaders think about ai. Optum. (2021). Retrieved March 22, 2022, from https://www.optum.com/business/resources/library/2021-ai-survey-report.html


Previous
Previous

New Book Explains Why Metadata Matters

Next
Next

How to Make and Market NFTs