Data and AI Summit 2023: Delta Lake 3.0 UniForm, Unifying analytics and AI on your data

Teepika R M
5 min readJul 2, 2023

On the Data and AI Summit 2023, Databricks made an exciting announcement regarding the latest release of Delta Lake 3.0 to Linux Foundation’s open source Project Delta Lake. The Linux Foundation is a non-profit organization that fosters the growth of open source software and technologies. In this post, we will see in detail about UniForm, one of the promising features released as part of Delta Lake 3.0.

Let’s unravel the technical jargons to gain a deeper understanding of the underlying concept.

The Databricks Lakehouse Platform:

Data Lake + Data warehouse = Lakehouse

The Databricks Lakehouse Platform is a comprehensive platform provided by Databricks to bring together the best features of Data Warehouse & Data Lake.

What is data warehouse?

Data warehouse offers a consolidated and structured perspective of data, facilitating informed business decision-making. But it holds good only for structured data, whereas modern enterprises often work with raw, unstructured data characterized by high variety, velocity, and volume. As a result, the traditional data warehouse model may not be well-suited for many use cases.

What is data lake?

Data lake is a vast repository designed to store large amounts of diverse data including raw, unstructured, semi-structured, and structured data. Unlike traditional data storage systems, data lakes prioritize flexibility over rigid schema enforcement. Organizations can store raw and unprocessed data without predefined schemas. This flexibility allows for easy adaptation to evolving data requirements. However, it’s important to note that due to their flexibility, data lakes may lack data quality measures and do not support traditional transactional capabilities, such as ACID properties.

Benefit of combining data warehouse & data lake:

The combination of a data warehouse and a data lake creates a unified platform that offers numerous benefits. The lakehouse platform, which brings together the features of both a data lake and a data warehouse, provides a single system for storing and processing structured, semi-structured, and unstructured data. By leveraging data lake capabilities such as schema-on-read and support for raw data, as well as data warehouse capabilities like optimized query performance and support for structured data, organizations can effectively harness their data for a wide range of use cases. Additionally, the lakehouse platform enables the application of governance policies and data quality checks on both structured and unstructured data. The ultimate goal for Databricks enterprise is to build a unified lakehouse platform with these combined capabilities, offering a powerful solution for data management, analysis, and governance.

Apache Spark, Delta Lake, and MLflow form the foundation for Lakehouse and provide the necessary features for its functionality.

Apache Spark Open source, used for powerful distributed computing

Delta LakeOpen source, storage layer that brings ACID properties and reliability to DataLakes

MLflow Open source, used for managing the ML life cycle

Since this post is about Delta Lake, will add more insights on Delta Lake.

Delta lake: high-performance ACID table storage over cloud object stores

Delta lake helps in adding reliable and scalable storage layer for data lakes. The add-on values that it brings to the data lakes are ACID transaction capabilites, schema enforcement, time travel, metadata management and data versioning. By adding these features to datalakes, it bridges the gap between traditional datawarehouses and raw datalakes, resulting in unified lakehouse architecture.

Then where companies struggle in making the move towards lakehouse?

Lakehouse enables utilizing a single copy of data for various use cases by using the right tool and avoids locking data into proprietary formats. When data is locked in proprietary formats, it means that it is stored in a format that is specific to a particular software or vendor. This can create limitations for companies. For instance, if data is stored in a proprietary format, it may be difficult to access it with other tools or systems. It limits the integration with other systems. But by moving towards lakehouse system, companies can avoid the pitfalls of preprietary formats and store in open formats, not specific to a software or vendor. This frees them from being tied with limited choices and facilitates to choose the best tools for their needs. However, making a choice for open format for lakehouse is a complex decision. They don’t want to lock in a open format and later face the difficulties of moving the data around. Companies want to make the right choice upfront to save costs. They want to make an optimal data format choice for handling ETL, business intelligence, real time streaming and aritifical intelligence workloads.

There are different open formats available in the market to fulfill the requirement of adding reliability and ACID properties to datalakes, for instance, Apache Hudi, Apache Iceberg and Delta Lake are remarkable ones. Companies dont want to lock in one of these formats and regret later when situation demands migrating from one format to another. It hinders companies to make the move towards Lakehouse architecture.

Delta Universal Format (UniForm) comes for the rescue!!

Before the introduction of Uniform feature as part of Delta 3.0, Delta Lake, Hudi and Iceberg requires their own copy of data. It means data needs to either duplicated or converted to be compatible with other frameworks.

But UniForm feature, added as part of Delta 3.0, enables single copy of data with same data format(Delta) being used for different query engines (utilizing different data lake management frameworks without making additional copies or conversions).

Hudi, Iceberg and Delta Lake, all three are built on top of Parquet data files and each has its own metadata generated based on its own specifications. This results in three different storage formats for each of them. Uniform exploits this fact and generates a single layer of metadata that is compatible with all the three. It implies a Delta table written in the Delta format can be accessed by Iceberg and Hudi readers using the shared metadata layer. It results in seamless interoperability between them.

Real world example where UniForm feature plays a major role:

Consider a scenario where different teams within an organization use different data processing frameworks. Some teams might prefer Apache Hudi for its incremental data processing capabilities, while others might prefer Apache Iceberg for its schema evolution features. But, there is a need to share and access data across these teams. In this case, data can be stored in Delta format, which supports querying/accessing through Hudi/Iceberg/Delta lake. Each team can use their preferred framework without any hassle. Uniform feature of Delta 3.0 promotes interoperability.

--

--

Teepika R M

AWS Certified Big Data Specialty| Linux Certified Kubernetes Application Developer| Hortonworks Certified Spark Developer|Hortonworks Certified Hadoop Developer