Databricks Open-Sources Delta Lake to Ensure Reliability of Enterprise Data Lake

Big Data

Databricks CEO Ali Ghodsi made new announcements for the company as its open-sources their Delta Lake to ensure data integrity / Photo by: Steve Jurvetson via Wikimedia Commons


Databricks open-sourced storage layer Delta Lake as it moves to ensure data integrity as new data moves into a business' data lake by shifting ACID transactions to these wide repositories of data.

Delta Lake has been a trademarked tool of Databrick, the tech firm founded by the original developers of the Apache Spark big data analytics engine. TechCrunch reports that the offering is already being employed in production by companies like Viacom, Edmunds, Riot Games, and McGraw Hill.

It adds that Delta Lake offers the capacity to enforce certain schemas, which can be altered as necessary, in order to produce snapshots as well as ingest streaming data or backfill the repository as a batch task.

The tool also sees the use of the Spark engine in handling the lake's metadata (which, if handled only by Delta Lake, is often a problem). TechCrunch says Databricks is also seeking to add an audit trail, among other offerings, over time.

"Today nearly every company has a data lake they are trying to gain insights from, but data lakes have proven to lack data reliability. Delta Lake has eliminated these challenges for hundreds of enterprises," said Ali Ghodsi, co-founder and CEO at Databricks.

"By making Delta Lake open source, developers will be able to easily build reliable data lakes and turn them into ‘Delta Lakes.'"

The tech news site notes that Delta Lake operates on top of pre-existing data lakes and that it can work with the Apache Spark APIs.

Databrick is still figuring out how the initiative will be regulated moving forward. According to Ghodsi, they are still looking and testing at various open source project governance models, although the GitHub framework is "well understood and presents a good trade-off" among the capacity to take in contributions and governance from above.

"One thing we know for sure if we want to foster a vibrant community, as we see this as a critical piece of technology for increasing data reliability on data lakes. This is why we chose to go with a permissive open source license model: Apache License v2, [the] same license that Apache Spark uses."

Like the Spark project, Databricks also plans to consider outside contributions in order to engage the said community. The company's CEO said using the technology everywhere—on-prem and in the cloud—by both minor and major enterprises is the fastest way to develop something that has the potential to become a standard.

Having the community to give direction and contribution to the development of the tool is the reason why the tech firm didn't take on a Commons Clouse licenses, which they believe "is restrictive and will discourage adoption"—which go against their primary goal of driving adoption on-prem and in the cloud.