Join the event trusted by enterprise leaders for nearly two decades. VB Transform brings together the people building real enterprise AI strategy. Learn more
Today, at its annual Data + AI Summit, Databricks announced that it is open-sourcing its core declarative ETL framework as Apache Spark Declarative Pipelines, making it available to the entire Apache Spark community in an upcoming release.
Databricks launched the framework as Delta Live Tables (DLT) in 2022 and has since expanded it to help teams build and operate reliable, scalable data pipelines end-to-end. The move to open-source it reinforces the company’s commitment to open ecosystems while marking an effort to one-up rival Snowflake, which recently launched its own Openflow service for data integration—a crucial component of data engineering.
Snowflake’s offering taps Apache NiFi to centralize any data from any source into its platform, while Databricks is making its in-house pipeline engineering technology open, allowing users to run it anywhere Apache Spark is supported — and not just on its own platform.
Declare pipelines, let Spark handle the rest
Traditionally, data engineering has been associated with three main pain points: complex pipeline authoring, manual operations overhead and the need to maintain separate systems for batch and streaming workloads.
With Spark Declarative Pipelines, engineers describe what their pipeline should do using SQL or Python, and Apache Spark handles the execution. The framework automatically tracks dependencies between tables, manages table creation and evolution and handles operational tasks like parallel execution, checkpoints, and retries in production.
“You declare a series of datasets and data flows, and Apache Spark figures out the right execution plan,” Michael Armbrust, distinguished software engineer at Databricks, said in an interview with VentureBeat.
The framework supports batch, streaming and semi-structured data, including files from object storage systems like Amazon S3, ADLS, or GCS, out of the box. Engineers simply have to define both real-time and periodic processing through a single API, with pipeline definitions validated before execution to catch issues early — no need to maintain separate systems.
“It’s designed for the realities of modern data like change data feeds, message buses, and real-time analytics that power AI systems. If Apache Spark can process it (the data), these pipelines can handle it,” Armbrust explained. He added that the declarative approach marks the latest effort from Databricks to simplify Apache Spark.
... continue reading