[SPARK-56451][DOCS][SDP] Document how SDP datasets are stored and refreshed by moomindani · Pull Request #55277 · apache/spark

moomindani · 2026-04-09T08:07:09Z

What changes were proposed in this pull request?

Add a new "How Datasets are Stored and Refreshed" section to the Spark Declarative Pipelines programming guide. This section covers:

Table Format: Default format (parquet via spark.sql.sources.default) and how to specify a different format with Python and SQL examples
How Materialized Views are Refreshed: Full recomputation (TRUNCATE + append) on every pipeline run, and how this differs from database-native materialized views
How Streaming Tables are Refreshed: Incremental processing with checkpoints and schema evolution support
Full Refresh: Behavior differences between materialized views and streaming tables

Why are the changes needed?

The current programming guide explains how to define datasets but does not explain how they are stored or refreshed. Users need to understand:

What format their tables are stored in by default
That materialized views perform a full recomputation on every run (unlike PostgreSQL-style MVs)
That streaming tables require checkpoint storage on a Hadoop-compatible file system
What --full-refresh actually does for each dataset type

Without this information, users cannot make informed decisions about table formats, storage configurations, or pipeline performance.

Does this PR introduce any user-facing change?

No. Documentation only.

How was this patch tested?

Documentation change only. Verified the content is accurate by reading the SDP implementation (DatasetManager.scala, FlowExecution.scala).

Was this patch authored or co-authored using generative AI tooling?

Yes.

dongjoon-hyun

Hi, @moomindani .

Apache Spark community uses JIRA IDs for bug tracking. Your PR title is wrong.

SPARK-55276 is SPARK-55276 Upgrade scala-maven-plugin to 4.9.9.

Add a new section to the Spark Declarative Pipelines programming guide that explains the storage and refresh mechanics, including: - Default table format and how to specify a different format - How materialized views are refreshed (full recomputation via TRUNCATE + append) - How streaming tables are refreshed (incremental processing with checkpoints) - Full refresh behavior for both dataset types

moomindani · 2026-04-11T00:50:38Z

Thank you for pointing that out, @dongjoon-hyun. I've updated the PR title and commit message to use the correct JIRA ID: SPARK-56451. The GitHub issue has been closed.

jaceklaskowski

LGTM (with some tiny changes)

jaceklaskowski · 2026-04-11T19:27:48Z

docs/declarative-pipelines-programming-guide.md

+</div>
+</div>
+
+SDP itself does not restrict which table formats can be used. However, the table format must be supported by the configured catalog. For example, a Delta catalog only supports Delta tables, while the default session catalog supports Parquet, ORC, and other built-in formats.


What's a "catalog" here? Table formats are set up via packages on command line when Spark Connect server's started.

Thank you for the feedback. Revised to: "SDP itself does not restrict which table formats can be used. Any table format available in your Spark environment can be specified. By default, tables are created using Spark's default format (parquet), which is configured by spark.sql.sources.default."

jaceklaskowski · 2026-04-11T19:29:22Z

docs/declarative-pipelines-programming-guide.md

+
+This means that every refresh is a **full recomputation** - there is no incremental or differential update. For tables with large amounts of data, be aware that each pipeline run will reprocess the entire dataset.
+
+Because of this mechanism, the materialized view's underlying table format must support the `TRUNCATE TABLE` operation.


Suggested change

Because of this mechanism, the materialized view's underlying table format must support the `TRUNCATE TABLE` operation.

Because of this mechanism, the materialized view's underlying table format must support the `TRUNCATE TABLE` operation (e.g., Delta Lake).

Thank you for the suggestion. I checked and the built-in formats such as Parquet, ORC, JSON, and CSV also support TRUNCATE TABLE, so I kept this without a specific example to avoid implying it is limited to Delta Lake.

jaceklaskowski · 2026-04-11T19:31:04Z

docs/declarative-pipelines-programming-guide.md

+2. New data is appended to the existing table data.
+3. A checkpoint tracks the processing progress so subsequent runs resume from where the last run left off.
+
+Streaming tables require a checkpoint directory on a Hadoop-compatible file system (e.g., HDFS, Amazon S3, Azure ADLS Gen2, Google Cloud Storage, or local file system). The checkpoint directory is configured via the `storage` field in the pipeline spec file.


Suggested change

Streaming tables require a checkpoint directory on a Hadoop-compatible file system (e.g., HDFS, Amazon S3, Azure ADLS Gen2, Google Cloud Storage, or local file system). The checkpoint directory is configured via the `storage` field in the pipeline spec file.

Streaming tables require a checkpoint directory on a Hadoop-compatible file system (e.g., local file system, HDFS, Amazon S3, Azure ADLS Gen2, Google Cloud Storage). The checkpoint directory is configured via the `storage` field in the pipeline spec file.

Applied, thank you.

jaceklaskowski · 2026-04-11T19:32:19Z

docs/declarative-pipelines-programming-guide.md

+
+### Full Refresh
+
+You can force a full refresh of specific datasets or the entire pipeline using the `--full-refresh` or `--full-refresh-all` CLI options. A full refresh:


Suggested change

You can force a full refresh of specific datasets or the entire pipeline using the `--full-refresh` or `--full-refresh-all` CLI options. A full refresh:

You can force a full refresh of specific datasets or the entire pipeline using the `--full-refresh` or `--full-refresh-all` CLI options, respectively. A full refresh:

Applied, thank you.

jaceklaskowski · 2026-04-11T19:34:06Z

Please add [SDP] tag to the title 🙏

- Clarify table format description: any format available in Spark environment works - Reorder checkpoint filesystem examples to list local file system first - Add "respectively" to full refresh CLI options description

moomindani · 2026-04-13T00:30:10Z

Added [SDP] tag to the title. Thank you, @jaceklaskowski.

moomindani force-pushed the sdp-doc-storage-refresh branch from 5763f95 to 723afa3 Compare April 9, 2026 08:08

dongjoon-hyun requested changes Apr 10, 2026

View reviewed changes

moomindani changed the title ~~[SPARK-55276][DOCS] Document how SDP datasets are stored and refreshed~~ [SPARK-56451][DOCS] Document how SDP datasets are stored and refreshed Apr 11, 2026

moomindani force-pushed the sdp-doc-storage-refresh branch from 723afa3 to 4c09d8f Compare April 11, 2026 00:49

jaceklaskowski approved these changes Apr 11, 2026

View reviewed changes

moomindani changed the title ~~[SPARK-56451][DOCS] Document how SDP datasets are stored and refreshed~~ [SPARK-56451][DOCS][SDP] Document how SDP datasets are stored and refreshed Apr 13, 2026

[SPARK-56451][DOCS][SDP] Address review comments

b20372d

- Clarify table format description: any format available in Spark environment works - Reorder checkpoint filesystem examples to list local file system first - Add "respectively" to full refresh CLI options description


		This means that every refresh is a full recomputation - there is no incremental or differential update. For tables with large amounts of data, be aware that each pipeline run will reprocess the entire dataset.

		Because of this mechanism, the materialized view's underlying table format must support the `TRUNCATE TABLE` operation.

	Streaming tables require a checkpoint directory on a Hadoop-compatible file system (e.g., HDFS, Amazon S3, Azure ADLS Gen2, Google Cloud Storage, or local file system). The checkpoint directory is configured via the `storage` field in the pipeline spec file.
	Streaming tables require a checkpoint directory on a Hadoop-compatible file system (e.g., local file system, HDFS, Amazon S3, Azure ADLS Gen2, Google Cloud Storage). The checkpoint directory is configured via the `storage` field in the pipeline spec file.


		### Full Refresh

		You can force a full refresh of specific datasets or the entire pipeline using the `--full-refresh` or `--full-refresh-all` CLI options. A full refresh:

Conversation

moomindani commented Apr 9, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

What changes were proposed in this pull request?

Why are the changes needed?

Does this PR introduce any user-facing change?

How was this patch tested?

Was this patch authored or co-authored using generative AI tooling?

Uh oh!

dongjoon-hyun left a comment • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

moomindani commented Apr 11, 2026

Uh oh!

jaceklaskowski left a comment

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

jaceklaskowski commented Apr 11, 2026

Uh oh!

moomindani commented Apr 13, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

moomindani commented Apr 9, 2026 •

edited

Loading

dongjoon-hyun left a comment •

edited

Loading