[SPARK-56451][DOCS][SDP] Document how SDP datasets are stored and refreshed#55277
[SPARK-56451][DOCS][SDP] Document how SDP datasets are stored and refreshed#55277moomindani wants to merge 2 commits intoapache:masterfrom
Conversation
5763f95 to
723afa3
Compare
There was a problem hiding this comment.
Hi, @moomindani .
Apache Spark community uses JIRA IDs for bug tracking. Your PR title is wrong.
SPARK-55276 is SPARK-55276 Upgrade scala-maven-plugin to 4.9.9.
Add a new section to the Spark Declarative Pipelines programming guide that explains the storage and refresh mechanics, including: - Default table format and how to specify a different format - How materialized views are refreshed (full recomputation via TRUNCATE + append) - How streaming tables are refreshed (incremental processing with checkpoints) - Full refresh behavior for both dataset types
723afa3 to
4c09d8f
Compare
|
Thank you for pointing that out, @dongjoon-hyun. I've updated the PR title and commit message to use the correct JIRA ID: SPARK-56451. The GitHub issue has been closed. |
jaceklaskowski
left a comment
There was a problem hiding this comment.
LGTM (with some tiny changes)
| </div> | ||
| </div> | ||
|
|
||
| SDP itself does not restrict which table formats can be used. However, the table format must be supported by the configured catalog. For example, a Delta catalog only supports Delta tables, while the default session catalog supports Parquet, ORC, and other built-in formats. |
There was a problem hiding this comment.
What's a "catalog" here? Table formats are set up via packages on command line when Spark Connect server's started.
There was a problem hiding this comment.
Thank you for the feedback. Revised to: "SDP itself does not restrict which table formats can be used. Any table format available in your Spark environment can be specified. By default, tables are created using Spark's default format (parquet), which is configured by spark.sql.sources.default."
|
|
||
| This means that every refresh is a **full recomputation** - there is no incremental or differential update. For tables with large amounts of data, be aware that each pipeline run will reprocess the entire dataset. | ||
|
|
||
| Because of this mechanism, the materialized view's underlying table format must support the `TRUNCATE TABLE` operation. |
There was a problem hiding this comment.
| Because of this mechanism, the materialized view's underlying table format must support the `TRUNCATE TABLE` operation. | |
| Because of this mechanism, the materialized view's underlying table format must support the `TRUNCATE TABLE` operation (e.g., Delta Lake). |
There was a problem hiding this comment.
Thank you for the suggestion. I checked and the built-in formats such as Parquet, ORC, JSON, and CSV also support TRUNCATE TABLE, so I kept this without a specific example to avoid implying it is limited to Delta Lake.
| 2. New data is appended to the existing table data. | ||
| 3. A checkpoint tracks the processing progress so subsequent runs resume from where the last run left off. | ||
|
|
||
| Streaming tables require a checkpoint directory on a Hadoop-compatible file system (e.g., HDFS, Amazon S3, Azure ADLS Gen2, Google Cloud Storage, or local file system). The checkpoint directory is configured via the `storage` field in the pipeline spec file. |
There was a problem hiding this comment.
| Streaming tables require a checkpoint directory on a Hadoop-compatible file system (e.g., HDFS, Amazon S3, Azure ADLS Gen2, Google Cloud Storage, or local file system). The checkpoint directory is configured via the `storage` field in the pipeline spec file. | |
| Streaming tables require a checkpoint directory on a Hadoop-compatible file system (e.g., local file system, HDFS, Amazon S3, Azure ADLS Gen2, Google Cloud Storage). The checkpoint directory is configured via the `storage` field in the pipeline spec file. |
There was a problem hiding this comment.
Applied, thank you.
|
|
||
| ### Full Refresh | ||
|
|
||
| You can force a full refresh of specific datasets or the entire pipeline using the `--full-refresh` or `--full-refresh-all` CLI options. A full refresh: |
There was a problem hiding this comment.
| You can force a full refresh of specific datasets or the entire pipeline using the `--full-refresh` or `--full-refresh-all` CLI options. A full refresh: | |
| You can force a full refresh of specific datasets or the entire pipeline using the `--full-refresh` or `--full-refresh-all` CLI options, respectively. A full refresh: |
There was a problem hiding this comment.
Applied, thank you.
|
Please add |
- Clarify table format description: any format available in Spark environment works - Reorder checkpoint filesystem examples to list local file system first - Add "respectively" to full refresh CLI options description
|
Added |
What changes were proposed in this pull request?
Add a new "How Datasets are Stored and Refreshed" section to the Spark Declarative Pipelines programming guide. This section covers:
parquetviaspark.sql.sources.default) and how to specify a different format with Python and SQL examplesWhy are the changes needed?
The current programming guide explains how to define datasets but does not explain how they are stored or refreshed. Users need to understand:
--full-refreshactually does for each dataset typeWithout this information, users cannot make informed decisions about table formats, storage configurations, or pipeline performance.
Does this PR introduce any user-facing change?
No. Documentation only.
How was this patch tested?
Documentation change only. Verified the content is accurate by reading the SDP implementation (
DatasetManager.scala,FlowExecution.scala).Was this patch authored or co-authored using generative AI tooling?
Yes.