[SPARK-55645][SQL][FOLLOWUP] Move serdeName to last parameter and filter empty strings#54860
[SPARK-55645][SQL][FOLLOWUP] Move serdeName to last parameter and filter empty strings#54860cloud-fan wants to merge 4 commits intoapache:masterfrom
Conversation
…ter empty strings
Move serdeName to the last parameter of CatalogStorageFormat with a
default value of None, so that existing callers that construct
CatalogStorageFormat positionally (e.g., third-party connectors like
Hudi) remain source-compatible without code changes.
Also filter empty strings when reading serdeName from the Hive
Metastore API — Hive returns "" for tables without an explicit serde
name, which should map to None rather than Some("").
I agree as I was concerned here but what do you think, @pan3793 ? You were concerned about hidden bugs. |
|
I agree with mapping |
| val options = storage.properties + (ParquetOptions.MERGE_SCHEMA -> | ||
| SQLConf.get.getConf(HiveUtils.CONVERT_METASTORE_PARQUET_WITH_SCHEMA_MERGING).toString) | ||
| storage.copy( | ||
| serdeName = None, |
There was a problem hiding this comment.
Sorry if this change was not correct in the original PR, but doesn't unsetting serdeName match the sentiment of unsetting serde here?
There was a problem hiding this comment.
it's just a name for display, so does not really matter.
There was a problem hiding this comment.
but still better to clear it.
| val options = storage.properties | ||
| if (SQLConf.get.getConf(SQLConf.ORC_IMPLEMENTATION) == "native") { | ||
| storage.copy( | ||
| serdeName = None, |
There was a problem hiding this comment.
Same question as line 180.
| ) | ||
| } else { | ||
| storage.copy( | ||
| serdeName = None, |
There was a problem hiding this comment.
Same question as line 180.
|
This is just a name for display in DESC TABLE, I don't think it worths people's attention and effort by breaking source-compatibility. |
|
@cloud-fan, many spark apps/plugins use spark public api with only a few set private api, and they tend to create a unified jar to be compatible with multiple Spark versions. in such cases, source-compatible but binary-incompatible easily cause hidden bugs. one example is in apache kyuubi, we build a thrift server on the driver as a spark app, which is similar to STS and is a unified jar that can run on all Spark 3.x versions. while a source compatible method change of spark method signature breaks our assumption - the worst thing is, though we have built multi-level tests to check the compatibility (compile/run unit tests with different spark versions, compile with the default spark version and submit the app jar to all other supported spark versions, and run integration tests), but it's related to web ui, which is not covered by our integration tests, thus the bug was not caught. for cases like delta and iceberg, which use spark private api heavily, should have shim layer to easily adapt to such change. |
|
I don't get the point, what can do wrong if we miss to set serde name? I won't have agreed to merge the PR at the first place if we break compatibility for such a small feature. |
pan3793
left a comment
There was a problem hiding this comment.
in offline discussion, @cloud-fan provided a good case (scala notebook) for the benefit of keeping source compatibility.
sql/hive/src/main/scala/org/apache/spark/sql/hive/HiveMetastoreCatalog.scala
Outdated
Show resolved
Hide resolved
sql/hive/src/main/scala/org/apache/spark/sql/hive/HiveMetastoreCatalog.scala
Outdated
Show resolved
Hide resolved
sql/hive/src/main/scala/org/apache/spark/sql/hive/HiveMetastoreCatalog.scala
Outdated
Show resolved
Hide resolved
Co-authored-by: Wenchen Fan <cloud0fan@gmail.com>
| table.storage.inputFormat.map(toInputFormat).foreach(hiveTable.setInputFormatClass) | ||
| table.storage.outputFormat.map(toOutputFormat).foreach(hiveTable.setOutputFormatClass) | ||
| table.storage.serdeName.foreach(hiveTable.getSd.getSerdeInfo.setName) | ||
| table.storage.serdeName.foreach(hiveTable.getTTable.getSd.getSerdeInfo.setName) |
There was a problem hiding this comment.
minor: found all other places use h.getTTable.getSd, so make it consistent here.
What changes were proposed in this pull request?
Two followup improvements to #54467 (SPARK-55645):
Move
serdeNameto the last parameter ofCatalogStorageFormatwith a default value ofNone, so that existing callers that constructCatalogStorageFormatpositionally remain source-compatible without code changes.Filter empty strings when reading
serdeNamefrom the Hive Metastore API — Hive returns""for tables without an explicit serde name, which should map toNonerather thanSome("").Why are the changes needed?
Adding
serdeNameas a required positional parameter in the middle of the parameter list breaks source compatibility for all external callers (e.g., third-party connectors) that constructCatalogStorageFormatpositionally. Moving it to the last position with a default value avoids this.The Hive Metastore returns an empty string for
SerDeInfo.namewhen no serde name is explicitly set. Without filtering,Option("")producesSome("")instead of the semantically correctNone, which could cause unexpected behavior in downstream code that checksserdeName.isDefinedor pattern-matches on it.Does this PR introduce any user-facing change?
No.
How was this patch tested?
Added a new test
serdeName should be None for tables without an explicit serde namethat verifies the empty string filtering. Existing tests cover the parameter reordering.Was this patch authored or co-authored using generative AI tooling?
Yes