feat: add support for add_column with backfill #91

fangbo · 2025-09-18T03:01:42Z

Related issue: #32

To add column with backfill data, some config should be set.

Read _rowaddr and _fragid from Lance dataset

      spark.conf().set("spark.sql.lance.with_metadata", "true");

Use "overwrite" mode to write and set "backfill_columns" to specify the new columns to add

      df2.write()
          .mode("overwrite")
          .option("backfill_columns", "new_col")
          .saveAsTable(catalogName + ".default." + tableName);

github-actions · 2025-09-18T03:02:00Z

ACTION NEEDED
Lance follows the Conventional Commits specification for release automation.

The PR title and description are used as the merge commit message. Please update your PR title and description to match the specification.

For details on the error please inspect the "PR Title Check" action.

...base_2.12/src/main/java/com/lancedb/lance/spark/extention/ExtendedDataSourceV2Strategy.scala

qidian99 · 2025-09-18T04:02:28Z

lance-spark-base_2.12/src/main/java/com/lancedb/lance/spark/write/AddColumnsBackfillWrite.java

+    NamedReference segmentId = Expressions.column(LanceConstant.ROW_ADDRESS);
+    SortValue sortValue =
+        new SortValue(segmentId, SortDirection.ASCENDING, NullOrdering.NULLS_FIRST);
+    return new SortValue[] {sortValue};


should both of two keys, _frag_id and _rowaddr, be sorted here?

The physical plan is:

== Physical Plan == CommandResult <empty> +- AppendData org.apache.spark.sql.execution.datasources.v2.DataSourceV2Strategy$$Lambda$1782/1752894940@6f7b8ae1, com.lancedb.lance.spark.write.AddColumnsWrite@6c8d8b60 +- AdaptiveSparkPlan isFinalPlan=true +- == Final Plan == *(2) Sort [_rowaddr#11L ASC NULLS FIRST], false, 0 +- AQEShuffleRead coalesced +- ShuffleQueryStage 0 +- Exchange hashpartitioning(_fragid#12, 200), REPARTITION_BY_COL, [plan_id=84] +- *(1) Project [_rowaddr#11L, _fragid#12, id#8, (id#8 * 100) AS new_col#39] +- *(1) ColumnarToRow +- BatchScan add_column_table[id#8, _rowaddr#11L, _fragid#12] class com.lancedb.lance.spark.read.LanceScan RuntimeFilters: [] +- == Initial Plan == Sort [_rowaddr#11L ASC NULLS FIRST], false, 0 +- Exchange hashpartitioning(_fragid#12, 200), REPARTITION_BY_COL, [plan_id=68] +- Project [_rowaddr#11L, _fragid#12, id#8, (id#8 * 100) AS new_col#39] +- BatchScan add_column_table[id#8, _rowaddr#11L, _fragid#12] class com.lancedb.lance.spark.read.LanceScan RuntimeFilters: []

The sort operator is after distribution. So I think we can only sort data by _rowaddr.

Let's say the data is first repartitioned by _fragid, which guarantees that all rows from the same fragment end up in the same partition.
But if the sort doesn’t include _fragid, rows inside that partition may get interleaved? E.g., (1,0), (2,0), (1,1), (2,2)

But if the sort doesn’t include _fragid, rows inside that partition may get interleaved? E.g., (1,0), (2,0), (1,1), (2,2)

_rowaddr is u64 and is composed by:

frag_id (32) + row_index(32)

So I think rows will not get interleaved.

The physical plan is:

== Physical Plan == CommandResult <empty> +- AppendData org.apache.spark.sql.execution.datasources.v2.DataSourceV2Strategy$$Lambda$1782/1752894940@6f7b8ae1, com.lancedb.lance.spark.write.AddColumnsWrite@6c8d8b60 +- AdaptiveSparkPlan isFinalPlan=true +- == Final Plan == *(2) Sort [_rowaddr#11L ASC NULLS FIRST], false, 0 +- AQEShuffleRead coalesced +- ShuffleQueryStage 0 +- Exchange hashpartitioning(_fragid#12, 200), REPARTITION_BY_COL, [plan_id=84] +- *(1) Project [_rowaddr#11L, _fragid#12, id#8, (id#8 * 100) AS new_col#39] +- *(1) ColumnarToRow +- BatchScan add_column_table[id#8, _rowaddr#11L, _fragid#12] class com.lancedb.lance.spark.read.LanceScan RuntimeFilters: [] +- == Initial Plan == Sort [_rowaddr#11L ASC NULLS FIRST], false, 0 +- Exchange hashpartitioning(_fragid#12, 200), REPARTITION_BY_COL, [plan_id=68] +- Project [_rowaddr#11L, _fragid#12, id#8, (id#8 * 100) AS new_col#39] +- BatchScan add_column_table[id#8, _rowaddr#11L, _fragid#12] class com.lancedb.lance.spark.read.LanceScan RuntimeFilters: []

The sort operator is after distribution. So I think we can only sort data by _rowaddr.

If the view has 3 new columns, namely new_col1, new_col2, new_col3, and we only want to do backfill for new_col1 and new_col2, will new_col3 be excluded from shuffle stage?

jackye1995 · 2025-09-18T06:22:17Z

lance-spark-3.5_2.12/src/test/java/com/lancedb/lance/spark/update/AddColumnBackFillTest.java

+      Dataset<Row> df2 = result.withColumn("new_col", functions.expr("id * 100"));
+
+      // Write back with backfill option
+      df2.write()


I am a bit hesitated about this user experience... we are clearly inventing something new here which is great, but overloading the overwrite mode feels wrong to me. So far the "norm" is that we can invent something in SQL, and then eventually dataframe operations will catch up. For example there was MERGE INTO SQL, and in Spark 4.0 now there is merge DataFrame operations. So to me having a SQL extensions feels "official" compared to using write overwrite that feels much more like a hack.

df2.createOrReplaceTempView("backfill_data") spark.sql("ALTER TABLE ADD COLUMNS col1, col2 AS SELECT * FROM backfill_data")

I guess that brings the complexity of needing to add the SQL extensions and related parsers, but we probably need that anyway for features like compaction.

Another thing is that compared to this, there is technically also the add column using function experience, which would be harder to express in DataFrame, but we can invent SQL like:

spark.sql("ALTER TABLE ADD COLUMNS col1 AS my_udf(col2)")

I agree, that custom DSL/SQL will deliver a better user experience. For now, changing the semantics of overwrite/mergeinto seems okay to me.

For the UDF scenario, if the computed column only depends on the current dataset (and is 1:1 row-aligned), the UDF approach is more intuitive. It's worth noting that there are also cases where we need to combine external data to compute new columns — e.g., join with external tables / lookup services -- where the DataFrame style offers greater expressiveness.

I greatly appreciate your suggestion. I also think using SQL extensions feels more official.

jackye1995 · 2025-09-18T06:25:55Z

lance-spark-base_2.12/src/main/java/com/lancedb/lance/spark/write/AddColumnsBackfillWrite.java

+  }
+
+  @Override
+  public SortOrder[] requiredOrdering() {


shouldn't the row address naturally be ascending within a fragment? I don't think we need to force ordering it?

shouldn't the row address naturally be ascending within a fragment? I don't think we need to force ordering it?

The data in the original table may be processed in a distributed manner, which can change the order of the records. So it is necessary to force order it by _rowaddr.

fangbo · 2025-10-10T03:48:31Z

@jackye1995 @qidian99 This PR is ready. Could you please review it? Thank you.

qidian99 · 2025-10-20T03:57:06Z

...spark-base_2.12/src/test/java/com/lancedb/lance/spark/update/BaseAddColumnsBackfillTest.java

+              .withColumn("new_col1", functions.expr("id * 100"))
+              .withColumn("new_col2", functions.expr("id * 2"));
+
+      df2.createOrReplaceTempView("tmp_view");


Can we create the temporary view using pure Spark SQL syntax—e.g., CREATE VIEW [view_id] AS [query]—to verify that users can leverage this functionality entirely through Spark SQL?

Additionally, it would be helpful to add another test case where some of the columns targeted for backfill already exist, to ensure the error handling behaves as expected.

Also maybe another case where the source view/table is not aligned with lance dataset. E.g., 10 records in lance but 9 records in the view, and expect error to be thrown?

@qidian99 Thanks for your suggestions.

I add test case for pure Spark SQL and added columns already exist.

If the newly added record does not match the original dataset, two scenarios must be considered:

a) If the new column is nullable (e.g., String), the missing rows will contain null.

b) If the new column is non-nullable (e.g., Int32), an error will be raised.

qidian99 · 2025-10-20T09:53:58Z

lance-spark-base_2.12/src/main/java/com/lancedb/lance/spark/write/AddColumnsBackfillWrite.java

+    NamedReference segmentId = Expressions.column(LanceConstant.ROW_ADDRESS);
+    SortValue sortValue =
+        new SortValue(segmentId, SortDirection.ASCENDING, NullOrdering.NULLS_FIRST);
+    return new SortValue[] {sortValue};


The physical plan is:

== Physical Plan == CommandResult <empty> +- AppendData org.apache.spark.sql.execution.datasources.v2.DataSourceV2Strategy$$Lambda$1782/1752894940@6f7b8ae1, com.lancedb.lance.spark.write.AddColumnsWrite@6c8d8b60 +- AdaptiveSparkPlan isFinalPlan=true +- == Final Plan == *(2) Sort [_rowaddr#11L ASC NULLS FIRST], false, 0 +- AQEShuffleRead coalesced +- ShuffleQueryStage 0 +- Exchange hashpartitioning(_fragid#12, 200), REPARTITION_BY_COL, [plan_id=84] +- *(1) Project [_rowaddr#11L, _fragid#12, id#8, (id#8 * 100) AS new_col#39] +- *(1) ColumnarToRow +- BatchScan add_column_table[id#8, _rowaddr#11L, _fragid#12] class com.lancedb.lance.spark.read.LanceScan RuntimeFilters: [] +- == Initial Plan == Sort [_rowaddr#11L ASC NULLS FIRST], false, 0 +- Exchange hashpartitioning(_fragid#12, 200), REPARTITION_BY_COL, [plan_id=68] +- Project [_rowaddr#11L, _fragid#12, id#8, (id#8 * 100) AS new_col#39] +- BatchScan add_column_table[id#8, _rowaddr#11L, _fragid#12] class com.lancedb.lance.spark.read.LanceScan RuntimeFilters: []

The sort operator is after distribution. So I think we can only sort data by _rowaddr.

If the view has 3 new columns, namely new_col1, new_col2, new_col3, and we only want to do backfill for new_col1 and new_col2, will new_col3 be excluded from shuffle stage?

qidian99 · 2025-10-20T10:01:28Z

.../main/scala/org/apache/spark/sql/execution/datasources/v2/ExtendedDataSourceV2Strategy.scala

+import org.apache.spark.sql.connector.catalog._
+import org.apache.spark.sql.execution.{SparkPlan, SparkStrategy}
+
+case class ExtendedDataSourceV2Strategy(session: SparkSession) extends SparkStrategy


Consider renaming to LanceDataSourceV2Strategy, or LanceSparkStrategy?

If the view has 3 new columns, namely new_col1, new_col2, new_col3, and we only want to do backfill for new_col1 and new_col2, will new_col3 be excluded from shuffle stage?

I made an optimization to AddColumnsBackfillExec: if some columns dont not need to be added, a new Project is introduced so that only the columns being added are shuffled

Consider renaming to LanceDataSourceV2Strategy, or LanceSparkStrategy?

Good suggestion. I renamed it to LanceDataSourceV2Strategy

jackye1995 · 2025-10-23T02:57:04Z

Looks like we are making a lot of progress, let me know whenever this is ready for another pass!

fangbo · 2025-10-23T07:37:56Z

Looks like we are making a lot of progress, let me know whenever this is ready for another pass!

@jackye1995 I have test this feature in a spark cluster. I think this PR is ready. Could you please review it again? Thanks a lot.

fangbo mentioned this pull request Sep 18, 2025

Add support for add_column with backfill #32

Open

fangbo marked this pull request as draft September 18, 2025 03:58

fangbo changed the title ~~[WIP]feat: add support for add_column with backfill~~ feat: add support for add_column with backfill Sep 18, 2025

github-actions bot added the enhancement New feature or request label Sep 18, 2025

qidian99 reviewed Sep 18, 2025

View reviewed changes

jackye1995 reviewed Sep 18, 2025

View reviewed changes

fangbo force-pushed the add-column-backfill branch 2 times, most recently from 555bd7f to 33a412c Compare September 30, 2025 10:00

fangbo force-pushed the add-column-backfill branch 2 times, most recently from f2d0827 to fb84266 Compare October 9, 2025 06:30

fangbo marked this pull request as ready for review October 9, 2025 09:39

fangbo force-pushed the add-column-backfill branch from 9553e73 to d65a315 Compare October 11, 2025 02:17

qidian99 approved these changes Oct 20, 2025

View reviewed changes

fangbo force-pushed the add-column-backfill branch from 31d8010 to 97cd432 Compare October 21, 2025 08:49

feat: add support for add_column with backfilld

b089dde

fangbo force-pushed the add-column-backfill branch from 97cd432 to b089dde Compare October 21, 2025 08:50

fangbo.0511 added 4 commits October 21, 2025 16:59

Rename to LanceDataSourceV2Strategy

32f7c3a

throws IOException for test's setup and tearDown

bb3637f

Add more test case

112809f

Add more test case

107fd65

fangbo.0511 added 2 commits October 23, 2025 17:58

Refactor test case

ed1ccc1

Refactor test case

de8bad9

feat: add support for add_column with backfill #91

Are you sure you want to change the base?

feat: add support for add_column with backfill #91

Uh oh!

Conversation

fangbo commented Sep 18, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

github-actions bot commented Sep 18, 2025

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

fangbo commented Oct 10, 2025

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

jackye1995 commented Oct 23, 2025

Uh oh!

fangbo commented Oct 23, 2025

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

fangbo commented Sep 18, 2025 •

edited

Loading