Skip to content

mydump,importer: skip redundant CastColumnValue for parquet import#67054

Open
joechenrh wants to merge 37 commits intopingcap:masterfrom
joechenrh:optimize-encoding-v2
Open

mydump,importer: skip redundant CastColumnValue for parquet import#67054
joechenrh wants to merge 37 commits intopingcap:masterfrom
joechenrh:optimize-encoding-v2

Conversation

@joechenrh
Copy link
Contributor

@joechenrh joechenrh commented Mar 16, 2026

What problem does this PR solve?

Issue Number: ref #xxx

Problem Summary: During parquet-based IMPORT INTO, every column value goes through table.CastColumnValue() even when the parquet physical/logical type already guarantees the datum is correct under strict SQL mode. This adds unnecessary overhead on the hot encoding path.

What changed and how does it work?

This PR adds a skip-cast optimization for parquet imports that bypasses table.CastColumnValue() when the parquet source type is guaranteed to produce a correct datum.

Layer responsibilities:

  1. mydump/Row — Added a SkipCast []bool field to carry per-column skip-cast decisions alongside datum values. nil when no skip-cast info is available (CSV/SQL formats).

  2. mydump/ParquetParser (parser side) — Makes all skip-cast decisions:

    • Precheck phase: At parser init, computes a columnSkipCastPrecheck per column based on parquet physical/logical type vs target TiDB column type. Determines whether a column can unconditionally skip cast, needs a per-value post-check (string length, decimal precision), or must always cast.
    • Per-row phase: fillSkipCast() applies post-checks on each datum to produce the final SkipCast slice.
    • Target column info ([]*model.ColumnInfo) is threaded through to enable signedness-aware integer conversion and temporal type FSP matching.
  3. executor/importer (encoder side) — Consumes Row.SkipCast via a centralized fileColIdxForInsertCol mapping that translates between file-column indices and insert-column indices. When skipCast[i] is true, assigns the datum directly; otherwise falls through to table.CastColumnValue().

Check List

Tests

  • Unit test
  • Integration test
  • Manual test (add detailed scripts or steps below)
  • No need to test
    • I checked and no code files have been changed.

Side effects

  • Performance regression: Consumes more CPU
  • Performance regression: Consumes more Memory
  • Breaking backward compatibility

Documentation

  • Affects user behaviors
  • Contains syntax changes
  • Contains variable changes
  • Contains experimental features
  • Changes MySQL compatibility

Release note

Please refer to Release Notes Language Style Guide to write a quality release note.

Improve `IMPORT INTO` performance for parquet files by skipping redundant type casting when the parquet source type already guarantees correct datum values.

Summary by CodeRabbit

  • New Features

    • Per-row skip-cast metadata now flows through import and encoding, enabling selective casting.
    • Parquet imports carry target-column metadata for more accurate type-aware reads.
  • Improvements

    • Skip-cast optimization to avoid unnecessary conversions.
    • Better temporal handling (fractional-second precision, UTC adjustments).
    • Stronger numeric range/unsigned handling and stricter string/decimal validation to prevent invalid casts/truncation.
  • Tests

    • Added extensive unit tests covering skip-cast logic, temporal behavior, numeric/string/decimal validations.

joechenrh and others added 5 commits March 14, 2026 08:02
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
- Add temporal truncation to setTemporalDatum (DATE zeroes time, DATETIME
  truncates sub-FSP nanoseconds)
- Fix integer setter signedness dispatch based on target column signedness
- Internalize skip-cast types (columnSkipCastPrecheck, skipCheckKind)
- Broaden string targets to CHAR (non-binary) and VARBINARY
- Implement fillSkipCast in ParquetParser called from ReadRow
- Per-value post-checks for string (charset+length) and decimal (precision/frac)
- Results communicated via Row.SkipCast []bool

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Precalculate fileColIdxForInsertCol[] once at encoder init. The
canSkipCastColumnValue function centralizes the file->insert column
mapping lookup. No per-row remapping needed.

Also pass targetColumns to NewParquetParser so skip-cast prechecks
can be built at parser init time.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
- Fix Int64/None setter to dispatch on target signedness
- Add early null check in fillSkipCast (skip cast for nulls)
- Add intest.Assert in initInsertColFileMapping for mapping invariant
- Update spec: signed→unsigned integer not eligible (range includes negatives)
- Add VARBINARY byte-length-exceeded test
- Add null value post-check test

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
@ti-chi-bot ti-chi-bot bot added the release-note Denotes a PR that will be considered when it comes time to generate release notes. label Mar 16, 2026
@pantheon-ai
Copy link

pantheon-ai bot commented Mar 16, 2026

Review failed due to infrastructure/execution failure after retries. Please re-trigger review.

ℹ️ Learn more details on Pantheon AI.

@ti-chi-bot ti-chi-bot bot added the size/XXL Denotes a PR that changes 1000+ lines, ignoring generated files. label Mar 16, 2026
@tiprow
Copy link

tiprow bot commented Mar 16, 2026

Hi @joechenrh. Thanks for your PR.

PRs from untrusted users cannot be marked as trusted with /ok-to-test in this repo meaning untrusted PR authors can never trigger tests themselves. Collaborators can still trigger tests on the PR using /test all.

I understand the commands that are listed here.

Details

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes-sigs/prow repository.

@coderabbitai
Copy link

coderabbitai bot commented Mar 16, 2026

Note

Reviews paused

It looks like this branch is under active development. To avoid overwhelming you with review comments due to an influx of new commits, CodeRabbit has automatically paused this review. You can configure this behavior by changing the reviews.auto_review.auto_pause_after_reviewed_commits setting.

Use the following commands to manage reviews:

  • @coderabbitai resume to resume automatic reviews.
  • @coderabbitai review to trigger a single review.

Use the checkboxes below for quick actions:

  • ▶️ Resume reviews
  • 🔍 Trigger review
📝 Walkthrough

Walkthrough

Adds per-column and per-row "skip-cast" metadata and logic for Parquet import; extends TableKVEncoder.Encode to accept per-row skipCast flags; threads target column metadata from parsers into encoders and tests to enable selective bypass of type conversions.

Changes

Cohort / File(s) Summary
Encoder API & Call Sites
pkg/dxf/importinto/conflict_resolution_test.go, pkg/dxf/importinto/conflictedkv/deleter_test.go, pkg/dxf/importinto/conflictedkv/handler.go, pkg/dxf/importinto/conflictedkv/handler_test.go, pkg/executor/importer/kv_encode_test.go
Updated all Encode call sites to the new signature including a second argument (tests pass nil): Encode(row, skipCast, rowID). Reflects Encode signature change across tests and handlers.
Executor — Encode Integration
pkg/executor/importer/chunk_process.go, pkg/executor/importer/kv_encode.go
Propagated per-row skipCast []bool in rowToEncode; TableKVEncoder gains insertColToFileIdx and currentSkipCast; Encode now accepts skipCast []bool; added initInsertColFileMapping and canSkipCastColumnValue.
Parquet Parser & Type Conversion
pkg/lightning/mydump/parquet_parser.go, pkg/lightning/mydump/parquet_type_converter.go, pkg/lightning/mydump/parquet_type_skipcast_test.go
Parser now accepts/propagates targetColumns, builds per-column skip-cast prechecks and produces per-row SkipCast []bool. Converted setters now return (bool, error) to signal skip-cast and include target-aware casting checks (temporal, numeric unsigned, decimal, string). Large test coverage added for skip-cast behavior.
Metadata & Runtime Plumbing
pkg/lightning/mydump/loader.go, pkg/dxf/importinto/subtask_executor.go, pkg/lightning/mydump/parser.go
Added TargetColumns []*model.ColumnInfo to ParquetFileMeta; subtask_executor attaches target columns to Parquet metadata; Row struct now includes SkipCast []bool.
Tests, BUILD & Test Helpers
pkg/lightning/mydump/BUILD.bazel, pkg/lightning/mydump/parquet_parser_test.go
Updated BUILD deps to include //pkg/meta/model and test deps; added test helper newInt96 and minor test expectation adjustments.
sequenceDiagram
    participant Client as Import Client
    participant Parser as Parquet Parser
    participant ChunkProc as Chunk Processor
    participant Encoder as TableKVEncoder
    participant Mapper as Column Mapper

    Client->>Parser: NewParquetParser(targetColumns)
    Parser->>Parser: buildSkipCastPrechecks(targetColumns)
    Parser->>Parser: ReadRow() -> lastRow.SkipCast
    ChunkProc->>Parser: parserEncodeReader.Read()
    ChunkProc->>ChunkProc: assemble rowToEncode{row, rowID, skipCast}
    ChunkProc->>Encoder: Encode(row, skipCast, rowID)
    Encoder->>Encoder: set currentSkipCast = skipCast
    Encoder->>Mapper: canSkipCastColumnValue(insertColIdx)?
    alt skip allowed
        Mapper-->>Encoder: use raw value
    else
        Mapper-->>Encoder: perform cast/convert
    end
    Encoder-->>ChunkProc: return KV pairs
Loading

Estimated code review effort

🎯 4 (Complex) | ⏱️ ~50 minutes

Poem

🐰 I hopped through columns, sniffed each cast,
A nudge of metadata made decisions fast.
Skip a conversion where safe and true,
Parquet seeds to KV — a lighter chew.
Crunchy bytes, carrot-tuned view. 🥕

🚥 Pre-merge checks | ✅ 1 | ❌ 2

❌ Failed checks (1 warning, 1 inconclusive)

Check name Status Explanation Resolution
Docstring Coverage ⚠️ Warning Docstring coverage is 7.14% which is insufficient. The required threshold is 80.00%. Write docstrings for the functions missing them to satisfy the coverage threshold.
Description check ❓ Inconclusive The description covers the problem, solution approach, layer responsibilities, and includes a release note. However, the Issue Number field shows 'ref #xxx' (placeholder) rather than an actual linked issue. Replace 'Issue Number: ref #xxx' with an actual issue reference (e.g., 'close #67054' or 'ref #XXXXX') to properly link the related issue.
✅ Passed checks (1 passed)
Check name Status Explanation
Title check ✅ Passed The title clearly and specifically describes the main optimization: skipping redundant CastColumnValue for parquet import, which is the central focus of the PR.

✏️ Tip: You can configure your own custom pre-merge checks in the settings.

✨ Finishing Touches
🧪 Generate unit tests (beta)
  • Create PR with unit tests
📝 Coding Plan
  • Generate coding plan for human review comments

Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share

Comment @coderabbitai help to get the list of available commands and usage tips.

@hawkingrei
Copy link
Member

/ok-to-test

@ti-chi-bot ti-chi-bot bot added ok-to-test Indicates a PR is ready to be tested. do-not-merge/needs-linked-issue labels Mar 16, 2026
joechenrh and others added 3 commits March 16, 2026 03:43
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Signed-off-by: Ruihao Chen <joechenrh@gmail.com>
@joechenrh
Copy link
Contributor Author

/retest

Signed-off-by: Ruihao Chen <joechenrh@gmail.com>
@joechenrh joechenrh changed the title mydump,executor/importer: skip redundant CastColumnValue for parquet import mydump,importer: skip redundant CastColumnValue for parquet import Mar 16, 2026
Copy link

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 2

🧹 Nitpick comments (1)
pkg/lightning/mydump/parquet_type_skipcast_test.go (1)

172-212: Add a negative signed→unsigned regression here.

The new target-aware integer setters are only exercised with positive inputs. A -1 case against an unsigned target would lock down the no-wraparound behavior that this optimization depends on.

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@pkg/lightning/mydump/parquet_type_skipcast_test.go` around lines 172 - 212,
Add a regression subtest in TestIntSetterSignedness that exercises a negative
signed→unsigned conversion to lock down no-wraparound behavior: create a
convertedType for Int32, build an unsigned target via
newParquetTargetColumnInfo(..., mysql.UnsignedFlag, ...), get the setter with
getInt32Setter, call setter(-1, &d) and assert no error and that the Datum
preserves the signed negative value (types.KindInt64 and d.GetInt64() == -1) so
the code does not wrap -1 into a large uint64.
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.

Inline comments:
In `@pkg/executor/importer/kv_encode.go`:
- Around line 85-92: The file→insert-column mapping (built in
TableKVEncoder.initInsertColFileMapping and used via fileColIdxForInsertCol to
decide skipping CastColumnValue) must fail closed in production instead of using
intest.Assert; update initInsertColFileMapping (and the same logic in the other
initialization block around the second occurrence) to validate that every
insertColumns entry has a matching fieldMappings entry and return an error (or
panic) if any mapping is missing, removing the intest.Assert dependency; ensure
TableKVEncoder construction in the calling site handles the returned error (or
accept the panic) so production cannot silently proceed with wrong skip-cast
bits.

In `@pkg/lightning/mydump/parquet_type.go`:
- Around line 608-613: The setters are reinterpreting source signed integers as
unsigned based on target.HasUnsignedFlag (e.g., the branch that currently does
d.SetUint64(uint64(uint32(val))) for val int32); change them to preserve the
source integer semantics instead of flipping signedness: for signed source
params (e.g., val int32) always set the Datum with the signed representation
(d.SetInt64(int64(val))), and for unsigned source params use the unsigned
representation, letting CastColumnValue handle any overflow/signedness
conversions. Update the analogous branches referenced (the current one plus the
similar branches at the other locations noted) to stop casting signed source
values to uint64 and instead use the source-typed int64/uint64 setters.

---

Nitpick comments:
In `@pkg/lightning/mydump/parquet_type_skipcast_test.go`:
- Around line 172-212: Add a regression subtest in TestIntSetterSignedness that
exercises a negative signed→unsigned conversion to lock down no-wraparound
behavior: create a convertedType for Int32, build an unsigned target via
newParquetTargetColumnInfo(..., mysql.UnsignedFlag, ...), get the setter with
getInt32Setter, call setter(-1, &d) and assert no error and that the Datum
preserves the signed negative value (types.KindInt64 and d.GetInt64() == -1) so
the code does not wrap -1 into a large uint64.

ℹ️ Review info
⚙️ Run configuration

Configuration used: Repository UI

Review profile: CHILL

Plan: Pro

Run ID: e49ac364-d597-40be-95e1-1b7ccc63d3e6

📥 Commits

Reviewing files that changed from the base of the PR and between 0763fed and 37c6e24.

📒 Files selected for processing (18)
  • lightning/pkg/importer/chunk_process.go
  • lightning/pkg/importer/get_pre_info.go
  • pkg/dxf/importinto/conflict_resolution_test.go
  • pkg/dxf/importinto/conflictedkv/deleter_test.go
  • pkg/dxf/importinto/conflictedkv/handler.go
  • pkg/dxf/importinto/conflictedkv/handler_test.go
  • pkg/executor/importer/chunk_process.go
  • pkg/executor/importer/import.go
  • pkg/executor/importer/kv_encode.go
  • pkg/executor/importer/kv_encode_test.go
  • pkg/lightning/mydump/BUILD.bazel
  • pkg/lightning/mydump/parquet_parser.go
  • pkg/lightning/mydump/parquet_parser_test.go
  • pkg/lightning/mydump/parquet_type.go
  • pkg/lightning/mydump/parquet_type_converter.go
  • pkg/lightning/mydump/parquet_type_skipcast_test.go
  • pkg/lightning/mydump/parquet_wrapper.go
  • pkg/lightning/mydump/parser.go
💤 Files with no reviewable changes (1)
  • pkg/lightning/mydump/parquet_type_converter.go

joechenrh and others added 7 commits March 16, 2026 04:24
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
…ocations

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Signed-off-by: Ruihao Chen <joechenrh@gmail.com>
…ng through methods

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Signed-off-by: Ruihao Chen <joechenrh@gmail.com>
Set TargetColumns in importMinimalTaskExecutor.Run instead of passing
as a separate parameter to NewParquetParser.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
@codecov
Copy link

codecov bot commented Mar 16, 2026

Codecov Report

❌ Patch coverage is 42.44186% with 198 lines in your changes missing coverage. Please review.
✅ Project coverage is 77.7596%. Comparing base (0763fed) to head (8526db9).
⚠️ Report is 6 commits behind head on master.

Additional details and impacted files
@@               Coverage Diff                @@
##             master     #67054        +/-   ##
================================================
+ Coverage   77.6966%   77.7596%   +0.0630%     
================================================
  Files          2013       1945        -68     
  Lines        551311     543539      -7772     
================================================
- Hits         428350     422654      -5696     
+ Misses       121229     120211      -1018     
+ Partials       1732        674      -1058     
Flag Coverage Δ
integration 43.6090% <31.6715%> (-4.5465%) ⬇️

Flags with carried forward coverage won't be shown. Click here to find out more.

Components Coverage Δ
dumpling 56.7974% <ø> (ø)
parser ∅ <ø> (∅)
br 48.7501% <ø> (-12.1143%) ⬇️
🚀 New features to boost your workflow:
  • ❄️ Test Analytics: Detect flaky tests, report on failures, and find test suite problems.
  • 📦 JS Bundle Analysis: Save yourself from yourself by tracking and limiting bundle sizes in JS merges.

joechenrh and others added 2 commits March 16, 2026 04:58
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
joechenrh and others added 11 commits March 16, 2026 05:10
Merge physical type into columnType struct and remove separate
physicalTypes arg from buildSkipCastPrechecks test calls.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Signed-off-by: Ruihao Chen <joechenrh@gmail.com>
…pCast

- Add target *model.ColumnInfo field to columnSkipCastPrecheck
- Remove separate targetCols field from ParquetParser
- Pre-allocate skipCast []bool at init time (same length as file columns)
- Simplify fillSkipCast by removing bounds/cap checks
- Fix postCheckString binary path regression

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
prechecks is always the same length as colTypes (file column count).

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
castRequired, castSkipAlways, castCheckString, castCheckDecimal

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Boolean values (0 or 1) fit in any integer type.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
All integer converted types share the same int32 physical range.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
All integer converted types share the same int64 physical range.
Remove unused unsignedRangeFitsTarget.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
… InsertColumns

FieldMappings is 1:1 with file columns, while InsertColumns includes
SET clause columns and excludes user-var columns — the indices don't
align with file column positions.

Also skip buildSkipCastPrechecks when TargetColumns is nil (lightning path).

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Parquet file columns may not match FieldMappings length.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Copy link

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

♻️ Duplicate comments (1)
pkg/executor/importer/kv_encode.go (1)

260-278: ⚠️ Potential issue | 🟠 Major

Production should fail-closed on column ID mismatch.

The intest.Assert at lines 268-269 only fires in test builds. In production, if fieldMappings and insertColumns have drifted (column ID mismatch), the wrong skipCast bit would be used, potentially causing silent data corruption.

This was flagged in a previous review. Consider converting to a hard error:

🔧 Proposed fix
-func (en *TableKVEncoder) initInsertColFileMapping() {
+func (en *TableKVEncoder) initInsertColFileMapping() error {
 	en.insertColToFileIdx = make([]int, len(en.insertColumns))
 	insertIdx := 0
 	for fileIdx, mapping := range en.fieldMappings {
 		if mapping == nil || mapping.Column == nil {
 			continue
 		}
 		if insertIdx < len(en.insertColToFileIdx) {
-			intest.Assert(mapping.Column.ID == en.insertColumns[insertIdx].ID,
-				"fieldMapping column ID mismatch with insertColumns")
+			if mapping.Column.ID != en.insertColumns[insertIdx].ID {
+				return errors.Errorf(
+					"field mapping column id %d does not match insert column id %d at index %d",
+					mapping.Column.ID, en.insertColumns[insertIdx].ID, insertIdx,
+				)
+			}
 			en.insertColToFileIdx[insertIdx] = fileIdx
 		}
 		insertIdx++
 	}
 	// SET clause columns have no file column
 	for ; insertIdx < len(en.insertColToFileIdx); insertIdx++ {
 		en.insertColToFileIdx[insertIdx] = -1
 	}
+	return nil
 }
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@pkg/executor/importer/kv_encode.go` around lines 260 - 278, The code
currently uses intest.Assert in TableKVEncoder.initInsertColFileMapping which
only runs in tests; convert that check into a production hard-fail by replacing
the intest.Assert with a runtime check that compares mapping.Column.ID and
en.insertColumns[insertIdx].ID and calls panic or a fatal logger with a clear
message and the two IDs when they differ, before assigning
en.insertColToFileIdx[insertIdx] = fileIdx; ensure you still handle nil
mapping/Column and preserve the subsequent logic that fills -1 for remaining
insertColToFileIdx entries.
🧹 Nitpick comments (1)
pkg/lightning/mydump/parquet_parser.go (1)

310-311: Consider documenting the skipCast slice reuse semantics.

The skipCast slice (line 311) is allocated once and reused across ReadRow calls. When LastRow().SkipCast is accessed, it returns a reference to this shared buffer.

Current usage in kv_encode.go is safe because Encode consumes skipCast synchronously and clears its reference in the defer. However, this contract is implicit.

Consider adding a comment to document this:

// skipCast holds per-file-column skip-cast decisions for the current row.
// WARNING: This slice is reused across ReadRow calls. Callers must consume
// LastRow().SkipCast before the next ReadRow call.
skipCast []bool

Also applies to: 566-567

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@pkg/lightning/mydump/parquet_parser.go` around lines 310 - 311, Add a clear
comment by the skipCast field documenting its reuse semantics: explain that
skipCast is a per-file-column buffer populated by ReadRow and returned via
LastRow().SkipCast, but it is a shared slice reused across successive ReadRow
calls so callers must consume or copy its contents before the next ReadRow;
reference the ReadRow, LastRow().SkipCast and kv_encode.go:Encode usage patterns
as examples of safe synchronous consumption. Also add the same explanatory
comment at the other occurrence noted (around lines 566-567) so both
declarations carry the warning.
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.

Duplicate comments:
In `@pkg/executor/importer/kv_encode.go`:
- Around line 260-278: The code currently uses intest.Assert in
TableKVEncoder.initInsertColFileMapping which only runs in tests; convert that
check into a production hard-fail by replacing the intest.Assert with a runtime
check that compares mapping.Column.ID and en.insertColumns[insertIdx].ID and
calls panic or a fatal logger with a clear message and the two IDs when they
differ, before assigning en.insertColToFileIdx[insertIdx] = fileIdx; ensure you
still handle nil mapping/Column and preserve the subsequent logic that fills -1
for remaining insertColToFileIdx entries.

---

Nitpick comments:
In `@pkg/lightning/mydump/parquet_parser.go`:
- Around line 310-311: Add a clear comment by the skipCast field documenting its
reuse semantics: explain that skipCast is a per-file-column buffer populated by
ReadRow and returned via LastRow().SkipCast, but it is a shared slice reused
across successive ReadRow calls so callers must consume or copy its contents
before the next ReadRow; reference the ReadRow, LastRow().SkipCast and
kv_encode.go:Encode usage patterns as examples of safe synchronous consumption.
Also add the same explanatory comment at the other occurrence noted (around
lines 566-567) so both declarations carry the warning.

ℹ️ Review info
⚙️ Run configuration

Configuration used: Repository UI

Review profile: CHILL

Plan: Pro

Run ID: 89b6c96f-5385-42b3-88fa-a04c61640005

📥 Commits

Reviewing files that changed from the base of the PR and between 37c6e24 and 4c07c5f.

📒 Files selected for processing (7)
  • pkg/dxf/importinto/subtask_executor.go
  • pkg/executor/importer/kv_encode.go
  • pkg/lightning/mydump/BUILD.bazel
  • pkg/lightning/mydump/loader.go
  • pkg/lightning/mydump/parquet_parser.go
  • pkg/lightning/mydump/parquet_type_converter.go
  • pkg/lightning/mydump/parquet_type_skipcast_test.go
🚧 Files skipped from review as they are similar to previous changes (1)
  • pkg/lightning/mydump/BUILD.bazel

joechenrh and others added 2 commits March 16, 2026 07:55
Physical type is int32 so all integer converted types share the same
[MinInt32, MaxInt32] range — use default case with signedRangeFitsTarget.

Also fix inverted nil check in buildSkipCastPrechecks (== nil → != nil).

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Unsigned converted types produce [0, MaxUint32] or [0, MaxUint64] via
the setter, not the signed physical range.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Copy link

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 2

🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.

Inline comments:
In `@pkg/lightning/mydump/parquet_type_converter.go`:
- Around line 595-606: The int64 setter currently bases signedness on
target.GetFlag() and uses inverted logic compared to the int32 case; change it
to use the source converted type's unsignedness (the same check used in the
int32 setter/precheck) instead of target.GetFlag(), and ensure the branch maps
unsigned -> d.SetUint64(uint64(val)) and signed -> d.SetInt64(val) (use
mysql.HasUnsignedFlag on the source converted-type flag or the variable
representing the source converted type).
- Around line 539-549: The setter currently chooses SetUint64 vs SetInt64 based
on the target column signedness (using target.GetFlag()), which breaks skip-cast
paths; change it to use the source parquet converted type signedness (use
isUnsignedConvertedType(converted.converted)) so the closure preserves source
semantics (call SetUint64(uint64(uint32(val))) when the source converted type is
unsigned, otherwise call SetInt64(int64(val))); this ensures CastColumnValue
still performs any target-side conversion. Update the default branch that
returns the func(val int32, d *types.Datum) error to check
isUnsignedConvertedType(converted.converted) instead of target.GetFlag().

ℹ️ Review info
⚙️ Run configuration

Configuration used: Repository UI

Review profile: CHILL

Plan: Pro

Run ID: dedb444e-5ccc-4600-94c9-9e8ab922504b

📥 Commits

Reviewing files that changed from the base of the PR and between 4c07c5f and cdc56f0.

📒 Files selected for processing (1)
  • pkg/lightning/mydump/parquet_type_converter.go

- Flatten parquetColumnPrecheck into switch on physical type
- Rename to int32CanSkipCast/int64CanSkipCast/decimalCanSkipCast/stringCanSkipCast
- Use mysql.IsIntegerType and types.Integer{Signed,Unsigned}{Lower,Upper}Bound
- signedRangeFitsTarget rejects unsigned targets early
- Collapse integer setter default cases by target signedness
- Clamp numColumns to min(fileColumns, len(TargetColumns)) in NewParquetParser

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Copy link

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Caution

Some comments are outside the diff and can’t be posted inline due to platform limitations.

⚠️ Outside diff range comments (1)
pkg/lightning/mydump/parquet_parser.go (1)

239-257: ⚠️ Potential issue | 🔴 Critical

Guard precheck indexing to avoid panic when target columns are absent.

Line 256 unconditionally reads prechecks[idx], but prechecks is nil unless meta.TargetColumns is set. This can panic in regular parquet parser flows (e.g., sampling).

🔧 Proposed fix
 func (rgp *rowGroupParser) init(colTypes []columnType, loc *time.Location, prechecks []columnSkipCastPrecheck) (err error) {
@@
 	for idx := range numCols {
 		tp := meta.Schema.Column(idx).PhysicalType()
-		iter := createColumnIterator(tp, &colTypes[idx], loc, prechecks[idx].target, readBatchSize)
+		var target *model.ColumnInfo
+		if idx < len(prechecks) {
+			target = prechecks[idx].target
+		}
+		iter := createColumnIterator(tp, &colTypes[idx], loc, target, readBatchSize)
 		if iter == nil {
 			return errors.Errorf("unsupported parquet type %s", tp.String())
 		}

Also applies to: 403-403, 704-707

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@pkg/lightning/mydump/parquet_parser.go` around lines 239 - 257, The init
method on rowGroupParser accesses prechecks[idx] unconditionally which can panic
when prechecks is nil or shorter than the number of columns; update
rowGroupParser.init to guard access to prechecks (and the other occurrences at
the same pattern) by checking len(prechecks) > idx and prechecks != nil before
using prechecks[idx].target, and pass a nil or default target value to
createColumnIterator when no precheck exists; ensure the loop over columns uses
for idx := 0; idx < numCols; idx++ (instead of range numCols) to avoid iteration
bugs and close iterators on error as already implemented.
♻️ Duplicate comments (1)
pkg/lightning/mydump/parquet_type_converter.go (1)

482-488: ⚠️ Potential issue | 🔴 Critical

Preserve source integer semantics in default setters.

Line 483 and Line 539 still branch on target signedness, which can mis-decode parquet logical unsigned/signed values and produce wrong datums when skip-cast is true. Use source converted type signedness instead.

🔧 Proposed fix
@@
 	default:
-		if target != nil && mysql.HasUnsignedFlag(target.GetFlag()) {
+		switch converted.converted {
+		case schema.ConvertedTypes.Uint8, schema.ConvertedTypes.Uint16, schema.ConvertedTypes.Uint32:
 			return func(val int32, d *types.Datum) error {
 				d.SetUint64(uint64(uint32(val)))
 				return nil
 			}
+		default:
+			return func(val int32, d *types.Datum) error {
+				d.SetInt64(int64(val))
+				return nil
+			}
 		}
-		return func(val int32, d *types.Datum) error {
-			d.SetInt64(int64(val))
-			return nil
-		}
@@
 	default:
-		if target != nil && !mysql.HasUnsignedFlag(target.GetFlag()) {
+		switch converted.converted {
+		case schema.ConvertedTypes.Uint8, schema.ConvertedTypes.Uint16,
+			schema.ConvertedTypes.Uint32, schema.ConvertedTypes.Uint64:
+			return func(val int64, d *types.Datum) error {
+				d.SetUint64(uint64(val))
+				return nil
+			}
+		default:
 			return func(val int64, d *types.Datum) error {
 				d.SetInt64(val)
 				return nil
 			}
 		}
-		return func(val int64, d *types.Datum) error {
-			d.SetUint64(uint64(val))
-			return nil
-		}
 	}
 }

Also applies to: 538-547

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@pkg/lightning/mydump/parquet_type_converter.go` around lines 482 - 488, The
default setter branches on target signedness (using target.GetFlag() /
mysql.HasUnsignedFlag) which can mis-decode parquet source signed/unsigned
values when skip-cast is enabled; update the logic in the int32/int64 default
setter blocks (where the code currently returns a func(val int32/int64, d
*types.Datum) error and uses d.SetUint64 for unsigned) to inspect the source
converted type's signedness instead of target (use the source conversion
metadata or converted type flag provided by the Parquet-to-MySQL conversion
context), and choose SetUint64 vs SetInt64 (or SetUint32/SetInt32 semantics)
accordingly; apply the same change to the other mirrored block around the
538-547 region so both places use source signedness rather than
target.GetFlag()/mysql.HasUnsignedFlag to decide which Datum setter to call.
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.

Outside diff comments:
In `@pkg/lightning/mydump/parquet_parser.go`:
- Around line 239-257: The init method on rowGroupParser accesses prechecks[idx]
unconditionally which can panic when prechecks is nil or shorter than the number
of columns; update rowGroupParser.init to guard access to prechecks (and the
other occurrences at the same pattern) by checking len(prechecks) > idx and
prechecks != nil before using prechecks[idx].target, and pass a nil or default
target value to createColumnIterator when no precheck exists; ensure the loop
over columns uses for idx := 0; idx < numCols; idx++ (instead of range numCols)
to avoid iteration bugs and close iterators on error as already implemented.

---

Duplicate comments:
In `@pkg/lightning/mydump/parquet_type_converter.go`:
- Around line 482-488: The default setter branches on target signedness (using
target.GetFlag() / mysql.HasUnsignedFlag) which can mis-decode parquet source
signed/unsigned values when skip-cast is enabled; update the logic in the
int32/int64 default setter blocks (where the code currently returns a func(val
int32/int64, d *types.Datum) error and uses d.SetUint64 for unsigned) to inspect
the source converted type's signedness instead of target (use the source
conversion metadata or converted type flag provided by the Parquet-to-MySQL
conversion context), and choose SetUint64 vs SetInt64 (or SetUint32/SetInt32
semantics) accordingly; apply the same change to the other mirrored block around
the 538-547 region so both places use source signedness rather than
target.GetFlag()/mysql.HasUnsignedFlag to decide which Datum setter to call.

ℹ️ Review info
⚙️ Run configuration

Configuration used: Repository UI

Review profile: CHILL

Plan: Pro

Run ID: 058a4973-899d-4189-8277-74debb7bf9b2

📥 Commits

Reviewing files that changed from the base of the PR and between cdc56f0 and 3ea1a89.

📒 Files selected for processing (2)
  • pkg/lightning/mydump/parquet_parser.go
  • pkg/lightning/mydump/parquet_type_converter.go

Signed-off-by: Ruihao Chen <joechenrh@gmail.com>
Signed-off-by: Ruihao Chen <joechenrh@gmail.com>
Copy link

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 2

🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.

Inline comments:
In `@pkg/lightning/mydump/parquet_parser.go`:
- Around line 542-545: pp.lastRow.SkipCast currently aliases the mutable parser
state pp.skipCast which gets overwritten on subsequent ReadRow calls; change the
assignment in the ReadRow branch that sets pp.lastRow.SkipCast so it stores an
independent copy of the skip flags instead of a reference. Concretely, locate
where pp.lastRow.Row, pp.lastRow.Length and pp.lastRow.SkipCast are set
(symbols: pp.lastRow, pp.skipCast, ReadRow) and replace the direct assignment of
SkipCast with a clone (for a slice use a new slice and copy/append, for a map
create a new map and copy entries, or perform a deep copy of the flag structure)
so previously returned rows retain their original skip flags. Ensure tests cover
mutating pp.skipCast after ReadRow to validate no aliasing.

In `@pkg/lightning/mydump/parquet_type_converter.go`:
- Around line 111-124: The stringCheckFunc currently falls back to
utf8.RuneCount for length checks which is incorrect for binary/varbinary
semantics; update stringCheckFunc to perform byte-length enforcement when
val.Kind() == types.KindBytes (i.e., treat targetFlen as a byte limit and return
false if len(b) > targetFlen), while for text (types.KindString) keep the
existing behavior of allowing by rune count when targetFlen !=
types.UnspecifiedLength; preserve the initial kind check and enc.IsValid(b)
validation and only replace the final utf8.RuneCount branch with a conditional
that distinguishes bytes vs string kinds.

ℹ️ Review info
⚙️ Run configuration

Configuration used: Repository UI

Review profile: CHILL

Plan: Pro

Run ID: b8215e4e-2c70-462b-b1b0-732aa6361b53

📥 Commits

Reviewing files that changed from the base of the PR and between 3ea1a89 and cecf41d.

📒 Files selected for processing (4)
  • pkg/lightning/mydump/parquet_parser.go
  • pkg/lightning/mydump/parquet_parser_test.go
  • pkg/lightning/mydump/parquet_type_converter.go
  • pkg/lightning/mydump/parquet_type_skipcast_test.go

joechenrh and others added 3 commits March 17, 2026 01:27
Signed-off-by: Ruihao Chen <joechenrh@gmail.com>
For binary/varbinary targets, flen is byte count. Previously, when
len(b) > targetFlen, the function fell through to utf8.RuneCount which
could undercount (e.g. 2-byte 'é' = 1 rune), incorrectly allowing
skip-cast. Now check EncodingTpBin and return false immediately.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Update expected collation from "utf8mb4_bin" to "binary" for parquet
string datums when no target columns are provided.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
@ti-chi-bot
Copy link

ti-chi-bot bot commented Mar 17, 2026

[FORMAT CHECKER NOTIFICATION]

Notice: To remove the do-not-merge/needs-linked-issue label, please provide the linked issue number on one line in the PR body, for example: Issue Number: close #123 or Issue Number: ref #456.

📖 For more info, you can check the "Contribute Code" section in the development guide.

@ti-chi-bot
Copy link

ti-chi-bot bot commented Mar 17, 2026

[APPROVALNOTIFIER] This PR is NOT APPROVED

This pull-request has been approved by:
Once this PR has been reviewed and has the lgtm label, please assign d3hunter for approval. For more information see the Code Review Process.
Please ensure that each of them provides their approval before proceeding.

The full list of commands accepted by this bot can be found here.

Details Needs approval from an approver in each of these files:

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

@ti-chi-bot
Copy link

ti-chi-bot bot commented Mar 17, 2026

@joechenrh: The following test failed, say /retest to rerun all failed tests or /retest-required to rerun all mandatory failed tests:

Test name Commit Details Required Rerun command
pull-unit-test-next-gen 8526db9 link true /test pull-unit-test-next-gen

Full PR test history. Your PR dashboard.

Details

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes-sigs/prow repository. I understand the commands that are listed here.

@joechenrh
Copy link
Contributor Author

/test unit-test

@tiprow
Copy link

tiprow bot commented Mar 17, 2026

@joechenrh: The specified target(s) for /test were not found.
The following commands are available to trigger required jobs:

/test fast_test_tiprow
/test tidb_parser_test

Use /test all to run all jobs.

Details

In response to this:

/test unit-test

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes-sigs/prow repository.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

do-not-merge/needs-linked-issue ok-to-test Indicates a PR is ready to be tested. release-note Denotes a PR that will be considered when it comes time to generate release notes. size/XXL Denotes a PR that changes 1000+ lines, ignoring generated files.

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants