mydump,importer: skip redundant `CastColumnValue` for parquet import by joechenrh · Pull Request #67054 · pingcap/tidb

joechenrh · 2026-03-16T07:34:54Z

What problem does this PR solve?

Issue Number: ref #xxx

Problem Summary: During parquet-based IMPORT INTO, every column value goes through table.CastColumnValue() even when the parquet physical/logical type already guarantees the datum is correct under strict SQL mode. This adds unnecessary overhead on the hot encoding path.

What changed and how does it work?

This PR adds a skip-cast optimization for parquet imports that bypasses table.CastColumnValue() when the parquet source type is guaranteed to produce a correct datum.

Layer responsibilities:

mydump/Row — Added a SkipCast []bool field to carry per-column skip-cast decisions alongside datum values. nil when no skip-cast info is available (CSV/SQL formats).
mydump/ParquetParser (parser side) — Makes all skip-cast decisions:
- Precheck phase: At parser init, computes a columnSkipCastPrecheck per column based on parquet physical/logical type vs target TiDB column type. Determines whether a column can unconditionally skip cast, needs a per-value post-check (string length, decimal precision), or must always cast.
- Per-row phase: fillSkipCast() applies post-checks on each datum to produce the final SkipCast slice.
- Target column info ([]*model.ColumnInfo) is threaded through to enable signedness-aware integer conversion and temporal type FSP matching.
executor/importer (encoder side) — Consumes Row.SkipCast via a centralized fileColIdxForInsertCol mapping that translates between file-column indices and insert-column indices. When skipCast[i] is true, assigns the datum directly; otherwise falls through to table.CastColumnValue().

Check List

Tests

Unit test
Integration test
Manual test (add detailed scripts or steps below)
No need to test
- I checked and no code files have been changed.

Side effects

Performance regression: Consumes more CPU
Performance regression: Consumes more Memory
Breaking backward compatibility

Documentation

Release note

Please refer to Release Notes Language Style Guide to write a quality release note.

Improve `IMPORT INTO` performance for parquet files by skipping redundant type casting when the parquet source type already guarantees correct datum values.

Summary by CodeRabbit

New Features
- Per-row skip-cast metadata now flows through import and encoding, enabling selective casting.
- Parquet imports carry target-column metadata for more accurate type-aware reads.
Improvements
- Skip-cast optimization to avoid unnecessary conversions.
- Better temporal handling (fractional-second precision, UTC adjustments).
- Stronger numeric range/unsigned handling and stricter string/decimal validation to prevent invalid casts/truncation.
Tests
- Added extensive unit tests covering skip-cast logic, temporal behavior, numeric/string/decimal validations.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

- Add temporal truncation to setTemporalDatum (DATE zeroes time, DATETIME truncates sub-FSP nanoseconds) - Fix integer setter signedness dispatch based on target column signedness - Internalize skip-cast types (columnSkipCastPrecheck, skipCheckKind) - Broaden string targets to CHAR (non-binary) and VARBINARY - Implement fillSkipCast in ParquetParser called from ReadRow - Per-value post-checks for string (charset+length) and decimal (precision/frac) - Results communicated via Row.SkipCast []bool Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

Precalculate fileColIdxForInsertCol[] once at encoder init. The canSkipCastColumnValue function centralizes the file->insert column mapping lookup. No per-row remapping needed. Also pass targetColumns to NewParquetParser so skip-cast prechecks can be built at parser init time. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

- Fix Int64/None setter to dispatch on target signedness - Add early null check in fillSkipCast (skip cast for nulls) - Add intest.Assert in initInsertColFileMapping for mapping invariant - Update spec: signed→unsigned integer not eligible (range includes negatives) - Add VARBINARY byte-length-exceeded test - Add null value post-check test Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

pantheon-ai · 2026-03-16T07:35:00Z

Review failed due to infrastructure/execution failure after retries. Please re-trigger review.

_{ℹ️ Learn more details on Pantheon AI.}

tiprow · 2026-03-16T07:35:16Z

Hi @joechenrh. Thanks for your PR.

PRs from untrusted users cannot be marked as trusted with /ok-to-test in this repo meaning untrusted PR authors can never trigger tests themselves. Collaborators can still trigger tests on the PR using /test all.

I understand the commands that are listed here.

Details

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes-sigs/prow repository.

coderabbitai · 2026-03-16T07:35:22Z

Note

Reviews paused

It looks like this branch is under active development. To avoid overwhelming you with review comments due to an influx of new commits, CodeRabbit has automatically paused this review. You can configure this behavior by changing the reviews.auto_review.auto_pause_after_reviewed_commits setting.

Use the following commands to manage reviews:

@coderabbitai resume to resume automatic reviews.
@coderabbitai review to trigger a single review.

Use the checkboxes below for quick actions:

▶️ Resume reviews
🔍 Trigger review

📝 Walkthrough

Walkthrough

Adds per-column and per-row "skip-cast" metadata and logic for Parquet import; extends TableKVEncoder.Encode to accept per-row skipCast flags; threads target column metadata from parsers into encoders and tests to enable selective bypass of type conversions.

Changes

Cohort / File(s)	Summary
Encoder API & Call Sites `pkg/dxf/importinto/conflict_resolution_test.go`, `pkg/dxf/importinto/conflictedkv/deleter_test.go`, `pkg/dxf/importinto/conflictedkv/handler.go`, `pkg/dxf/importinto/conflictedkv/handler_test.go`, `pkg/executor/importer/kv_encode_test.go`	Updated all Encode call sites to the new signature including a second argument (tests pass `nil`): `Encode(row, skipCast, rowID)`. Reflects Encode signature change across tests and handlers.
Executor — Encode Integration `pkg/executor/importer/chunk_process.go`, `pkg/executor/importer/kv_encode.go`	Propagated per-row `skipCast []bool` in `rowToEncode`; `TableKVEncoder` gains `insertColToFileIdx` and `currentSkipCast`; `Encode` now accepts `skipCast []bool`; added `initInsertColFileMapping` and `canSkipCastColumnValue`.
Parquet Parser & Type Conversion `pkg/lightning/mydump/parquet_parser.go`, `pkg/lightning/mydump/parquet_type_converter.go`, `pkg/lightning/mydump/parquet_type_skipcast_test.go`	Parser now accepts/propagates `targetColumns`, builds per-column skip-cast prechecks and produces per-row `SkipCast []bool`. Converted setters now return (bool, error) to signal skip-cast and include target-aware casting checks (temporal, numeric unsigned, decimal, string). Large test coverage added for skip-cast behavior.
Metadata & Runtime Plumbing `pkg/lightning/mydump/loader.go`, `pkg/dxf/importinto/subtask_executor.go`, `pkg/lightning/mydump/parser.go`	Added `TargetColumns []*model.ColumnInfo` to `ParquetFileMeta`; `subtask_executor` attaches target columns to Parquet metadata; `Row` struct now includes `SkipCast []bool`.
Tests, BUILD & Test Helpers `pkg/lightning/mydump/BUILD.bazel`, `pkg/lightning/mydump/parquet_parser_test.go`	Updated BUILD deps to include `//pkg/meta/model` and test deps; added test helper `newInt96` and minor test expectation adjustments.

sequenceDiagram
    participant Client as Import Client
    participant Parser as Parquet Parser
    participant ChunkProc as Chunk Processor
    participant Encoder as TableKVEncoder
    participant Mapper as Column Mapper

    Client->>Parser: NewParquetParser(targetColumns)
    Parser->>Parser: buildSkipCastPrechecks(targetColumns)
    Parser->>Parser: ReadRow() -> lastRow.SkipCast
    ChunkProc->>Parser: parserEncodeReader.Read()
    ChunkProc->>ChunkProc: assemble rowToEncode{row, rowID, skipCast}
    ChunkProc->>Encoder: Encode(row, skipCast, rowID)
    Encoder->>Encoder: set currentSkipCast = skipCast
    Encoder->>Mapper: canSkipCastColumnValue(insertColIdx)?
    alt skip allowed
        Mapper-->>Encoder: use raw value
    else
        Mapper-->>Encoder: perform cast/convert
    end
    Encoder-->>ChunkProc: return KV pairs

Estimated code review effort

🎯 4 (Complex) | ⏱️ ~50 minutes

Poem

🐰 I hopped through columns, sniffed each cast,
A nudge of metadata made decisions fast.
Skip a conversion where safe and true,
Parquet seeds to KV — a lighter chew.
Crunchy bytes, carrot-tuned view. 🥕

🚥 Pre-merge checks | ✅ 1 | ❌ 2

❌ Failed checks (1 warning, 1 inconclusive)

Check name	Status	Explanation	Resolution
Docstring Coverage	⚠️ Warning	Docstring coverage is 7.14% which is insufficient. The required threshold is 80.00%.	Write docstrings for the functions missing them to satisfy the coverage threshold.
Description check	❓ Inconclusive	The description covers the problem, solution approach, layer responsibilities, and includes a release note. However, the Issue Number field shows 'ref `#xxx`' (placeholder) rather than an actual linked issue.	Replace 'Issue Number: ref `#xxx`' with an actual issue reference (e.g., 'close `#67054`' or 'ref `#XXXXX`') to properly link the related issue.

✅ Passed checks (1 passed)

Check name	Status	Explanation
Title check	✅ Passed	The title clearly and specifically describes the main optimization: skipping redundant CastColumnValue for parquet import, which is the central focus of the PR.

_{✏️ Tip: You can configure your own custom pre-merge checks in the settings.}

✨ Finishing Touches

🧪 Generate unit tests (beta)

Create PR with unit tests

📝 Coding Plan

Generate coding plan for human review comments

Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share

_{Comment @coderabbitai help to get the list of available commands and usage tips.}

hawkingrei · 2026-03-16T07:35:43Z

/ok-to-test

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

Signed-off-by: Ruihao Chen <joechenrh@gmail.com>

joechenrh · 2026-03-16T08:03:34Z

/retest

Signed-off-by: Ruihao Chen <joechenrh@gmail.com>

coderabbitai

Actionable comments posted: 2

🧹 Nitpick comments (1)

pkg/lightning/mydump/parquet_type_skipcast_test.go (1)
172-212: Add a negative signed→unsigned regression here.

The new target-aware integer setters are only exercised with positive inputs. A -1 case against an unsigned target would lock down the no-wraparound behavior that this optimization depends on.
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@pkg/lightning/mydump/parquet_type_skipcast_test.go` around lines 172 - 212,
Add a regression subtest in TestIntSetterSignedness that exercises a negative
signed→unsigned conversion to lock down no-wraparound behavior: create a
convertedType for Int32, build an unsigned target via
newParquetTargetColumnInfo(..., mysql.UnsignedFlag, ...), get the setter with
getInt32Setter, call setter(-1, &d) and assert no error and that the Datum
preserves the signed negative value (types.KindInt64 and d.GetInt64() == -1) so
the code does not wrap -1 into a large uint64.

🤖 Prompt for all review comments with AI agents

Verify each finding against the current code and only fix it if needed.

Inline comments:
In `@pkg/executor/importer/kv_encode.go`:
- Around line 85-92: The file→insert-column mapping (built in
TableKVEncoder.initInsertColFileMapping and used via fileColIdxForInsertCol to
decide skipping CastColumnValue) must fail closed in production instead of using
intest.Assert; update initInsertColFileMapping (and the same logic in the other
initialization block around the second occurrence) to validate that every
insertColumns entry has a matching fieldMappings entry and return an error (or
panic) if any mapping is missing, removing the intest.Assert dependency; ensure
TableKVEncoder construction in the calling site handles the returned error (or
accept the panic) so production cannot silently proceed with wrong skip-cast
bits.

In `@pkg/lightning/mydump/parquet_type.go`:
- Around line 608-613: The setters are reinterpreting source signed integers as
unsigned based on target.HasUnsignedFlag (e.g., the branch that currently does
d.SetUint64(uint64(uint32(val))) for val int32); change them to preserve the
source integer semantics instead of flipping signedness: for signed source
params (e.g., val int32) always set the Datum with the signed representation
(d.SetInt64(int64(val))), and for unsigned source params use the unsigned
representation, letting CastColumnValue handle any overflow/signedness
conversions. Update the analogous branches referenced (the current one plus the
similar branches at the other locations noted) to stop casting signed source
values to uint64 and instead use the source-typed int64/uint64 setters.

---

Nitpick comments:
In `@pkg/lightning/mydump/parquet_type_skipcast_test.go`:
- Around line 172-212: Add a regression subtest in TestIntSetterSignedness that
exercises a negative signed→unsigned conversion to lock down no-wraparound
behavior: create a convertedType for Int32, build an unsigned target via
newParquetTargetColumnInfo(..., mysql.UnsignedFlag, ...), get the setter with
getInt32Setter, call setter(-1, &d) and assert no error and that the Datum
preserves the signed negative value (types.KindInt64 and d.GetInt64() == -1) so
the code does not wrap -1 into a large uint64.

ℹ️ Review info

⚙️ Run configuration

Configuration used: Repository UI

Review profile: CHILL

Plan: Pro

Run ID: e49ac364-d597-40be-95e1-1b7ccc63d3e6

📥 Commits

Reviewing files that changed from the base of the PR and between 0763fed and 37c6e24.

📒 Files selected for processing (18)

lightning/pkg/importer/chunk_process.go
lightning/pkg/importer/get_pre_info.go
pkg/dxf/importinto/conflict_resolution_test.go
pkg/dxf/importinto/conflictedkv/deleter_test.go
pkg/dxf/importinto/conflictedkv/handler.go
pkg/dxf/importinto/conflictedkv/handler_test.go
pkg/executor/importer/chunk_process.go
pkg/executor/importer/import.go
pkg/executor/importer/kv_encode.go
pkg/executor/importer/kv_encode_test.go
pkg/lightning/mydump/BUILD.bazel
pkg/lightning/mydump/parquet_parser.go
pkg/lightning/mydump/parquet_parser_test.go
pkg/lightning/mydump/parquet_type.go
pkg/lightning/mydump/parquet_type_converter.go
pkg/lightning/mydump/parquet_type_skipcast_test.go
pkg/lightning/mydump/parquet_wrapper.go
pkg/lightning/mydump/parser.go

💤 Files with no reviewable changes (1)

pkg/lightning/mydump/parquet_type_converter.go

pkg/executor/importer/kv_encode.go

pkg/lightning/mydump/parquet_type_converter.go

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

…ocations Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

Signed-off-by: Ruihao Chen <joechenrh@gmail.com>

…ng through methods Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

Signed-off-by: Ruihao Chen <joechenrh@gmail.com>

Set TargetColumns in importMinimalTaskExecutor.Run instead of passing as a separate parameter to NewParquetParser. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

codecov · 2026-03-16T08:55:53Z

Codecov Report

❌ Patch coverage is 42.44186% with 198 lines in your changes missing coverage. Please review.
✅ Project coverage is 77.7596%. Comparing base (0763fed) to head (8526db9).
⚠️ Report is 6 commits behind head on master.

Additional details and impacted files

@@               Coverage Diff                @@
##             master     #67054        +/-   ##
================================================
+ Coverage   77.6966%   77.7596%   +0.0630%     
================================================
  Files          2013       1945        -68     
  Lines        551311     543539      -7772     
================================================
- Hits         428350     422654      -5696     
+ Misses       121229     120211      -1018     
+ Partials       1732        674      -1058

Flag	Coverage Δ
integration	`43.6090% <31.6715%> (-4.5465%)`	⬇️

Flags with carried forward coverage won't be shown. Click here to find out more.

Components	Coverage Δ
dumpling	`56.7974% <ø> (ø)`
parser	`∅ <ø> (∅)`
br	`48.7501% <ø> (-12.1143%)`	⬇️

🚀 New features to boost your workflow:

❄️ Test Analytics: Detect flaky tests, report on failures, and find test suite problems.
📦 JS Bundle Analysis: Save yourself from yourself by tracking and limiting bundle sizes in JS merges.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

Merge physical type into columnType struct and remove separate physicalTypes arg from buildSkipCastPrechecks test calls. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

Signed-off-by: Ruihao Chen <joechenrh@gmail.com>

…pCast - Add target *model.ColumnInfo field to columnSkipCastPrecheck - Remove separate targetCols field from ParquetParser - Pre-allocate skipCast []bool at init time (same length as file columns) - Simplify fillSkipCast by removing bounds/cap checks - Fix postCheckString binary path regression Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

prechecks is always the same length as colTypes (file column count). Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

castRequired, castSkipAlways, castCheckString, castCheckDecimal Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

Boolean values (0 or 1) fit in any integer type. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

All integer converted types share the same int32 physical range. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

All integer converted types share the same int64 physical range. Remove unused unsignedRangeFitsTarget. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

… InsertColumns FieldMappings is 1:1 with file columns, while InsertColumns includes SET clause columns and excludes user-var columns — the indices don't align with file column positions. Also skip buildSkipCastPrechecks when TargetColumns is nil (lightning path). Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

Parquet file columns may not match FieldMappings length. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

coderabbitai

♻️ Duplicate comments (1)

pkg/executor/importer/kv_encode.go (1)

260-278: ⚠️ Potential issue | 🟠 Major

Production should fail-closed on column ID mismatch.

The intest.Assert at lines 268-269 only fires in test builds. In production, if fieldMappings and insertColumns have drifted (column ID mismatch), the wrong skipCast bit would be used, potentially causing silent data corruption.

This was flagged in a previous review. Consider converting to a hard error:

🔧 Proposed fix

-func (en *TableKVEncoder) initInsertColFileMapping() {
+func (en *TableKVEncoder) initInsertColFileMapping() error {
 	en.insertColToFileIdx = make([]int, len(en.insertColumns))
 	insertIdx := 0
 	for fileIdx, mapping := range en.fieldMappings {
 		if mapping == nil || mapping.Column == nil {
 			continue
 		}
 		if insertIdx < len(en.insertColToFileIdx) {
-			intest.Assert(mapping.Column.ID == en.insertColumns[insertIdx].ID,
-				"fieldMapping column ID mismatch with insertColumns")
+			if mapping.Column.ID != en.insertColumns[insertIdx].ID {
+				return errors.Errorf(
+					"field mapping column id %d does not match insert column id %d at index %d",
+					mapping.Column.ID, en.insertColumns[insertIdx].ID, insertIdx,
+				)
+			}
 			en.insertColToFileIdx[insertIdx] = fileIdx
 		}
 		insertIdx++
 	}
 	// SET clause columns have no file column
 	for ; insertIdx < len(en.insertColToFileIdx); insertIdx++ {
 		en.insertColToFileIdx[insertIdx] = -1
 	}
+	return nil
 }

🤖 Prompt for AI Agents

Verify each finding against the current code and only fix it if needed.

In `@pkg/executor/importer/kv_encode.go` around lines 260 - 278, The code
currently uses intest.Assert in TableKVEncoder.initInsertColFileMapping which
only runs in tests; convert that check into a production hard-fail by replacing
the intest.Assert with a runtime check that compares mapping.Column.ID and
en.insertColumns[insertIdx].ID and calls panic or a fatal logger with a clear
message and the two IDs when they differ, before assigning
en.insertColToFileIdx[insertIdx] = fileIdx; ensure you still handle nil
mapping/Column and preserve the subsequent logic that fills -1 for remaining
insertColToFileIdx entries.

🧹 Nitpick comments (1)

pkg/lightning/mydump/parquet_parser.go (1)
310-311: Consider documenting the skipCast slice reuse semantics.

The skipCast slice (line 311) is allocated once and reused across ReadRow calls. When LastRow().SkipCast is accessed, it returns a reference to this shared buffer.

Current usage in kv_encode.go is safe because Encode consumes skipCast synchronously and clears its reference in the defer. However, this contract is implicit.

Consider adding a comment to document this:
// skipCast holds per-file-column skip-cast decisions for the current row.
// WARNING: This slice is reused across ReadRow calls. Callers must consume
// LastRow().SkipCast before the next ReadRow call.
skipCast []bool
Also applies to: 566-567
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@pkg/lightning/mydump/parquet_parser.go` around lines 310 - 311, Add a clear
comment by the skipCast field documenting its reuse semantics: explain that
skipCast is a per-file-column buffer populated by ReadRow and returned via
LastRow().SkipCast, but it is a shared slice reused across successive ReadRow
calls so callers must consume or copy its contents before the next ReadRow;
reference the ReadRow, LastRow().SkipCast and kv_encode.go:Encode usage patterns
as examples of safe synchronous consumption. Also add the same explanatory
comment at the other occurrence noted (around lines 566-567) so both
declarations carry the warning.

🤖 Prompt for all review comments with AI agents

Verify each finding against the current code and only fix it if needed.

Duplicate comments:
In `@pkg/executor/importer/kv_encode.go`:
- Around line 260-278: The code currently uses intest.Assert in
TableKVEncoder.initInsertColFileMapping which only runs in tests; convert that
check into a production hard-fail by replacing the intest.Assert with a runtime
check that compares mapping.Column.ID and en.insertColumns[insertIdx].ID and
calls panic or a fatal logger with a clear message and the two IDs when they
differ, before assigning en.insertColToFileIdx[insertIdx] = fileIdx; ensure you
still handle nil mapping/Column and preserve the subsequent logic that fills -1
for remaining insertColToFileIdx entries.

---

Nitpick comments:
In `@pkg/lightning/mydump/parquet_parser.go`:
- Around line 310-311: Add a clear comment by the skipCast field documenting its
reuse semantics: explain that skipCast is a per-file-column buffer populated by
ReadRow and returned via LastRow().SkipCast, but it is a shared slice reused
across successive ReadRow calls so callers must consume or copy its contents
before the next ReadRow; reference the ReadRow, LastRow().SkipCast and
kv_encode.go:Encode usage patterns as examples of safe synchronous consumption.
Also add the same explanatory comment at the other occurrence noted (around
lines 566-567) so both declarations carry the warning.

ℹ️ Review info

⚙️ Run configuration

Configuration used: Repository UI

Review profile: CHILL

Plan: Pro

Run ID: 89b6c96f-5385-42b3-88fa-a04c61640005

📥 Commits

Reviewing files that changed from the base of the PR and between 37c6e24 and 4c07c5f.

📒 Files selected for processing (7)

pkg/dxf/importinto/subtask_executor.go
pkg/executor/importer/kv_encode.go
pkg/lightning/mydump/BUILD.bazel
pkg/lightning/mydump/loader.go
pkg/lightning/mydump/parquet_parser.go
pkg/lightning/mydump/parquet_type_converter.go
pkg/lightning/mydump/parquet_type_skipcast_test.go

🚧 Files skipped from review as they are similar to previous changes (1)

pkg/lightning/mydump/BUILD.bazel

Physical type is int32 so all integer converted types share the same [MinInt32, MaxInt32] range — use default case with signedRangeFitsTarget. Also fix inverted nil check in buildSkipCastPrechecks (== nil → != nil). Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

Unsigned converted types produce [0, MaxUint32] or [0, MaxUint64] via the setter, not the signed physical range. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

coderabbitai

Actionable comments posted: 2

🤖 Prompt for all review comments with AI agents

Verify each finding against the current code and only fix it if needed.

Inline comments:
In `@pkg/lightning/mydump/parquet_type_converter.go`:
- Around line 595-606: The int64 setter currently bases signedness on
target.GetFlag() and uses inverted logic compared to the int32 case; change it
to use the source converted type's unsignedness (the same check used in the
int32 setter/precheck) instead of target.GetFlag(), and ensure the branch maps
unsigned -> d.SetUint64(uint64(val)) and signed -> d.SetInt64(val) (use
mysql.HasUnsignedFlag on the source converted-type flag or the variable
representing the source converted type).
- Around line 539-549: The setter currently chooses SetUint64 vs SetInt64 based
on the target column signedness (using target.GetFlag()), which breaks skip-cast
paths; change it to use the source parquet converted type signedness (use
isUnsignedConvertedType(converted.converted)) so the closure preserves source
semantics (call SetUint64(uint64(uint32(val))) when the source converted type is
unsigned, otherwise call SetInt64(int64(val))); this ensures CastColumnValue
still performs any target-side conversion. Update the default branch that
returns the func(val int32, d *types.Datum) error to check
isUnsignedConvertedType(converted.converted) instead of target.GetFlag().

ℹ️ Review info

⚙️ Run configuration

Configuration used: Repository UI

Review profile: CHILL

Plan: Pro

Run ID: dedb444e-5ccc-4600-94c9-9e8ab922504b

📥 Commits

Reviewing files that changed from the base of the PR and between 4c07c5f and cdc56f0.

📒 Files selected for processing (1)

pkg/lightning/mydump/parquet_type_converter.go

pkg/lightning/mydump/parquet_type_converter.go

- Flatten parquetColumnPrecheck into switch on physical type - Rename to int32CanSkipCast/int64CanSkipCast/decimalCanSkipCast/stringCanSkipCast - Use mysql.IsIntegerType and types.Integer{Signed,Unsigned}{Lower,Upper}Bound - signedRangeFitsTarget rejects unsigned targets early - Collapse integer setter default cases by target signedness - Clamp numColumns to min(fileColumns, len(TargetColumns)) in NewParquetParser Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

coderabbitai

Caution

Some comments are outside the diff and can’t be posted inline due to platform limitations.

⚠️ Outside diff range comments (1)

pkg/lightning/mydump/parquet_parser.go (1)

239-257: ⚠️ Potential issue | 🔴 Critical

Guard precheck indexing to avoid panic when target columns are absent.

Line 256 unconditionally reads prechecks[idx], but prechecks is nil unless meta.TargetColumns is set. This can panic in regular parquet parser flows (e.g., sampling).

🔧 Proposed fix

 func (rgp *rowGroupParser) init(colTypes []columnType, loc *time.Location, prechecks []columnSkipCastPrecheck) (err error) {
@@
 	for idx := range numCols {
 		tp := meta.Schema.Column(idx).PhysicalType()
-		iter := createColumnIterator(tp, &colTypes[idx], loc, prechecks[idx].target, readBatchSize)
+		var target *model.ColumnInfo
+		if idx < len(prechecks) {
+			target = prechecks[idx].target
+		}
+		iter := createColumnIterator(tp, &colTypes[idx], loc, target, readBatchSize)
 		if iter == nil {
 			return errors.Errorf("unsupported parquet type %s", tp.String())
 		}

Also applies to: 403-403, 704-707

🤖 Prompt for AI Agents

Verify each finding against the current code and only fix it if needed.

In `@pkg/lightning/mydump/parquet_parser.go` around lines 239 - 257, The init
method on rowGroupParser accesses prechecks[idx] unconditionally which can panic
when prechecks is nil or shorter than the number of columns; update
rowGroupParser.init to guard access to prechecks (and the other occurrences at
the same pattern) by checking len(prechecks) > idx and prechecks != nil before
using prechecks[idx].target, and pass a nil or default target value to
createColumnIterator when no precheck exists; ensure the loop over columns uses
for idx := 0; idx < numCols; idx++ (instead of range numCols) to avoid iteration
bugs and close iterators on error as already implemented.

♻️ Duplicate comments (1)

pkg/lightning/mydump/parquet_type_converter.go (1)

482-488: ⚠️ Potential issue | 🔴 Critical

Preserve source integer semantics in default setters.

Line 483 and Line 539 still branch on target signedness, which can mis-decode parquet logical unsigned/signed values and produce wrong datums when skip-cast is true. Use source converted type signedness instead.

🔧 Proposed fix

@@
 	default:
-		if target != nil && mysql.HasUnsignedFlag(target.GetFlag()) {
+		switch converted.converted {
+		case schema.ConvertedTypes.Uint8, schema.ConvertedTypes.Uint16, schema.ConvertedTypes.Uint32:
 			return func(val int32, d *types.Datum) error {
 				d.SetUint64(uint64(uint32(val)))
 				return nil
 			}
+		default:
+			return func(val int32, d *types.Datum) error {
+				d.SetInt64(int64(val))
+				return nil
+			}
 		}
-		return func(val int32, d *types.Datum) error {
-			d.SetInt64(int64(val))
-			return nil
-		}
@@
 	default:
-		if target != nil && !mysql.HasUnsignedFlag(target.GetFlag()) {
+		switch converted.converted {
+		case schema.ConvertedTypes.Uint8, schema.ConvertedTypes.Uint16,
+			schema.ConvertedTypes.Uint32, schema.ConvertedTypes.Uint64:
+			return func(val int64, d *types.Datum) error {
+				d.SetUint64(uint64(val))
+				return nil
+			}
+		default:
 			return func(val int64, d *types.Datum) error {
 				d.SetInt64(val)
 				return nil
 			}
 		}
-		return func(val int64, d *types.Datum) error {
-			d.SetUint64(uint64(val))
-			return nil
-		}
 	}
 }

Also applies to: 538-547

🤖 Prompt for AI Agents

Verify each finding against the current code and only fix it if needed.

In `@pkg/lightning/mydump/parquet_type_converter.go` around lines 482 - 488, The
default setter branches on target signedness (using target.GetFlag() /
mysql.HasUnsignedFlag) which can mis-decode parquet source signed/unsigned
values when skip-cast is enabled; update the logic in the int32/int64 default
setter blocks (where the code currently returns a func(val int32/int64, d
*types.Datum) error and uses d.SetUint64 for unsigned) to inspect the source
converted type's signedness instead of target (use the source conversion
metadata or converted type flag provided by the Parquet-to-MySQL conversion
context), and choose SetUint64 vs SetInt64 (or SetUint32/SetInt32 semantics)
accordingly; apply the same change to the other mirrored block around the
538-547 region so both places use source signedness rather than
target.GetFlag()/mysql.HasUnsignedFlag to decide which Datum setter to call.

🤖 Prompt for all review comments with AI agents

Verify each finding against the current code and only fix it if needed.

Outside diff comments:
In `@pkg/lightning/mydump/parquet_parser.go`:
- Around line 239-257: The init method on rowGroupParser accesses prechecks[idx]
unconditionally which can panic when prechecks is nil or shorter than the number
of columns; update rowGroupParser.init to guard access to prechecks (and the
other occurrences at the same pattern) by checking len(prechecks) > idx and
prechecks != nil before using prechecks[idx].target, and pass a nil or default
target value to createColumnIterator when no precheck exists; ensure the loop
over columns uses for idx := 0; idx < numCols; idx++ (instead of range numCols)
to avoid iteration bugs and close iterators on error as already implemented.

---

Duplicate comments:
In `@pkg/lightning/mydump/parquet_type_converter.go`:
- Around line 482-488: The default setter branches on target signedness (using
target.GetFlag() / mysql.HasUnsignedFlag) which can mis-decode parquet source
signed/unsigned values when skip-cast is enabled; update the logic in the
int32/int64 default setter blocks (where the code currently returns a func(val
int32/int64, d *types.Datum) error and uses d.SetUint64 for unsigned) to inspect
the source converted type's signedness instead of target (use the source
conversion metadata or converted type flag provided by the Parquet-to-MySQL
conversion context), and choose SetUint64 vs SetInt64 (or SetUint32/SetInt32
semantics) accordingly; apply the same change to the other mirrored block around
the 538-547 region so both places use source signedness rather than
target.GetFlag()/mysql.HasUnsignedFlag to decide which Datum setter to call.

ℹ️ Review info

⚙️ Run configuration

Configuration used: Repository UI

Review profile: CHILL

Plan: Pro

Run ID: 058a4973-899d-4189-8277-74debb7bf9b2

📥 Commits

Reviewing files that changed from the base of the PR and between cdc56f0 and 3ea1a89.

📒 Files selected for processing (2)

pkg/lightning/mydump/parquet_parser.go
pkg/lightning/mydump/parquet_type_converter.go

Signed-off-by: Ruihao Chen <joechenrh@gmail.com>

coderabbitai

Actionable comments posted: 2

🤖 Prompt for all review comments with AI agents

Verify each finding against the current code and only fix it if needed.

Inline comments:
In `@pkg/lightning/mydump/parquet_parser.go`:
- Around line 542-545: pp.lastRow.SkipCast currently aliases the mutable parser
state pp.skipCast which gets overwritten on subsequent ReadRow calls; change the
assignment in the ReadRow branch that sets pp.lastRow.SkipCast so it stores an
independent copy of the skip flags instead of a reference. Concretely, locate
where pp.lastRow.Row, pp.lastRow.Length and pp.lastRow.SkipCast are set
(symbols: pp.lastRow, pp.skipCast, ReadRow) and replace the direct assignment of
SkipCast with a clone (for a slice use a new slice and copy/append, for a map
create a new map and copy entries, or perform a deep copy of the flag structure)
so previously returned rows retain their original skip flags. Ensure tests cover
mutating pp.skipCast after ReadRow to validate no aliasing.

In `@pkg/lightning/mydump/parquet_type_converter.go`:
- Around line 111-124: The stringCheckFunc currently falls back to
utf8.RuneCount for length checks which is incorrect for binary/varbinary
semantics; update stringCheckFunc to perform byte-length enforcement when
val.Kind() == types.KindBytes (i.e., treat targetFlen as a byte limit and return
false if len(b) > targetFlen), while for text (types.KindString) keep the
existing behavior of allowing by rune count when targetFlen !=
types.UnspecifiedLength; preserve the initial kind check and enc.IsValid(b)
validation and only replace the final utf8.RuneCount branch with a conditional
that distinguishes bytes vs string kinds.

ℹ️ Review info

⚙️ Run configuration

Configuration used: Repository UI

Review profile: CHILL

Plan: Pro

Run ID: b8215e4e-2c70-462b-b1b0-732aa6361b53

📥 Commits

Reviewing files that changed from the base of the PR and between 3ea1a89 and cecf41d.

📒 Files selected for processing (4)

pkg/lightning/mydump/parquet_parser.go
pkg/lightning/mydump/parquet_parser_test.go
pkg/lightning/mydump/parquet_type_converter.go
pkg/lightning/mydump/parquet_type_skipcast_test.go

pkg/lightning/mydump/parquet_parser.go

pkg/lightning/mydump/parquet_type_converter.go

Signed-off-by: Ruihao Chen <joechenrh@gmail.com>

For binary/varbinary targets, flen is byte count. Previously, when len(b) > targetFlen, the function fell through to utf8.RuneCount which could undercount (e.g. 2-byte 'é' = 1 rune), incorrectly allowing skip-cast. Now check EncodingTpBin and return false immediately. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

Update expected collation from "utf8mb4_bin" to "binary" for parquet string datums when no target columns are provided. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

ti-chi-bot · 2026-03-17T07:44:42Z

[FORMAT CHECKER NOTIFICATION]

Notice: To remove the do-not-merge/needs-linked-issue label, please provide the linked issue number on one line in the PR body, for example: Issue Number: close #123 or Issue Number: ref #456.

_{📖 For more info, you can check the "Contribute Code" section in the development guide.}

ti-chi-bot · 2026-03-17T07:44:49Z

[APPROVALNOTIFIER] This PR is NOT APPROVED

This pull-request has been approved by:
Once this PR has been reviewed and has the lgtm label, please assign d3hunter for approval. For more information see the Code Review Process.
Please ensure that each of them provides their approval before proceeding.

The full list of commands accepted by this bot can be found here.

Details

Needs approval from an approver in each of these files:

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

ti-chi-bot · 2026-03-17T08:51:47Z

@joechenrh: The following test failed, say /retest to rerun all failed tests or /retest-required to rerun all mandatory failed tests:

Test name	Commit	Details	Required	Rerun command
pull-unit-test-next-gen	`8526db9`	link	true	`/test pull-unit-test-next-gen`

Full PR test history. Your PR dashboard.

Details

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes-sigs/prow repository. I understand the commands that are listed here.

joechenrh · 2026-03-17T09:23:08Z

/test unit-test

tiprow · 2026-03-17T09:23:32Z

@joechenrh: The specified target(s) for /test were not found.
The following commands are available to trigger required jobs:

/test fast_test_tiprow

/test tidb_parser_test

Use /test all to run all jobs.

Details

In response to this:

/test unit-test

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes-sigs/prow repository.

joechenrh and others added 5 commits March 14, 2026 08:02

mydump: add SkipCast field to Row struct

52f4753

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

update bazel

57eadcc

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

ti-chi-bot bot added the release-note Denotes a PR that will be considered when it comes time to generate release notes. label Mar 16, 2026

ti-chi-bot bot added the size/XXL Denotes a PR that changes 1000+ lines, ignoring generated files. label Mar 16, 2026

ti-chi-bot bot added ok-to-test Indicates a PR is ready to be tested. do-not-merge/needs-linked-issue labels Mar 16, 2026

joechenrh and others added 3 commits March 16, 2026 03:43

Merge branch 'master' into optimize-encoding-v2

6d73e94

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

Remove unrelated tools/coverage-rank/BUILD.bazel

92a2204

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

make fmt

37c6e24

Signed-off-by: Ruihao Chen <joechenrh@gmail.com>

fix

14dc65f

Signed-off-by: Ruihao Chen <joechenrh@gmail.com>

joechenrh changed the title ~~mydump,executor/importer: skip redundant CastColumnValue for parquet import~~ mydump,importer: skip redundant CastColumnValue for parquet import Mar 16, 2026

coderabbitai bot reviewed Mar 16, 2026

View reviewed changes

pkg/executor/importer/kv_encode.go Show resolved Hide resolved

pkg/lightning/mydump/parquet_type_converter.go Outdated Show resolved Hide resolved

joechenrh and others added 7 commits March 16, 2026 04:24

mydump: restore parquet_type_converter.go filename for cleaner diff

99ce54e

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

mydump: keep defaultBufSize and unsupportedParquetTypes in original l…

18fc10b

…ocations Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

update

60f8432

Signed-off-by: Ruihao Chen <joechenrh@gmail.com>

executor/importer: store skipCast as encoder field instead of threadi…

42ca50d

…ng through methods Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

executor/importer: clear currentSkipCast in defer to prevent stale state

d652ade

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

update

f8c92f3

Signed-off-by: Ruihao Chen <joechenrh@gmail.com>

mydump,executor/importer: move targetColumns into ParquetFileMeta

4e74b00

Set TargetColumns in importMinimalTaskExecutor.Run instead of passing as a separate parameter to NewParquetParser. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

joechenrh and others added 2 commits March 16, 2026 04:58

mydump: inline extractColumnTypes back into NewParquetParser

222c6dc

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

executor/importer: rename fileColIdxForInsertCol to insertColToFileIdx

5e90214

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

joechenrh and others added 11 commits March 16, 2026 05:10

mydump: update skip-cast tests for columnType refactor

69a871a

Merge physical type into columnType struct and remove separate physicalTypes arg from buildSkipCastPrechecks test calls. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

update

0985c23

Signed-off-by: Ruihao Chen <joechenrh@gmail.com>

mydump: remove unnecessary bounds check in rgp.init

61c8084

prechecks is always the same length as colTypes (file column count). Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

mydump: simplify fillSkipCast with null check before switch

7c4d57d

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

mydump: rename skipCheckKind to castDecision

d01f82d

castRequired, castSkipAlways, castCheckString, castCheckDecimal Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

mydump: simplify boolean skip-cast to isIntegerType check

ff68516

Boolean values (0 or 1) fit in any integer type. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

mydump: simplify parquetInt32SkipCastInfo integer cases

128f2d2

All integer converted types share the same int32 physical range. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

mydump: simplify parquetInt64SkipCastInfo integer cases

c555f53

All integer converted types share the same int64 physical range. Remove unused unsignedRangeFitsTarget. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

mydump: restore bounds check in buildSkipCastPrechecks

4c07c5f

Parquet file columns may not match FieldMappings length. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

coderabbitai bot reviewed Mar 16, 2026

View reviewed changes

joechenrh and others added 2 commits March 16, 2026 07:55

mydump: split int32/int64 skip-cast by signed/unsigned converted type

cdc56f0

Unsigned converted types produce [0, MaxUint32] or [0, MaxUint64] via the setter, not the signed physical range. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

coderabbitai bot reviewed Mar 16, 2026

View reviewed changes

pkg/lightning/mydump/parquet_type_converter.go Show resolved Hide resolved

pkg/lightning/mydump/parquet_type_converter.go Show resolved Hide resolved

coderabbitai bot reviewed Mar 16, 2026

View reviewed changes

joechenrh added 2 commits March 16, 2026 12:26

update

c7f49f5

Signed-off-by: Ruihao Chen <joechenrh@gmail.com>

update

cecf41d

Signed-off-by: Ruihao Chen <joechenrh@gmail.com>

coderabbitai bot reviewed Mar 17, 2026

View reviewed changes

pkg/lightning/mydump/parquet_parser.go Show resolved Hide resolved

pkg/lightning/mydump/parquet_type_converter.go Show resolved Hide resolved

joechenrh and others added 3 commits March 17, 2026 01:27

update

1868edd

Signed-off-by: Ruihao Chen <joechenrh@gmail.com>

importer: fix TestGetPreInfoReadFirstRow parquet collation expectation

8526db9

Update expected collation from "utf8mb4_bin" to "binary" for parquet string datums when no target columns are provided. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

Conversation

joechenrh commented Mar 16, 2026 • edited by coderabbitai bot Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

What problem does this PR solve?

What changed and how does it work?

Check List

Release note

Summary by CodeRabbit

Uh oh!

pantheon-ai bot commented Mar 16, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

tiprow bot commented Mar 16, 2026

Uh oh!

coderabbitai bot commented Mar 16, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Reviews paused

Walkthrough

Changes

Estimated code review effort

Poem

❌ Failed checks (1 warning, 1 inconclusive)

Uh oh!

hawkingrei commented Mar 16, 2026

Uh oh!

joechenrh commented Mar 16, 2026

Uh oh!

coderabbitai bot left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

codecov bot commented Mar 16, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Codecov Report

Uh oh!

coderabbitai bot left a comment

Choose a reason for hiding this comment

Uh oh!

coderabbitai bot left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

coderabbitai bot left a comment

Choose a reason for hiding this comment

Uh oh!

coderabbitai bot left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

ti-chi-bot bot commented Mar 17, 2026

Uh oh!

ti-chi-bot bot commented Mar 17, 2026

Uh oh!

ti-chi-bot bot commented Mar 17, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

joechenrh commented Mar 17, 2026

Uh oh!

tiprow bot commented Mar 17, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

joechenrh commented Mar 16, 2026 •

edited by coderabbitai bot

Loading

pantheon-ai bot commented Mar 16, 2026 •

edited

Loading

coderabbitai bot commented Mar 16, 2026 •

edited

Loading

codecov bot commented Mar 16, 2026 •

edited

Loading

ti-chi-bot bot commented Mar 17, 2026 •

edited

Loading