mydump,importer: skip redundant CastColumnValue for parquet import#67054
mydump,importer: skip redundant CastColumnValue for parquet import#67054joechenrh wants to merge 37 commits intopingcap:masterfrom
CastColumnValue for parquet import#67054Conversation
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
- Add temporal truncation to setTemporalDatum (DATE zeroes time, DATETIME truncates sub-FSP nanoseconds) - Fix integer setter signedness dispatch based on target column signedness - Internalize skip-cast types (columnSkipCastPrecheck, skipCheckKind) - Broaden string targets to CHAR (non-binary) and VARBINARY - Implement fillSkipCast in ParquetParser called from ReadRow - Per-value post-checks for string (charset+length) and decimal (precision/frac) - Results communicated via Row.SkipCast []bool Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Precalculate fileColIdxForInsertCol[] once at encoder init. The canSkipCastColumnValue function centralizes the file->insert column mapping lookup. No per-row remapping needed. Also pass targetColumns to NewParquetParser so skip-cast prechecks can be built at parser init time. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
- Fix Int64/None setter to dispatch on target signedness - Add early null check in fillSkipCast (skip cast for nulls) - Add intest.Assert in initInsertColFileMapping for mapping invariant - Update spec: signed→unsigned integer not eligible (range includes negatives) - Add VARBINARY byte-length-exceeded test - Add null value post-check test Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
|
Review failed due to infrastructure/execution failure after retries. Please re-trigger review. ℹ️ Learn more details on Pantheon AI. |
|
Hi @joechenrh. Thanks for your PR. PRs from untrusted users cannot be marked as trusted with I understand the commands that are listed here. DetailsInstructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes-sigs/prow repository. |
|
Note Reviews pausedIt looks like this branch is under active development. To avoid overwhelming you with review comments due to an influx of new commits, CodeRabbit has automatically paused this review. You can configure this behavior by changing the Use the following commands to manage reviews:
Use the checkboxes below for quick actions:
📝 WalkthroughWalkthroughAdds per-column and per-row "skip-cast" metadata and logic for Parquet import; extends TableKVEncoder.Encode to accept per-row skipCast flags; threads target column metadata from parsers into encoders and tests to enable selective bypass of type conversions. Changes
sequenceDiagram
participant Client as Import Client
participant Parser as Parquet Parser
participant ChunkProc as Chunk Processor
participant Encoder as TableKVEncoder
participant Mapper as Column Mapper
Client->>Parser: NewParquetParser(targetColumns)
Parser->>Parser: buildSkipCastPrechecks(targetColumns)
Parser->>Parser: ReadRow() -> lastRow.SkipCast
ChunkProc->>Parser: parserEncodeReader.Read()
ChunkProc->>ChunkProc: assemble rowToEncode{row, rowID, skipCast}
ChunkProc->>Encoder: Encode(row, skipCast, rowID)
Encoder->>Encoder: set currentSkipCast = skipCast
Encoder->>Mapper: canSkipCastColumnValue(insertColIdx)?
alt skip allowed
Mapper-->>Encoder: use raw value
else
Mapper-->>Encoder: perform cast/convert
end
Encoder-->>ChunkProc: return KV pairs
Estimated code review effort🎯 4 (Complex) | ⏱️ ~50 minutes Poem
🚥 Pre-merge checks | ✅ 1 | ❌ 2❌ Failed checks (1 warning, 1 inconclusive)
✅ Passed checks (1 passed)
✏️ Tip: You can configure your own custom pre-merge checks in the settings. ✨ Finishing Touches🧪 Generate unit tests (beta)
📝 Coding Plan
Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out. Comment |
|
/ok-to-test |
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
|
/retest |
CastColumnValue for parquet import
There was a problem hiding this comment.
Actionable comments posted: 2
🧹 Nitpick comments (1)
pkg/lightning/mydump/parquet_type_skipcast_test.go (1)
172-212: Add a negative signed→unsigned regression here.The new target-aware integer setters are only exercised with positive inputs. A
-1case against an unsigned target would lock down the no-wraparound behavior that this optimization depends on.🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@pkg/lightning/mydump/parquet_type_skipcast_test.go` around lines 172 - 212, Add a regression subtest in TestIntSetterSignedness that exercises a negative signed→unsigned conversion to lock down no-wraparound behavior: create a convertedType for Int32, build an unsigned target via newParquetTargetColumnInfo(..., mysql.UnsignedFlag, ...), get the setter with getInt32Setter, call setter(-1, &d) and assert no error and that the Datum preserves the signed negative value (types.KindInt64 and d.GetInt64() == -1) so the code does not wrap -1 into a large uint64.
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.
Inline comments:
In `@pkg/executor/importer/kv_encode.go`:
- Around line 85-92: The file→insert-column mapping (built in
TableKVEncoder.initInsertColFileMapping and used via fileColIdxForInsertCol to
decide skipping CastColumnValue) must fail closed in production instead of using
intest.Assert; update initInsertColFileMapping (and the same logic in the other
initialization block around the second occurrence) to validate that every
insertColumns entry has a matching fieldMappings entry and return an error (or
panic) if any mapping is missing, removing the intest.Assert dependency; ensure
TableKVEncoder construction in the calling site handles the returned error (or
accept the panic) so production cannot silently proceed with wrong skip-cast
bits.
In `@pkg/lightning/mydump/parquet_type.go`:
- Around line 608-613: The setters are reinterpreting source signed integers as
unsigned based on target.HasUnsignedFlag (e.g., the branch that currently does
d.SetUint64(uint64(uint32(val))) for val int32); change them to preserve the
source integer semantics instead of flipping signedness: for signed source
params (e.g., val int32) always set the Datum with the signed representation
(d.SetInt64(int64(val))), and for unsigned source params use the unsigned
representation, letting CastColumnValue handle any overflow/signedness
conversions. Update the analogous branches referenced (the current one plus the
similar branches at the other locations noted) to stop casting signed source
values to uint64 and instead use the source-typed int64/uint64 setters.
---
Nitpick comments:
In `@pkg/lightning/mydump/parquet_type_skipcast_test.go`:
- Around line 172-212: Add a regression subtest in TestIntSetterSignedness that
exercises a negative signed→unsigned conversion to lock down no-wraparound
behavior: create a convertedType for Int32, build an unsigned target via
newParquetTargetColumnInfo(..., mysql.UnsignedFlag, ...), get the setter with
getInt32Setter, call setter(-1, &d) and assert no error and that the Datum
preserves the signed negative value (types.KindInt64 and d.GetInt64() == -1) so
the code does not wrap -1 into a large uint64.
ℹ️ Review info
⚙️ Run configuration
Configuration used: Repository UI
Review profile: CHILL
Plan: Pro
Run ID: e49ac364-d597-40be-95e1-1b7ccc63d3e6
📒 Files selected for processing (18)
lightning/pkg/importer/chunk_process.golightning/pkg/importer/get_pre_info.gopkg/dxf/importinto/conflict_resolution_test.gopkg/dxf/importinto/conflictedkv/deleter_test.gopkg/dxf/importinto/conflictedkv/handler.gopkg/dxf/importinto/conflictedkv/handler_test.gopkg/executor/importer/chunk_process.gopkg/executor/importer/import.gopkg/executor/importer/kv_encode.gopkg/executor/importer/kv_encode_test.gopkg/lightning/mydump/BUILD.bazelpkg/lightning/mydump/parquet_parser.gopkg/lightning/mydump/parquet_parser_test.gopkg/lightning/mydump/parquet_type.gopkg/lightning/mydump/parquet_type_converter.gopkg/lightning/mydump/parquet_type_skipcast_test.gopkg/lightning/mydump/parquet_wrapper.gopkg/lightning/mydump/parser.go
💤 Files with no reviewable changes (1)
- pkg/lightning/mydump/parquet_type_converter.go
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
…ocations Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
…ng through methods Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Set TargetColumns in importMinimalTaskExecutor.Run instead of passing as a separate parameter to NewParquetParser. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Codecov Report❌ Patch coverage is Additional details and impacted files@@ Coverage Diff @@
## master #67054 +/- ##
================================================
+ Coverage 77.6966% 77.7596% +0.0630%
================================================
Files 2013 1945 -68
Lines 551311 543539 -7772
================================================
- Hits 428350 422654 -5696
+ Misses 121229 120211 -1018
+ Partials 1732 674 -1058
Flags with carried forward coverage won't be shown. Click here to find out more.
🚀 New features to boost your workflow:
|
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Merge physical type into columnType struct and remove separate physicalTypes arg from buildSkipCastPrechecks test calls. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
…pCast - Add target *model.ColumnInfo field to columnSkipCastPrecheck - Remove separate targetCols field from ParquetParser - Pre-allocate skipCast []bool at init time (same length as file columns) - Simplify fillSkipCast by removing bounds/cap checks - Fix postCheckString binary path regression Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
prechecks is always the same length as colTypes (file column count). Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
castRequired, castSkipAlways, castCheckString, castCheckDecimal Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Boolean values (0 or 1) fit in any integer type. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
All integer converted types share the same int32 physical range. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
All integer converted types share the same int64 physical range. Remove unused unsignedRangeFitsTarget. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
… InsertColumns FieldMappings is 1:1 with file columns, while InsertColumns includes SET clause columns and excludes user-var columns — the indices don't align with file column positions. Also skip buildSkipCastPrechecks when TargetColumns is nil (lightning path). Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Parquet file columns may not match FieldMappings length. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
There was a problem hiding this comment.
♻️ Duplicate comments (1)
pkg/executor/importer/kv_encode.go (1)
260-278:⚠️ Potential issue | 🟠 MajorProduction should fail-closed on column ID mismatch.
The
intest.Assertat lines 268-269 only fires in test builds. In production, iffieldMappingsandinsertColumnshave drifted (column ID mismatch), the wrongskipCastbit would be used, potentially causing silent data corruption.This was flagged in a previous review. Consider converting to a hard error:
🔧 Proposed fix
-func (en *TableKVEncoder) initInsertColFileMapping() { +func (en *TableKVEncoder) initInsertColFileMapping() error { en.insertColToFileIdx = make([]int, len(en.insertColumns)) insertIdx := 0 for fileIdx, mapping := range en.fieldMappings { if mapping == nil || mapping.Column == nil { continue } if insertIdx < len(en.insertColToFileIdx) { - intest.Assert(mapping.Column.ID == en.insertColumns[insertIdx].ID, - "fieldMapping column ID mismatch with insertColumns") + if mapping.Column.ID != en.insertColumns[insertIdx].ID { + return errors.Errorf( + "field mapping column id %d does not match insert column id %d at index %d", + mapping.Column.ID, en.insertColumns[insertIdx].ID, insertIdx, + ) + } en.insertColToFileIdx[insertIdx] = fileIdx } insertIdx++ } // SET clause columns have no file column for ; insertIdx < len(en.insertColToFileIdx); insertIdx++ { en.insertColToFileIdx[insertIdx] = -1 } + return nil }🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@pkg/executor/importer/kv_encode.go` around lines 260 - 278, The code currently uses intest.Assert in TableKVEncoder.initInsertColFileMapping which only runs in tests; convert that check into a production hard-fail by replacing the intest.Assert with a runtime check that compares mapping.Column.ID and en.insertColumns[insertIdx].ID and calls panic or a fatal logger with a clear message and the two IDs when they differ, before assigning en.insertColToFileIdx[insertIdx] = fileIdx; ensure you still handle nil mapping/Column and preserve the subsequent logic that fills -1 for remaining insertColToFileIdx entries.
🧹 Nitpick comments (1)
pkg/lightning/mydump/parquet_parser.go (1)
310-311: Consider documenting the skipCast slice reuse semantics.The
skipCastslice (line 311) is allocated once and reused acrossReadRowcalls. WhenLastRow().SkipCastis accessed, it returns a reference to this shared buffer.Current usage in
kv_encode.gois safe becauseEncodeconsumesskipCastsynchronously and clears its reference in the defer. However, this contract is implicit.Consider adding a comment to document this:
// skipCast holds per-file-column skip-cast decisions for the current row. // WARNING: This slice is reused across ReadRow calls. Callers must consume // LastRow().SkipCast before the next ReadRow call. skipCast []boolAlso applies to: 566-567
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@pkg/lightning/mydump/parquet_parser.go` around lines 310 - 311, Add a clear comment by the skipCast field documenting its reuse semantics: explain that skipCast is a per-file-column buffer populated by ReadRow and returned via LastRow().SkipCast, but it is a shared slice reused across successive ReadRow calls so callers must consume or copy its contents before the next ReadRow; reference the ReadRow, LastRow().SkipCast and kv_encode.go:Encode usage patterns as examples of safe synchronous consumption. Also add the same explanatory comment at the other occurrence noted (around lines 566-567) so both declarations carry the warning.
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.
Duplicate comments:
In `@pkg/executor/importer/kv_encode.go`:
- Around line 260-278: The code currently uses intest.Assert in
TableKVEncoder.initInsertColFileMapping which only runs in tests; convert that
check into a production hard-fail by replacing the intest.Assert with a runtime
check that compares mapping.Column.ID and en.insertColumns[insertIdx].ID and
calls panic or a fatal logger with a clear message and the two IDs when they
differ, before assigning en.insertColToFileIdx[insertIdx] = fileIdx; ensure you
still handle nil mapping/Column and preserve the subsequent logic that fills -1
for remaining insertColToFileIdx entries.
---
Nitpick comments:
In `@pkg/lightning/mydump/parquet_parser.go`:
- Around line 310-311: Add a clear comment by the skipCast field documenting its
reuse semantics: explain that skipCast is a per-file-column buffer populated by
ReadRow and returned via LastRow().SkipCast, but it is a shared slice reused
across successive ReadRow calls so callers must consume or copy its contents
before the next ReadRow; reference the ReadRow, LastRow().SkipCast and
kv_encode.go:Encode usage patterns as examples of safe synchronous consumption.
Also add the same explanatory comment at the other occurrence noted (around
lines 566-567) so both declarations carry the warning.
ℹ️ Review info
⚙️ Run configuration
Configuration used: Repository UI
Review profile: CHILL
Plan: Pro
Run ID: 89b6c96f-5385-42b3-88fa-a04c61640005
📒 Files selected for processing (7)
pkg/dxf/importinto/subtask_executor.gopkg/executor/importer/kv_encode.gopkg/lightning/mydump/BUILD.bazelpkg/lightning/mydump/loader.gopkg/lightning/mydump/parquet_parser.gopkg/lightning/mydump/parquet_type_converter.gopkg/lightning/mydump/parquet_type_skipcast_test.go
🚧 Files skipped from review as they are similar to previous changes (1)
- pkg/lightning/mydump/BUILD.bazel
Physical type is int32 so all integer converted types share the same [MinInt32, MaxInt32] range — use default case with signedRangeFitsTarget. Also fix inverted nil check in buildSkipCastPrechecks (== nil → != nil). Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Unsigned converted types produce [0, MaxUint32] or [0, MaxUint64] via the setter, not the signed physical range. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
There was a problem hiding this comment.
Actionable comments posted: 2
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.
Inline comments:
In `@pkg/lightning/mydump/parquet_type_converter.go`:
- Around line 595-606: The int64 setter currently bases signedness on
target.GetFlag() and uses inverted logic compared to the int32 case; change it
to use the source converted type's unsignedness (the same check used in the
int32 setter/precheck) instead of target.GetFlag(), and ensure the branch maps
unsigned -> d.SetUint64(uint64(val)) and signed -> d.SetInt64(val) (use
mysql.HasUnsignedFlag on the source converted-type flag or the variable
representing the source converted type).
- Around line 539-549: The setter currently chooses SetUint64 vs SetInt64 based
on the target column signedness (using target.GetFlag()), which breaks skip-cast
paths; change it to use the source parquet converted type signedness (use
isUnsignedConvertedType(converted.converted)) so the closure preserves source
semantics (call SetUint64(uint64(uint32(val))) when the source converted type is
unsigned, otherwise call SetInt64(int64(val))); this ensures CastColumnValue
still performs any target-side conversion. Update the default branch that
returns the func(val int32, d *types.Datum) error to check
isUnsignedConvertedType(converted.converted) instead of target.GetFlag().
ℹ️ Review info
⚙️ Run configuration
Configuration used: Repository UI
Review profile: CHILL
Plan: Pro
Run ID: dedb444e-5ccc-4600-94c9-9e8ab922504b
📒 Files selected for processing (1)
pkg/lightning/mydump/parquet_type_converter.go
- Flatten parquetColumnPrecheck into switch on physical type
- Rename to int32CanSkipCast/int64CanSkipCast/decimalCanSkipCast/stringCanSkipCast
- Use mysql.IsIntegerType and types.Integer{Signed,Unsigned}{Lower,Upper}Bound
- signedRangeFitsTarget rejects unsigned targets early
- Collapse integer setter default cases by target signedness
- Clamp numColumns to min(fileColumns, len(TargetColumns)) in NewParquetParser
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
There was a problem hiding this comment.
Caution
Some comments are outside the diff and can’t be posted inline due to platform limitations.
⚠️ Outside diff range comments (1)
pkg/lightning/mydump/parquet_parser.go (1)
239-257:⚠️ Potential issue | 🔴 CriticalGuard precheck indexing to avoid panic when target columns are absent.
Line 256 unconditionally reads
prechecks[idx], butprechecksis nil unlessmeta.TargetColumnsis set. This can panic in regular parquet parser flows (e.g., sampling).🔧 Proposed fix
func (rgp *rowGroupParser) init(colTypes []columnType, loc *time.Location, prechecks []columnSkipCastPrecheck) (err error) { @@ for idx := range numCols { tp := meta.Schema.Column(idx).PhysicalType() - iter := createColumnIterator(tp, &colTypes[idx], loc, prechecks[idx].target, readBatchSize) + var target *model.ColumnInfo + if idx < len(prechecks) { + target = prechecks[idx].target + } + iter := createColumnIterator(tp, &colTypes[idx], loc, target, readBatchSize) if iter == nil { return errors.Errorf("unsupported parquet type %s", tp.String()) }Also applies to: 403-403, 704-707
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@pkg/lightning/mydump/parquet_parser.go` around lines 239 - 257, The init method on rowGroupParser accesses prechecks[idx] unconditionally which can panic when prechecks is nil or shorter than the number of columns; update rowGroupParser.init to guard access to prechecks (and the other occurrences at the same pattern) by checking len(prechecks) > idx and prechecks != nil before using prechecks[idx].target, and pass a nil or default target value to createColumnIterator when no precheck exists; ensure the loop over columns uses for idx := 0; idx < numCols; idx++ (instead of range numCols) to avoid iteration bugs and close iterators on error as already implemented.
♻️ Duplicate comments (1)
pkg/lightning/mydump/parquet_type_converter.go (1)
482-488:⚠️ Potential issue | 🔴 CriticalPreserve source integer semantics in default setters.
Line 483 and Line 539 still branch on target signedness, which can mis-decode parquet logical unsigned/signed values and produce wrong datums when skip-cast is true. Use source converted type signedness instead.
🔧 Proposed fix
@@ default: - if target != nil && mysql.HasUnsignedFlag(target.GetFlag()) { + switch converted.converted { + case schema.ConvertedTypes.Uint8, schema.ConvertedTypes.Uint16, schema.ConvertedTypes.Uint32: return func(val int32, d *types.Datum) error { d.SetUint64(uint64(uint32(val))) return nil } + default: + return func(val int32, d *types.Datum) error { + d.SetInt64(int64(val)) + return nil + } } - return func(val int32, d *types.Datum) error { - d.SetInt64(int64(val)) - return nil - } @@ default: - if target != nil && !mysql.HasUnsignedFlag(target.GetFlag()) { + switch converted.converted { + case schema.ConvertedTypes.Uint8, schema.ConvertedTypes.Uint16, + schema.ConvertedTypes.Uint32, schema.ConvertedTypes.Uint64: + return func(val int64, d *types.Datum) error { + d.SetUint64(uint64(val)) + return nil + } + default: return func(val int64, d *types.Datum) error { d.SetInt64(val) return nil } } - return func(val int64, d *types.Datum) error { - d.SetUint64(uint64(val)) - return nil - } } }Also applies to: 538-547
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@pkg/lightning/mydump/parquet_type_converter.go` around lines 482 - 488, The default setter branches on target signedness (using target.GetFlag() / mysql.HasUnsignedFlag) which can mis-decode parquet source signed/unsigned values when skip-cast is enabled; update the logic in the int32/int64 default setter blocks (where the code currently returns a func(val int32/int64, d *types.Datum) error and uses d.SetUint64 for unsigned) to inspect the source converted type's signedness instead of target (use the source conversion metadata or converted type flag provided by the Parquet-to-MySQL conversion context), and choose SetUint64 vs SetInt64 (or SetUint32/SetInt32 semantics) accordingly; apply the same change to the other mirrored block around the 538-547 region so both places use source signedness rather than target.GetFlag()/mysql.HasUnsignedFlag to decide which Datum setter to call.
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.
Outside diff comments:
In `@pkg/lightning/mydump/parquet_parser.go`:
- Around line 239-257: The init method on rowGroupParser accesses prechecks[idx]
unconditionally which can panic when prechecks is nil or shorter than the number
of columns; update rowGroupParser.init to guard access to prechecks (and the
other occurrences at the same pattern) by checking len(prechecks) > idx and
prechecks != nil before using prechecks[idx].target, and pass a nil or default
target value to createColumnIterator when no precheck exists; ensure the loop
over columns uses for idx := 0; idx < numCols; idx++ (instead of range numCols)
to avoid iteration bugs and close iterators on error as already implemented.
---
Duplicate comments:
In `@pkg/lightning/mydump/parquet_type_converter.go`:
- Around line 482-488: The default setter branches on target signedness (using
target.GetFlag() / mysql.HasUnsignedFlag) which can mis-decode parquet source
signed/unsigned values when skip-cast is enabled; update the logic in the
int32/int64 default setter blocks (where the code currently returns a func(val
int32/int64, d *types.Datum) error and uses d.SetUint64 for unsigned) to inspect
the source converted type's signedness instead of target (use the source
conversion metadata or converted type flag provided by the Parquet-to-MySQL
conversion context), and choose SetUint64 vs SetInt64 (or SetUint32/SetInt32
semantics) accordingly; apply the same change to the other mirrored block around
the 538-547 region so both places use source signedness rather than
target.GetFlag()/mysql.HasUnsignedFlag to decide which Datum setter to call.
ℹ️ Review info
⚙️ Run configuration
Configuration used: Repository UI
Review profile: CHILL
Plan: Pro
Run ID: 058a4973-899d-4189-8277-74debb7bf9b2
📒 Files selected for processing (2)
pkg/lightning/mydump/parquet_parser.gopkg/lightning/mydump/parquet_type_converter.go
There was a problem hiding this comment.
Actionable comments posted: 2
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.
Inline comments:
In `@pkg/lightning/mydump/parquet_parser.go`:
- Around line 542-545: pp.lastRow.SkipCast currently aliases the mutable parser
state pp.skipCast which gets overwritten on subsequent ReadRow calls; change the
assignment in the ReadRow branch that sets pp.lastRow.SkipCast so it stores an
independent copy of the skip flags instead of a reference. Concretely, locate
where pp.lastRow.Row, pp.lastRow.Length and pp.lastRow.SkipCast are set
(symbols: pp.lastRow, pp.skipCast, ReadRow) and replace the direct assignment of
SkipCast with a clone (for a slice use a new slice and copy/append, for a map
create a new map and copy entries, or perform a deep copy of the flag structure)
so previously returned rows retain their original skip flags. Ensure tests cover
mutating pp.skipCast after ReadRow to validate no aliasing.
In `@pkg/lightning/mydump/parquet_type_converter.go`:
- Around line 111-124: The stringCheckFunc currently falls back to
utf8.RuneCount for length checks which is incorrect for binary/varbinary
semantics; update stringCheckFunc to perform byte-length enforcement when
val.Kind() == types.KindBytes (i.e., treat targetFlen as a byte limit and return
false if len(b) > targetFlen), while for text (types.KindString) keep the
existing behavior of allowing by rune count when targetFlen !=
types.UnspecifiedLength; preserve the initial kind check and enc.IsValid(b)
validation and only replace the final utf8.RuneCount branch with a conditional
that distinguishes bytes vs string kinds.
ℹ️ Review info
⚙️ Run configuration
Configuration used: Repository UI
Review profile: CHILL
Plan: Pro
Run ID: b8215e4e-2c70-462b-b1b0-732aa6361b53
📒 Files selected for processing (4)
pkg/lightning/mydump/parquet_parser.gopkg/lightning/mydump/parquet_parser_test.gopkg/lightning/mydump/parquet_type_converter.gopkg/lightning/mydump/parquet_type_skipcast_test.go
For binary/varbinary targets, flen is byte count. Previously, when len(b) > targetFlen, the function fell through to utf8.RuneCount which could undercount (e.g. 2-byte 'é' = 1 rune), incorrectly allowing skip-cast. Now check EncodingTpBin and return false immediately. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Update expected collation from "utf8mb4_bin" to "binary" for parquet string datums when no target columns are provided. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
|
[FORMAT CHECKER NOTIFICATION] Notice: To remove the 📖 For more info, you can check the "Contribute Code" section in the development guide. |
|
[APPROVALNOTIFIER] This PR is NOT APPROVED This pull-request has been approved by: The full list of commands accepted by this bot can be found here. DetailsNeeds approval from an approver in each of these files:Approvers can indicate their approval by writing |
|
@joechenrh: The following test failed, say
Full PR test history. Your PR dashboard. DetailsInstructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes-sigs/prow repository. I understand the commands that are listed here. |
|
/test unit-test |
|
@joechenrh: The specified target(s) for Use DetailsIn response to this:
Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes-sigs/prow repository. |
What problem does this PR solve?
Issue Number: ref #xxx
Problem Summary: During parquet-based
IMPORT INTO, every column value goes throughtable.CastColumnValue()even when the parquet physical/logical type already guarantees the datum is correct under strict SQL mode. This adds unnecessary overhead on the hot encoding path.What changed and how does it work?
This PR adds a skip-cast optimization for parquet imports that bypasses
table.CastColumnValue()when the parquet source type is guaranteed to produce a correct datum.Layer responsibilities:
mydump/Row— Added aSkipCast []boolfield to carry per-column skip-cast decisions alongside datum values.nilwhen no skip-cast info is available (CSV/SQL formats).mydump/ParquetParser(parser side) — Makes all skip-cast decisions:columnSkipCastPrecheckper column based on parquet physical/logical type vs target TiDB column type. Determines whether a column can unconditionally skip cast, needs a per-value post-check (string length, decimal precision), or must always cast.fillSkipCast()applies post-checks on each datum to produce the finalSkipCastslice.[]*model.ColumnInfo) is threaded through to enable signedness-aware integer conversion and temporal type FSP matching.executor/importer(encoder side) — ConsumesRow.SkipCastvia a centralizedfileColIdxForInsertColmapping that translates between file-column indices and insert-column indices. WhenskipCast[i]is true, assigns the datum directly; otherwise falls through totable.CastColumnValue().Check List
Tests
Side effects
Documentation
Release note
Please refer to Release Notes Language Style Guide to write a quality release note.
Summary by CodeRabbit
New Features
Improvements
Tests