Describe the bug
CudfRegexTranspiler.rewrite's arms for RegexHexDigit and
RegexOctalChar rewrite codepoints >= 0x80 as
RegexChar(r.codePoint.toChar). Char is a 16-bit UTF-16 code unit,
so any codepoint above U+FFFF is silently truncated to its low 16
bits.
For example, \x{1F600} (😀, codepoint U+1F600) is rewritten as
RegexChar('') (a private-use-area code unit). cuDF then
matches U+F600 instead of the supplementary U+1F600 — silent wrong-
result for any pattern containing emoji or many CJK supplementary
characters.
Steps/Code to reproduce bug
from pyspark.sql import SparkSession
spark = SparkSession.builder.getOrCreate()
df = spark.createDataFrame([("\U0001F600",), ("a",), ("hello \U0001F600 world",)], ["a"])
# Pattern matches U+1F600 (😀) as a literal codepoint.
df.selectExpr("a RLIKE '\\\\x{1F600}'").show()
Observed:
- CPU: input
"😀" matches → True
- GPU: input
"😀" does not match → False
(The GPU is matching U+F600 instead of U+1F600.)
Expected behavior
\x{1F600} should match the actual supplementary codepoint U+1F600.
GPU and CPU should agree.
Suggested fix
Encode supplementary codepoints as a UTF-8 byte sequence (or via a
dedicated AST node that emits the correct byte sequence). In Scala
terms:
private def codepointToCharNode(cp: Int): RegexAST = {
if (cp <= 0xFFFF) {
RegexChar(cp.toChar)
} else {
// Encode as a sequence of UTF-8 bytes, each a separate
// RegexChar with the byte value.
RegexSequence(encodeUtf8(cp).map(b => RegexChar(b.toChar)))
}
}
Affected file
sql-plugin/src/main/scala/com/nvidia/spark/rapids/RegexParser.scala
— CudfRegexTranspiler.rewrite (the RegexHexDigit and
RegexOctalChar arms).
Environment details
- Environment location: Standalone
- Spark configuration settings:
spark.rapids.sql.enabled=true, spark.rapids.sql.regexp.enabled=true
Additional context
Severity: correctness; silent wrong-result for any pattern with
\x{...} codepoint > U+FFFF (emoji, many CJK supplementary
characters).
Reproduced against main. Verified via integration test
test_repro_hex_octal_supplementary_codepoint_truncation in
integration_tests/src/main/python/spark_rapids_regression_charclass_test.py
under Spark 4.0.0 + Scala 2.13. Result: GPU False vs CPU True on
input "😀" matched against \x{1F600}.
Describe the bug
CudfRegexTranspiler.rewrite's arms forRegexHexDigitandRegexOctalCharrewrite codepoints>= 0x80asRegexChar(r.codePoint.toChar).Charis a 16-bit UTF-16 code unit,so any codepoint above U+FFFF is silently truncated to its low 16
bits.
For example,
\x{1F600}(😀, codepoint U+1F600) is rewritten asRegexChar('')(a private-use-area code unit). cuDF thenmatches U+F600 instead of the supplementary U+1F600 — silent wrong-
result for any pattern containing emoji or many CJK supplementary
characters.
Steps/Code to reproduce bug
Observed:
"😀"matches →True"😀"does not match →False(The GPU is matching U+F600 instead of U+1F600.)
Expected behavior
\x{1F600}should match the actual supplementary codepoint U+1F600.GPU and CPU should agree.
Suggested fix
Encode supplementary codepoints as a UTF-8 byte sequence (or via a
dedicated AST node that emits the correct byte sequence). In Scala
terms:
Affected file
sql-plugin/src/main/scala/com/nvidia/spark/rapids/RegexParser.scala—
CudfRegexTranspiler.rewrite(theRegexHexDigitandRegexOctalChararms).Environment details
spark.rapids.sql.enabled=true,spark.rapids.sql.regexp.enabled=trueAdditional context
Severity: correctness; silent wrong-result for any pattern with
\x{...}codepoint > U+FFFF (emoji, many CJK supplementarycharacters).
Reproduced against
main. Verified via integration testtest_repro_hex_octal_supplementary_codepoint_truncationinintegration_tests/src/main/python/spark_rapids_regression_charclass_test.pyunder Spark 4.0.0 + Scala 2.13. Result: GPU
Falsevs CPUTrueoninput
"😀"matched against\x{1F600}.