Skip to content

[BUG] Transpiler truncates supplementary codepoints (\\x{1F600} becomes U+F600); silent wrong matches for non-BMP characters #14744

@revans2

Description

@revans2

Describe the bug

CudfRegexTranspiler.rewrite's arms for RegexHexDigit and
RegexOctalChar rewrite codepoints >= 0x80 as
RegexChar(r.codePoint.toChar). Char is a 16-bit UTF-16 code unit,
so any codepoint above U+FFFF is silently truncated to its low 16
bits.

For example, \x{1F600} (😀, codepoint U+1F600) is rewritten as
RegexChar('') (a private-use-area code unit). cuDF then
matches U+F600 instead of the supplementary U+1F600 — silent wrong-
result for any pattern containing emoji or many CJK supplementary
characters.

Steps/Code to reproduce bug

from pyspark.sql import SparkSession
spark = SparkSession.builder.getOrCreate()

df = spark.createDataFrame([("\U0001F600",), ("a",), ("hello \U0001F600 world",)], ["a"])
# Pattern matches U+1F600 (😀) as a literal codepoint.
df.selectExpr("a RLIKE '\\\\x{1F600}'").show()

Observed:

  • CPU: input "😀" matches → True
  • GPU: input "😀" does not match → False

(The GPU is matching U+F600 instead of U+1F600.)

Expected behavior

\x{1F600} should match the actual supplementary codepoint U+1F600.
GPU and CPU should agree.

Suggested fix

Encode supplementary codepoints as a UTF-8 byte sequence (or via a
dedicated AST node that emits the correct byte sequence). In Scala
terms:

private def codepointToCharNode(cp: Int): RegexAST = {
  if (cp <= 0xFFFF) {
    RegexChar(cp.toChar)
  } else {
    // Encode as a sequence of UTF-8 bytes, each a separate
    // RegexChar with the byte value.
    RegexSequence(encodeUtf8(cp).map(b => RegexChar(b.toChar)))
  }
}

Affected file

sql-plugin/src/main/scala/com/nvidia/spark/rapids/RegexParser.scala
CudfRegexTranspiler.rewrite (the RegexHexDigit and
RegexOctalChar arms).

Environment details

  • Environment location: Standalone
  • Spark configuration settings: spark.rapids.sql.enabled=true, spark.rapids.sql.regexp.enabled=true

Additional context

Severity: correctness; silent wrong-result for any pattern with
\x{...} codepoint > U+FFFF (emoji, many CJK supplementary
characters).

Reproduced against main. Verified via integration test
test_repro_hex_octal_supplementary_codepoint_truncation in
integration_tests/src/main/python/spark_rapids_regression_charclass_test.py
under Spark 4.0.0 + Scala 2.13. Result: GPU False vs CPU True on
input "😀" matched against \x{1F600}.

Metadata

Metadata

Assignees

No one assigned

    Labels

    bugSomething isn't working

    Type

    No type
    No fields configured for issues without a type.

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions