[BUG] Transpiler truncates supplementary codepoints (`\\x{1F600}` becomes U+F600); silent wrong matches for non-BMP characters

**Describe the bug**

`CudfRegexTranspiler.rewrite`'s arms for `RegexHexDigit` and
`RegexOctalChar` rewrite codepoints `>= 0x80` as
`RegexChar(r.codePoint.toChar)`. `Char` is a 16-bit UTF-16 code unit,
so any codepoint above U+FFFF is silently truncated to its low 16
bits.

For example, `\x{1F600}` (😀, codepoint U+1F600) is rewritten as
`RegexChar('')` (a private-use-area code unit). cuDF then
matches U+F600 instead of the supplementary U+1F600 — silent wrong-
result for any pattern containing emoji or many CJK supplementary
characters.

**Steps/Code to reproduce bug**

```python
from pyspark.sql import SparkSession
spark = SparkSession.builder.getOrCreate()

df = spark.createDataFrame([("\U0001F600",), ("a",), ("hello \U0001F600 world",)], ["a"])
# Pattern matches U+1F600 (😀) as a literal codepoint.
df.selectExpr("a RLIKE '\\\\x{1F600}'").show()
```

Observed:
- CPU: input `"😀"` matches → `True`
- GPU: input `"😀"` does **not** match → `False`

(The GPU is matching U+F600 instead of U+1F600.)

**Expected behavior**

`\x{1F600}` should match the actual supplementary codepoint U+1F600.
GPU and CPU should agree.

**Suggested fix**

Encode supplementary codepoints as a UTF-8 byte sequence (or via a
dedicated AST node that emits the correct byte sequence). In Scala
terms:

```scala
private def codepointToCharNode(cp: Int): RegexAST = {
  if (cp <= 0xFFFF) {
    RegexChar(cp.toChar)
  } else {
    // Encode as a sequence of UTF-8 bytes, each a separate
    // RegexChar with the byte value.
    RegexSequence(encodeUtf8(cp).map(b => RegexChar(b.toChar)))
  }
}
```

**Affected file**

`sql-plugin/src/main/scala/com/nvidia/spark/rapids/RegexParser.scala`
— `CudfRegexTranspiler.rewrite` (the `RegexHexDigit` and
`RegexOctalChar` arms).

**Environment details**
 - Environment location: Standalone
 - Spark configuration settings: `spark.rapids.sql.enabled=true`, `spark.rapids.sql.regexp.enabled=true`

**Additional context**

Severity: correctness; silent wrong-result for any pattern with
`\x{...}` codepoint > U+FFFF (emoji, many CJK supplementary
characters).

Reproduced against `main`. Verified via integration test
`test_repro_hex_octal_supplementary_codepoint_truncation` in
`integration_tests/src/main/python/spark_rapids_regression_charclass_test.py`
under Spark 4.0.0 + Scala 2.13. Result: GPU `False` vs CPU `True` on
input `"😀"` matched against `\x{1F600}`.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[BUG] Transpiler truncates supplementary codepoints (`\\x{1F600}` becomes U+F600); silent wrong matches for non-BMP characters #14744

Metadata

Assignees

Labels

Type

Fields

Projects

Milestone

Relationships

Development

[BUG] Transpiler truncates supplementary codepoints (\\x{1F600} becomes U+F600); silent wrong matches for non-BMP characters #14744

Description

Metadata

Metadata

Assignees

Labels

Type

Fields

Projects

Milestone

Relationships

Development

Issue actions

[BUG] Transpiler truncates supplementary codepoints (`\\x{1F600}` becomes U+F600); silent wrong matches for non-BMP characters #14744