Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Initcap behaves differently in Spark and in DataFusion (also Comet) #1052

Open
Blizzara opened this issue Nov 4, 2024 · 1 comment
Open
Labels
bug Something isn't working
Milestone

Comments

@Blizzara
Copy link
Contributor

Blizzara commented Nov 4, 2024

Describe the bug

DataFusion's initcap behaves differently than Spark's. While both do "upper-case the first letter of each word and lowercase others", Spark considers as words anything separated by whitespace (' '), while DataFusion considers anything separated by non-ascii-alphanumeric as words. (DF's code would also fail to uppercase or lowercase non-ascii chars, but that doesn't materialize as a separate issue as it considers them separators already in the first place.)

#1051 shows the problem by adding two cases to the test, one using a dash and one using non-ascii letters (from Finnish).

== Results ==
!== Correct Answer - 7 ==       == Spark Answer - 7 ==
 struct<initcap(name):string>   struct<initcap(name):string>
 [James Smith]                  [James Smith]
 [James Smith]                  [James Smith]
![James Ähtäri]                 [James äHtäRi]
 [Michael Rose]                 [Michael Rose]
 [Rames Rose]                   [Rames Rose]
![Robert Rose-smith]            [Robert Rose-Smith]
 [Robert Williams]              [Robert Williams]

Steps to reproduce

Call initcap with an input containing non-ascii-alphanumeric non-whitespace characters

Expected behavior

Match Spark

Additional context

No response

@Blizzara Blizzara added the bug Something isn't working label Nov 4, 2024
@andygrove andygrove added this to the 0.4.0 milestone Nov 4, 2024
@viirya
Copy link
Member

viirya commented Nov 4, 2024

Thanks for reporting this bug.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working
Projects
None yet
Development

No branches or pull requests

3 participants