-
Notifications
You must be signed in to change notification settings - Fork 116
Description
I believe there’s a bug in the varchar() helper when used with SQL Server.
Lines 119 to 128 in c8a2e26
| varchar <- function(x, type = "varchar") { | |
| # at least 255 characters, use max if more than 8000: | |
| max_length <- max(c(255, nchar(as.character(x))), na.rm = TRUE) | |
| if (max_length > 8000) { | |
| max_length <- "max" | |
| } | |
| paste0(type, "(", max_length, ")") | |
| } |
The SQL Server documentation for VARCHAR(n) defines n as the number of bytes, not characters.
Currently, varchar() determines n using nchar(type = "chars"), which underestimates the required width for multibyte UTF-8 strings. This can cause string truncation when inserting data longer than 255 bytes.
Reprex
string <- paste0("\xe2\x80\x99", paste(rep("a", 255), collapse = ""))
nchar(string)
#> [1] 256
nchar(string, type = "bytes")
#> [1] 258
odbc:::varchar(string) # not 258
#> [1] "varchar(256)"Created on 2025-10-29 with reprex v2.1.1
Expected
varchar() should use the byte length of the input (nchar(x, type = "bytes")) rather than character count.
That aligns with SQL Server’s definition and avoids truncation with UTF-8 input.
Using byte length instead of character count should not adversely affect other database backends.
Proposed change
I think that this would be the cleanest fix, given nchar() supports returning the byte count.
# before
max_length <- max(c(255, nchar(as.character(x))), na.rm = TRUE)
# after
max_length <- max(c(255, nchar(as.character(x), type = "bytes")), na.rm = TRUE)I plan to open a PR with this proposed change.