Skip to content

varchar() measures characters instead of bytes, causing truncation with multibyte UTF-8 strings (SQL Server) #960

@lachlangh

Description

@lachlangh

I believe there’s a bug in the varchar() helper when used with SQL Server.

odbc/R/aaa-odbc-data-type.R

Lines 119 to 128 in c8a2e26

varchar <- function(x, type = "varchar") {
# at least 255 characters, use max if more than 8000:
max_length <- max(c(255, nchar(as.character(x))), na.rm = TRUE)
if (max_length > 8000) {
max_length <- "max"
}
paste0(type, "(", max_length, ")")
}

The SQL Server documentation for VARCHAR(n) defines n as the number of bytes, not characters.
Currently, varchar() determines n using nchar(type = "chars"), which underestimates the required width for multibyte UTF-8 strings. This can cause string truncation when inserting data longer than 255 bytes.

Reprex

string <- paste0("\xe2\x80\x99", paste(rep("a", 255), collapse = ""))
nchar(string)
#> [1] 256
nchar(string, type = "bytes")
#> [1] 258
odbc:::varchar(string) # not 258
#> [1] "varchar(256)"

Created on 2025-10-29 with reprex v2.1.1

Expected

varchar() should use the byte length of the input (nchar(x, type = "bytes")) rather than character count.
That aligns with SQL Server’s definition and avoids truncation with UTF-8 input.
Using byte length instead of character count should not adversely affect other database backends.

Proposed change

I think that this would be the cleanest fix, given nchar() supports returning the byte count.

# before
max_length <- max(c(255, nchar(as.character(x))), na.rm = TRUE)

# after
max_length <- max(c(255, nchar(as.character(x), type = "bytes")), na.rm = TRUE)

I plan to open a PR with this proposed change.

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions