Skip to content

SSO for strings#25593

Open
Araq wants to merge 24 commits intodevelfrom
araq-sso2
Open

SSO for strings#25593
Araq wants to merge 24 commits intodevelfrom
araq-sso2

Conversation

@Araq
Copy link
Member

@Araq Araq commented Mar 9, 2026

No description provided.

@hourianto
Copy link
Contributor

hourianto commented Mar 12, 2026

Four current issues with the PR, please ignore if they're known.

Found by GPT-5.4, all verified.

  1. SIGBUS:
proc mutate(x: var openArray[char]) =
  x[0] = 'X'

const longLit = "0123456789abcdef"

doAssert longLit.len == 16

var a = longLit
var b = a
mutate(a)
doAssert a == "X123456789abcdef"
doAssert b == longLit
  1. Mutates a, but comparison gives wrong result:
proc poke(c: var char) =
  c = 'X'

var s = "0123456789abcdef"
poke(s[0])
doAssert s == "X123456789abcdef"
  1. 32-bit short-string cutoff bug
proc c_exit(code: cint) {.importc: "exit", header: "<stdlib.h>", noreturn.}
proc c_printf(fmt: cstring): cint {.importc: "printf", header: "<stdio.h>", varargs.}
proc c_fflush(stream: pointer): cint {.importc: "fflush", header: "<stdio.h>".}

type SmallStringMirror = object
  bytes: uint
  more: pointer

proc dump(label: cstring; s: string) =
  let ss = cast[ptr SmallStringMirror](unsafeAddr s)
  let p = cast[ptr UncheckedArray[byte]](cast[pointer](ss))
  discard c_printf(
    "%s raw=0x%08x more=%p bytes=%02x %02x %02x %02x %02x %02x %02x %02x\n",
    label,
    cuint(ss.bytes),
    ss.more,
    cuint(p[0]),
    cuint(p[1]),
    cuint(p[2]),
    cuint(p[3]),
    cuint(p[4]),
    cuint(p[5]),
    cuint(p[6]),
    cuint(p[7]),
  )

var a = newString(4)
a[0] = 'a'
a[1] = 'b'
a[2] = 'c'
a[3] = 'a'

var b = newString(4)
b[0] = 'a'
b[1] = 'b'
b[2] = 'c'
b[3] = 'b'

var lit = "abcdef"

dump("a", a)
dump("b", b)
dump("lit", lit)
discard c_printf("a_eq_b=%d expected=0\n", cint(a == b))
discard c_fflush(nil)
c_exit(0)

On 32-bit targets, the PR treats all strings of length <= 7 as “short” and compares them using only the bytes field, but bytes holds only the first 3 payload chars on a 32-bit layout; chars 4..7 live in more. That makes runtime-built 4-char strings like "abca" and "abcb" compare equal, and it also makes emitted 4..7 char literals wrong because genStringLiteralV3 stores them as if all payload fit in bytes.

The fix is to stop using AlwaysAvail = 7 as the “all data is in bytes” cutoff. Short/SWAR fast paths and short literal emission need to use sizeof(uint) - 1 as the inline-in-bytes limit, and lengths above that must use the medium path that includes more.

  1. Implementation is currently broken on big-endian systems:
proc c_exit(code: cint) {.importc: "exit", header: "<stdlib.h>", noreturn.}
proc c_printf(fmt: cstring): cint {.importc: "printf", header: "<stdio.h>", varargs.}
proc c_fflush(stream: pointer): cint {.importc: "fflush", header: "<stdio.h>".}

type SmallStringMirror = object
  bytes: uint
  more: pointer

proc dump(label: cstring; s: string) =
  let ss = cast[ptr SmallStringMirror](unsafeAddr s)
  let p = cast[ptr UncheckedArray[byte]](addr ss.bytes)
  discard c_printf(
    "%s raw=0x%016llx b0=%u b7=%u more=%p chars=%02x %02x %02x %02x %02x %02x %02x\n",
    label,
    culonglong(ss.bytes),
    cuint(p[0]),
    cuint(p[7]),
    ss.more,
    cuint(p[1]),
    cuint(p[2]),
    cuint(p[3]),
    cuint(p[4]),
    cuint(p[5]),
    cuint(p[6]),
    cuint(p[7]),
  )

var short = "abc"
var medium = "abcdefgh"

dump("short", short)
dump("medium", medium)
discard c_fflush(nil)
c_exit(0)

The SmallString layout stores slen in byte 0 on big-endian (b0=3 for "abc", b0=8 for "abcdefgh"), but the current runtime reads byte 7 instead, which is 0 for short strings and 'g' (103) for medium strings. As a result, short strings get the wrong length and medium strings are misclassified as long strings, causing more to be treated as a real LongString* even though it only contains packed inline bytes. Simplest fix: make ssLen read the first byte in memory on all endiannesses, because that is where literals and setSSLen already store slen. In practice, the big-endian branch in lib/system/strs_v3.nim should stop reading addr s.bytes + (sizeof(uint) - 1) and just read addr s.bytes.

@Araq
Copy link
Member Author

Araq commented Mar 12, 2026

3 and 4 are known and 1-2 are too and supposed to be caught at compile-time later.

@arnetheduck
Copy link
Contributor

need to warn/error on cast[string](seq[byte]) btw since this will be even more wrong now

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants