Skip to content

-i matches multi-character sequences via Unicode case folding where GNU matches one #32

@sylvestre

Description

@sylvestre

Under -i, uu_grep lets a single-character pattern match a multi-character sequence when that sequence is a Unicode case-folding of one character — e.g. st (folds from the ligatures /), ss (from ß), or ff/fi/ffi (from //). GNU under LC_ALL=C folds case 1:1 and matches a single character. The input here is plain ASCII, so this is not the locale/encoding limitation — the extra matching comes from the case folder, not from byte-vs-codepoint handling.

Found by the differential fuzzer (fuzz_grep).

Rust (incorrect)

$ printf 'st\n' | ./target/release/grep -o -i '[[:alpha:]]'
st
# one match spanning two characters

GNU (correct)

$ printf 'st\n' | LC_ALL=C /usr/bin/grep -o -i '[[:alpha:]]'
s
t
# two separate single-character matches

More cases (Rust → GNU): ssss vs s/s; ffff vs f/f; ffiffi vs f/f/i. It is order-sensitive (st merges but ts does not) and changes match counts, not just -o output.

Metadata

Metadata

Assignees

No one assigned

    Labels

    Type

    No type
    No fields configured for issues without a type.

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions