Skip to content

fix(format): decode SBOMs encoded as UTF-16 or with a UTF-8 BOM#4919

Open
Dashtid wants to merge 1 commit into
anchore:mainfrom
Dashtid:fix/sbom-input-bom-utf16-encoding
Open

fix(format): decode SBOMs encoded as UTF-16 or with a UTF-8 BOM#4919
Dashtid wants to merge 1 commit into
anchore:mainfrom
Dashtid:fix/sbom-input-bom-utf16-encoding

Conversation

@Dashtid

@Dashtid Dashtid commented May 14, 2026

Copy link
Copy Markdown
Contributor

Fixes #4916.

Problem

Running syft.exe scan dir:X -o json > sbom.json from PowerShell on Windows produces an SBOM file encoded as UTF-16LE (the OS-level default for PowerShell's > operator). grype sbom:sbom.json and syft convert sbom.json -o ... then fail with unable to decode sbom: sbom format not recognized, because the per-format Identify methods only recognize UTF-8 byte sequences. Verified on current versions (grype v0.112.0, syft main):

Input encoding First bytes format.Identify result
UTF-8 (no BOM) 7b 22 61 ... ({"a...) syft-json v16.1.3
UTF-16LE BOM ff fe 7b 00 22 00 61 00 ... rejected → format not recognized

Same SBOM content, only the on-disk encoding differs.

Fix

Detect a leading UTF-8 BOM (EF BB BF), UTF-16LE BOM (FF FE), or UTF-16BE BOM (FE FF) at SeekableReader — the single funnel DecoderCollection.{Decode,Identify} use for every SBOM input — and transcode the content to UTF-8 via golang.org/x/text/encoding/unicode.BOMOverride before per-format identification.

The peek-and-branch shape preserves the existing fast path for io.ReadSeeker inputs (and its offset-aware contract for JSONL-style multi-document readers) when no BOM is present; we only buffer when transcoding is required.

Scope is limited to BOM-marked encodings — UTF-8, UTF-16LE, UTF-16BE — per @willmurphyscode's confirmation on #4916. No heuristic detection of unmarked Windows-1252 etc. (too easy to misclassify legitimate binary content).

golang.org/x/text was already an indirect dep; this PR promotes it to direct.

Test plan

  • 7 new cases in TestSeekableReader covering UTF-8 BOM / UTF-16LE / UTF-16BE with both seekable and non-seekable inputs, plus short-input edge case
  • New Test_hasBOM covering 9 cases (each BOM variant, near-miss, empty, short, plain JSON, plain ASCII)
  • New Test_peekHead_preservesReaderPosition covering 4 cases (seekable at 0, seekable at non-zero offset, non-seekable, short input — verifies the seek-back contract that the offset-aware fast path depends on)
  • Existing 7 TestSeekableReader cases unchanged and passing (no-BOM path is byte-for-byte unchanged)
  • End-to-end smoke: the actual PowerShell-generated UTF-16LE file from the issue's repro is now correctly identified as syft-json v16.1.3, decoded without error, and yields the expected package count
  • golangci-lint run --tests=false clean on the modified file

@oss-housekeeper oss-housekeeper Bot added the dependencies dealing with project dependencies label May 14, 2026
PowerShell's `>` operator emits UTF-16LE on Windows, so any tool that
pipes syft into a file via the documented `syft.exe ... > sbom.json`
pattern produces an SBOM that the format identifiers reject as
"sbom format not recognized" — the per-format Identify methods only
recognize UTF-8.

Detect a leading UTF-8 BOM, UTF-16LE BOM, or UTF-16BE BOM at the
SeekableReader seam (the single funnel for all SBOM input) and
transcode the content to UTF-8 before per-format identification.
Streams without a BOM are unchanged and still take the existing
fast path for io.ReadSeeker inputs (preserving the offset-aware
contract used by JSONL-style multi-document readers).

Fixes anchore#4916

Signed-off-by: David Dashti <david.dashti@hermesmedical.com>
@Dashtid Dashtid force-pushed the fix/sbom-input-bom-utf16-encoding branch from 2fd77a2 to e4aea5e Compare June 6, 2026 17:41
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

dependencies dealing with project dependencies

Projects

None yet

Development

Successfully merging this pull request may close these issues.

Syft-generated Windows SBOM unable to identify format

1 participant