fix(format): decode SBOMs encoded as UTF-16 or with a UTF-8 BOM#4919
Open
Dashtid wants to merge 1 commit into
Open
fix(format): decode SBOMs encoded as UTF-16 or with a UTF-8 BOM#4919Dashtid wants to merge 1 commit into
Dashtid wants to merge 1 commit into
Conversation
PowerShell's `>` operator emits UTF-16LE on Windows, so any tool that pipes syft into a file via the documented `syft.exe ... > sbom.json` pattern produces an SBOM that the format identifiers reject as "sbom format not recognized" — the per-format Identify methods only recognize UTF-8. Detect a leading UTF-8 BOM, UTF-16LE BOM, or UTF-16BE BOM at the SeekableReader seam (the single funnel for all SBOM input) and transcode the content to UTF-8 before per-format identification. Streams without a BOM are unchanged and still take the existing fast path for io.ReadSeeker inputs (preserving the offset-aware contract used by JSONL-style multi-document readers). Fixes anchore#4916 Signed-off-by: David Dashti <david.dashti@hermesmedical.com>
2fd77a2 to
e4aea5e
Compare
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Fixes #4916.
Problem
Running
syft.exe scan dir:X -o json > sbom.jsonfrom PowerShell on Windows produces an SBOM file encoded as UTF-16LE (the OS-level default for PowerShell's>operator).grype sbom:sbom.jsonandsyft convert sbom.json -o ...then fail withunable to decode sbom: sbom format not recognized, because the per-formatIdentifymethods only recognize UTF-8 byte sequences. Verified on current versions (grype v0.112.0, syftmain):format.Identifyresult7b 22 61 ...({"a...)syft-json v16.1.3ff fe 7b 00 22 00 61 00 ...format not recognizedSame SBOM content, only the on-disk encoding differs.
Fix
Detect a leading UTF-8 BOM (
EF BB BF), UTF-16LE BOM (FF FE), or UTF-16BE BOM (FE FF) atSeekableReader— the single funnelDecoderCollection.{Decode,Identify}use for every SBOM input — and transcode the content to UTF-8 viagolang.org/x/text/encoding/unicode.BOMOverridebefore per-format identification.The peek-and-branch shape preserves the existing fast path for
io.ReadSeekerinputs (and its offset-aware contract for JSONL-style multi-document readers) when no BOM is present; we only buffer when transcoding is required.Scope is limited to BOM-marked encodings — UTF-8, UTF-16LE, UTF-16BE — per @willmurphyscode's confirmation on #4916. No heuristic detection of unmarked Windows-1252 etc. (too easy to misclassify legitimate binary content).
golang.org/x/textwas already an indirect dep; this PR promotes it to direct.Test plan
TestSeekableReadercovering UTF-8 BOM / UTF-16LE / UTF-16BE with both seekable and non-seekable inputs, plus short-input edge caseTest_hasBOMcovering 9 cases (each BOM variant, near-miss, empty, short, plain JSON, plain ASCII)Test_peekHead_preservesReaderPositioncovering 4 cases (seekable at 0, seekable at non-zero offset, non-seekable, short input — verifies the seek-back contract that the offset-aware fast path depends on)TestSeekableReadercases unchanged and passing (no-BOM path is byte-for-byte unchanged)syft-json v16.1.3, decoded without error, and yields the expected package countgolangci-lint run --tests=falseclean on the modified file