-
Notifications
You must be signed in to change notification settings - Fork 5.4k
Description
Note
This issue was drafted with Copilot assistance.
Per this comment from @stephentoub, System.Xml.Base64Decoder has its own hand-rolled base64 decoding implementation that should be replaced with the modern System.Buffers.Text.Base64 APIs.
Current state
Base64Decoder.cs contains:
- A 123-byte lookup table mapping ASCII chars to base64 digit values
- Manual bit-shifting to assemble decoded bytes (
(b << 6) | digit, extract whenbFilled >= 8) - Hand-rolled
=padding consumption and post-padding whitespace validation - Streaming state carried across calls via
_bits/_bitsFilledfields
The encoder side (Base64Encoder.cs) already delegates to Convert.ToBase64CharArray — the decoder is the outlier.
Proposed change
Replace the core Decode(ReadOnlySpan, Span, ...) logic with System.Buffers.Text.Base64 APIs. The OperationStatus-based overloads (e.g., Base64.DecodeFromUtf8) support partial input/output natively, which maps well to the streaming IncrementalReadDecoder pattern where the caller may supply more chars than fit in the output buffer.
Key considerations:
- The XML decoder skips whitespace inline; the BCL base64 APIs may need a pre-pass or use of an overload that handles whitespace.
- Error messages should identify the offending character (as fixed in Fix OOM in BinHexDecoder, Base64Decoder, and XmlSchemaValidator when throwing on large invalid input #125930), not dump the whole buffer.
- The
IncrementalReadDecodercontract requires tracking how many chars were consumed and bytes produced, whichOperationStatusprovides.
Aside: other XML codec opportunities
While auditing the XML encoder/decoder surface, two other areas surfaced that are less clear-cut but worth noting:
XmlUtf8RawTextWriter.EncodeMultibyteUTF8 / EncodeSurrogate — Hand-rolled UTF-8 byte assembly using 0xC0/0xE0/0xF0 masks and bit-shifting in unsafe pointer code. Could potentially use Rune.EncodeToUtf8, but this is a hot path with unsafe pointer arithmetic where the manual approach may be faster. Would need benchmarking before changing.
BinHexDecoder.Decode(ReadOnlySpan, bool) (static overload) — Allocates a byte[], decodes char-by-char, then Array.Resizes. The instance path already uses HexConverter.FromChar (modern), but the static convenience method could potentially use Convert.FromHexString with whitespace pre-stripping. The complication is XML-specific behavior: inline whitespace skipping and optional odd-count tolerance. The BinHexEncoder side is already fully modernized (HexConverter.EncodeToUtf16, Convert.ToHexString).
Feasibility / effort assessment
Note
This assessment was generated with Copilot assistance.
Code churn: Low
- 1 file changed:
Base64Decoder.cs(~170 lines total, ~80-line coreDecodemethod) - Deletes the 123-byte lookup table, manual bit-shifting loop, and
_bits/_bitsFilledstate fields - Replaces with calls to
Base64.DecodeFromChars(ReadOnlySpan<char>, Span<byte>, out int charsConsumed, out int bytesWritten, bool isFinalBlock)which already handles inline whitespace - The
IncrementalReadDecoderwrapper and itsSetNextOutputBuffer/DecodedCount/IsFullcontract stay unchanged - Integration surface is well-encapsulated:
ReadContentAsBinaryHelper.csandXmlTextReaderImpl.cscreate/use the decoder, but their code doesn't change
Difficulty: Low–Medium
- Easy part: The BCL API is almost a drop-in —
DecodeFromCharsacceptsReadOnlySpan<char>, returnsOperationStatuswithcharsConsumed/bytesWritten, and skips whitespace natively - Tricky parts:
- Whitespace definition: The XML decoder uses
XmlCharType.IsWhiteSpace; the BCL uses space/tab/CR/LF. Need to verify these match for base64 content - Error reporting: Current code throws
XmlException(SR.Xml_InvalidBase64Value, ...)— the BCL returnsOperationStatus.InvalidDatawithout identifying the offending character. The error message quality from Fix OOM in BinHexDecoder, Base64Decoder, and XmlSchemaValidator when throwing on large invalid input #125930 needs preserving, so some post-failure inspection logic may be needed - Streaming state: Current code carries
_bits/_bitsFilledacross calls. The BCL'sisFinalBlock=falsehandles this, but the mapping needs care - Padding behavior: Current code has custom
=consumption and post-padding whitespace-only validation. Need to verify BCL matches
- Whitespace definition: The XML decoder uses
Test coverage: Good (indirect)
- ~35+ test methods in
ReadBase64.cscover: valid decoding, chunked/streaming reads, whitespace in middle,=padding, invalid chars, argument validation, overflow regression, reader state after reads - Tests run across 5+ reader implementations (factory, subtree, wrapped, char-checking, custom) via inherited test classes
- Gap: No direct unit tests for
Base64Decoder— all coverage is throughXmlReaderAPIs. Sufficient to catch regressions, but a contributor could optionally add targeted edge-case tests
Perf risk: Low
- The BCL
DecodeFromCharsis SIMD-optimized and should be faster than the scalar char-by-char lookup table loop - This is not a hot path in typical XML workloads (base64 in XML is relatively niche: MTOM, embedded binary)
Perf test coverage: None ⚠️
- No XML base64 benchmarks exist in
dotnet/performance(searched forReadContentAsBase64,ReadElementContentAsBase64,XmlReaderbenchmarks) - BCL-level
Base64benchmarks exist but don't exercise the XML reader path - Not a blocker given the low perf risk
Reviewability: High
- Small, self-contained change in one file with a clear "delete hand-rolled, call BCL" narrative
What we get
- Eliminates a 123-byte magic lookup table and manual bit-shifting
- Consistency: encoder side already delegates to
Convert.ToBase64CharArray; decoder becomes symmetric - Future BCL base64 optimizations flow through automatically
- Net deletion of ~50–60 lines of tricky decoding logic