Skip to content

Replace hand-rolled Base64Decoder in System.Private.Xml with System.Buffers.Text.Base64 #125993

@danmoseley

Description

@danmoseley

Note

This issue was drafted with Copilot assistance.

Per this comment from @stephentoub, System.Xml.Base64Decoder has its own hand-rolled base64 decoding implementation that should be replaced with the modern System.Buffers.Text.Base64 APIs.

Current state

Base64Decoder.cs contains:

  • A 123-byte lookup table mapping ASCII chars to base64 digit values
  • Manual bit-shifting to assemble decoded bytes ((b << 6) | digit, extract when bFilled >= 8)
  • Hand-rolled = padding consumption and post-padding whitespace validation
  • Streaming state carried across calls via _bits / _bitsFilled fields

The encoder side (Base64Encoder.cs) already delegates to Convert.ToBase64CharArray — the decoder is the outlier.

Proposed change

Replace the core Decode(ReadOnlySpan, Span, ...) logic with System.Buffers.Text.Base64 APIs. The OperationStatus-based overloads (e.g., Base64.DecodeFromUtf8) support partial input/output natively, which maps well to the streaming IncrementalReadDecoder pattern where the caller may supply more chars than fit in the output buffer.

Key considerations:

Aside: other XML codec opportunities

While auditing the XML encoder/decoder surface, two other areas surfaced that are less clear-cut but worth noting:

XmlUtf8RawTextWriter.EncodeMultibyteUTF8 / EncodeSurrogate — Hand-rolled UTF-8 byte assembly using 0xC0/0xE0/0xF0 masks and bit-shifting in unsafe pointer code. Could potentially use Rune.EncodeToUtf8, but this is a hot path with unsafe pointer arithmetic where the manual approach may be faster. Would need benchmarking before changing.

BinHexDecoder.Decode(ReadOnlySpan, bool) (static overload) — Allocates a byte[], decodes char-by-char, then Array.Resizes. The instance path already uses HexConverter.FromChar (modern), but the static convenience method could potentially use Convert.FromHexString with whitespace pre-stripping. The complication is XML-specific behavior: inline whitespace skipping and optional odd-count tolerance. The BinHexEncoder side is already fully modernized (HexConverter.EncodeToUtf16, Convert.ToHexString).

Feasibility / effort assessment

Note

This assessment was generated with Copilot assistance.

Code churn: Low

  • 1 file changed: Base64Decoder.cs (~170 lines total, ~80-line core Decode method)
  • Deletes the 123-byte lookup table, manual bit-shifting loop, and _bits/_bitsFilled state fields
  • Replaces with calls to Base64.DecodeFromChars(ReadOnlySpan<char>, Span<byte>, out int charsConsumed, out int bytesWritten, bool isFinalBlock) which already handles inline whitespace
  • The IncrementalReadDecoder wrapper and its SetNextOutputBuffer/DecodedCount/IsFull contract stay unchanged
  • Integration surface is well-encapsulated: ReadContentAsBinaryHelper.cs and XmlTextReaderImpl.cs create/use the decoder, but their code doesn't change

Difficulty: Low–Medium

  • Easy part: The BCL API is almost a drop-in — DecodeFromChars accepts ReadOnlySpan<char>, returns OperationStatus with charsConsumed/bytesWritten, and skips whitespace natively
  • Tricky parts:
    • Whitespace definition: The XML decoder uses XmlCharType.IsWhiteSpace; the BCL uses space/tab/CR/LF. Need to verify these match for base64 content
    • Error reporting: Current code throws XmlException(SR.Xml_InvalidBase64Value, ...) — the BCL returns OperationStatus.InvalidData without identifying the offending character. The error message quality from Fix OOM in BinHexDecoder, Base64Decoder, and XmlSchemaValidator when throwing on large invalid input #125930 needs preserving, so some post-failure inspection logic may be needed
    • Streaming state: Current code carries _bits/_bitsFilled across calls. The BCL's isFinalBlock=false handles this, but the mapping needs care
    • Padding behavior: Current code has custom = consumption and post-padding whitespace-only validation. Need to verify BCL matches

Test coverage: Good (indirect)

  • ~35+ test methods in ReadBase64.cs cover: valid decoding, chunked/streaming reads, whitespace in middle, = padding, invalid chars, argument validation, overflow regression, reader state after reads
  • Tests run across 5+ reader implementations (factory, subtree, wrapped, char-checking, custom) via inherited test classes
  • Gap: No direct unit tests for Base64Decoder — all coverage is through XmlReader APIs. Sufficient to catch regressions, but a contributor could optionally add targeted edge-case tests

Perf risk: Low

  • The BCL DecodeFromChars is SIMD-optimized and should be faster than the scalar char-by-char lookup table loop
  • This is not a hot path in typical XML workloads (base64 in XML is relatively niche: MTOM, embedded binary)

Perf test coverage: None ⚠️

  • No XML base64 benchmarks exist in dotnet/performance (searched for ReadContentAsBase64, ReadElementContentAsBase64, XmlReader benchmarks)
  • BCL-level Base64 benchmarks exist but don't exercise the XML reader path
  • Not a blocker given the low perf risk

Reviewability: High

  • Small, self-contained change in one file with a clear "delete hand-rolled, call BCL" narrative

What we get

  • Eliminates a 123-byte magic lookup table and manual bit-shifting
  • Consistency: encoder side already delegates to Convert.ToBase64CharArray; decoder becomes symmetric
  • Future BCL base64 optimizations flow through automatically
  • Net deletion of ~50–60 lines of tricky decoding logic

Metadata

Metadata

Assignees

No one assigned

    Labels

    area-System.Xmlhelp wanted[up-for-grabs] Good issue for external contributorsuntriagedNew issue has not been triaged by the area owner

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions