Clarify WriteAsync contract: add explicit IngestionDocument parameter to WriteAsync by Copilot · Pull Request #7433 · dotnet/extensions

Copilot · 2026-03-28T10:15:27Z

WriteAsync implicitly assumed all chunks belong to a single document (pre-existing keys only fetched once for the first chunk's document). This made the contract ambiguous and error-prone. Make it explicit by adding IngestionDocument document as a required parameter on the WriteAsync method signature.

API changes

IngestionChunkWriter<T>.WriteAsync: Add IngestionDocument document as the first parameter

// Before
public abstract Task WriteAsync(IAsyncEnumerable<IngestionChunk<T>> chunks, CancellationToken cancellationToken = default);

// After
public abstract Task WriteAsync(IngestionDocument document, IAsyncEnumerable<IngestionChunk<T>> chunks, CancellationToken cancellationToken = default);

Note: IngestionChunk<T>.Document is intentionally preserved — chunks still reference their source document for use in processors/enrichers. The new WriteAsync parameter makes the single-document contract explicit at the writer level, and VectorStoreWriter now uses the explicit document parameter instead of chunk.Document for fetching pre-existing keys and setting the document ID on records.

…to WriteAsync - Remove Document property and constructor parameter from IngestionChunk<T> - Add IngestionDocument document parameter to IngestionChunkWriter<T>.WriteAsync - Update VectorStoreWriter to use the new document parameter - Update IngestionPipeline to pass document to WriteAsync - Update all chunkers (DocumentTokenChunker, ElementsChunker, HeaderChunker, SectionChunker, SemanticSimilarityChunker) to not pass document to chunks - Update all tests to match the new API Agent-Logs-Url: https://github.com/dotnet/extensions/sessions/d041591e-b70e-45f7-9302-c04e4787e92e Co-authored-by: adamsitnik <6011991+adamsitnik@users.noreply.github.com>

adamsitnik

It's not as bad as I thought it would be.

@roji could you PTAL and let me know if it addresses your feedback from #6970?

roji · 2026-04-02T10:14:03Z

src/Libraries/Microsoft.Extensions.DataIngestion.Abstractions/IngestionChunk.cs

    /// <paramref name="tokenCount"/> is negative.
    /// </exception>
-    public IngestionChunk(T content, IngestionDocument document, int tokenCount, string? context = null)
+    public IngestionChunk(T content, int tokenCount, string? context = null)


This is beyond the scope of the MEVD writer, but are you sure you want to remove the reference to the document from the chunk? Wouldn't that preclude some legitimate uses of this in processors/erichers/whatever in the ingestion pipeline, e.g. have some metadata at the document level, and then do something to chunks which are part of a document that has certain metadata? Removing the reference between the chunks and their document makes the ingestion data model poorer (and less normailzed), forcing e.g. duplicating document metadata onto each and every chunk.

Just saying, because my original comment wasn't about changing IngestionChunk - just making a clearer (and ideally enforced) contract in IngestionChunkWriter.

@copilot please address this feedback and restore Document to IngestionChunk. Make sure yo revert the changes, so the diff is empty. But keep the changes that extended IngestionChunkWriter.WriteAsync with mandatory IngestionDocument document argument and actually make the IngestionDocument document the first argument of WriteAsync method.

Restored Document property on IngestionChunk and reverted all chunker/test changes. The only remaining diff vs base is the WriteAsync signature change — document is now the first argument: WriteAsync(IngestionDocument document, IAsyncEnumerable<IngestionChunk<T>> chunks, CancellationToken). See fbc1730.

adamsitnik

@copilot please address my feedback

adamsitnik · 2026-04-02T13:38:50Z

src/Libraries/Microsoft.Extensions.DataIngestion.Abstractions/IngestionChunk.cs

    /// <paramref name="tokenCount"/> is negative.
    /// </exception>
-    public IngestionChunk(T content, IngestionDocument document, int tokenCount, string? context = null)
+    public IngestionChunk(T content, int tokenCount, string? context = null)


@copilot please address this feedback and restore Document to IngestionChunk. Make sure yo revert the changes, so the diff is empty. But keep the changes that extended IngestionChunkWriter.WriteAsync with mandatory IngestionDocument document argument and actually make the IngestionDocument document the first argument of WriteAsync method.

…sync Addresses review feedback: - Restore Document property on IngestionChunk (revert removal) - Keep IngestionDocument document as explicit parameter on WriteAsync - Make document the first argument of WriteAsync - Only 4 files changed vs base: IngestionChunkWriter, VectorStoreWriter, IngestionPipeline, VectorStoreWriterTests Agent-Logs-Url: https://github.com/dotnet/extensions/sessions/1e9e1aa4-84c0-4e3c-a6b1-b1f90cafeda3 Co-authored-by: adamsitnik <6011991+adamsitnik@users.noreply.github.com>

Copilot · 2026-04-02T13:47:34Z

@copilot please address my feedback

Done in fbc1730 — Document restored on IngestionChunk, document is now the first argument of WriteAsync, and all reverted files show an empty diff vs base.

adamsitnik

LGTM, @roji PTAL one more look

roji

Thanks, PTAL at some thoughts on whether introducing the new parameter makes sense here (or just renaming the method).

roji · 2026-04-02T16:59:49Z

src/Libraries/Microsoft.Extensions.DataIngestion.Abstractions/IngestionChunkWriter.cs

    /// <param name="cancellationToken">The token to monitor for cancellation requests.</param>
    /// <returns>A task representing the asynchronous write operation.</returns>
-    public abstract Task WriteAsync(IAsyncEnumerable<IngestionChunk<T>> chunks, CancellationToken cancellationToken = default);
+    public abstract Task WriteAsync(IngestionDocument document, IAsyncEnumerable<IngestionChunk<T>> chunks, CancellationToken cancellationToken = default);


Optional: consider renaming to WriteDocumentAsync rather than adding the document parameter: writing a document is what this function actually does.

I think renaming makes more sense than adding the new document parameter - the parameter has to always be exactly the same as the document referenced by all chunks in the 2nd parameter; in other words, the 1st parameter is useless: it's only there for documentation and/or validation purposes (see below).

But even if you do decide to keep the new parameter (which I'd recommend against), I'd still consider renaming to make the API extra-clear.

the parameter has to always be exactly the same as the document referenced by all chunks in the 2nd parameter; in other words, the 1st parameter is useless: it's only there for documentation and/or validation purposes (see below).

That is why the original design did not have it ;P

Optional: consider renaming to WriteDocumentAsync rather than adding the document parameter: writing a document is what this function actually does.

Hmm let me think loud about it:

We have IngestionDocumentReader that comes with ReadAsync (not ReadDocumentAsync) that reads a document.
Then we have a IngestionChunkWriter that comes with WriteAsync. It's not called IngestionDocumentWriter because it does not write the document, but the chunks that were created from a document. Because of that I don't think that the method should be called WriteDocumentAsync.

So perhaps we should not rename anything but add doc remarks that clarify that all the chunks belong to a single document?

For one thing, we could simply add validation. That is, if WriteAsync is passed chunks from multiple documents, it throws - should be easy to do, right?

Regardless, I'm a big believer in API names clearly expressing their behavior/semantics; so it feels like somehow the API naming should express that there's a single-document assumption/restriction here.

So... With IngestionDocumentReader, presumably ReadAsync just returns a document type, so there's much less need to clarify anything via naming (the type already does that). But on the writer side things are more complicated, since we receive chunks as input, but there's an unwritten limitation that they have to belong to the same document. That's an important difference IMHO.

So I'd still consider calling the method WriteDocumentChunksAsync, or WriteDocumentAsync if we want to prioritize brevity (the parameter type already tells you that the "document" here is passed in the form of a bunch of chunks).

If we want to 100% lock down this assumption at the type level, we can also rename the type from IngestionChunkWriter to IngestionDocumentChunkWriter (or again, IngestionDocumentWriter to prioritize brevity). I'd do this only if you're 100% sure that you'll never add an API that allows e.g. mixing chunks from different documents; we can do that, but it's maybe a strong commitment to make for the future... It might be safer to only change the naming on the method level, this way if we ever want to add another method which allows multiple documents, we can.

The way I see it, at the end of the day the WriteAsync method accepts and writes a document to the database; it happens to have already been decomposed into chunks, but that's just a form/shape/format for that document.

BTW your ingestion data model might have been different here, with the document actually holding a list of its chunks; if that were the case, you'd just pass in the document directly as the only parameter and everything would be 100% crystal-clear. But since the document (currently) doesn't reference its chunks, we pass the chunks instead as a replacement for the document; but what's written is still the document.

What do you think?

roji · 2026-04-02T17:00:41Z

src/Libraries/Microsoft.Extensions.DataIngestion.Abstractions/IngestionChunkWriter.cs

 {
    /// <summary>
-    /// Writes chunks asynchronously.
+    /// Writes the chunks of a single document asynchronously.


Unimportant: do we still add "asynchronously" to the docs of each and every async function (especially when these don't have a corresponding synchronous overload)? Seems a bit useless to me (but obviously let's follow latest practices and patterns).

roji · 2026-04-02T17:02:05Z

src/Libraries/Microsoft.Extensions.DataIngestion/Writers/VectorStoreWriter.cs

                Content = chunk.Content,
                Context = chunk.Context,
-                DocumentId = chunk.Document.Identifier,
+                DocumentId = document.Identifier,


Assuming we keep the document parameter, add validation that the document of all the chunks is the same as the document argument?

Initial plan

92df59c

Copilot AI assigned Copilot and adamsitnik Mar 28, 2026

Copilot started work on behalf of adamsitnik March 28, 2026 10:15 View session

Copilot AI linked an issue Mar 28, 2026 that may be closed by this pull request

[MEDI] Clarify the IngestionChunkWriter.WriteAsync contract around documents #6970

Open

github-actions bot added the area-infrastructure label Mar 28, 2026

Copilot AI changed the title ~~[WIP] Clarify the IngestionChunkWriter.WriteAsync contract around documents~~ Clarify WriteAsync contract: move document from IngestionChunk to WriteAsync parameter Mar 28, 2026

Copilot finished work on behalf of adamsitnik March 28, 2026 10:29

Copilot AI requested a review from adamsitnik March 28, 2026 10:29

adamsitnik approved these changes Mar 28, 2026

View reviewed changes

adamsitnik requested a review from roji March 28, 2026 10:59

adamsitnik marked this pull request as ready for review March 28, 2026 11:03

adamsitnik added area-data-ingestion and removed area-infrastructure labels Mar 28, 2026

adamsitnik added this to the Data Ingestion Preview 2 milestone Mar 28, 2026

roji reviewed Apr 2, 2026

View reviewed changes

adamsitnik reviewed Apr 2, 2026

View reviewed changes

Copilot started work on behalf of adamsitnik April 2, 2026 13:39 View session

Copilot AI changed the title ~~Clarify WriteAsync contract: move document from IngestionChunk to WriteAsync parameter~~ Clarify WriteAsync contract: add explicit IngestionDocument parameter to WriteAsync Apr 2, 2026

Copilot finished work on behalf of adamsitnik April 2, 2026 13:48

Copilot AI requested a review from adamsitnik April 2, 2026 13:48

adamsitnik approved these changes Apr 2, 2026

View reviewed changes

adamsitnik requested a review from roji April 2, 2026 13:58

roji reviewed Apr 2, 2026

View reviewed changes

Conversation

Copilot AI commented Mar 28, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

API changes

Uh oh!

adamsitnik left a comment

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Copilot AI Apr 2, 2026

Choose a reason for hiding this comment

Uh oh!

adamsitnik left a comment

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Copilot AI commented Apr 2, 2026

Uh oh!

adamsitnik left a comment

Choose a reason for hiding this comment

Uh oh!

roji left a comment

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

Copilot AI commented Mar 28, 2026 •

edited

Loading