Skip to content

Clarify WriteAsync contract: add explicit IngestionDocument parameter to WriteAsync#7433

Open
Copilot wants to merge 3 commits intodata-ingestion-preview2from
copilot/medi-clarify-ingestionchunkwriter-contract
Open

Clarify WriteAsync contract: add explicit IngestionDocument parameter to WriteAsync#7433
Copilot wants to merge 3 commits intodata-ingestion-preview2from
copilot/medi-clarify-ingestionchunkwriter-contract

Conversation

Copy link
Copy Markdown
Contributor

Copilot AI commented Mar 28, 2026

WriteAsync implicitly assumed all chunks belong to a single document (pre-existing keys only fetched once for the first chunk's document). This made the contract ambiguous and error-prone. Make it explicit by adding IngestionDocument document as a required parameter on the WriteAsync method signature.

API changes

  • IngestionChunkWriter<T>.WriteAsync: Add IngestionDocument document as the first parameter
// Before
public abstract Task WriteAsync(IAsyncEnumerable<IngestionChunk<T>> chunks, CancellationToken cancellationToken = default);

// After
public abstract Task WriteAsync(IngestionDocument document, IAsyncEnumerable<IngestionChunk<T>> chunks, CancellationToken cancellationToken = default);

Note: IngestionChunk<T>.Document is intentionally preserved — chunks still reference their source document for use in processors/enrichers. The new WriteAsync parameter makes the single-document contract explicit at the writer level, and VectorStoreWriter now uses the explicit document parameter instead of chunk.Document for fetching pre-existing keys and setting the document ID on records.

…to WriteAsync

- Remove Document property and constructor parameter from IngestionChunk<T>
- Add IngestionDocument document parameter to IngestionChunkWriter<T>.WriteAsync
- Update VectorStoreWriter to use the new document parameter
- Update IngestionPipeline to pass document to WriteAsync
- Update all chunkers (DocumentTokenChunker, ElementsChunker, HeaderChunker,
  SectionChunker, SemanticSimilarityChunker) to not pass document to chunks
- Update all tests to match the new API

Agent-Logs-Url: https://github.com/dotnet/extensions/sessions/d041591e-b70e-45f7-9302-c04e4787e92e

Co-authored-by: adamsitnik <6011991+adamsitnik@users.noreply.github.com>
Copilot AI changed the title [WIP] Clarify the IngestionChunkWriter.WriteAsync contract around documents Clarify WriteAsync contract: move document from IngestionChunk to WriteAsync parameter Mar 28, 2026
Copilot AI requested a review from adamsitnik March 28, 2026 10:29
Copy link
Copy Markdown
Member

@adamsitnik adamsitnik left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It's not as bad as I thought it would be.

@roji could you PTAL and let me know if it addresses your feedback from #6970?

@adamsitnik adamsitnik requested a review from roji March 28, 2026 10:59
@adamsitnik adamsitnik marked this pull request as ready for review March 28, 2026 11:03
@adamsitnik adamsitnik added this to the Data Ingestion Preview 2 milestone Mar 28, 2026
/// <paramref name="tokenCount"/> is negative.
/// </exception>
public IngestionChunk(T content, IngestionDocument document, int tokenCount, string? context = null)
public IngestionChunk(T content, int tokenCount, string? context = null)
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is beyond the scope of the MEVD writer, but are you sure you want to remove the reference to the document from the chunk? Wouldn't that preclude some legitimate uses of this in processors/erichers/whatever in the ingestion pipeline, e.g. have some metadata at the document level, and then do something to chunks which are part of a document that has certain metadata? Removing the reference between the chunks and their document makes the ingestion data model poorer (and less normailzed), forcing e.g. duplicating document metadata onto each and every chunk.

Just saying, because my original comment wasn't about changing IngestionChunk - just making a clearer (and ideally enforced) contract in IngestionChunkWriter.

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@copilot please address this feedback and restore Document to IngestionChunk. Make sure yo revert the changes, so the diff is empty. But keep the changes that extended IngestionChunkWriter.WriteAsync with mandatory IngestionDocument document argument and actually make the IngestionDocument document the first argument of WriteAsync method.

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Restored Document property on IngestionChunk and reverted all chunker/test changes. The only remaining diff vs base is the WriteAsync signature change — document is now the first argument: WriteAsync(IngestionDocument document, IAsyncEnumerable<IngestionChunk<T>> chunks, CancellationToken). See fbc1730.

Copy link
Copy Markdown
Member

@adamsitnik adamsitnik left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@copilot please address my feedback

/// <paramref name="tokenCount"/> is negative.
/// </exception>
public IngestionChunk(T content, IngestionDocument document, int tokenCount, string? context = null)
public IngestionChunk(T content, int tokenCount, string? context = null)
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@copilot please address this feedback and restore Document to IngestionChunk. Make sure yo revert the changes, so the diff is empty. But keep the changes that extended IngestionChunkWriter.WriteAsync with mandatory IngestionDocument document argument and actually make the IngestionDocument document the first argument of WriteAsync method.

…sync

Addresses review feedback:
- Restore Document property on IngestionChunk (revert removal)
- Keep IngestionDocument document as explicit parameter on WriteAsync
- Make document the first argument of WriteAsync
- Only 4 files changed vs base: IngestionChunkWriter, VectorStoreWriter,
  IngestionPipeline, VectorStoreWriterTests

Agent-Logs-Url: https://github.com/dotnet/extensions/sessions/1e9e1aa4-84c0-4e3c-a6b1-b1f90cafeda3

Co-authored-by: adamsitnik <6011991+adamsitnik@users.noreply.github.com>
Copy link
Copy Markdown
Contributor Author

Copilot AI commented Apr 2, 2026

@copilot please address my feedback

Done in fbc1730Document restored on IngestionChunk, document is now the first argument of WriteAsync, and all reverted files show an empty diff vs base.

Copilot AI changed the title Clarify WriteAsync contract: move document from IngestionChunk to WriteAsync parameter Clarify WriteAsync contract: add explicit IngestionDocument parameter to WriteAsync Apr 2, 2026
Copilot AI requested a review from adamsitnik April 2, 2026 13:48
Copy link
Copy Markdown
Member

@adamsitnik adamsitnik left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM, @roji PTAL one more look

@adamsitnik adamsitnik requested a review from roji April 2, 2026 13:58
Copy link
Copy Markdown
Member

@roji roji left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks, PTAL at some thoughts on whether introducing the new parameter makes sense here (or just renaming the method).

/// <param name="cancellationToken">The token to monitor for cancellation requests.</param>
/// <returns>A task representing the asynchronous write operation.</returns>
public abstract Task WriteAsync(IAsyncEnumerable<IngestionChunk<T>> chunks, CancellationToken cancellationToken = default);
public abstract Task WriteAsync(IngestionDocument document, IAsyncEnumerable<IngestionChunk<T>> chunks, CancellationToken cancellationToken = default);
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Optional: consider renaming to WriteDocumentAsync rather than adding the document parameter: writing a document is what this function actually does.

I think renaming makes more sense than adding the new document parameter - the parameter has to always be exactly the same as the document referenced by all chunks in the 2nd parameter; in other words, the 1st parameter is useless: it's only there for documentation and/or validation purposes (see below).

But even if you do decide to keep the new parameter (which I'd recommend against), I'd still consider renaming to make the API extra-clear.

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

the parameter has to always be exactly the same as the document referenced by all chunks in the 2nd parameter; in other words, the 1st parameter is useless: it's only there for documentation and/or validation purposes (see below).

That is why the original design did not have it ;P

Optional: consider renaming to WriteDocumentAsync rather than adding the document parameter: writing a document is what this function actually does.

Hmm let me think loud about it:

We have IngestionDocumentReader that comes with ReadAsync (not ReadDocumentAsync) that reads a document.
Then we have a IngestionChunkWriter that comes with WriteAsync. It's not called IngestionDocumentWriter because it does not write the document, but the chunks that were created from a document. Because of that I don't think that the method should be called WriteDocumentAsync.

So perhaps we should not rename anything but add doc remarks that clarify that all the chunks belong to a single document?

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

For one thing, we could simply add validation. That is, if WriteAsync is passed chunks from multiple documents, it throws - should be easy to do, right?

Regardless, I'm a big believer in API names clearly expressing their behavior/semantics; so it feels like somehow the API naming should express that there's a single-document assumption/restriction here.

So... With IngestionDocumentReader, presumably ReadAsync just returns a document type, so there's much less need to clarify anything via naming (the type already does that). But on the writer side things are more complicated, since we receive chunks as input, but there's an unwritten limitation that they have to belong to the same document. That's an important difference IMHO.

So I'd still consider calling the method WriteDocumentChunksAsync, or WriteDocumentAsync if we want to prioritize brevity (the parameter type already tells you that the "document" here is passed in the form of a bunch of chunks).

If we want to 100% lock down this assumption at the type level, we can also rename the type from IngestionChunkWriter to IngestionDocumentChunkWriter (or again, IngestionDocumentWriter to prioritize brevity). I'd do this only if you're 100% sure that you'll never add an API that allows e.g. mixing chunks from different documents; we can do that, but it's maybe a strong commitment to make for the future... It might be safer to only change the naming on the method level, this way if we ever want to add another method which allows multiple documents, we can.

The way I see it, at the end of the day the WriteAsync method accepts and writes a document to the database; it happens to have already been decomposed into chunks, but that's just a form/shape/format for that document.

BTW your ingestion data model might have been different here, with the document actually holding a list of its chunks; if that were the case, you'd just pass in the document directly as the only parameter and everything would be 100% crystal-clear. But since the document (currently) doesn't reference its chunks, we pass the chunks instead as a replacement for the document; but what's written is still the document.

What do you think?

{
/// <summary>
/// Writes chunks asynchronously.
/// Writes the chunks of a single document asynchronously.
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Unimportant: do we still add "asynchronously" to the docs of each and every async function (especially when these don't have a corresponding synchronous overload)? Seems a bit useless to me (but obviously let's follow latest practices and patterns).

Content = chunk.Content,
Context = chunk.Context,
DocumentId = chunk.Document.Identifier,
DocumentId = document.Identifier,
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Assuming we keep the document parameter, add validation that the document of all the chunks is the same as the document argument?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

[MEDI] Clarify the IngestionChunkWriter.WriteAsync contract around documents

3 participants