Clarify WriteAsync contract: add explicit IngestionDocument parameter to WriteAsync#7433
Conversation
…to WriteAsync - Remove Document property and constructor parameter from IngestionChunk<T> - Add IngestionDocument document parameter to IngestionChunkWriter<T>.WriteAsync - Update VectorStoreWriter to use the new document parameter - Update IngestionPipeline to pass document to WriteAsync - Update all chunkers (DocumentTokenChunker, ElementsChunker, HeaderChunker, SectionChunker, SemanticSimilarityChunker) to not pass document to chunks - Update all tests to match the new API Agent-Logs-Url: https://github.com/dotnet/extensions/sessions/d041591e-b70e-45f7-9302-c04e4787e92e Co-authored-by: adamsitnik <6011991+adamsitnik@users.noreply.github.com>
| /// <paramref name="tokenCount"/> is negative. | ||
| /// </exception> | ||
| public IngestionChunk(T content, IngestionDocument document, int tokenCount, string? context = null) | ||
| public IngestionChunk(T content, int tokenCount, string? context = null) |
There was a problem hiding this comment.
This is beyond the scope of the MEVD writer, but are you sure you want to remove the reference to the document from the chunk? Wouldn't that preclude some legitimate uses of this in processors/erichers/whatever in the ingestion pipeline, e.g. have some metadata at the document level, and then do something to chunks which are part of a document that has certain metadata? Removing the reference between the chunks and their document makes the ingestion data model poorer (and less normailzed), forcing e.g. duplicating document metadata onto each and every chunk.
Just saying, because my original comment wasn't about changing IngestionChunk - just making a clearer (and ideally enforced) contract in IngestionChunkWriter.
There was a problem hiding this comment.
@copilot please address this feedback and restore Document to IngestionChunk. Make sure yo revert the changes, so the diff is empty. But keep the changes that extended IngestionChunkWriter.WriteAsync with mandatory IngestionDocument document argument and actually make the IngestionDocument document the first argument of WriteAsync method.
There was a problem hiding this comment.
Restored Document property on IngestionChunk and reverted all chunker/test changes. The only remaining diff vs base is the WriteAsync signature change — document is now the first argument: WriteAsync(IngestionDocument document, IAsyncEnumerable<IngestionChunk<T>> chunks, CancellationToken). See fbc1730.
adamsitnik
left a comment
There was a problem hiding this comment.
@copilot please address my feedback
| /// <paramref name="tokenCount"/> is negative. | ||
| /// </exception> | ||
| public IngestionChunk(T content, IngestionDocument document, int tokenCount, string? context = null) | ||
| public IngestionChunk(T content, int tokenCount, string? context = null) |
There was a problem hiding this comment.
@copilot please address this feedback and restore Document to IngestionChunk. Make sure yo revert the changes, so the diff is empty. But keep the changes that extended IngestionChunkWriter.WriteAsync with mandatory IngestionDocument document argument and actually make the IngestionDocument document the first argument of WriteAsync method.
…sync Addresses review feedback: - Restore Document property on IngestionChunk (revert removal) - Keep IngestionDocument document as explicit parameter on WriteAsync - Make document the first argument of WriteAsync - Only 4 files changed vs base: IngestionChunkWriter, VectorStoreWriter, IngestionPipeline, VectorStoreWriterTests Agent-Logs-Url: https://github.com/dotnet/extensions/sessions/1e9e1aa4-84c0-4e3c-a6b1-b1f90cafeda3 Co-authored-by: adamsitnik <6011991+adamsitnik@users.noreply.github.com>
adamsitnik
left a comment
There was a problem hiding this comment.
LGTM, @roji PTAL one more look
roji
left a comment
There was a problem hiding this comment.
Thanks, PTAL at some thoughts on whether introducing the new parameter makes sense here (or just renaming the method).
| /// <param name="cancellationToken">The token to monitor for cancellation requests.</param> | ||
| /// <returns>A task representing the asynchronous write operation.</returns> | ||
| public abstract Task WriteAsync(IAsyncEnumerable<IngestionChunk<T>> chunks, CancellationToken cancellationToken = default); | ||
| public abstract Task WriteAsync(IngestionDocument document, IAsyncEnumerable<IngestionChunk<T>> chunks, CancellationToken cancellationToken = default); |
There was a problem hiding this comment.
Optional: consider renaming to WriteDocumentAsync rather than adding the document parameter: writing a document is what this function actually does.
I think renaming makes more sense than adding the new document parameter - the parameter has to always be exactly the same as the document referenced by all chunks in the 2nd parameter; in other words, the 1st parameter is useless: it's only there for documentation and/or validation purposes (see below).
But even if you do decide to keep the new parameter (which I'd recommend against), I'd still consider renaming to make the API extra-clear.
There was a problem hiding this comment.
the parameter has to always be exactly the same as the document referenced by all chunks in the 2nd parameter; in other words, the 1st parameter is useless: it's only there for documentation and/or validation purposes (see below).
That is why the original design did not have it ;P
Optional: consider renaming to WriteDocumentAsync rather than adding the document parameter: writing a document is what this function actually does.
Hmm let me think loud about it:
We have IngestionDocumentReader that comes with ReadAsync (not ReadDocumentAsync) that reads a document.
Then we have a IngestionChunkWriter that comes with WriteAsync. It's not called IngestionDocumentWriter because it does not write the document, but the chunks that were created from a document. Because of that I don't think that the method should be called WriteDocumentAsync.
So perhaps we should not rename anything but add doc remarks that clarify that all the chunks belong to a single document?
There was a problem hiding this comment.
For one thing, we could simply add validation. That is, if WriteAsync is passed chunks from multiple documents, it throws - should be easy to do, right?
Regardless, I'm a big believer in API names clearly expressing their behavior/semantics; so it feels like somehow the API naming should express that there's a single-document assumption/restriction here.
So... With IngestionDocumentReader, presumably ReadAsync just returns a document type, so there's much less need to clarify anything via naming (the type already does that). But on the writer side things are more complicated, since we receive chunks as input, but there's an unwritten limitation that they have to belong to the same document. That's an important difference IMHO.
So I'd still consider calling the method WriteDocumentChunksAsync, or WriteDocumentAsync if we want to prioritize brevity (the parameter type already tells you that the "document" here is passed in the form of a bunch of chunks).
If we want to 100% lock down this assumption at the type level, we can also rename the type from IngestionChunkWriter to IngestionDocumentChunkWriter (or again, IngestionDocumentWriter to prioritize brevity). I'd do this only if you're 100% sure that you'll never add an API that allows e.g. mixing chunks from different documents; we can do that, but it's maybe a strong commitment to make for the future... It might be safer to only change the naming on the method level, this way if we ever want to add another method which allows multiple documents, we can.
The way I see it, at the end of the day the WriteAsync method accepts and writes a document to the database; it happens to have already been decomposed into chunks, but that's just a form/shape/format for that document.
BTW your ingestion data model might have been different here, with the document actually holding a list of its chunks; if that were the case, you'd just pass in the document directly as the only parameter and everything would be 100% crystal-clear. But since the document (currently) doesn't reference its chunks, we pass the chunks instead as a replacement for the document; but what's written is still the document.
What do you think?
| { | ||
| /// <summary> | ||
| /// Writes chunks asynchronously. | ||
| /// Writes the chunks of a single document asynchronously. |
There was a problem hiding this comment.
Unimportant: do we still add "asynchronously" to the docs of each and every async function (especially when these don't have a corresponding synchronous overload)? Seems a bit useless to me (but obviously let's follow latest practices and patterns).
| Content = chunk.Content, | ||
| Context = chunk.Context, | ||
| DocumentId = chunk.Document.Identifier, | ||
| DocumentId = document.Identifier, |
There was a problem hiding this comment.
Assuming we keep the document parameter, add validation that the document of all the chunks is the same as the document argument?
WriteAsyncimplicitly assumed all chunks belong to a single document (pre-existing keys only fetched once for the first chunk's document). This made the contract ambiguous and error-prone. Make it explicit by addingIngestionDocument documentas a required parameter on theWriteAsyncmethod signature.API changes
IngestionChunkWriter<T>.WriteAsync: AddIngestionDocument documentas the first parameterNote:
IngestionChunk<T>.Documentis intentionally preserved — chunks still reference their source document for use in processors/enrichers. The newWriteAsyncparameter makes the single-document contract explicit at the writer level, andVectorStoreWriternow uses the explicitdocumentparameter instead ofchunk.Documentfor fetching pre-existing keys and setting the document ID on records.