Skip to content

Conversation

@daniele-pini
Copy link

Brief description

This PR fixes unexpected header recognitions across calls to parser.pause() / parser.resume().

Issue #998 was happening because the header recognition code was called multiple times for rows that were not the first, when the above mentioned functions were invoked.

Comment on lines 508 to +511
this.parseChunk = function(chunk, isFakeChunk)
{
var notFirstChunk = !this.isFirstChunk;

Copy link
Author

@daniele-pini daniele-pini Sep 2, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Explaining the changes to facilitate the review. When parsing chunks, we want parsing to keep track of whether some previous chunk was already processed or not. This is basically to know whether header recognition should trigger.

We already have this information in the ChunkStreamer as the isFirstChunk property. We need to invert it before passing it down in order to make the default a falsy value. This is used for example at this point:

PapaParse/papaparse.js

Lines 1350 to 1355 in b10b87e

var preview = new Parser({
comments: comments,
delimiter: delim,
newline: newline,
preview: 10
}).parse(input);

NOTE: I think the first chunk may not necessarily contain the header, although it usually does. It depends on how big the header is and whether skipFirstNLines was used. In general, we could find the header on following chunks.

This is also a problem in current implementation, but fixing that would require further refactoring that I'm not comfortable carrying out myself. I would like to leave this problem to a future PR.

The correct way to refactor all this really right, I think, should be to move the headerParsed variable out of the Parser class somehow, because that one gets reinitialized all the time. A "parseContext" variable initialized in the highest level parse function could be used for that.

Comment on lines -1743 to +1745
if (config.header && !baseIndex && data.length && !headerParsed)
if (config.header && !notFirstChunk && data.length && !headerParsed)
Copy link
Author

@daniele-pini daniele-pini Sep 2, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The notFirstChunk variable gets passed down the parsing call chain up to this point. Previously, the baseIndex variable was used to determine whether to stop header recognition, while we now use the explicit notFirstChunk parameter.

The baseIndex (i.e. the cursor in the streamed file at the start of the chunk) was probably intended to be used for this role, except this doesn't work for the "fake chunks" generated when pausing and resuming. In particular, the first real chunk has a baseIndex of 0, and "fake chunks" inside it would also use that - which caused the header recognition to trigger multiple times.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant