Skip to content
Open
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
12 changes: 7 additions & 5 deletions papaparse.js
Original file line number Diff line number Diff line change
Expand Up @@ -507,6 +507,8 @@ License: MIT

this.parseChunk = function(chunk, isFakeChunk)
{
var notFirstChunk = !this.isFirstChunk;

Comment on lines 508 to +511
Copy link
Author

@daniele-pini daniele-pini Sep 2, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Explaining the changes to facilitate the review. When parsing chunks, we want parsing to keep track of whether some previous chunk was already processed or not. This is basically to know whether header recognition should trigger.

We already have this information in the ChunkStreamer as the isFirstChunk property. We need to invert it before passing it down in order to make the default a falsy value. This is used for example at this point:

PapaParse/papaparse.js

Lines 1350 to 1355 in b10b87e

var preview = new Parser({
comments: comments,
delimiter: delim,
newline: newline,
preview: 10
}).parse(input);

NOTE: I think the first chunk may not necessarily contain the header, although it usually does. It depends on how big the header is and whether skipFirstNLines was used. In general, we could find the header on following chunks.

This is also a problem in current implementation, but fixing that would require further refactoring that I'm not comfortable carrying out myself. I would like to leave this problem to a future PR.

The correct way to refactor all this really right, I think, should be to move the headerParsed variable out of the Parser class somehow, because that one gets reinitialized all the time. A "parseContext" variable initialized in the highest level parse function could be used for that.

// First chunk pre-processing
const skipFirstNLines = parseInt(this._config.skipFirstNLines) || 0;
if (this.isFirstChunk && skipFirstNLines > 0) {
Expand All @@ -530,7 +532,7 @@ License: MIT
// Rejoin the line we likely just split in two by chunking the file
var aggregate = this._partialLine + chunk;
this._partialLine = '';
var results = this._handle.parse(aggregate, this._baseIndex, !this._finished);
var results = this._handle.parse(aggregate, this._baseIndex, !this._finished, notFirstChunk);

if (this._handle.paused() || this._handle.aborted()) {
this._halted = true;
Expand Down Expand Up @@ -1080,7 +1082,7 @@ License: MIT
* and ignoreLastRow parameters. They are used by streamers (wrapper functions)
* when an input comes in multiple chunks, like from a file.
*/
this.parse = function(input, baseIndex, ignoreLastRow)
this.parse = function(input, baseIndex, ignoreLastRow, notFirstChunk)
{
var quoteChar = _config.quoteChar || '"';
if (!_config.newline)
Expand Down Expand Up @@ -1111,7 +1113,7 @@ License: MIT

_input = input;
_parser = new Parser(parserConfig);
_results = _parser.parse(_input, baseIndex, ignoreLastRow);
_results = _parser.parse(_input, baseIndex, ignoreLastRow, notFirstChunk);
processResults();
return _paused ? { meta: { paused: true } } : (_results || { meta: { paused: false } });
};
Expand Down Expand Up @@ -1458,7 +1460,7 @@ License: MIT
var cursor = 0;
var aborted = false;

this.parse = function(input, baseIndex, ignoreLastRow)
this.parse = function(input, baseIndex, ignoreLastRow, notFirstChunk)
{
// For some reason, in Chrome, this speeds things up (!?)
if (typeof input !== 'string')
Expand Down Expand Up @@ -1740,7 +1742,7 @@ License: MIT
/** Returns an object with the results, errors, and meta. */
function returnable(stopped)
{
if (config.header && !baseIndex && data.length && !headerParsed)
if (config.header && !notFirstChunk && data.length && !headerParsed)
Comment on lines -1743 to +1745
Copy link
Author

@daniele-pini daniele-pini Sep 2, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The notFirstChunk variable gets passed down the parsing call chain up to this point. Previously, the baseIndex variable was used to determine whether to stop header recognition, while we now use the explicit notFirstChunk parameter.

The baseIndex (i.e. the cursor in the streamed file at the start of the chunk) was probably intended to be used for this role, except this doesn't work for the "fake chunks" generated when pausing and resuming. In particular, the first real chunk has a baseIndex of 0, and "fake chunks" inside it would also use that - which caused the header recognition to trigger multiple times.

{
const result = data[0];
const headerCount = Object.create(null); // To track the count of each base header
Expand Down
Loading