Handle frontend loaded zip archive content (h5p, bloom(pub/d), perseus, kpub) that contains large files#12805
Handle frontend loaded zip archive content (h5p, bloom(pub/d), perseus, kpub) that contains large files#12805rtibbles wants to merge 12 commits intolearningequality:release-v0.19.xfrom
Conversation
Build Artifacts
Smoke test screenshot |
|
Hi @rtibbles, I noticed the following issues:
large.mp4
And 'Unidad 1 - Herramientas para el empoderamiento' from the QA channel, where the player buttons are missing: I've created the following test channel |
|
Thanks @pcenov - I guess there was a good reason I marked this as a draft! |
|
Note the backend parts of this PR have been cherry-picked into this PR: #12993 |
38dada3 to
269f3a8
Compare
269f3a8 to
12743fd
Compare
|
@pcenov @radinamatic this has been updated - this is not super urgent, and QA of it shouldn't take precedence over others - but in case there is some spare bandwidth, I think the latest updates here should have resolved the issues that were previously being observed. |
|
Hi @rtibbles - unfortunately after testing again with the resources from
Point 2 from my previous comment is fixed now and the icons are displayed correctly. |
|
I will take an incremental improvement - thanks for testing, I think this remaining issue will be quicker to hunt down! |
12743fd to
40dc62c
Compare
77443cc to
f274fff
Compare
rtibblesbot
left a comment
There was a problem hiding this comment.
Well-structured approach to lazy-loading large zip archives on the frontend. The AdaptiveHttpReader, chunked fetching, and large-file URL delegation are solid.
CI: linting passed, other checks pending. No new strings added (appropriate for release branch target).
- suggestion:
ZIP_EXTRA_FIELD_ESTIMATEmay underestimate — see inline - suggestion: tail prefetch range validation — see inline
- praise: Good decision to only defer audio/video to URL generation while keeping large non-media files in chunks
- praise: CSP_MEDIA_SRC addition correctly enables cross-origin media streaming
@rtibblesbot's comments are generated by an LLM, and should be evaluated accordingly
How was this generated?
Reviewed the pull request diff checking for:
- Correctness: bugs, edge cases, undocumented behavior, resource leaks, hardcoded values
- Design: unnecessary complexity, naming, readability, comment accuracy, redundant state
- Architecture: duplicated concerns, minimal interfaces, composition over inheritance
- Testing: behavior-based assertions, mocks only at hard boundaries, accurate coverage
- Completeness: missing dependencies, unupdated usages, i18n, accessibility, security
- Principles: DRY (same reason to change), SRP, Rule of Three (no premature abstraction)
- Checked CI status and linked issue acceptance criteria
- For UI changes: inspected screenshots for layout, visual completeness, and consistency
f274fff to
5b795b3
Compare
rtibblesbot
left a comment
There was a problem hiding this comment.
Good progress addressing prior feedback — both the ZIP_EXTRA_FIELD_ESTIMATE increase and tail prefetch bounds check are in place.
2 prior finding(s) resolved.
CI: Frontend tests failing (2 tests in AdaptiveHttpReader.spec.js). Linting passes, other checks pending.
- blocking: Test assertions need updating after
ZIP_EXTRA_FIELD_ESTIMATEchange — see inline - suggestion:
_tryFullDownloadonprogress abort path doesn't setthis.size— see inline - nitpick: BloomRunner comment copy-paste — see inline
@rtibblesbot's comments are generated by an LLM, and should be evaluated accordingly
How was this generated?
Compared the current PR state against findings from a prior review:
- Retrieved prior bot reviews deterministically via the GitHub API
- Classified each prior finding as RESOLVED, UNRESOLVED, PARTIALLY_ADDRESSED, or CONTESTED
- Only raised NEW findings for newly introduced code
- Reviewed the pull request diff checking for correctness, design, architecture, testing, completeness, and adherence to DRY/SRP principles
- Checked CI status and linked issue acceptance criteria
rtibbles
left a comment
There was a problem hiding this comment.
Some notes for myself.
5b795b3 to
ceb4d67
Compare
rtibblesbot
left a comment
There was a problem hiding this comment.
Good progress — the core adaptive loading and large media delegation are solid.
5 prior finding(s) resolved.
Prior findings still open:
- [suggestion]
_tryFullDownloadonprogress abort path doesn't setthis.size— UNRESOLVED. Whenevent.lengthComputableis false and Content-Length header is absent,this.sizestays 0 after abort, breaking downstream lazy-mode operations. @rtibbles noted the onprogress handler may not be needed at all — either way, this needs resolution before merge.
CI: Linting failing — unused variable in test file. Frontend tests pass.
- blocking: Lint failure from unused
rangesBefore— see inline - suggestion: README "Large File Handling" naming is misleading per @rtibbles' own review — see inline
- suggestion: Lazy loading test comment block can be cleaned up per @rtibbles' own review — see inline
@rtibblesbot's comments are generated by an LLM, and should be evaluated accordingly
How was this generated?
Compared the current PR state against findings from a prior review:
- Retrieved prior bot reviews deterministically via the GitHub API
- Classified each prior finding as RESOLVED, UNRESOLVED, PARTIALLY_ADDRESSED, or CONTESTED
- Only raised NEW findings for newly introduced code
- Reviewed the pull request diff checking for correctness, design, architecture, testing, completeness, and adherence to DRY/SRP principles
- Checked CI status and linked issue acceptance criteria
ceb4d67 to
e7a2bed
Compare
rtibblesbot
left a comment
There was a problem hiding this comment.
Solid implementation — the adaptive loading, chunked fetching, and large media delegation are well-designed and well-tested.
7 prior finding(s) resolved. 1 prior finding acknowledged (author accepted, pending fix).
CI: Frontend tests passing, linting passing, Python tests in progress. No new i18n strings (appropriate for release-v0.19.x target).
- suggestion:
embeddedFilePathinzipContentUrlis not URL-encoded — see inline - suggestion:
chunksConfiguredgetter has a subtle edge case — see inline - praise: Comprehensive test coverage for both
AdaptiveHttpReaderandZipFilelazy loading, including edge cases like tail chunk bounds fallback and large file gap detection
@rtibblesbot's comments are generated by an LLM, and should be evaluated accordingly
How was this generated?
Compared the current PR state against findings from a prior review:
- Retrieved prior bot reviews deterministically via the GitHub API
- Classified each prior finding as RESOLVED, UNADDRESSED, ACKNOWLEDGED, or CONTESTED
- Only raised NEW findings for newly introduced code
- Reviewed the pull request diff checking for correctness, design, architecture, testing, completeness, and adherence to DRY/SRP principles
- Checked CI status and linked issue acceptance criteria
…largeFile handling. Add zip origin to CSP for media.
Add unquote() call for embedded_filepath so filenames with spaces or special characters resolve correctly. Add test coverage for both backend (zip_wsgi) and frontend (zipContentUrl) URL encoding. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
e7a2bed to
29a6dab
Compare
rtibblesbot
left a comment
There was a problem hiding this comment.
Implementation is solid — the adaptive loading, chunked fetching, and large media delegation are well-designed and thoroughly tested.
3 prior finding(s) resolved. 1 prior finding contested (URL encoding — URL constructor handles it, confirming test added).
CI: Build assets job in progress. No new i18n strings (appropriate for release-v0.19.x target).
- praise: Consistent
largeFileUrlGeneratorintegration across all five consumers (Perseus, SafeHtml5, QTI, H5P, Bloom) with a clean API surface
@rtibblesbot's comments are generated by an LLM, and should be evaluated accordingly
How was this generated?
Compared the current PR state against findings from a prior review:
- Retrieved prior bot reviews deterministically via the GitHub API
- Classified each prior finding as RESOLVED, UNADDRESSED, ACKNOWLEDGED, or CONTESTED
- Only raised NEW findings for newly introduced code
- Reviewed the pull request diff checking for correctness, design, architecture, testing, completeness, and adherence to DRY/SRP principles
- Checked CI status and linked issue acceptance criteria
| try { | ||
| // Only use largeMediaThreshold if we have a generator to serve large files | ||
| // Otherwise, treat all files the same (include them all in chunks) | ||
| const effectiveThreshold = this.largeFileUrlGenerator ? this.largeMediaThreshold : Infinity; |
There was a problem hiding this comment.
praise: Smart use of Infinity threshold when no URL generator is provided — ensures all files are included in chunks without needing a separate code path.
There was a problem hiding this comment.
The README states:
Audio/video files larger than this use
largeFileUrlGeneratorinstead of being extracted
Which made me go back to check whether it threw an error if the size was larger than the threshold but a URL generator wasn't supplied. This ternary prevents that situation, but silently. Since threshold handling is managed in AdaptiveHttpReader, this all leads to a bit of a confusing situation if misconfigured.
There was a problem hiding this comment.
I've added a warning in the constructor now to handle this case.
rtibbles
left a comment
There was a problem hiding this comment.
A quick self review for some signposting!
| next(response.streaming_content).decode(), self.embedded_file_str | ||
| ) | ||
|
|
||
| def test_request_for_url_encoded_embedded_file(self): |
There was a problem hiding this comment.
This come up during local self review using Claude, it seems like it's an edge case that we've not really hit previously, but it was simple enough to write a test for and fix.
|
|
||
| # Includes a prefix that is almost certain not to collide | ||
| # with a filename embedded in a zip file. Prefix is: | ||
| # @*._ followed by the encoded base url |
There was a problem hiding this comment.
This comment was outdated.
| CSP_FRAME_SRC = CSP_DEFAULT_SRC + frame_src | ||
|
|
||
| # Allow media from our zipcontent origin as well | ||
| CSP_MEDIA_SRC = CSP_DEFAULT_SRC + frame_src |
There was a problem hiding this comment.
This shouldn't affect our sandboxing, as it only adds a CSP allow for media URLs specifically, and doesn't switch up the opposite direction.
| if (!this.isH5P) { | ||
| // In the case that this is being routed via a remote URL | ||
| // ensure we preserve that for the zip endpoint. | ||
| const url = new URL(this.defaultFile.storage_url, window.location.href); |
There was a problem hiding this comment.
We were repeating this logic, so I moved it into the zip url function instead.
| const storageUrl = this.defaultFile.storage_url; | ||
| const zipFile = new ZipFile(storageUrl); | ||
| const zipFile = new ZipFile(storageUrl, { | ||
| largeFileUrlGenerator: filepath => urls.zipContentUrl(this.defaultFile, filepath), |
There was a problem hiding this comment.
I suppose I could have made a function factory that gets reused, but it was such a small amount of code, it didn't seem worth abstracting.
| @@ -0,0 +1,455 @@ | |||
| // Polyfills for Node.js and older browsers (must be before zip.js import) | |||
There was a problem hiding this comment.
This is all new code, a subclass of the HttpReader class from zip.js to implement the lazy/chunked loading behaviour.
| this._fileLoadingPromise = this._init(); | ||
| } | ||
|
|
||
| async _init() { |
There was a problem hiding this comment.
This seemed like a good opportunity to switch these methods to async/await, as there was a lot of gnarly Promise callback nesting otherwise.
| @@ -1 +0,0 @@ | |||
| {"js":"application/javascript","json":"application/json","doc":"application/msword","pdf":"application/pdf","rtf":"text/rtf","xls":"application/vnd.ms-excel","ppt":"application/vnd.ms-powerpoint","odp":"application/vnd.oasis.opendocument.presentation","ods":"application/vnd.oasis.opendocument.spreadsheet","odt":"application/vnd.oasis.opendocument.text","pptx":"application/vnd.openxmlformats-officedocument.presentationml.presentation","xlsx":"application/vnd.openxmlformats-officedocument.spreadsheetml.sheet","docx":"application/vnd.openxmlformats-officedocument.wordprocessingml.document","swf":"application/x-shockwave-flash","xhtml":"application/xhtml+xml","xml":"text/xml","mp3":"audio/mpeg","m4a":"audio/x-m4a","ogg":"audio/ogg","wav":"audio/x-wav","otf":"font/otf","ttf":"font/ttf","woff":"font/woff","bmp":"image/x-ms-bmp","gif":"image/gif","jpeg":"image/jpeg","jpg":"image/jpeg","png":"image/png","svg":"image/svg+xml","tif":"image/tiff","tiff":"image/tiff","css":"text/css","csv":"text/csv","html":"text/html","htm":"text/html","md":"text/markdown","txt":"text/plain","vtt":"text/vtt","mp4":"video/mp4","webm":"video/webm"} No newline at end of file | |||
There was a problem hiding this comment.
zip.js includes more robust mimetype handling internally, which I chose to supplement rather than continuing to maintain our own.
| @@ -0,0 +1,1154 @@ | |||
| // Polyfills for Node.js/jsdom environment - must run BEFORE any module imports | |||
There was a problem hiding this comment.
I wrote this test suite before doing any refactor to zip.js to give me confidence of the public API of the kolibri-zip module remaining stable. This is also why the tests here still rely on fflate, because that was what was in place previously.
I had to make a few tweaks to the tests (for example adding the polyfills for webstream and text encoder/decoder etc), but for the most part even the tests of the novel behaviour were written first (with Claude Code assistance) in order to take a RED/GREEN approach here.
| }); | ||
|
|
||
| describe('Large file with URL generator', () => { | ||
| test('large media file cannot be converted to string when URL generator is used', async () => { |
There was a problem hiding this comment.
We could in theory enable this, by turning the toString method into an async/await method to allow this - but it feels of limited use for media files.
bjester
left a comment
There was a problem hiding this comment.
The most notable feedback I have is in regards to the potentially confusing configuration situation with largeMediaThreshold and largeFileUrlGenerator
| // web-streams-polyfill/polyfill patches globals as a side effect | ||
| import 'web-streams-polyfill/polyfill'; | ||
| // Use Node.js built-in TextEncoder/TextDecoder (available since Node 11) | ||
| import { TextEncoder, TextDecoder } from 'util'; |
There was a problem hiding this comment.
Recommended practice now is to use node: protocol prefix.
| // Restore original URL methods | ||
| global.URL.createObjectURL = originalCreateObjectURL; | ||
| global.URL.revokeObjectURL = originalRevokeObjectURL; |
There was a problem hiding this comment.
This type of thing is handled for you by using proper jest mocks on objects, and jest.restoreAllMocks();
| // Use Node.js built-in TextEncoder/TextDecoder (available since Node 11) | ||
| import { TextEncoder, TextDecoder } from 'util'; | ||
|
|
||
| import mock from 'xhr-mock'; |
There was a problem hiding this comment.
Explicitly naming this xhrMock is much clearer.
| const videoUrl = videoFile.toUrl(); | ||
|
|
||
| // KEY ASSERTION: largeFileUrlGenerator must be called | ||
| expect(generatedUrls).toContain('media/video.mp4'); |
There was a problem hiding this comment.
Using a local array of call args, to assert behavior, is a hack for what mock functions do normally. This pattern in repeated in several places, creating significant test logic code that does what mocks do.
| expect(video._urlGenerator).toBeNull(); | ||
| expect(video.obj).toBeInstanceOf(Uint8Array); | ||
| expect(video.obj.length).toBe(600 * 1024); | ||
| // toString() would work (but would be slow for large binary), toUrl() returns blob |
There was a problem hiding this comment.
It appears there are other tests calling toString(), but this comments feels like an implementation note, rather than a good reason not to actually test .toString()? would work => well, test it!
It was attempted in large media file cannot be converted to string when URL generator is used which makes the reciprocal case sensible IMO
There was a problem hiding this comment.
Yes, have tweaked - this was a left over note I missed cleaning up.
| global.URL.createObjectURL = jest.fn(() => { | ||
| return `blob:mock-${Math.random().toString(36).substr(2, 9)}`; | ||
| }); | ||
| global.URL.revokeObjectURL = jest.fn(); |
There was a problem hiding this comment.
Never restored? see my other comment
kolibri/core/content/zip_wsgi.py
Outdated
| embedded_filepath = embedded_filepath.replace("//", "/") | ||
| # Decode any URL-encoded characters in the embedded filepath so that | ||
| # filenames with spaces or special characters resolve correctly. | ||
| embedded_filepath = unquote(embedded_filepath) |
There was a problem hiding this comment.
Does the webserver not already unquote these, meaning this could inadvertently mess up embedded paths that legitimately have character sequences that would be considered url-encoded?
There was a problem hiding this comment.
Yes - you're right - the test was not veridical because it artificially encoded the path, whereas the WSGI server would already have dealt with that. Have reverted the test and the changes.
packages/kolibri-zip/README.md
Outdated
| | Option | Type | Default | Description | | ||
| |--------|------|---------|-------------| | ||
| | `maxFullLoadSize` | `number` | `2.5MB` | Files larger than this trigger lazy loading with range requests | | ||
| | `largeMediaThreshold` | `number` | `500KB` | Audio/video files larger than this use `largeFileUrlGenerator` instead of being extracted | | ||
| | `largeFileUrlGenerator` | `function` | `null` | Function `(filename) => url` for serving large audio/video files directly instead of extracting them | | ||
| | `chunkSize` | `number` | `500KB` | Size of chunks for grouped range requests | | ||
| | `filePathMappers` | `object` | (built-in) | Custom path replacement mappers by file extension | |
There was a problem hiding this comment.
There's more documentation here than exists in the code. Please consider more JSDocs over separate README details
There was a problem hiding this comment.
Moved all code specific documentation to JSDocs.
| try { | ||
| // Only use largeMediaThreshold if we have a generator to serve large files | ||
| // Otherwise, treat all files the same (include them all in chunks) | ||
| const effectiveThreshold = this.largeFileUrlGenerator ? this.largeMediaThreshold : Infinity; |
There was a problem hiding this comment.
The README states:
Audio/video files larger than this use
largeFileUrlGeneratorinstead of being extracted
Which made me go back to check whether it threw an error if the size was larger than the threshold but a URL generator wasn't supplied. This ternary prevents that situation, but silently. Since threshold handling is managed in AdaptiveHttpReader, this all leads to a bit of a confusing situation if misconfigured.
…to add largeFile handling. Add zip origin to CSP for media.
…to add largeFile handling. Add zip origin to CSP for media. fixup! Change API of zipContentUrl. Update all kolibri-zip consumers to add largeFile handling. Add zip origin to CSP for media. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
fixup! Refactor kolibri-zip to use zip.js Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
| // A fallback URL to the zipcontent endpoint for this H5P file | ||
| this.zipcontentUrl = new URL( | ||
| `../../zipcontent/${this.filepath.substring(this.filepath.lastIndexOf('/') + 1)}`, | ||
| `../../zipcontent/${this.filepath.split('/').pop()}`, |
There was a problem hiding this comment.
Propagated the update from the BloomRunner here.
fixup! Refactor kolibri-zip to use zip.js Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
67e51c3 to
d9d912c
Compare
fixup! Refactor kolibri-zip to use zip.js Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>







Summary
Refactors the
kolibri-zippackage to usezip.jsand implements adaptive loading for large ZIP archives (H5P, Bloom, Perseus, QTI). Instead of downloading entire archives upfront, the new approach:largeFileUrlGeneratorcallback provides streaming URLs pointing to thezipcontentbackend, so video/audio plays via standard HTTP without being loaded into memoryOther changes:
zipContentUrlAPI — now takes a file object instead of separate parametersCSP_MEDIA_SRCto allow media streaming from the zipcontent originAdaptiveHttpReaderandkolibri-zipReferences
Fixes #9761
Reviewer guidance
How to test:
zapif-babuz— it contains content with large embedded video/audiorelease-v0.19.x.bloomd/.bloompub) and Perseus content as wellRisky areas deserving extra review:
packages/kolibri-zip/src/AdaptiveHttpReader.js— entirely new; implements HTTP range request reading with chunking, tail prefetching, and adaptive full-download-to-lazy switchingpackages/kolibri-zip/src/index.js— rewritten from Promise/callback-based code (fflatewithunzipcallbacks) to async/await usingzip.js. The control flow is substantially different even though the public API (file,files,filesFromExtension,close) is preserved. Also adds the new large file URL generation path throughExtractedFile.kolibri/deployment/default/settings/base.py:489-491— CSP change addingCSP_MEDIA_SRCto allow media streaming from the zipcontent originArchitecture decisions:
XMLHttpRequestinstead offetchfor range requests to maintain iOS Safari 9.3 compatibilityAI usage
Claude Code was used extensively to implement the
AdaptiveHttpReader, write test suites, and update all consumer components to the new API. All generated code was reviewed, tested, and refined, with a particular focus on removing or refining redundant test and production code paths.