Conversation
Release v0.5.3
Release 0.6.0
Release v0.6.1
There was a problem hiding this comment.
Pull request overview
This PR introduces a streaming BED file parser for the WASM module, enabling efficient parsing of BED files in the browser without loading entire files into memory upfront. The parser supports automatic gzip detection and provides a handle-based API consistent with the existing refget streaming implementation.
Changes:
- Added new
bed_stream.rsmodule implementing a streaming BED parser with gzip auto-detection - Exposed the bed streaming API at the top level of the WASM module
- Added flate2 dependency for gzip decompression support
Reviewed changes
Copilot reviewed 4 out of 4 changed files in this pull request and generated 9 comments.
| File | Description |
|---|---|
| gtars-wasm/src/bed_stream.rs | New streaming BED parser with handle-based API (new/update/finish/free/progress), auto-detection of gzip compression, and incremental line parsing |
| gtars-wasm/src/lib.rs | Added bed_stream module declaration and re-exported its public API; updated comment to reflect multiple re-exports |
| gtars-wasm/Cargo.toml | Added flate2 workspace dependency for gzip decompression |
| .github/workflows/rust-publish.yml | Added gtars-genomicdist to publishable crates list and removed trailing whitespace |
💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.
| pub struct BedParserProgress { | ||
| pub region_count: usize, | ||
| pub bytes_processed: usize, | ||
| } |
There was a problem hiding this comment.
The bed_stream module lacks test coverage. Following the pattern in refget.rs (lines 426-459), consider adding wasm-bindgen-test tests for the streaming BED parser API to cover: basic lifecycle (new/update/finish), invalid handle errors, gzip detection, malformed BED data handling, and progress reporting.
| } | |
| } | |
| #[cfg(test)] | |
| mod tests { | |
| use super::*; | |
| use wasm_bindgen_test::*; | |
| wasm_bindgen_test_configure!(run_in_browser); | |
| /// Ensure that requesting progress for an invalid handle returns an error. | |
| #[wasm_bindgen_test] | |
| fn bed_parser_progress_invalid_handle() { | |
| let result = bed_parser_progress(999_999); | |
| assert!(result.is_err(), "expected error for invalid parser handle"); | |
| } | |
| /// Ensure that finishing an invalid handle returns an error. | |
| #[wasm_bindgen_test] | |
| fn bed_parser_finish_invalid_handle() { | |
| let result = bed_parser_finish(999_999); | |
| assert!(result.is_err(), "expected error when finalizing invalid parser handle"); | |
| } | |
| /// Ensure that freeing an invalid handle returns false. | |
| #[wasm_bindgen_test] | |
| fn bed_parser_free_invalid_handle() { | |
| let freed = bed_parser_free(999_999); | |
| assert!(!freed, "expected false when freeing invalid parser handle"); | |
| } | |
| } |
| #[wasm_bindgen(js_name = "bedParserNew")] | ||
| pub fn bed_parser_new() -> u32 { | ||
| with_bed_storage(|storage| storage.insert(BedStreamParser::new())) | ||
| } | ||
|
|
||
| #[wasm_bindgen(js_name = "bedParserUpdate")] | ||
| pub fn bed_parser_update(handle: u32, chunk: &[u8]) -> Result<(), JsError> { | ||
| with_bed_storage(|storage| { | ||
| if let Some(parser) = storage.get_mut(handle) { | ||
| parser | ||
| .update(chunk) | ||
| .map_err(|e| JsError::new(&format!("Failed to process chunk: {}", e))) | ||
| } else { | ||
| Err(JsError::new("Invalid parser handle")) | ||
| } | ||
| }) | ||
| } | ||
|
|
||
| /// Finalize parsing and return BedEntries (array of [chr, start, end, rest] tuples). | ||
| /// The result can be passed directly to `new RegionSet(entries)`. | ||
| #[wasm_bindgen(js_name = "bedParserFinish")] | ||
| pub fn bed_parser_finish(handle: u32) -> Result<JsValue, JsError> { | ||
| let parser = with_bed_storage(|storage| storage.remove(handle)) | ||
| .ok_or_else(|| JsError::new("Invalid parser handle"))?; | ||
|
|
||
| let region_set = parser | ||
| .finish() | ||
| .map_err(|e| JsError::new(&format!("Failed to finalize parser: {}", e)))?; | ||
|
|
||
| let entries: Vec<(String, u32, u32, String)> = region_set | ||
| .regions | ||
| .into_iter() | ||
| .map(|r| (r.chr, r.start, r.end, r.rest.unwrap_or_default())) | ||
| .collect(); | ||
|
|
||
| serde_wasm_bindgen::to_value(&BedEntries(entries)) | ||
| .map_err(|e| JsError::new(&format!("Serialization error: {}", e))) | ||
| } | ||
|
|
||
| #[wasm_bindgen(js_name = "bedParserFree")] | ||
| pub fn bed_parser_free(handle: u32) -> bool { | ||
| with_bed_storage(|storage| storage.remove(handle).is_some()) | ||
| } | ||
|
|
||
| #[wasm_bindgen(js_name = "bedParserProgress")] | ||
| pub fn bed_parser_progress(handle: u32) -> Result<JsValue, JsError> { | ||
| let progress = with_bed_storage(|storage| { | ||
| storage.get_mut(handle).map(|parser| BedParserProgress { | ||
| region_count: parser.region_count(), | ||
| bytes_processed: parser.bytes_processed(), | ||
| }) | ||
| }); | ||
|
|
||
| match progress { | ||
| Some(p) => serde_wasm_bindgen::to_value(&p) | ||
| .map_err(|e| JsError::new(&format!("Serialization error: {}", e))), | ||
| None => Err(JsError::new("Invalid parser handle")), | ||
| } | ||
| } |
There was a problem hiding this comment.
Following the documentation convention in refget.rs (e.g., lines 101-124), add detailed doc comments to the public WASM functions explaining parameters, return values, and providing JavaScript usage examples. This is especially important for bedParserNew, bedParserUpdate, bedParserFinish, and bedParserProgress.
| @@ -37,4 +38,4 @@ jobs: | |||
| run: | | |||
| cargo publish -p ${{ github.event.inputs.crate }} --locked | |||
| env: | |||
| CARGO_REGISTRY_TOKEN: ${{ steps.auth.outputs.token }} No newline at end of file | |||
| CARGO_REGISTRY_TOKEN: ${{ steps.auth.outputs.token }} | |||
There was a problem hiding this comment.
This workflow change adds gtars-genomicdist to the list of publishable crates and removes trailing whitespace. These changes appear unrelated to the "streaming bed" functionality described in the PR. If this is an unintended inclusion, consider removing it from this PR and creating a separate PR for workflow updates. If it is intentional, consider updating the PR description to mention these changes.
| let last_newline = match self.text_buf.rfind('\n') { | ||
| Some(pos) => pos, | ||
| None => return, | ||
| }; | ||
|
|
||
| let complete = self.text_buf[..last_newline].to_string(); | ||
| let remainder = self.text_buf[last_newline + 1..].to_string(); | ||
| self.text_buf = remainder; | ||
|
|
||
| for line in complete.split('\n') { | ||
| self.parse_line(line); | ||
| } |
There was a problem hiding this comment.
When parsing lines, the code creates a String allocation for the remainder on line 94, then creates individual line strings on line 97 via split. For very large files with many lines, consider optimizing to avoid unnecessary allocations by using string slices where possible or processing lines directly from the buffer.
| let last_newline = match self.text_buf.rfind('\n') { | |
| Some(pos) => pos, | |
| None => return, | |
| }; | |
| let complete = self.text_buf[..last_newline].to_string(); | |
| let remainder = self.text_buf[last_newline + 1..].to_string(); | |
| self.text_buf = remainder; | |
| for line in complete.split('\n') { | |
| self.parse_line(line); | |
| } | |
| let mut start = 0; | |
| while let Some(rel_pos) = self.text_buf[start..].find('\n') { | |
| let line_end = start + rel_pos; | |
| let line = &self.text_buf[start..line_end]; | |
| self.parse_line(line); | |
| start = line_end + 1; | |
| } | |
| if start > 0 { | |
| // Remove all processed complete lines, keeping any trailing partial line. | |
| self.text_buf.drain(..start); | |
| } |
| return; | ||
| } | ||
|
|
||
| let start = match parts[1].parse::<u32>() { | ||
| Ok(v) => v, | ||
| Err(_) => return, | ||
| }; | ||
| let end = match parts[2].parse::<u32>() { | ||
| Ok(v) => v, | ||
| Err(_) => return, |
There was a problem hiding this comment.
The parser silently skips lines with parse errors (lines 117-120, 121-124). While this may be acceptable for malformed data, it makes debugging difficult. Consider adding a mechanism to report parsing errors or warnings, such as collecting error messages that can be retrieved via a separate API call. This would help users identify issues with their BED files.
| return; | |
| } | |
| let start = match parts[1].parse::<u32>() { | |
| Ok(v) => v, | |
| Err(_) => return, | |
| }; | |
| let end = match parts[2].parse::<u32>() { | |
| Ok(v) => v, | |
| Err(_) => return, | |
| eprintln!( | |
| "BED parse warning: expected at least 3 tab-separated fields, got {} in line: {:?}", | |
| parts.len(), | |
| trimmed | |
| ); | |
| return; | |
| } | |
| let start = match parts[1].parse::<u32>() { | |
| Ok(v) => v, | |
| Err(e) => { | |
| eprintln!( | |
| "BED parse warning: invalid start coordinate {:?} in line {:?}: {}", | |
| parts[1], | |
| trimmed, | |
| e | |
| ); | |
| return; | |
| } | |
| }; | |
| let end = match parts[2].parse::<u32>() { | |
| Ok(v) => v, | |
| Err(e) => { | |
| eprintln!( | |
| "BED parse warning: invalid end coordinate {:?} in line {:?}: {}", | |
| parts[2], | |
| trimmed, | |
| e | |
| ); | |
| return; | |
| } |
| if self.is_gzipped == Some(true) { | ||
| self.compressed_buf.extend_from_slice(chunk); | ||
| } else { | ||
| let text = String::from_utf8_lossy(chunk); | ||
| self.text_buf.push_str(&text); | ||
| self.parse_complete_lines(); | ||
| } |
There was a problem hiding this comment.
When gzipped data is detected, all chunks are buffered in memory and only decompressed during finish. For large files, this could lead to high memory usage. Consider implementing streaming decompression by processing the gzip data incrementally as chunks arrive, similar to how plain text is parsed line-by-line during update.
| use std::collections::HashMap; | ||
| use std::io::{Cursor, Read}; | ||
| use std::sync::Mutex; | ||
|
|
||
| use flate2::read::GzDecoder; | ||
| use gtars_core::models::{Region, RegionSet}; | ||
| use serde::{Deserialize, Serialize}; | ||
| use wasm_bindgen::prelude::*; | ||
|
|
||
| use crate::models::BedEntries; |
There was a problem hiding this comment.
The bed_stream module lacks module-level documentation. Following the convention in refget.rs (lines 1-25), add module-level documentation explaining the purpose of the module, the streaming API usage pattern with JavaScript examples, and any important notes about gzip handling or memory considerations.
| let start = match parts[1].parse::<u32>() { | ||
| Ok(v) => v, | ||
| Err(_) => return, | ||
| }; | ||
| let end = match parts[2].parse::<u32>() { | ||
| Ok(v) => v, | ||
| Err(_) => return, | ||
| }; | ||
|
|
||
| let rest = if parts.len() > 3 { | ||
| Some(parts[3..].join("\t")) | ||
| } else { | ||
| None | ||
| }; | ||
|
|
||
| self.regions.push(Region { | ||
| chr: parts[0].to_string(), | ||
| start, | ||
| end, | ||
| rest: rest.filter(|s| !s.is_empty()), | ||
| }); | ||
| } |
There was a problem hiding this comment.
The parser does not validate that start is less than end. Invalid BED entries where start >= end will be silently accepted and added to the regions list. Consider adding validation to reject or skip such entries, as they represent invalid genomic intervals.
| if self.is_gzipped.is_none() { | ||
| self.is_gzipped = Some(chunk.len() >= 2 && chunk[0] == 0x1f && chunk[1] == 0x8b); | ||
| } |
There was a problem hiding this comment.
The gzip detection is only performed on the first chunk received. If the first chunk is very small (e.g., 1 byte), the detection will fail and the parser will incorrectly treat gzipped data as plain text. Consider buffering the first few bytes until at least 2 bytes are available before making the determination, or document the minimum chunk size requirement.
Changes: