Skip to content

Streaming bed#235

Open
nsheff wants to merge 6 commits intodevfrom
streaming_bed
Open

Streaming bed#235
nsheff wants to merge 6 commits intodevfrom
streaming_bed

Conversation

@nsheff
Copy link
Member

@nsheff nsheff commented Feb 19, 2026

Changes:

  • Added bed streaming, instead Region initialization in JS

@khoroshevskyi khoroshevskyi marked this pull request as ready for review February 19, 2026 20:06
Copy link
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

This PR introduces a streaming BED file parser for the WASM module, enabling efficient parsing of BED files in the browser without loading entire files into memory upfront. The parser supports automatic gzip detection and provides a handle-based API consistent with the existing refget streaming implementation.

Changes:

  • Added new bed_stream.rs module implementing a streaming BED parser with gzip auto-detection
  • Exposed the bed streaming API at the top level of the WASM module
  • Added flate2 dependency for gzip decompression support

Reviewed changes

Copilot reviewed 4 out of 4 changed files in this pull request and generated 9 comments.

File Description
gtars-wasm/src/bed_stream.rs New streaming BED parser with handle-based API (new/update/finish/free/progress), auto-detection of gzip compression, and incremental line parsing
gtars-wasm/src/lib.rs Added bed_stream module declaration and re-exported its public API; updated comment to reflect multiple re-exports
gtars-wasm/Cargo.toml Added flate2 workspace dependency for gzip decompression
.github/workflows/rust-publish.yml Added gtars-genomicdist to publishable crates list and removed trailing whitespace

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

pub struct BedParserProgress {
pub region_count: usize,
pub bytes_processed: usize,
}
Copy link

Copilot AI Feb 19, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The bed_stream module lacks test coverage. Following the pattern in refget.rs (lines 426-459), consider adding wasm-bindgen-test tests for the streaming BED parser API to cover: basic lifecycle (new/update/finish), invalid handle errors, gzip detection, malformed BED data handling, and progress reporting.

Suggested change
}
}
#[cfg(test)]
mod tests {
use super::*;
use wasm_bindgen_test::*;
wasm_bindgen_test_configure!(run_in_browser);
/// Ensure that requesting progress for an invalid handle returns an error.
#[wasm_bindgen_test]
fn bed_parser_progress_invalid_handle() {
let result = bed_parser_progress(999_999);
assert!(result.is_err(), "expected error for invalid parser handle");
}
/// Ensure that finishing an invalid handle returns an error.
#[wasm_bindgen_test]
fn bed_parser_finish_invalid_handle() {
let result = bed_parser_finish(999_999);
assert!(result.is_err(), "expected error when finalizing invalid parser handle");
}
/// Ensure that freeing an invalid handle returns false.
#[wasm_bindgen_test]
fn bed_parser_free_invalid_handle() {
let freed = bed_parser_free(999_999);
assert!(!freed, "expected false when freeing invalid parser handle");
}
}

Copilot uses AI. Check for mistakes.
Comment on lines +202 to +260
#[wasm_bindgen(js_name = "bedParserNew")]
pub fn bed_parser_new() -> u32 {
with_bed_storage(|storage| storage.insert(BedStreamParser::new()))
}

#[wasm_bindgen(js_name = "bedParserUpdate")]
pub fn bed_parser_update(handle: u32, chunk: &[u8]) -> Result<(), JsError> {
with_bed_storage(|storage| {
if let Some(parser) = storage.get_mut(handle) {
parser
.update(chunk)
.map_err(|e| JsError::new(&format!("Failed to process chunk: {}", e)))
} else {
Err(JsError::new("Invalid parser handle"))
}
})
}

/// Finalize parsing and return BedEntries (array of [chr, start, end, rest] tuples).
/// The result can be passed directly to `new RegionSet(entries)`.
#[wasm_bindgen(js_name = "bedParserFinish")]
pub fn bed_parser_finish(handle: u32) -> Result<JsValue, JsError> {
let parser = with_bed_storage(|storage| storage.remove(handle))
.ok_or_else(|| JsError::new("Invalid parser handle"))?;

let region_set = parser
.finish()
.map_err(|e| JsError::new(&format!("Failed to finalize parser: {}", e)))?;

let entries: Vec<(String, u32, u32, String)> = region_set
.regions
.into_iter()
.map(|r| (r.chr, r.start, r.end, r.rest.unwrap_or_default()))
.collect();

serde_wasm_bindgen::to_value(&BedEntries(entries))
.map_err(|e| JsError::new(&format!("Serialization error: {}", e)))
}

#[wasm_bindgen(js_name = "bedParserFree")]
pub fn bed_parser_free(handle: u32) -> bool {
with_bed_storage(|storage| storage.remove(handle).is_some())
}

#[wasm_bindgen(js_name = "bedParserProgress")]
pub fn bed_parser_progress(handle: u32) -> Result<JsValue, JsError> {
let progress = with_bed_storage(|storage| {
storage.get_mut(handle).map(|parser| BedParserProgress {
region_count: parser.region_count(),
bytes_processed: parser.bytes_processed(),
})
});

match progress {
Some(p) => serde_wasm_bindgen::to_value(&p)
.map_err(|e| JsError::new(&format!("Serialization error: {}", e))),
None => Err(JsError::new("Invalid parser handle")),
}
}
Copy link

Copilot AI Feb 19, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Following the documentation convention in refget.rs (e.g., lines 101-124), add detailed doc comments to the public WASM functions explaining parameters, return values, and providing JavaScript usage examples. This is especially important for bedParserNew, bedParserUpdate, bedParserFinish, and bedParserProgress.

Copilot uses AI. Check for mistakes.
Comment on lines 24 to 41
@@ -37,4 +38,4 @@ jobs:
run: |
cargo publish -p ${{ github.event.inputs.crate }} --locked
env:
CARGO_REGISTRY_TOKEN: ${{ steps.auth.outputs.token }} No newline at end of file
CARGO_REGISTRY_TOKEN: ${{ steps.auth.outputs.token }}
Copy link

Copilot AI Feb 19, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This workflow change adds gtars-genomicdist to the list of publishable crates and removes trailing whitespace. These changes appear unrelated to the "streaming bed" functionality described in the PR. If this is an unintended inclusion, consider removing it from this PR and creating a separate PR for workflow updates. If it is intentional, consider updating the PR description to mention these changes.

Copilot uses AI. Check for mistakes.
Comment on lines +88 to +99
let last_newline = match self.text_buf.rfind('\n') {
Some(pos) => pos,
None => return,
};

let complete = self.text_buf[..last_newline].to_string();
let remainder = self.text_buf[last_newline + 1..].to_string();
self.text_buf = remainder;

for line in complete.split('\n') {
self.parse_line(line);
}
Copy link

Copilot AI Feb 19, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

When parsing lines, the code creates a String allocation for the remainder on line 94, then creates individual line strings on line 97 via split. For very large files with many lines, consider optimizing to avoid unnecessary allocations by using string slices where possible or processing lines directly from the buffer.

Suggested change
let last_newline = match self.text_buf.rfind('\n') {
Some(pos) => pos,
None => return,
};
let complete = self.text_buf[..last_newline].to_string();
let remainder = self.text_buf[last_newline + 1..].to_string();
self.text_buf = remainder;
for line in complete.split('\n') {
self.parse_line(line);
}
let mut start = 0;
while let Some(rel_pos) = self.text_buf[start..].find('\n') {
let line_end = start + rel_pos;
let line = &self.text_buf[start..line_end];
self.parse_line(line);
start = line_end + 1;
}
if start > 0 {
// Remove all processed complete lines, keeping any trailing partial line.
self.text_buf.drain(..start);
}

Copilot uses AI. Check for mistakes.
Comment on lines +114 to +123
return;
}

let start = match parts[1].parse::<u32>() {
Ok(v) => v,
Err(_) => return,
};
let end = match parts[2].parse::<u32>() {
Ok(v) => v,
Err(_) => return,
Copy link

Copilot AI Feb 19, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The parser silently skips lines with parse errors (lines 117-120, 121-124). While this may be acceptable for malformed data, it makes debugging difficult. Consider adding a mechanism to report parsing errors or warnings, such as collecting error messages that can be retrieved via a separate API call. This would help users identify issues with their BED files.

Suggested change
return;
}
let start = match parts[1].parse::<u32>() {
Ok(v) => v,
Err(_) => return,
};
let end = match parts[2].parse::<u32>() {
Ok(v) => v,
Err(_) => return,
eprintln!(
"BED parse warning: expected at least 3 tab-separated fields, got {} in line: {:?}",
parts.len(),
trimmed
);
return;
}
let start = match parts[1].parse::<u32>() {
Ok(v) => v,
Err(e) => {
eprintln!(
"BED parse warning: invalid start coordinate {:?} in line {:?}: {}",
parts[1],
trimmed,
e
);
return;
}
};
let end = match parts[2].parse::<u32>() {
Ok(v) => v,
Err(e) => {
eprintln!(
"BED parse warning: invalid end coordinate {:?} in line {:?}: {}",
parts[2],
trimmed,
e
);
return;
}

Copilot uses AI. Check for mistakes.
Comment on lines +46 to +52
if self.is_gzipped == Some(true) {
self.compressed_buf.extend_from_slice(chunk);
} else {
let text = String::from_utf8_lossy(chunk);
self.text_buf.push_str(&text);
self.parse_complete_lines();
}
Copy link

Copilot AI Feb 19, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

When gzipped data is detected, all chunks are buffered in memory and only decompressed during finish. For large files, this could lead to high memory usage. Consider implementing streaming decompression by processing the gzip data incrementally as chunks arrive, similar to how plain text is parsed line-by-line during update.

Copilot uses AI. Check for mistakes.
Comment on lines +1 to +10
use std::collections::HashMap;
use std::io::{Cursor, Read};
use std::sync::Mutex;

use flate2::read::GzDecoder;
use gtars_core::models::{Region, RegionSet};
use serde::{Deserialize, Serialize};
use wasm_bindgen::prelude::*;

use crate::models::BedEntries;
Copy link

Copilot AI Feb 19, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The bed_stream module lacks module-level documentation. Following the convention in refget.rs (lines 1-25), add module-level documentation explaining the purpose of the module, the streaming API usage pattern with JavaScript examples, and any important notes about gzip handling or memory considerations.

Copilot uses AI. Check for mistakes.
Comment on lines +117 to +138
let start = match parts[1].parse::<u32>() {
Ok(v) => v,
Err(_) => return,
};
let end = match parts[2].parse::<u32>() {
Ok(v) => v,
Err(_) => return,
};

let rest = if parts.len() > 3 {
Some(parts[3..].join("\t"))
} else {
None
};

self.regions.push(Region {
chr: parts[0].to_string(),
start,
end,
rest: rest.filter(|s| !s.is_empty()),
});
}
Copy link

Copilot AI Feb 19, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The parser does not validate that start is less than end. Invalid BED entries where start >= end will be silently accepted and added to the regions list. Consider adding validation to reject or skip such entries, as they represent invalid genomic intervals.

Copilot uses AI. Check for mistakes.
Comment on lines +42 to +44
if self.is_gzipped.is_none() {
self.is_gzipped = Some(chunk.len() >= 2 && chunk[0] == 0x1f && chunk[1] == 0x8b);
}
Copy link

Copilot AI Feb 19, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The gzip detection is only performed on the first chunk received. If the first chunk is very small (e.g., 1 byte), the detection will fail and the parser will incorrectly treat gzipped data as plain text. Consider buffering the first few bytes until at least 2 bytes are available before making the determination, or document the minimum chunk size requirement.

Copilot uses AI. Check for mistakes.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants

Comments