Streaming bed by nsheff · Pull Request #235 · databio/gtars

nsheff · 2026-02-19T13:45:49Z

Changes:

Added bed streaming, instead Region initialization in JS

Release v0.5.3

Release 0.6.0

Release v0.6.1

Copilot

Pull request overview

This PR introduces a streaming BED file parser for the WASM module, enabling efficient parsing of BED files in the browser without loading entire files into memory upfront. The parser supports automatic gzip detection and provides a handle-based API consistent with the existing refget streaming implementation.

Changes:

Added new bed_stream.rs module implementing a streaming BED parser with gzip auto-detection
Exposed the bed streaming API at the top level of the WASM module
Added flate2 dependency for gzip decompression support

Reviewed changes

Copilot reviewed 4 out of 4 changed files in this pull request and generated 9 comments.

File	Description
gtars-wasm/src/bed_stream.rs	New streaming BED parser with handle-based API (new/update/finish/free/progress), auto-detection of gzip compression, and incremental line parsing
gtars-wasm/src/lib.rs	Added bed_stream module declaration and re-exported its public API; updated comment to reflect multiple re-exports
gtars-wasm/Cargo.toml	Added flate2 workspace dependency for gzip decompression
.github/workflows/rust-publish.yml	Added gtars-genomicdist to publishable crates list and removed trailing whitespace

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

Copilot · 2026-02-19T20:13:36Z

gtars-wasm/src/bed_stream.rs

+pub struct BedParserProgress {
+    pub region_count: usize,
+    pub bytes_processed: usize,
+}


The bed_stream module lacks test coverage. Following the pattern in refget.rs (lines 426-459), consider adding wasm-bindgen-test tests for the streaming BED parser API to cover: basic lifecycle (new/update/finish), invalid handle errors, gzip detection, malformed BED data handling, and progress reporting.

Suggested change

}

}

#[cfg(test)]

mod tests {

use super::*;

use wasm_bindgen_test::*;

wasm_bindgen_test_configure!(run_in_browser);

/// Ensure that requesting progress for an invalid handle returns an error.

#[wasm_bindgen_test]

fn bed_parser_progress_invalid_handle() {

let result = bed_parser_progress(999_999);

assert!(result.is_err(), "expected error for invalid parser handle");

}

/// Ensure that finishing an invalid handle returns an error.

#[wasm_bindgen_test]

fn bed_parser_finish_invalid_handle() {

let result = bed_parser_finish(999_999);

assert!(result.is_err(), "expected error when finalizing invalid parser handle");

}

/// Ensure that freeing an invalid handle returns false.

#[wasm_bindgen_test]

fn bed_parser_free_invalid_handle() {

let freed = bed_parser_free(999_999);

assert!(!freed, "expected false when freeing invalid parser handle");

}

}

Copilot · 2026-02-19T20:13:37Z

gtars-wasm/src/bed_stream.rs

+#[wasm_bindgen(js_name = "bedParserNew")]
+pub fn bed_parser_new() -> u32 {
+    with_bed_storage(|storage| storage.insert(BedStreamParser::new()))
+}
+
+#[wasm_bindgen(js_name = "bedParserUpdate")]
+pub fn bed_parser_update(handle: u32, chunk: &[u8]) -> Result<(), JsError> {
+    with_bed_storage(|storage| {
+        if let Some(parser) = storage.get_mut(handle) {
+            parser
+                .update(chunk)
+                .map_err(|e| JsError::new(&format!("Failed to process chunk: {}", e)))
+        } else {
+            Err(JsError::new("Invalid parser handle"))
+        }
+    })
+}
+
+/// Finalize parsing and return BedEntries (array of [chr, start, end, rest] tuples).
+/// The result can be passed directly to `new RegionSet(entries)`.
+#[wasm_bindgen(js_name = "bedParserFinish")]
+pub fn bed_parser_finish(handle: u32) -> Result<JsValue, JsError> {
+    let parser = with_bed_storage(|storage| storage.remove(handle))
+        .ok_or_else(|| JsError::new("Invalid parser handle"))?;
+
+    let region_set = parser
+        .finish()
+        .map_err(|e| JsError::new(&format!("Failed to finalize parser: {}", e)))?;
+
+    let entries: Vec<(String, u32, u32, String)> = region_set
+        .regions
+        .into_iter()
+        .map(|r| (r.chr, r.start, r.end, r.rest.unwrap_or_default()))
+        .collect();
+
+    serde_wasm_bindgen::to_value(&BedEntries(entries))
+        .map_err(|e| JsError::new(&format!("Serialization error: {}", e)))
+}
+
+#[wasm_bindgen(js_name = "bedParserFree")]
+pub fn bed_parser_free(handle: u32) -> bool {
+    with_bed_storage(|storage| storage.remove(handle).is_some())
+}
+
+#[wasm_bindgen(js_name = "bedParserProgress")]
+pub fn bed_parser_progress(handle: u32) -> Result<JsValue, JsError> {
+    let progress = with_bed_storage(|storage| {
+        storage.get_mut(handle).map(|parser| BedParserProgress {
+            region_count: parser.region_count(),
+            bytes_processed: parser.bytes_processed(),
+        })
+    });
+
+    match progress {
+        Some(p) => serde_wasm_bindgen::to_value(&p)
+            .map_err(|e| JsError::new(&format!("Serialization error: {}", e))),
+        None => Err(JsError::new("Invalid parser handle")),
+    }
+}


Following the documentation convention in refget.rs (e.g., lines 101-124), add detailed doc comments to the public WASM functions explaining parameters, return values, and providing JavaScript usage examples. This is especially important for bedParserNew, bedParserUpdate, bedParserFinish, and bedParserProgress.

Copilot · 2026-02-19T20:13:37Z

.github/workflows/rust-publish.yml

@@ -37,4 +38,4 @@ jobs:
        run: |
            cargo publish -p ${{ github.event.inputs.crate }} --locked
        env:
-          CARGO_REGISTRY_TOKEN: ${{ steps.auth.outputs.token }}
+          CARGO_REGISTRY_TOKEN: ${{ steps.auth.outputs.token }}


This workflow change adds gtars-genomicdist to the list of publishable crates and removes trailing whitespace. These changes appear unrelated to the "streaming bed" functionality described in the PR. If this is an unintended inclusion, consider removing it from this PR and creating a separate PR for workflow updates. If it is intentional, consider updating the PR description to mention these changes.

Copilot · 2026-02-19T20:13:37Z

gtars-wasm/src/bed_stream.rs

+        let last_newline = match self.text_buf.rfind('\n') {
+            Some(pos) => pos,
+            None => return,
+        };
+
+        let complete = self.text_buf[..last_newline].to_string();
+        let remainder = self.text_buf[last_newline + 1..].to_string();
+        self.text_buf = remainder;
+
+        for line in complete.split('\n') {
+            self.parse_line(line);
+        }


When parsing lines, the code creates a String allocation for the remainder on line 94, then creates individual line strings on line 97 via split. For very large files with many lines, consider optimizing to avoid unnecessary allocations by using string slices where possible or processing lines directly from the buffer.

Suggested change

let last_newline = match self.text_buf.rfind('\n') {

Some(pos) => pos,

None => return,

};

let complete = self.text_buf[..last_newline].to_string();

let remainder = self.text_buf[last_newline + 1..].to_string();

self.text_buf = remainder;

for line in complete.split('\n') {

self.parse_line(line);

}

let mut start = 0;

while let Some(rel_pos) = self.text_buf[start..].find('\n') {

let line_end = start + rel_pos;

let line = &self.text_buf[start..line_end];

self.parse_line(line);

start = line_end + 1;

}

if start > 0 {

// Remove all processed complete lines, keeping any trailing partial line.

self.text_buf.drain(..start);

}

Copilot · 2026-02-19T20:13:37Z

gtars-wasm/src/bed_stream.rs

+            return;
+        }
+
+        let start = match parts[1].parse::<u32>() {
+            Ok(v) => v,
+            Err(_) => return,
+        };
+        let end = match parts[2].parse::<u32>() {
+            Ok(v) => v,
+            Err(_) => return,


The parser silently skips lines with parse errors (lines 117-120, 121-124). While this may be acceptable for malformed data, it makes debugging difficult. Consider adding a mechanism to report parsing errors or warnings, such as collecting error messages that can be retrieved via a separate API call. This would help users identify issues with their BED files.

Suggested change

return;

}

let start = match parts[1].parse::<u32>() {

Ok(v) => v,

Err(_) => return,

};

let end = match parts[2].parse::<u32>() {

Ok(v) => v,

Err(_) => return,

eprintln!(

"BED parse warning: expected at least 3 tab-separated fields, got {} in line: {:?}",

parts.len(),

trimmed

);

return;

}

let start = match parts[1].parse::<u32>() {

Ok(v) => v,

Err(e) => {

eprintln!(

"BED parse warning: invalid start coordinate {:?} in line {:?}: {}",

parts[1],

trimmed,

e

);

return;

}

};

let end = match parts[2].parse::<u32>() {

Ok(v) => v,

Err(e) => {

eprintln!(

"BED parse warning: invalid end coordinate {:?} in line {:?}: {}",

parts[2],

trimmed,

e

);

return;

}

Copilot · 2026-02-19T20:13:38Z

gtars-wasm/src/bed_stream.rs

+        if self.is_gzipped == Some(true) {
+            self.compressed_buf.extend_from_slice(chunk);
+        } else {
+            let text = String::from_utf8_lossy(chunk);
+            self.text_buf.push_str(&text);
+            self.parse_complete_lines();
+        }


When gzipped data is detected, all chunks are buffered in memory and only decompressed during finish. For large files, this could lead to high memory usage. Consider implementing streaming decompression by processing the gzip data incrementally as chunks arrive, similar to how plain text is parsed line-by-line during update.

Copilot · 2026-02-19T20:13:38Z

gtars-wasm/src/bed_stream.rs

+use std::collections::HashMap;
+use std::io::{Cursor, Read};
+use std::sync::Mutex;
+
+use flate2::read::GzDecoder;
+use gtars_core::models::{Region, RegionSet};
+use serde::{Deserialize, Serialize};
+use wasm_bindgen::prelude::*;
+
+use crate::models::BedEntries;


The bed_stream module lacks module-level documentation. Following the convention in refget.rs (lines 1-25), add module-level documentation explaining the purpose of the module, the streaming API usage pattern with JavaScript examples, and any important notes about gzip handling or memory considerations.

Copilot · 2026-02-19T20:13:38Z

gtars-wasm/src/bed_stream.rs

+        let start = match parts[1].parse::<u32>() {
+            Ok(v) => v,
+            Err(_) => return,
+        };
+        let end = match parts[2].parse::<u32>() {
+            Ok(v) => v,
+            Err(_) => return,
+        };
+
+        let rest = if parts.len() > 3 {
+            Some(parts[3..].join("\t"))
+        } else {
+            None
+        };
+
+        self.regions.push(Region {
+            chr: parts[0].to_string(),
+            start,
+            end,
+            rest: rest.filter(|s| !s.is_empty()),
+        });
+    }


The parser does not validate that start is less than end. Invalid BED entries where start >= end will be silently accepted and added to the regions list. Consider adding validation to reject or skip such entries, as they represent invalid genomic intervals.

Copilot · 2026-02-19T20:13:38Z

gtars-wasm/src/bed_stream.rs

+        if self.is_gzipped.is_none() {
+            self.is_gzipped = Some(chunk.len() >= 2 && chunk[0] == 0x1f && chunk[1] == 0x8b);
+        }


The gzip detection is only performed on the first chunk received. If the first chunk is very small (e.g., 1 byte), the detection will fail and the parser will incorrectly treat gzipped data as plain text. Consider buffering the first few bytes until at least 2 bytes are available before making the determination, or document the minimum chunk size requirement.

khoroshevskyi and others added 6 commits January 20, 2026 22:51

Merge pull request #221 from databio/dev

2e674fd

Release v0.5.3

Add gtars-genomicdist to workflow dependencies

96f4038

Merge pull request #226 from databio/dev

a0443bb

Release 0.6.0

Merge pull request #228 from databio/dev

4221004

Release v0.6.1

Merge branch 'dev'

00405cc

Added streaming of bed file

fef314e

khoroshevskyi marked this pull request as ready for review February 19, 2026 20:06

khoroshevskyi requested review from Copilot, khoroshevskyi and nleroy917 February 19, 2026 20:06

khoroshevskyi approved these changes Feb 19, 2026

View reviewed changes

Copilot started reviewing on behalf of khoroshevskyi February 19, 2026 20:07 View session

Copilot AI reviewed Feb 19, 2026

View reviewed changes

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Streaming bed#235

Streaming bed#235
nsheff wants to merge 6 commits intodevfrom
streaming_bed

nsheff commented Feb 19, 2026 •

edited by khoroshevskyi

Loading

Uh oh!

Copilot AI left a comment

Uh oh!

Copilot AI Feb 19, 2026

Uh oh!

Copilot AI Feb 19, 2026

Uh oh!

Copilot AI Feb 19, 2026

Uh oh!

Copilot AI Feb 19, 2026

Uh oh!

Copilot AI Feb 19, 2026

Uh oh!

Copilot AI Feb 19, 2026

Uh oh!

Copilot AI Feb 19, 2026

Uh oh!

Copilot AI Feb 19, 2026

Uh oh!

Copilot AI Feb 19, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Comments

-}
+}
+#[cfg(test)]
+mod tests {
+    use super::*;
+    use wasm_bindgen_test::*;
+    wasm_bindgen_test_configure!(run_in_browser);
+    /// Ensure that requesting progress for an invalid handle returns an error.
+    #[wasm_bindgen_test]
+    fn bed_parser_progress_invalid_handle() {
+        let result = bed_parser_progress(999_999);
+        assert!(result.is_err(), "expected error for invalid parser handle");
+    }
+    /// Ensure that finishing an invalid handle returns an error.
+    #[wasm_bindgen_test]
+    fn bed_parser_finish_invalid_handle() {
+        let result = bed_parser_finish(999_999);
+        assert!(result.is_err(), "expected error when finalizing invalid parser handle");
+    }
+    /// Ensure that freeing an invalid handle returns false.
+    #[wasm_bindgen_test]
+    fn bed_parser_free_invalid_handle() {
+        let freed = bed_parser_free(999_999);
+        assert!(!freed, "expected false when freeing invalid parser handle");
+    }
+}

-            return;
-        }
-        let start = match parts[1].parse::<u32>() {
-            Ok(v) => v,
-            Err(_) => return,
-        };
-        let end = match parts[2].parse::<u32>() {
-            Ok(v) => v,
-            Err(_) => return,
+            eprintln!(
+                "BED parse warning: expected at least 3 tab-separated fields, got {} in line: {:?}",
+                parts.len(),
+                trimmed
+            );
+            return;
+        }
+        let start = match parts[1].parse::<u32>() {
+            Ok(v) => v,
+            Err(e) => {
+                eprintln!(
+                    "BED parse warning: invalid start coordinate {:?} in line {:?}: {}",
+                    parts[1],
+                    trimmed,
+                    e
+                );
+                return;
+            }
+        };
+        let end = match parts[2].parse::<u32>() {
+            Ok(v) => v,
+            Err(e) => {
+                eprintln!(
+                    "BED parse warning: invalid end coordinate {:?} in line {:?}: {}",
+                    parts[2],
+                    trimmed,
+                    e
+                );
+                return;
+            }

Conversation

nsheff commented Feb 19, 2026 • edited by khoroshevskyi Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Changes:

Uh oh!

Copilot AI left a comment

Choose a reason for hiding this comment

Pull request overview

Reviewed changes

Uh oh!

Copilot AI Feb 19, 2026

Choose a reason for hiding this comment

Uh oh!

Copilot AI Feb 19, 2026

Choose a reason for hiding this comment

Uh oh!

Copilot AI Feb 19, 2026

Choose a reason for hiding this comment

Uh oh!

Copilot AI Feb 19, 2026

Choose a reason for hiding this comment

Uh oh!

Copilot AI Feb 19, 2026

Choose a reason for hiding this comment

Uh oh!

Copilot AI Feb 19, 2026

Choose a reason for hiding this comment

Uh oh!

Copilot AI Feb 19, 2026

Choose a reason for hiding this comment

Uh oh!

Copilot AI Feb 19, 2026

Choose a reason for hiding this comment

Uh oh!

Copilot AI Feb 19, 2026

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Comments

nsheff commented Feb 19, 2026 •

edited by khoroshevskyi

Loading