You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Copy file name to clipboardExpand all lines: paper/paper.md
+4-4Lines changed: 4 additions & 4 deletions
Display the source diff
Display the rich diff
Original file line number
Diff line number
Diff line change
@@ -18,17 +18,17 @@ bibliography: paper.bib
18
18
19
19
# Summary
20
20
21
-
`KrakenParser` is an open-source software tool (with a command-line interface and Python API) designed to streamline the post-analysis of metagenomic classification results produced by `Kraken2` and similar taxonomic profilers such as `Bracken` and `Metabuli`. `Kraken2` is a widely used taxonomic classifier that assigns metagenomic reads to taxa using exact k-mer matches, achieving high speed and accuracy. However, the raw output of `Kraken2` (and related tools) is a text report that can be cumbersome to interpret and aggregate across multiple samples. `KrakenParser` addresses this need by converting multiple Kraken-format reports into structured tables (CSV files) at various taxonomic ranks (from phylum down to species), performing filtering and normalization (including relative abundance calculations), and providing APIs to produce publication-ready plots. The tool automates the multi-step process of combining and cleaning Kraken results, allowing researchers to quickly obtain human-readable summaries of community composition. `KrakenParser`’s focus is on efficiency, ease-of-use, and integration: it can run an entire conversion pipeline with a single command and also be imported as a Python library for custom workflows. In summary, `KrakenParser` significantly reduces the manual effort required to post-process metagenomic classification data, enabling scientists to go from raw classifier output to analysis-ready tables and figures in one step.
21
+
`KrakenParser` is an open-source software tool (with a command-line interface and Python API) designed to streamline the post-analysis of metagenomic classification results produced by `Kraken2` [wood2019kraken2] and similar taxonomic profilers such as `Bracken` [lu2017bracken] and `Metabuli` [kim2024metabuli]. `Kraken2` is a widely used taxonomic classifier that assigns metagenomic reads to taxa using exact k-mer matches, achieving high speed and accuracy. However, the raw output of `Kraken2` [wood2019kraken2] (and related tools) is a text report that can be cumbersome to interpret and aggregate across multiple samples. `KrakenParser` addresses this need by converting multiple `Kraken`-format reports into structured tables (CSV files) at various taxonomic ranks (from phylum down to species), performing filtering and normalization (including relative abundance calculations), and providing APIs to produce publication-ready plots. The tool automates the multi-step process of combining and cleaning `Kraken` results, allowing researchers to quickly obtain human-readable summaries of community composition. `KrakenParser`’s focus is on efficiency, ease-of-use, and integration: it can run an entire conversion pipeline with a single command and also be imported as a Python library for custom workflows. In summary, `KrakenParser` significantly reduces the manual effort required to post-process metagenomic classification data, enabling scientists to go from raw classifier output to analysis-ready tables and figures in one step.
22
22
23
23
# Statement of need
24
24
25
-
Analyzing the taxonomic profiles of metagenomic samples often involves running k-mer based classifiers (like `Kraken2`) that generate detailed reports of read counts and abundances across taxa. These reports, while information-rich, are not immediately convenient for comparative analysis: they list each taxon in a hierarchical format for a single sample, and researchers must manually parse and merge multiple files to compare communities across samples. Existing scripts such as the `KrakenTools` suite (developed alongside `Kraken`) provide some post-processing functionality, but they require multiple steps and technical expertise to use. Similarly, interactive tools like `Pavian` focus on visualization and exploration of `Kraken` results rather than automated batch processing. There is a clear need for a streamlined solution to transform raw `Kraken`-family outputs into tidy data matrices and summary statistics that can be readily used in downstream analysis or publication figures. `KrakenParser` fulfills this need by offering an all-in-one pipeline that reads in multiple `Kraken2`/`Bracken`/`Metabuli` reports and outputs clean CSV tables of taxonomic counts or relative abundances, optionally filtering out low-abundance taxa or non-target taxa (e.g. human reads) as specified by the user. This greatly simplifies metagenomic workflows, especially in comparative studies or clinical settings where dozens of samples must be processed consistently. By bridging the gap between raw classifier output and statistical analysis, KrakenParser empowers researchers who may not be bioinformatics experts to leverage high-throughput metagenomics with minimal data wrangling.
25
+
Analyzing the taxonomic profiles of metagenomic samples often involves running k-mer based classifiers (like `Kraken2`) that generate detailed reports of read counts and abundances across taxa. These reports, while information-rich, are not immediately convenient for comparative analysis: they list each taxon in a hierarchical format for a single sample, and researchers must manually parse and merge multiple files to compare communities across samples. Existing scripts such as the `KrakenTools` suite [lu2022kraken] (developed alongside `Kraken`) provide some post-processing functionality, but they require multiple steps and technical expertise to use. Similarly, interactive tools like `Pavian` focus on visualization and exploration of `Kraken` results rather than automated batch processing [breitwieser2020pavian]. There is a clear need for a streamlined solution to transform raw `Kraken`-family outputs into tidy data matrices and summary statistics that can be readily used in downstream analysis or publication figures. `KrakenParser` fulfills this need by offering an all-in-one pipeline that reads in multiple `Kraken2`/`Bracken`/`Metabuli` reports and outputs clean CSV tables of taxonomic counts or relative abundances, optionally filtering out low-abundance taxa or non-target taxa (e.g. human reads) as specified by the user. This greatly simplifies metagenomic workflows, especially in comparative studies or clinical settings where dozens of samples must be processed consistently. By bridging the gap between raw classifier output and statistical analysis, KrakenParser empowers researchers who may not be bioinformatics experts to leverage high-throughput metagenomics with minimal data wrangling.
26
26
27
-
Metagenomic classification has seen rapid development, with numerous tools available for assigning sequencing reads to taxa. `Kraken` was introduced in 2014 as an ultrafast k-mer based classifier `[@wood2014kraken]`, and its successor `Kraken2` further reduced memory usage and improved speed . Other k-mer classifiers include `Bracken`, which refines `Kraken`’s counts to improve abundance estimates, `KrakenUniq` which tracks unique k-mers per taxon to reduce false positives, `Centrifuge` which uses an FM-index to allow classification with compressed databases, and `CLARK` which uses discriminative k-mers for fast classification. More recently, tools like `Kaiju` perform classification in protein space for greater sensitivity (especially on viruses), and `Metabuli` combines DNA and translated amino acid matching to improve accuracy. Comprehensive evaluations have benchmarked these methods’ accuracy and speed, and community challenges like `CAMI` have pushed development of improved classifiers. Despite the variety of classifiers, a common challenge remains: the output format. Many tools output reports similar to `Kraken`’s: tab-delimited text with hierarchical labels and counts. To interpret such outputs, researchers often rely on additional scripts or manual processing. `KrakenTools` provides scripts to combine `Kraken` reports, convert to other formats (e.g., `Krona` for visualization, or `BIOM` for ecological analysis), and filter results. `Pavian` and other interactive platforms allow users to visualize results with `Sankey` diagrams and heatmaps, but require use of a web interface or `R` environment. There are also lightweight utilities (e.g., `spideog` and `scrubby`) to convert Kraken reports to CSV or clean them, and researchers adept in programming sometimes write custom parsing scripts. In summary, prior to `KrakenParser`, users had to piece together multiple tools to achieve tasks like merging reports from multiple samples, summing reads at specific taxonomic ranks, and computing relative abundances. `KrakenParser` builds on this state of the field by consolidating the post-processing steps into one tool. It serves as an ideological successor to `KrakenTools`, using some of the same internal conversion steps (like `KrakenTools`’ report-to-MPA conversion) but adding improvements in automation, filtering, and output formatting. By producing standardized CSV tables (with samples as rows and taxa as columns) and by computing percentages automatically, `KrakenParser` greatly accelerates the transition from raw classification data to biological insights. This is particularly valuable given the increasing scale of metagenomic studies (where dozens or hundreds of samples are profiled) and the need for reproducible, efficient analysis pipelines.
27
+
Metagenomic classification has seen rapid development, with numerous tools available for assigning sequencing reads to taxa. `Kraken` was introduced in 2014 as an ultrafast k-mer based classifier [@wood2014kraken], and its successor `Kraken2` [wood2019kraken2] further reduced memory usage and improved speed . Other k-mer classifiers include `Bracken` [lu2017bracken], which refines `Kraken`’s counts to improve abundance estimates, `KrakenUniq` which tracks unique k-mers per taxon to reduce false positives [breitwieser2018krakenuniq], `Centrifuge` which uses an FM-index to allow classification with compressed databases [kim2016centrifuge], and `CLARK` which uses discriminative k-mers for fast classification [ounit2015clark]. More recently, tools like `Kaiju` perform classification in protein space for greater sensitivity (especially on viruses) [menzel2016kaiju], and `Metabuli` combines DNA and translated amino acid matching to improve accuracy [kim2024metabuli]. Comprehensive evaluations have benchmarked these methods’ accuracy and speed, and community challenges like `CAMI` have pushed development of improved classifiers [sczyrba2017cami]. Despite the variety of classifiers, a common challenge remains: the output format. Many tools output reports similar to `Kraken`’s: tab-delimited text with hierarchical labels and counts. To interpret such outputs, researchers often rely on additional scripts or manual processing. `KrakenTools` [lu2022kraken] provides scripts to combine `Kraken` reports, convert to other formats (e.g., `Krona` for visualization, or `BIOM` for ecological analysis), and filter results. `Pavian` and other interactive platforms allow users to visualize results with `Sankey` diagrams and heatmaps [breitwieser2020pavian], but require use of a web interface or `R` environment. There are also lightweight utilities (e.g., `spideog` and `scrubby`) to convert Kraken reports to CSV or clean them, and researchers adept in programming sometimes write custom parsing scripts. In summary, prior to `KrakenParser`, users had to piece together multiple tools to achieve tasks like merging reports from multiple samples, summing reads at specific taxonomic ranks, and computing relative abundances. `KrakenParser` builds on this state of the field by consolidating the post-processing steps into one tool. It serves as an ideological successor to `KrakenTools` [lu2022kraken], using some of the same internal conversion steps (like `KrakenTools`’ report-to-MPA conversion) but adding improvements in automation, filtering, and output formatting. By producing standardized CSV tables (with samples as rows and taxa as columns) and by computing percentages automatically, `KrakenParser` greatly accelerates the transition from raw classification data to biological insights. This is particularly valuable given the increasing scale of metagenomic studies (where dozens or hundreds of samples are profiled) and the need for reproducible, efficient analysis pipelines.
28
28
29
29
# Implementation
30
30
31
-
`KrakenParser` is implemented in `Python` (available via `PyPI` as `krakenparser`) with several auxiliary scripts. It leverages the original `KrakenTools` scripts for initial data reshaping and then applies its own pure-`Python` processing for downstream formatting. The software follows a pipeline of six main steps, which can be executed automatically in sequence (`--complete` mode) or run individually as needed:
31
+
`KrakenParser` is implemented in `Python` (available via `PyPI` as `krakenparser`) with several auxiliary scripts. It leverages the original `KrakenTools`[lu2022kraken]scripts for initial data reshaping and then applies its own pure-`Python` processing for downstream formatting. The software follows a pipeline of six main steps, which can be executed automatically in sequence (`--complete` mode) or run individually as needed:
32
32
33
33
1. Convert reports to MPA format: Each `Kraken2`/`Bracken`/`Metabuli` report (text file with taxon lines) is converted to an “MPA” table format using `KrakenTools`’ `kreport2mpa.py` script. In MPA format, each row corresponds to a read and columns correspond to taxonomic ranks, allowing easy combination of multiple samples.
34
34
2. Combine MPA files: All per-sample MPA files are merged into a single master table (samples × taxa) using `KrakenTools`’ combine_mpa.py. This yields a matrix of raw read counts, with entries where a taxon is absent in a sample filled with zero.
0 commit comments