Skip to content

Commit 4834520

Browse files
committed
Switch tools to use SQLite database, add tests for common_ancestor_distance
1 parent d549f34 commit 4834520

8 files changed

+172
-118
lines changed

.gitignore

-1
Original file line numberDiff line numberDiff line change
@@ -2,4 +2,3 @@
22
/target/
33
**/*.rs.bk
44
Cargo.lock
5-
data/ncbi_taxonomy.sqlite

Cargo.toml

+1-1
Original file line numberDiff line numberDiff line change
@@ -1,7 +1,7 @@
11
[package]
22
name = "ncbitaxonomy"
33
description = "Read NCBI Taxonomy Database from files and work with NCBI Taxonomy DB"
4-
version = "0.3.0"
4+
version = "1.0.0"
55
authors = ["Peter van Heusden <[email protected]>"]
66
license = "MIT"
77

README.md

+45-17
Original file line numberDiff line numberDiff line change
@@ -5,9 +5,10 @@
55
This is a Rust crate (i.e. library) for working with a local copy of the
66
[NCBI Taxonomy database](https://www.ncbi.nlm.nih.gov/Taxonomy/taxonomyhome.html/).
77
The database can be downloaded (either `taxdump.zip` or `taxdump.tar.gz`) from the
8-
[NCBI Taxonomy FTP site](https://ftp.ncbi.nih.gov/pub/taxonomy/).
8+
[NCBI Taxonomy FTP site](https://ftp.ncbi.nih.gov/pub/taxonomy/) and reformatted into a SQLite database
9+
using the `taxonomy_util` utility's `to_sqlite` subcommand.
910

10-
Documentation for version 0.1.3 is available at [crates.io](https://docs.rs/ncbitaxonomy/0.1.3/ncbitaxonomy/struct.NcbiTaxonomy.html).
11+
Documentation is available at [crates.io](https://crates.io/crates/ncbitaxonomy).
1112

1213
### taxonomy_filter_refseq
1314

@@ -18,42 +19,42 @@ are retained.
1819

1920
```bash
2021
$ taxonomy_filter_refseq --help
21-
taxonomy_filter_refseq 0.1.2
22+
taxonomy_filter_refseq 1.0.0
2223
Peter van Heusden <[email protected]>
2324
Filter NCBI RefSeq FASTA files by taxonomic lineage
2425

2526
USAGE:
26-
taxonomy_filter_refseq [OPTIONS] <INPUT_FASTA> <TAXONOMY_DIR> <ANCESTOR_NAME> [OUTPUT_FASTA]
27+
taxonomy_filter_refseq [FLAGS] [OPTIONS] <INPUT_FASTA> <ANCESTOR_NAME> [OUTPUT_FASTA]
2728

2829
FLAGS:
2930
--no_curated Don't accept curated RNAs and proteins (NM_, NR_ and NP_ accessions)
3031
--no_predicted Don't accept computationally predicted RNAs and proteins (XM_, XR_ and XP_ accessions)
31-
-h, --help Prints help information
32-
-V, --version Prints version information
32+
-h, --help Prints help information
33+
-V, --version Prints version information
3334

3435
OPTIONS:
35-
-t, --tax_prefix <TAXONOMY_FILENAME_PREFIX> String to prepend to names of nodes.dmp and names.dmp
36+
-d, --db <TAXDB_URL> URL for SQLite taxonomy database
3637

3738
ARGS:
3839
<INPUT_FASTA> FASTA file with RefSeq sequences
39-
<TAXONOMY_DIR> Directory containing the NCBI taxonomy nodes.dmp and names.dmp files
4040
<ANCESTOR_NAME> Name of ancestor to use as ancestor filter
4141
<OUTPUT_FASTA> Output FASTA filename (or stdout if omitted)
42-
4342
```
4443
4544
### taxonomy_filter_fastq
4645
4746
(new in version 0.2.0)
4847
48+
4949
```bash
5050
$ taxonomy_filter_fastq --help
51-
taxonomy_filter_refseq 0.1.2
51+
taxonomy_filter_fastq 1.0.0
5252
Peter van Heusden <[email protected]>
53-
Filter NCBI RefSeq FASTA files by taxonomic lineage
53+
Filter FASTQ files whose reads have been classified by Centrifuge or Kraken2, only retaining reads in taxa descending
54+
from given ancestor
5455

5556
USAGE:
56-
taxonomy_filter_fastq [FLAGS] [OPTIONS] <INPUT_FASTQ> --ancestor_taxid <ANCESTOR_ID> --taxdir <TAXONOMY_DIR> --tax_report_filename <TAXONOMY_REPORT_FILENAME> <--centrifuge|--kraken2>
57+
taxonomy_filter_fastq [FLAGS] [OPTIONS] <INPUT_FASTQ>... --ancestor_taxid <ANCESTOR_ID> --tax_report_filename <TAXONOMY_REPORT_FILENAME> <--centrifuge|--kraken2>
5758

5859
FLAGS:
5960
-d, --output_dir Directory to deposited filtered output files in
@@ -64,12 +65,39 @@ FLAGS:
6465

6566
OPTIONS:
6667
-A, --ancestor_taxid <ANCESTOR_ID> Name of ancestor to use as ancestor filter
67-
-T, --taxdir <TAXONOMY_DIR>
68-
Directory containing the NCBI taxonomy nodes.dmp and names.dmp files
69-
70-
-t, --tax_prefix <TAXONOMY_FILENAME_PREFIX> String to prepend to names of nodes.dmp and names.dmp
68+
-d, --db <TAXDB_URL> URL for SQLite taxonomy database
7169
-F, --tax_report_filename <TAXONOMY_REPORT_FILENAME> Output from Kraken2 (default) or Centrifuge
7270

7371
ARGS:
74-
<INPUT_FASTQ> FASTA file with RefSeq sequences
72+
<INPUT_FASTQ>... FASTA file with RefSeq sequences
7573
```
74+
75+
### taxonomy_util
76+
77+
(new in 1.0.0)
78+
79+
Utilities to convert NCBI taxonomy database files into SQLite database (the input format used in other tools).
80+
81+
```bash
82+
taxonomy_util 1.0.0
83+
Peter van Heusden <[email protected]>
84+
Utilities for working with the NCBI taxonomy database
85+
86+
USAGE:
87+
taxonomy_util [OPTIONS] [SUBCOMMAND]
88+
89+
FLAGS:
90+
-h, --help Prints help information
91+
-V, --version Prints version information
92+
93+
OPTIONS:
94+
-d, --db <TAXDB_URL> URL for SQLite taxonomy database
95+
96+
SUBCOMMANDS:
97+
common_ancestor_distance find the tree distance to te common ancestor between two taxa
98+
get_id find taxonomy ID for name
99+
get_lineage get lineage for name [unimplemented]
100+
get_name find name for taxonomy ID
101+
help Prints this message or the help of the given subcommand(s)
102+
to_sqlite save taxonomy database loaded from files to SQLite database file
103+
```

data/ncbi_taxonomy.sqlite

100 KB
Binary file not shown.

src/bin/taxonomy_filter_fastq.rs

+8-31
Original file line numberDiff line numberDiff line change
@@ -19,7 +19,7 @@ use flate2::Compression;
1919
use flate2::read::GzDecoder;
2020
use flate2::write::GzEncoder;
2121
use seq_io::fastq::Record;
22-
use ncbitaxonomy::{NcbiTaxonomy, NcbiFileTaxonomy};
22+
use ncbitaxonomy::{NcbiTaxonomy, NcbiSqliteTaxonomy};
2323

2424
enum FilterTool {
2525
Centrifuge,
@@ -35,24 +35,6 @@ impl fmt::Display for FilterTool {
3535
}
3636
}
3737

38-
fn read_taxonomy(tax_prefix: &str, ncbi_taxonomy_path: &Path) -> NcbiFileTaxonomy {
39-
let nodes_path = ncbi_taxonomy_path.join(tax_prefix.to_owned() + "nodes.dmp");
40-
if !nodes_path.exists() {
41-
eprintln!("NCBI Taxonomy {}nodes.dmp file not found in {}", tax_prefix, ncbi_taxonomy_path.to_str().unwrap());
42-
process::exit(1);
43-
}
44-
45-
let names_path = ncbi_taxonomy_path.join(tax_prefix.to_owned() + "names.dmp");
46-
if !names_path.exists() {
47-
eprintln!("NCBI Taxonomy {}names.dmp file not found in {}", tax_prefix, ncbi_taxonomy_path.to_str().unwrap());
48-
process::exit(1);
49-
}
50-
51-
ncbitaxonomy::NcbiFileTaxonomy::from_ncbi_files(
52-
nodes_path.as_path().to_str().unwrap(),
53-
names_path.as_path().to_str().unwrap()).expect("Failed to load NCBI Taxonomy")
54-
}
55-
5638
fn filter_fastq(fastq_filename: &Path, tax_report_filename: &str,
5739
taxonomy: &dyn NcbiTaxonomy,
5840
output_dir: &Path, filter_tool: &FilterTool, ancestor_id: i32) {
@@ -162,12 +144,12 @@ fn filter_fastq(fastq_filename: &Path, tax_report_filename: &str,
162144
}
163145

164146
pub fn main() {
165-
let matches = clap_app!(taxonomy_filter_refseq =>
147+
// TODO: write test suite
148+
let matches = clap_app!(taxonomy_filter_fastq =>
166149
(version: ncbitaxonomy::VERSION)
167150
(author: "Peter van Heusden <[email protected]>")
168-
(about: "Filter NCBI RefSeq FASTA files by taxonomic lineage")
169-
(@arg TAXONOMY_FILENAME_PREFIX: -t --tax_prefix +takes_value "String to prepend to names of nodes.dmp and names.dmp")
170-
(@arg TAXONOMY_DIR: -T --taxdir +takes_value +required "Directory containing the NCBI taxonomy nodes.dmp and names.dmp files")
151+
(about: "Filter FASTQ files whose reads have been classified by Centrifuge or Kraken2, only retaining reads in taxa descending from given ancestor")
152+
(@arg TAXDB_URL: -d --db +takes_value "URL for SQLite taxonomy database")
171153
(@arg ANCESTOR_ID: -A --ancestor_taxid +takes_value +required "Name of ancestor to use as ancestor filter")
172154
(@group filter_tool +required =>
173155
(@arg centrifuge: -C --centrifuge !required "Filter using report from Centrifuge")
@@ -178,13 +160,6 @@ pub fn main() {
178160
(@arg INPUT_FASTQ: ... +required "FASTA file with RefSeq sequences")
179161
).get_matches();
180162

181-
let tax_prefix = match matches.value_of("TAXONOMY_FILENAME_PREFIX") {
182-
Some(name) => name,
183-
None => ""
184-
}.to_string();
185-
186-
let ncbi_taxonomy_path = Path::new(matches.value_of("TAXONOMY_DIR").unwrap());
187-
188163
let output_dir = match matches.value_of("OUTPUT_DIR") {
189164
Some(path) => Path::new(path),
190165
None => Path::new(".")
@@ -202,7 +177,9 @@ pub fn main() {
202177

203178
let tax_report_filename = matches.value_of("TAXONOMY_REPORT_FILENAME").unwrap();
204179

205-
let taxonomy = read_taxonomy(&tax_prefix, ncbi_taxonomy_path);
180+
let taxdb_url = if matches.is_present("TAXDB_URL") { Some(matches.value_of("TAXDB_URL").unwrap()) } else { None };
181+
182+
let taxonomy = NcbiSqliteTaxonomy::new(taxdb_url);
206183
if !taxonomy.contains_id(ancestor_id) {
207184
eprintln!("Taxonomy does not contain an ancestor with taxid {}", ancestor_id);
208185
process::exit(1);

src/bin/taxonomy_filter_refseq.rs

+5-26
Original file line numberDiff line numberDiff line change
@@ -6,14 +6,13 @@ extern crate ncbitaxonomy;
66
use std::cmp;
77
use std::fs::File;
88
use std::io;
9-
use std::path::Path;
109
use std::process;
1110
use std::vec::Vec;
1211

1312
use bio::io::fasta;
1413
use bio::utils::TextSlice;
1514

16-
use ncbitaxonomy::{NcbiTaxonomy, NcbiFileTaxonomy};
15+
use ncbitaxonomy::{NcbiTaxonomy, NcbiSqliteTaxonomy};
1716

1817
// wrap a TextSlice (a rust-bio name for a &[u8] i.e. byte array)
1918
// at a certain width (e.g. 80 to look like NCBI RefSeq)
@@ -35,15 +34,15 @@ fn wrap(seq: TextSlice, width: usize) -> Vec<u8> {
3534
}
3635

3736
pub fn main() {
37+
// TODO: use functions, write testing suite
3838
let matches = clap_app!(taxonomy_filter_refseq =>
3939
(version: ncbitaxonomy::VERSION)
4040
(author: "Peter van Heusden <[email protected]>")
4141
(about: "Filter NCBI RefSeq FASTA files by taxonomic lineage")
42+
(@arg TAXDB_URL: -d --db +takes_value "URL for SQLite taxonomy database")
4243
(@arg NO_PREDICTED: --no_predicted "Don't accept computationally predicted RNAs and proteins (XM_, XR_ and XP_ accessions)")
4344
(@arg NO_CURATED: --no_curated "Don't accept curated RNAs and proteins (NM_, NR_ and NP_ accessions)")
44-
(@arg TAXONOMY_FILENAME_PREFIX: -t --tax_prefix +takes_value "String to prepend to names of nodes.dmp and names.dmp")
4545
(@arg INPUT_FASTA: +required "FASTA file with RefSeq sequences")
46-
(@arg TAXONOMY_DIR: +required "Directory containing the NCBI taxonomy nodes.dmp and names.dmp files")
4746
(@arg ANCESTOR_NAME: +required "Name of ancestor to use as ancestor filter")
4847
(@arg OUTPUT_FASTA: "Output FASTA filename (or stdout if omitted)")
4948
).get_matches();
@@ -62,24 +61,8 @@ pub fn main() {
6261
let input_fasta = File::open(input_fasta_filename).unwrap_or_else(|_| panic!("Failed to open input FASTA file ({})", input_fasta_filename));
6362
let input_fasta_reader = fasta::Reader::new(input_fasta);
6463

65-
let ncbi_taxonomy_path = Path::new(matches.value_of("TAXONOMY_DIR").unwrap());
66-
67-
let tax_prefix = match matches.value_of("TAXONOMY_FILENAME_PREFIX") {
68-
Some(name) => name,
69-
None => ""
70-
}.to_string();
71-
72-
let nodes_path = ncbi_taxonomy_path.join(tax_prefix.clone() + "nodes.dmp");
73-
if ! nodes_path.exists() {
74-
eprintln!("NCBI Taxonomy {}nodes.dmp file not found in {}", tax_prefix, ncbi_taxonomy_path.to_str().unwrap());
75-
process::exit(1);
76-
}
77-
78-
let names_path = ncbi_taxonomy_path.join(tax_prefix.clone() + "names.dmp");
79-
if ! names_path.exists() {
80-
eprintln!("NCBI Taxonomy {}names.dmp file not found in {}", tax_prefix, ncbi_taxonomy_path.to_str().unwrap());
81-
process::exit(1);
82-
}
64+
let taxdb_url = if matches.is_present("TAXDB_URL") { Some(matches.value_of("TAXDB_URL").unwrap()) } else { None };
65+
let taxonomy = NcbiSqliteTaxonomy::new(taxdb_url);
8366

8467
// the use of Box here is inspired by:
8568
// https://stackoverflow.com/questions/26378842/how-do-i-overcome-match-arms-with-incompatible-types-for-structs-implementing-sa
@@ -93,10 +76,6 @@ pub fn main() {
9376

9477
let ancestor_name = matches.value_of("ANCESTOR_NAME").unwrap();
9578

96-
let taxonomy = NcbiFileTaxonomy::from_ncbi_files(
97-
nodes_path.as_path().to_str().unwrap(),
98-
names_path.as_path().to_str().unwrap()).expect("Failed to load NCBI Taxonomy");
99-
10079
if !taxonomy.contains_name(ancestor_name) {
10180
eprintln!("Taxonomy does not contain an ancestor named {}", ancestor_name);
10281
process::exit(1);

src/bin/taxonomy_util.rs

+22-10
Original file line numberDiff line numberDiff line change
@@ -8,8 +8,8 @@ use ncbitaxonomy::{NcbiTaxonomy, NcbiSqliteTaxonomy};
88

99
fn common_ancestor_distance(taxonomy: &dyn NcbiTaxonomy, name1: &str, name2: &str, only_canonical: bool) {
1010
match taxonomy.get_distance_to_common_ancestor(name1, name2, only_canonical) {
11-
Some(distance) => {
12-
println!("{}", distance);
11+
Some((distance, common_ancestor_name)) => {
12+
println!("{}\t{}", distance, common_ancestor_name);
1313
},
1414
None => {
1515
eprintln!("no common ancestor found");
@@ -19,7 +19,9 @@ fn common_ancestor_distance(taxonomy: &dyn NcbiTaxonomy, name1: &str, name2: &st
1919
}
2020

2121
pub fn main() {
22-
let app_m = clap_app!(taxonomy_filter_refseq =>
22+
// TODO:
23+
// * write get_lineage - print lineage of taxon
24+
let app_m = clap_app!(taxonomy_util =>
2325
(version: ncbitaxonomy::VERSION)
2426
(author: "Peter van Heusden <[email protected]>")
2527
(about: "Utilities for working with the NCBI taxonomy database")
@@ -30,16 +32,21 @@ pub fn main() {
3032
(@arg NAME1: +required "Name of first taxon")
3133
(@arg NAME2: +required "Name of second taxon")
3234
)
33-
(@subcommand find_id =>
35+
(@subcommand get_id =>
3436
(about: "find taxonomy ID for name")
3537
(@arg NAME: +required "Name of taxon")
3638
)
37-
(@subcommand find_name =>
39+
(@subcommand get_name =>
3840
(about: "find name for taxonomy ID")
3941
(@arg ID: +required "Taxonomy ID to look up")
4042
)
41-
(@subcommand sqlite =>
42-
(about: "sqlite testing")
43+
(@subcommand get_lineage =>
44+
(about: "get lineage for name [unimplemented]")
45+
(@arg DELIMITER: --delimiter -D +takes_value "Delimiter for lineage string")
46+
(@arg NAME: +required "Name of taxon")
47+
)
48+
(@subcommand to_sqlite =>
49+
(about: "save taxonomy database loaded from files to SQLite database file")
4350
(@arg TAXONOMY_FILENAME_PREFIX: -t --tax_prefix +takes_value "String to prepend to names of nodes.dmp and names.dmp")
4451
(@arg TAXONOMY_DIR: "Directory containing the NCBI taxonomy nodes.dmp and names.dmp files")
4552
)
@@ -56,20 +63,25 @@ pub fn main() {
5663
let name2 = sub_m.value_of("NAME2").unwrap();
5764
common_ancestor_distance(&taxonomy, name1, name2, only_canonical);
5865
},
59-
("find_id", Some(sub_m)) => {
66+
("get_id", Some(sub_m)) => {
6067
let name = sub_m.value_of("NAME").unwrap();
6168
match taxonomy.get_id_by_name(name) {
6269
Some(val) => println!("{}", val),
6370
None => eprintln!("name {} not found in taxomomy", name)
6471
}
6572
},
66-
("find_name", Some(sub_m)) => {
73+
("get_name", Some(sub_m)) => {
6774
let taxid = (sub_m.value_of("ID").unwrap()).parse::<i32>().unwrap();
6875
match taxonomy.get_name_by_id(taxid) {
6976
Some(val) => println!("{}", val),
7077
None => eprintln!("id {} not found in taxonomy", taxid)
7178
}
7279
},
80+
("get_lineage", Some(sub_m)) => {
81+
let _ = sub_m.value_of("DELIMITER").unwrap_or(";");
82+
let _ = sub_m.value_of("NAME").unwrap();
83+
// TODO: implement here (depends on expanding NcbiTaxonomy method signature
84+
}
7385
("to_sqlite", Some(sub_m)) => {
7486
let ncbi_taxonomy_path = Path::new(sub_m.value_of("TAXONOMY_DIR").unwrap());
7587

@@ -96,7 +108,7 @@ pub fn main() {
96108
names_path.as_path().to_str().unwrap()).expect("Failed to load NCBI Taxonomy");
97109
eprintln!("taxonomy loaded");
98110

99-
taxonomy.save_to_sqlite().expect("failed to save taxonomy database to SQLite");
111+
taxonomy.save_to_sqlite(taxdb_url).expect("failed to save taxonomy database to SQLite");
100112
},
101113
_ => {
102114
eprintln!("Unknown subcommand");

0 commit comments

Comments
 (0)