Skip to content

Commit 18e1c17

Browse files
authored
Msa refactor (#17)
* Initial PIP branch length optimisation Initial impl of branch length optimisation under PIP * Refactoring: tree stores all nodes in one list Refactoring the tree structure to simplify access/creation: now all nodes are stored in a single vector instead of leaves and internals separately. Additional modifications to make all other structures work with the new version. * Removing get_ prefixes * PhyloInfo refactor to include builder Big refactoring for PhyloInfo to create the object using a builder rather than random functions. Refactor of the alphabet module (renamed from "sequences") to have own alphabet not relying on bio::alphabets. * Refactoring alphabets Refactoring alphabets: made checking whether something is a word faster using a HashSet; removed all small letters, only using uppercase now; rename nucleotide/aa sets to improve readability and remove mixing DNA/nucleotide and Protein/aminoacid. * Alignment structure overhaul Alignment structure overhaul to include both pairwise and a complete alignment and to work with a local and global representation. Added tests for all the representations. * Creating proper alignments from aligned sequences Creating a proper aligment from supplied aligned sequences, reconstructing the local alignments per node, saving leaf maps for later, removing gaps from saved leaf sequences. * Tree refac: adding accessors, fields more private Minor refactoring of the tree class: made a few fields more private (nodes. prostorder, preorder) to discourage direct access (for myself), added accessors for nodes and node ids, and len() and is_empty() methods. * Fix duplicate code for Node/NodeIdx Display Removed duplicate code from the Display impl in Node and NodeIdx. Added explicit Debug to NodeIdx for easier debugging. * Modify usize::from for NodeIdx to reduce deref Modified usize::from for both &NodeIdx and NodeIdx to reduce dereferencing. * Refactor alignment to use builder, class for seqs Refactored alignment creation into a builder struct, the internal representation can be built from either a given MSA or from a map of internal nodes. Now the alignment class also stores the ungapped sequences and, for a current alignment, the maps of leaf sequences to gaps/characters. Sequences have a separate struct for storage that has several methods to simplify access. * Changes throughout to use Sequences and Alignment Massive refactoring to use Sequences and Alignment structs throughout. PhyloInfo can only be generated at the moment with aligned sequences, but leaf encodings are stored for sequences without gaps and are used to compute conditional probabilities at the tips with leaf maps from the alignment. * Add test for gaps treated as X for non-gap models Added a test case checking whether gaps are indeed treated as ambiguous characters when a non-gap model is given. * Added missing test data file * Refactor AlignBuilder to separate file Refactored AlignmentBuilder to a separate file to clean up code. Rename alignment_tests.rs to tests.rs to remove repetition in names. * Sequences to uppercase when reading from file Sequences are now transformed into uppercase as they are read from file, not afterwards + doctest for this. Signature for write_sequences_to_file now is &Pathbuf, not Pathbuf to avoid copying. * PhyloInfoBuilder to new module + docs Moved PhyloInfoBuilder to a separate file and added docs + doctests. * Docs for PhyloInfo and move leaf_encoding creation Added docs for PhyloInfo struct. Moved leaf_encoding creation to a more logical place within PhyloInfo rather than in the builder. * Alphabet is defined in Sequences and only there The sequence struct now defines the Alphabet for the whole setup and that is the only place where the Alphabet exists. The Alphabet is used to compute leaf sequence encodings to simplify likelihood computation. * move alphabets_tests -> tests * Remove unnecessary method * Remove unnecessary GapHandling enum Removed the GapHandling enum for good, the evolutionary models themselves now handle getting the correct encodings for gaps and characters. * Removed stale files * Remove vscode settings file * Remove vscode files from git
1 parent 225914d commit 18e1c17

36 files changed

+1631
-1578
lines changed

.gitignore

+3
Original file line numberDiff line numberDiff line change
@@ -18,3 +18,6 @@ Cargo.lock
1818

1919
# Mac files
2020
.DS_Store
21+
22+
# VSCode files
23+
.vscode/

.vscode/settings.json

-3
This file was deleted.

phylo/data/sequences_DNA_small.fasta

+4-4
Original file line numberDiff line numberDiff line change
@@ -1,8 +1,8 @@
11
>A
2-
aAAAAAA
2+
aAAAAAAa
33
>B
4-
ccccCCC
4+
ccccCCCc
55
>C
6-
gGggggg
6+
gGgggggG
77
>D
8-
ttttttt
8+
tttttttT
Original file line numberDiff line numberDiff line change
@@ -0,0 +1 @@
1+
(((A:1.0,B:2.0):3.0,(C:3.5,D:0.5):5.25):1.0,(E:4.0,F:5.0):0.5);
+121
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,121 @@
1+
use std::collections::HashMap;
2+
3+
use anyhow::bail;
4+
5+
use crate::align;
6+
use crate::alignment::{Alignment, InternalMapping, Mapping, PairwiseAlignment, Sequences};
7+
use crate::tree::{NodeIdx, NodeIdx::Internal as Int, NodeIdx::Leaf, Tree};
8+
use crate::Result;
9+
10+
pub struct AlignmentBuilder<'a> {
11+
tree: &'a Tree,
12+
seqs: Sequences,
13+
node_map: InternalMapping,
14+
}
15+
16+
impl<'a> AlignmentBuilder<'a> {
17+
pub fn new(tree: &'a Tree, seqs: Sequences) -> AlignmentBuilder<'a> {
18+
AlignmentBuilder {
19+
tree,
20+
seqs,
21+
node_map: InternalMapping::new(),
22+
}
23+
}
24+
25+
pub fn msa(mut self, msa: InternalMapping) -> Self {
26+
self.node_map = msa;
27+
self
28+
}
29+
30+
pub fn build(self) -> Result<Alignment> {
31+
if self.node_map.is_empty() {
32+
if self.seqs.aligned {
33+
self.build_from_seqs()
34+
} else {
35+
self.build_from_unaligned()
36+
}
37+
} else {
38+
self.build_from_map()
39+
}
40+
}
41+
42+
/// This assumes that the tree structure matches the alignment structure and that the sequences are aligned.
43+
fn build_from_seqs(self) -> Result<Alignment> {
44+
let msa_len = self.seqs.msa_len();
45+
let mut stack = HashMap::<NodeIdx, Mapping>::with_capacity(self.tree.len());
46+
let mut msa = InternalMapping::with_capacity(self.tree.n);
47+
for node in self.tree.postorder.iter() {
48+
match node {
49+
Int(_) => {
50+
let childs = self.tree.children(node);
51+
let map_x = stack[&childs[0]].clone();
52+
let map_y = stack[&childs[1]].clone();
53+
stack.insert(*node, Self::stack_maps(msa_len, &map_x, &map_y));
54+
msa.insert(*node, Self::clear_common_gaps(msa_len, &map_x, &map_y));
55+
}
56+
Leaf(_) => {
57+
let seq = self.seqs.get_by_id(self.tree.node_id(node)).seq();
58+
stack.insert(*node, align!(seq).clone());
59+
}
60+
}
61+
}
62+
let leaf_maps = stack
63+
.iter()
64+
.filter_map(|(idx, map)| match idx {
65+
Leaf(_) => Some((*idx, map.clone())),
66+
_ => None,
67+
})
68+
.collect();
69+
Ok(Alignment {
70+
seqs: self.seqs.without_gaps(),
71+
leaf_map: leaf_maps,
72+
node_map: msa,
73+
})
74+
}
75+
76+
fn build_from_unaligned(self) -> Result<Alignment> {
77+
// TODO: use parsimony to align the sequences.
78+
bail!("Unaligned sequences are not yet supported.")
79+
}
80+
81+
/// This assumes that the tree structure matches the alignment structure.
82+
fn build_from_map(self) -> Result<Alignment> {
83+
let mut alignment = Alignment {
84+
seqs: Sequences::new(Vec::new()),
85+
leaf_map: HashMap::new(),
86+
node_map: self.node_map,
87+
};
88+
let leaf_map = alignment.compile_leaf_map(self.tree.root, self.tree)?;
89+
alignment.leaf_map = leaf_map;
90+
alignment.seqs = self.seqs.without_gaps();
91+
Ok(alignment)
92+
}
93+
94+
fn stack_maps(msa_len: usize, map_x: &Mapping, map_y: &Mapping) -> Mapping {
95+
let mut map = Vec::with_capacity(msa_len);
96+
let mut ind: usize = 0;
97+
for (x, y) in map_x.iter().zip(map_y.iter()) {
98+
if x.is_none() && y.is_none() {
99+
map.push(None);
100+
} else {
101+
map.push(Some(ind));
102+
ind += 1;
103+
}
104+
}
105+
map
106+
}
107+
108+
fn clear_common_gaps(msa_len: usize, map_x: &Mapping, map_y: &Mapping) -> PairwiseAlignment {
109+
let mut upd_map_x = Vec::with_capacity(msa_len);
110+
let mut upd_map_y = Vec::with_capacity(msa_len);
111+
for (x, y) in map_x.iter().zip(map_y.iter()) {
112+
if x.is_none() && y.is_none() {
113+
continue;
114+
} else {
115+
upd_map_x.push(*x);
116+
upd_map_y.push(*y);
117+
}
118+
}
119+
PairwiseAlignment::new(upd_map_x, upd_map_y)
120+
}
121+
}

phylo/src/alignment/alignment_tests.rs

-100
This file was deleted.

0 commit comments

Comments
 (0)