1
+ Overview
2
+ ========
3
+
4
+ FLEA is a bioinformatics pipeline for analyzing longitudinal
5
+ sequencing data from the Pacific Biosciences RS-II or Sequel. It
6
+ currently supports full-length HIV * env* sequences.
7
+
8
+ The pipeline takes a set of FASTQ files, one per time point,
9
+ containing circular consensus sequence (CCS) reads, which can be
10
+ obtained using the ”Reads of Insert“ protocol on PacBio’s SMRTportal
11
+ or SMRTanalysis tools. It produces a JSON file containing the
12
+ following results:
13
+
14
+ - a multiple sequence alignment of high-quality consensus sequences
15
+ for each time point
16
+
17
+ - a maximum-likelihood phylogenetic tree, inferred using
18
+ [ FastTree] ( http://www.microbesonline.org/fasttree/ )
19
+
20
+ - the most recent common ancestor (MRCA) and other inferred ancestor
21
+ sequences
22
+
23
+ - a two-dimensional embedding that respects TN93 sequence distances
24
+
25
+ - per-site selection pressure, inferred using
26
+ [ FUBAR] ( https://veg.github.io/hyphy-site/methods/selection-methods/ ) ,
27
+ and other per-site evolutionary metrics
28
+
29
+ - per-segment evolutionary and phenotypic metrics, inferred using
30
+ [ HyPhy] ( http://www.hyphy.org/ )
31
+
32
+ The pipeline logic is implemented in
33
+ [ Nextflow] ( https://www.nextflow.io/ ) . A full description of the
34
+ pipeline has been submitted for publication. A link to the journal
35
+ article will be added here when it is available.
36
+
37
+ Setup
38
+ =====
39
+
1
40
Dependencies
2
41
------------
3
- - Nextflow
4
- - Python
5
- - usearch
6
- - mafft
7
- - HyPhy
8
- - TN93
9
- - GNU parallel
42
+ - [ Nextflow] ( https://www.nextflow.io/ )
43
+ - [ Python] ( https://www.python.org/ )
44
+ - [ USEARCH] ( https://www.drive5.com/usearch/ )
45
+ - [ MAFFT] ( https://mafft.cbrc.jp/alignment/software/ )
46
+ - [ HyPhy] ( http://www.hyphy.org/ )
47
+ - [ FastTree] ( http://www.microbesonline.org/fasttree/ )
48
+ - [ TN93] ( https://github.com/veg/tn93 )
49
+ - [ GNU parallel] ( https://www.gnu.org/software/parallel/ )
50
+ - Python dependencies (see below)
10
51
11
52
Install Python scripts
12
53
----------------------
@@ -24,17 +65,50 @@ To test:
24
65
python setup.py nosetests
25
66
26
67
68
+ Configuration
69
+ -------------
70
+
71
+ The default config file is ` nextflow.config ` . It is recommended that
72
+ you make a seperate config file that overrides any options that need
73
+ to be customized. For more information on Nextflow-specific
74
+ configuration, see [ the Nextflow
75
+ documentation] ( https://www.nextflow.io/docs/latest/config.html ) .
76
+
77
+ At the very least, ` params.reference_dir ` and the parameters that
78
+ depend on it need to point to the various reference files used by the
79
+ pipeline:
80
+
81
+ - ` params.reference_db ` : FASTA file of reference sequences
82
+ - ` params.contaminants_db ` : FASTA file of contaminant sequences
83
+ - ` params.reference_dna ` : reference DNA sequence
84
+ - ` params.reference_protein ` : reference amino acid sequence
85
+ - ` params.reference_coordinates ` :
86
+
87
+
27
88
Usage
28
- -----
29
- Write a control file containing a list of fastq files, their sequence ids, and
30
- their dates, seperated by spaces.
89
+ =====
90
+
91
+ Write a control file containing a list of FASTQ files, visit codes,
92
+ and dates, seperated by spaces.
31
93
32
- <file> <label > <date>
33
- <file> <label > <date>
94
+ <file> <visit code > <date>
95
+ <file> <visit code > <date>
34
96
....
35
97
36
98
Dates must be in 'YYYYMMDD' format.
37
99
38
100
Run the pipeline with Nextflow:
39
101
40
- nextflow path/to/flea.nf --infile path/to/metadata --results_dir path/to/results
102
+ nextflow path/to/flea.nf -c path/to/custom/config/file \
103
+ --infile path/to/metadata \
104
+ --results_dir path/to/results
105
+
106
+ The results directory will contain output from lots of pipeline
107
+ steps. The two files that contain the final results are:
108
+
109
+ - ` session.json ` : a JSON file to be visualized with
110
+ [ ` flea-web-app ` ] ( https://github.com/veg/flea-web-app ) .
111
+
112
+ - ` session.zip ` : a zip file with FASTA files for the consensus
113
+ sequences, ancestors, and MRCA, and a Newick file containing the
114
+ rooted phylogenetic tree.
0 commit comments