Skip to content

Commit 8622e0c

Browse files
committed
converter done
Former-commit-id: f512bea019378a3d133faaca57d5f4fef448d4c3
1 parent 8a69b96 commit 8622e0c

File tree

3 files changed

+42
-13
lines changed

3 files changed

+42
-13
lines changed

README.md

+13-4
Original file line numberDiff line numberDiff line change
@@ -1,7 +1,7 @@
11
Brisera
22
=======
33

4-
A Python implementation of a distributed seed and reduce algorithm (similar to BlastReduce and CloudBurst) that utilizes RDDs (resilient distributed datasets) to perform fast iterative analyses and dynamic programming without relying on chained MapReduce jobs.
4+
A Python implementation of a distributed seed and reduce algorithm (similar to BlastReduce and CloudBurst) that utilizes RDDs (resilient distributed datasets) to perform fast iterative analyses and dynamic programming without relying on chained MapReduce jobs.
55

66
Quick Start
77
-----------
@@ -20,9 +20,18 @@ To install the required dependencies:
2020

2121
The code for Brisera is found in the `brisera` Python module. This module must be available to the spark applications (e.g. able to be imported) either by running the spark applications locally in the working directory that contains `brisera` or by using a virtual environment (recommended). You can install `brisera` and all dependencies, use the setup.py function:
2222

23-
$ python setup.py install
23+
$ python setup.py install
2424

25-
But note that you will still have to have access to the Spark applications that are in the `apps/` directory - don't delete them out of hand!
25+
But note that you will still have to have access to the Spark applications that are in the `apps/` directory - don't delete them out of hand!
26+
27+
Usage
28+
-----
29+
30+
To read a burst sequence file (e.g. `fixtures/cloudburst/100k.br`) in order to compare results from CloudBurst to Brisera, you can use the `read_burst.py` Spark application as follows:
31+
32+
$ spark-submit --master local[*] apps/read_burst.py <sequence_file> <output_dir>
33+
34+
This will write out each record (or chunk) from the sequence file to a text file on disk.
2635

2736
Other Details
2837
-------------
@@ -37,4 +46,4 @@ Brisera means to "explode" or to "burst" in Swedish. Since I'm reworking CloudBu
3746

3847
1. X\. Li, W. Jiang, Y. Jiang, and Q. Zou, “Hadoop Applications in Bioinformatics,” in Open Cirrus Summit (OCS), 2012 Seventh, 2012, pp. 48–52.
3948

40-
1. R\. K. Menon, G. P. Bhat, and M. C. Schatz, “Rapid parallel genome indexing with MapReduce,” in Proceedings of the second international workshop on MapReduce and its applications, 2011, pp. 51–58.
49+
1. R\. K. Menon, G. P. Bhat, and M. C. Schatz, “Rapid parallel genome indexing with MapReduce,” in Proceedings of the second international workshop on MapReduce and its applications, 2011, pp. 51–58.

apps/convert_fasta.py

+22-1
Original file line numberDiff line numberDiff line change
@@ -7,9 +7,12 @@
77
## Imports
88
##########################################################################
99

10+
import os
1011
import sys
12+
import shutil
13+
import tempfile
1114

12-
from operator import add
15+
from brisera.convert import FastaChunker
1316
from pyspark import SparkConf, SparkContext
1417

1518
if __name__ == "__main__":
@@ -23,7 +26,25 @@
2326

2427
infile = sys.argv[1]
2528
outfile = sys.argv[2]
29+
tempdir = tempfile.mkdtemp(prefix="fasta")
30+
tempout = os.path.join(tempdir, "output")
2631

2732
print "Converting FASTA %s to Sequence %s" % (infile, outfile)
2833

34+
chunker = FastaChunker(infile)
35+
chunks = sc.parallelize(chunker.convert()).coalesce(1, shuffle=True)
36+
chunks.saveAsSequenceFile(tempout)
2937

38+
partfile = None
39+
for name in os.listdir(tempout):
40+
if name.startswith('part-'):
41+
partfile = os.path.join(tempout, name)
42+
break
43+
44+
if partfile is None:
45+
raise Exception("Could not find partition file in %s!" % tempout)
46+
47+
shutil.move(partfile, outfile)
48+
shutil.rmtree(tempdir)
49+
50+
assert not os.path.exists(tempdir)

brisera/convert.py

+7-8
Original file line numberDiff line numberDiff line change
@@ -2,6 +2,8 @@
22
Handles the conversion of a FASTA sequence into a sequence format
33
"""
44

5+
import cPickle
6+
57
from brisera.utils import fasta
68
from brisera.config import settings
79

@@ -50,18 +52,15 @@ def chunk(self, sequence):
5052
else:
5153
offset = end - settings.overlap
5254

53-
def convert(self, writer):
55+
def convert(self):
5456
"""
55-
The main entry point, convert the FASTA file and output it by
56-
writing it to the given stream (the writer).
57+
The main entry point, convert the FASTA file and yielding pairs
58+
where the key is the index and the value is the record.
5759
"""
5860

5961
for idx, seq in self:
60-
for chunk in self.chunk(seq):
61-
record = str((idx, chunk))
62-
writer.write(record+"\n")
63-
break
64-
break
62+
for record in self.chunk(seq):
63+
yield (idx, cPickle.dumps(record, cPickle.HIGHEST_PROTOCOL))
6564

6665
def __iter__(self):
6766
"""

0 commit comments

Comments
 (0)