Skip to content

Commit 144f9f1

Browse files
committed
De-smart quotes, minor formatting fixes.
1 parent 153c62f commit 144f9f1

File tree

25 files changed

+137
-131
lines changed

25 files changed

+137
-131
lines changed

docs/algorithms/dm.rst

Lines changed: 1 addition & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -7,7 +7,7 @@ while on the sequencer. To identify duplicated reads, we apply a
77
heuristic algorithm that looks at read fragments that have a consistent
88
mapping signature. First, we bucket together reads that are from the
99
same sequenced fragment by grouping reads together on the basis of read
10-
name and record group. Per read bucket, we then identify the 5 mapping
10+
name and record group. Per read bucket, we then identify the 5' mapping
1111
positions of the primarily aligned reads. We mark as duplicates all read
1212
pairs that have the same pair alignment locations, and all unpaired
1313
reads that map to the same sites. Only the highest scoring read/read

docs/algorithms/reads.rst

Lines changed: 2 additions & 2 deletions
Original file line numberDiff line numberDiff line change
@@ -22,7 +22,7 @@ distributed system. These pre-processing stages include:
2222
2014), which is used by the GATK for Marking Duplicates. Our
2323
implementation is fully concordant with the Picard/GATK duplicate
2424
removal engine, except we are able to perform duplicate marking for
25-
chimeric read pairs. [2]_ Specifically, because Picards traversal
25+
chimeric read pairs. [2]_ Specifically, because Picard's traversal
2626
engine is restricted to processing linearly sorted alignments, Picard
2727
mishandles these alignments. Since our engine is not constrained by
2828
the underlying layout of data on disk, we are able to properly handle
@@ -60,7 +60,7 @@ distributed system. These pre-processing stages include:
6060
distribution of regions in mapped reads, joining two genomic datasets
6161
can be difficult or impossible when neither dataset fits completely
6262
on a single node. To reduce the impact of data skew on the runtime of
63-
joins, we implemented a load balancing engine in ADAMs
63+
joins, we implemented a load balancing engine in ADAM's
6464
ShuffleRegionJoin core. This load balancing is a preprocessing step
6565
to the ShuffleRegionJoin and improves performance by 10–100x. The
6666
first phase of the load balancer is to sort and repartition the left

docs/algorithms/ri.rst

Lines changed: 2 additions & 2 deletions
Original file line numberDiff line numberDiff line change
@@ -40,7 +40,7 @@ set of regions. For genomics, the convexity constraint is trivial to
4040
check: specifically, the genome is assembled out of reference contigs
4141
that define disparate 1-D coordinate spaces. If two regions exist on
4242
different contigs, they are known not to overlap. If two regions are on
43-
a single contig, we simply check to see if they overlap on that contigs
43+
a single contig, we simply check to see if they overlap on that contig's
4444
1-D coordinate plane.
4545

4646
Given this realization, we can define the convex hull Algorithm, which is a data parallel
@@ -88,7 +88,7 @@ Candidate Generation and Realignment
8888
Once we have generated the target set, we map across all the reads and
8989
check to see if the read overlaps a realignment target. We then group
9090
together all reads that map to a given realignment target; reads that do
91-
not map to a target are randomly assigned to a \`\`null’’ target. We do
91+
not map to a target are randomly assigned to a "null" target. We do
9292
not attempt realignment for reads mapped to null targets.
9393

9494
To process non-null targets, we must first generate candidate haplotypes

docs/api/adamContext.rst

Lines changed: 2 additions & 2 deletions
Original file line numberDiff line numberDiff line change
@@ -108,11 +108,11 @@ With an ``ADAMContext``, you can load:
108108
``loadReferenceFile``, which supports 2bit files, FASTA, and Parquet
109109
(Scala only)
110110

111-
The methods labeled Scala only may be usable from Java, but may not be
111+
The methods labeled "Scala only" may be usable from Java, but may not be
112112
convenient to use.
113113

114114
The ``JavaADAMContext`` class provides Java-friendly methods that are
115115
equivalent to the ``ADAMContext`` methods. Specifically, these methods
116116
use Java types, and do not make use of default parameters. In addition
117117
to the load/save methods described above, the ``ADAMContext`` adds the
118-
implicit methods needed for using ADAMs pipe_ API.
118+
implicit methods needed for using ADAM's pipe_ API.

docs/api/genomicRdd.rst

Lines changed: 4 additions & 4 deletions
Original file line numberDiff line numberDiff line change
@@ -4,7 +4,7 @@ Working with genomic data using GenomicRDDs
44
As described in the section on using the
55
`ADAMContext <#adam-context>`__, ADAM loads genomic data into a
66
``GenomicRDD`` which is specialized for each datatype. This
7-
``GenomicRDD`` wraps Apache Sparks Resilient Distributed Dataset (RDD,
7+
``GenomicRDD`` wraps Apache Spark's Resilient Distributed Dataset (RDD,
88
(Zaharia et al. 2012)) API with genomic metadata. The ``RDD``
99
abstraction presents an array of data which is distributed across a
1010
cluster. ``RDD``\ s are backed by a computational lineage, which allows
@@ -49,7 +49,7 @@ round trip between Parquet and VCF.
4949
Transforming GenomicRDDs
5050
~~~~~~~~~~~~~~~~~~~~~~~~
5151

52-
Although ``GenomicRDD``\ s do not extend Apache Sparks ``RDD`` class,
52+
Although ``GenomicRDD``\ s do not extend Apache Spark's ``RDD`` class,
5353
``RDD`` operations can be performed on them using the ``transform``
5454
method. Currently, we only support ``RDD`` to ``RDD`` transformations
5555
that keep the same type as the base type of the ``GenomicRDD``. To apply
@@ -132,8 +132,8 @@ to load the data directly using the Spark SQL APIs, instead of loading
132132
the data as an RDD, and then transforming that RDD into a SQL Dataset.
133133

134134
The functionality of the ``adam-codegen`` package is simple. The goal of
135-
this package is to take ADAMs Avro schemas and to remap them into
136-
classes that implement Scalas ``Product`` interface, and which have a
135+
this package is to take ADAM's Avro schemas and to remap them into
136+
classes that implement Scala's ``Product`` interface, and which have a
137137
specific style of constructor that is expected by Spark SQL.
138138
Additionally, we define functions that translate between these Product
139139
classes and the bdg-formats Avro models. Parquet files written with

docs/api/joins.rst

Lines changed: 2 additions & 2 deletions
Original file line numberDiff line numberDiff line change
@@ -1,4 +1,4 @@
1-
Using ADAMs RegionJoin API
1+
Using ADAM's RegionJoin API
22
---------------------------
33

44
Another useful API implemented in ADAM is the RegionJoin API, which
@@ -75,7 +75,7 @@ A subset of these joins are depicted in Figure 2 below.
7575

7676
One common pattern involves joining a single dataset against many
7777
datasets. An example of this is joining an RDD of features (e.g.,
78-
gene/exon coordinates) against many different RDD’s of reads. If the
78+
gene/exon coordinates) against many different RDDs of reads. If the
7979
object that is being used many times (gene/exon coordinates, in this
8080
case), we can force that object to be broadcast once and reused many
8181
times with the ``broadcast()`` function. This pairs with the

docs/api/overview.rst

Lines changed: 6 additions & 6 deletions
Original file line numberDiff line numberDiff line change
@@ -4,9 +4,9 @@ API Overview
44
The main entrypoint to ADAM is the `ADAMContext <#adam-context>`__,
55
which allows genomic data to be loaded in to Spark as
66
`GenomicRDD <#genomic-rdd>`__. GenomicRDDs can be transformed using
7-
ADAMs built in `pre-processing algorithms <#algorithms>`__, `Sparks
7+
ADAM's built in `pre-processing algorithms <#algorithms>`__, `Spark's
88
RDD primitives <#transforming>`__, the `region join <#join>`__
9-
primitive, and ADAMs `pipe <#pipes>`__ APIs. GenomicRDDs can also be
9+
primitive, and ADAM's `pipe <#pipes>`__ APIs. GenomicRDDs can also be
1010
interacted with as `Spark SQL tables <#sql>`__.
1111

1212
In addition to the Scala/Java API, ADAM can be used from
@@ -54,16 +54,16 @@ changes in ADAM.
5454
The ADAM Python API
5555
-------------------
5656

57-
ADAMs Python API wraps the `ADAMContext <#adam-context>`__ and
57+
ADAM's Python API wraps the `ADAMContext <#adam-context>`__ and
5858
`GenomicRDD <#genomic-rdd>`__ APIs so they can be used from PySpark. The
59-
Python API is feature complete relative to ADAMs Java API, with the
59+
Python API is feature complete relative to ADAM's Java API, with the
6060
exception of the `region join <#join>`__ API, which is not supported.
6161

6262
The ADAM R API
6363
--------------
6464

65-
ADAMs R API wraps the `ADAMContext <#adam-context>`__ and
65+
ADAM's R API wraps the `ADAMContext <#adam-context>`__ and
6666
`GenomicRDD <#genomic-rdd>`__ APIs so they can be used from SparkR. The
67-
R API is feature complete relative to ADAMs Java API, with the
67+
R API is feature complete relative to ADAM's Java API, with the
6868
exception of the `region join <#join>`__ API, which is not supported.
6969

docs/api/pipes.rst

Lines changed: 6 additions & 6 deletions
Original file line numberDiff line numberDiff line change
@@ -1,10 +1,10 @@
1-
Using ADAMs Pipe API
1+
Using ADAM's Pipe API
22
---------------------
33

4-
ADAMs ``GenomicRDD`` API provides support for piping the underlying
4+
ADAM's ``GenomicRDD`` API provides support for piping the underlying
55
genomic data out to a single node process through the use of a ``pipe``
6-
API. This builds off of Apache Sparks ``RDD.pipe`` API. However,
7-
``RDD.pipe`` prints the objects as strings to the pipe. ADAMs pipe API
6+
API. This builds off of Apache Spark's ``RDD.pipe`` API. However,
7+
``RDD.pipe`` prints the objects as strings to the pipe. ADAM's pipe API
88
adds several important functions:
99

1010
- It supports on-the-fly conversions to widely used genomic file
@@ -75,7 +75,7 @@ is being read into or out of the pipe. We support the following:
7575
- We do not support piping CRAM due to complexities around the
7676
reference-based compression.
7777
- ``FeatureRDD``:
78-
- ``InForamtter``\ s: ``BEDInFormatter``, ``GFF3InFormatter``,
78+
- ``InFormatter``\ s: ``BEDInFormatter``, ``GFF3InFormatter``,
7979
``GTFInFormatter``, and ``NarrowPeakInFormatter`` for writing
8080
features out to a pipe in BED, GFF3, GTF/GFF2, or NarrowPeak format,
8181
respectively.
@@ -163,7 +163,7 @@ each machine in our cluster. We suggest several different approaches:
163163
Using the Pipe API from Java
164164
~~~~~~~~~~~~~~~~~~~~~~~~~~~~
165165

166-
The pipe API example above uses Scalas implicit system and type
166+
The pipe API example above uses Scala's implicit system and type
167167
inference to make it easier to use the pipe API. However, we also
168168
provide a Java equivalent. There are several changes:
169169

docs/architecture/evidence.rst

Lines changed: 3 additions & 3 deletions
Original file line numberDiff line numberDiff line change
@@ -1,13 +1,13 @@
1-
Interacting with data through ADAMs evidence access layer
1+
Interacting with data through ADAM's evidence access layer
22
----------------------------------------------------------
33

44
ADAM exposes access to distributed datasets of genomic data through the
55
`ADAMContext <#adam-context>`__ entrypoint. The ADAMContext wraps Apache
6-
Sparks SparkContext, which tracks the configuration and state of the
6+
Spark's SparkContext, which tracks the configuration and state of the
77
current running Spark application. On top of the SparkContext, the
88
ADAMContext provides data loading functions which yield
99
`GenomicRDD <#genomic-rdd>`__\ s. The GenomicRDD classes provide a
10-
wrapper around Apache Sparks two APIs for manipulating distributed
10+
wrapper around Apache Spark's two APIs for manipulating distributed
1111
datasets: the legacy Resilient Distributed Dataset (Zaharia et al. 2012)
1212
and the new Spark SQL Dataset/DataFrame API (Armbrust et al. 2015).
1313
Additionally, the GenomicRDD is enriched with genomics-specific metadata

docs/architecture/overview.rst

Lines changed: 6 additions & 6 deletions
Original file line numberDiff line numberDiff line change
@@ -10,7 +10,7 @@ wide range of data formats and optimized query patterns without changing
1010
the data structures and query patterns that users are programming
1111
against.
1212

13-
ADAMs architecture was introduced as a response to the challenges
13+
ADAM's architecture was introduced as a response to the challenges
1414
processing the growing volume of genomic sequencing data in a reasonable
1515
timeframe (Schadt et al. 2010). While the per-run latency of current
1616
genomic pipelines such as the GATK could be improved by manually
@@ -24,13 +24,13 @@ make it difficult for bioinformatics developers to create novel
2424
distributed genomic analyses, and does little to attack sources of
2525
inefficiency or incorrectness in distributed genomics pipelines.
2626

27-
ADAMs architecture reconsiders how we build software for processing
27+
ADAM's architecture reconsiders how we build software for processing
2828
genomic data by eliminating the monolithic architectures that are driven
2929
by the underlying flat file formats used in genomics. These
3030
architectures impose significant restrictions, including:
3131

3232
- These implementations are locked to a single node processing model.
33-
Even the GATK’s “map-reduce styled walker API (McKenna et al. 2010)
33+
Even the GATK's "map-reduce" styled walker API (McKenna et al. 2010)
3434
is limited to natively support processing on a single node. While
3535
these jobs can be manually partitioned and run in a distributed
3636
setting, manual partitioning can lead to imbalance in work
@@ -39,8 +39,8 @@ architectures impose significant restrictions, including:
3939
provided by modern distributed systems such as Apache Hadoop or Spark
4040
(Zaharia et al. 2012).
4141
- Most of these implementations assume
42-
invariants about the sorted order of records on disk. This stack
43-
smashing (specifically, the layout of data is used to accelerate a
42+
invariants about the sorted order of records on disk. This "stack
43+
smashing" (specifically, the layout of data is used to accelerate a
4444
processing stage) can lead to bugs when data does not cleanly map to
4545
the assumed sort order. Additionally, since these sort order
4646
invariants are rarely explicit and vary from tool to tool, pipelines
@@ -50,7 +50,7 @@ architectures impose significant restrictions, including:
5050
this at the cost of opacity. If we can express the query patterns
5151
that are accelerated by these invariants at a higher level, then we
5252
can achieve both a better programming environment and enable various
53-
query optimizations. \\end{itemize}
53+
query optimizations.
5454

5555
At the core of ADAM, users use the `ADAMContext <#adam-context>`__ to
5656
load data as `GenomicRDDs <#genomic-rdd>`__, which they can then

0 commit comments

Comments
 (0)