-
Notifications
You must be signed in to change notification settings - Fork 23
/
Copy pathsession2a-tables-and-arrow.qmd
459 lines (351 loc) · 18.2 KB
/
session2a-tables-and-arrow.qmd
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
257
258
259
260
261
262
263
264
265
266
267
268
269
270
271
272
273
274
275
276
277
278
279
280
281
282
283
284
285
286
287
288
289
290
291
292
293
294
295
296
297
298
299
300
301
302
303
304
305
306
307
308
309
310
311
312
313
314
315
316
317
318
319
320
321
322
323
324
325
326
327
328
329
330
331
332
333
334
335
336
337
338
339
340
341
342
343
344
345
346
347
348
349
350
351
352
353
354
355
356
357
358
359
360
361
362
363
364
365
366
367
368
369
370
371
372
373
374
375
376
377
378
379
380
381
382
383
384
385
386
387
388
389
390
391
392
393
394
395
396
397
398
399
400
401
402
403
404
405
406
407
408
409
410
411
412
413
414
415
416
417
418
419
420
421
422
423
424
425
426
427
428
429
430
431
432
433
434
435
436
437
438
439
440
441
442
443
444
445
446
447
448
449
450
451
452
453
454
455
456
457
458
459
---
title: "2a: Data Tables and Arrow files"
author: "Douglas Bates and Claudia Solis-Lemus"
jupyter: julia-1.8
---
## Load packages to be used
```{julia}
#| code-fold: show
using Arrow # Arrow storage and file format
using CategoricalArrays # similar to the factor type in R
using CSV # read/write CSV and similar formats
using Downloads # file downloads
using DataFrames # versatile tabular data format
using GZip # utilities for compressed files
using RCall # run R within Julia
using Tar # tar archive utilities
```
- `Downloads`, `Gzip` and `Tar` are included just to demonstrate downloading and extracting tar files within Julia.
- You may find it easier to simply click on the download link and extract the files by hand.
- Downloading and extracting within Julia, as shown here, has a greater chance of working across various operating systems and environments in a workshop like this.
- `RCall` is included to show the use of other systems running within Julia.
- You can instead use your favorite environment, such as [jupyterlab](https://jupyter.org) or [RStudio](https://www.rstudio.com/products/rstudio/), to run Python or R
- Note that the [quarto notebooks](https://quarto.org) for these notes are easily converted, e.g. `quarto convert notebookname.qmd`, to Jupyter notebooks.
::: {.callout-note collapse="false"}
### Notes on Julia syntax
Boxes like this contain comments on Julia syntax and semantics in code examples.
The character in the upper right corner of the box is a toggle to expand or collapse the contents of the box.
:::
# Objectives
- Use an example of computing interval overlaps to introduce Julia facilities for working with tabular data.
- Introduce the [Arrow](https://arrow.apache.org) format for tabular data and demonstrate its use in Julia, Python/Pandas and R.
- Show a naive approach to computing overlaps.
- Modify the approach to use [RangeTrees.jl](https://github.com/dmbates/RangeTrees.jl) or [IntervalTrees.jl](https://github.com/BioJulia/IntervalTrees.jl) representation and to take advantage of multiple threads.
This notebook covers the first two objectives.
## Task and sample data
- [Li Heng](http://liheng.org) provides benchmark code and sample data for comparing programming languages on Bioinformatics tasks in his [biofast](https://github.com/lh3/biofast) repository.
- One of these tasks, an [interval query](https://bedtools.readthedocs.io/en/latest/content/tools/coverage.html), takes two [.bed files](https://en.wikipedia.org/wiki/BED_(file_format)) to compare.
- One file, `ex-anno.bed`, contains a reference set of intervals; the other, `ex-rna.bed`, contains target intervals.
- For each target interval, determine which reference intervals overlap with it.
- In the benchmark both the number of reference intervals that overlap with a target and the proportion of the target covered by the overlap are computed.
- Note that the calculation of the proportion of overlap must allow for overlapping intervals in the reference set, as shown in this [figure](https://bedtools.readthedocs.io/en/latest/content/tools/coverage.html#default-behavior)
# Downloading the sample data
- The data sets for the benchmark are available in the compressed archive `biofast-data-v1.tar.gz` at [biofast-data-v1](https://github.com/lh3/biofast/releases/tag/biofast-data-v1).
- The following code chunks download the tarball, if necessary, and extract the two bedfiles to a directory `biofast-data-v1`, if necessary.
(Alternatively, you can just click on the link to the tarball in the github tag page and extract the files by hand.)
```{julia}
datadir = "biofast-data-v1"
tarball = "$datadir.tar.gz"
if !isfile(tarball)
dataurl = join(
["https://github.com/lh3/biofast/releases/download", datadir, tarball],
'/',
)
Downloads.download(dataurl, tarball)
end
run(`ls -lh $tarball`);
```
::: {.callout-note collapse="true"}
### String interpolation
An expression like `"$datadir.tar.gz"` interpolates the value of `datadir` into the string, producing the name shown in the output.
:::
::: {.callout-note collapse="true"}
### Redundant commas in argument lists
Redundant commas are allowed at the end of the list of arguments before the `)` in a function call.
:::
- The `.tar.gz` file is about 0.5 GB. but most of that is the data for the FASTQ parsing test.
## Extract the files of interest (if not already present)
```{julia}
isdir(datadir) || mkdir(datadir)
bedfnms = joinpath.(datadir, ["ex-anno.bed", "ex-rna.bed"])
toextract = filter(!isfile, bedfnms) # don't overwrite existing files
if !isempty(toextract)
tmpdir = gzopen(tarball, "r") do io
Tar.extract(h -> in(h.path, toextract), io)
end
for pathnm in toextract
mv(joinpath(tmpdir, pathnm), pathnm; force = true)
end
end
filter(endswith(".bed"), readdir(datadir))
```
::: {.callout-note collapse="true"}
### Dot vectorization
The call `joinpath.(datadir, ["ex-anno.bed", "ex-rna.bed"])` is an example of [dot vectorization](https://docs.julialang.org/en/v1/manual/functions/#man-vectorized).
:::
::: {.callout-note collapse="true"}
### Stabby lambda syntax for anonymous functions
The expression `h -> in(h.path, toextract)` defines an anonymous function, in the "stabby lambda" syntax, to be used as a predicate in `Tar.extract`.
```{julia}
methods(Tar.extract)
```
:::
::: {.callout-note collapse="true"}
### do/end blocks
The 'do/end' block is yet another way of writing an anonymous function passed as the first argument in the call to `gzopen`, even though it occurs after that call in the code.
```{julia}
methods(gzopen)
```
The effect is to uncompress the file into a stream, process the stream in this anonymous function, then close the stream.
:::
::: {.callout-note collapse="true"}
### Extract to temporary directory
Because `Tar.extract` is conservative about overwriting files and requires that the directory into which the files are extracted be empty, we extract to a freshly-created temporary directory then move the files to the desired location.
:::
::: {.callout-note collapse="true"}
### Fully qualified names for many packages of utilities
- It is common for packages providing utilities to avoid name conflicts by not exporting any names from their namespace (or Module).
- The fully qualified name, `Tar.extract`, can always be used - similar to Python naming conventions.
- If a package exports a name, say `foo`, then after the `using FooPackage` directive, the unqualified name `foo` can be used.
- The `varinfo` function provides a listing of the names exported by a Package (formally the package's `Module`).
- Compare the result below with that of, say, `varinfo(DataFrames)`.
```{julia}
varinfo(Tar)
```
:::
# Initial processing
- The `.bed` files, in this case, are simple tab-separated-value files.
- Each of the language implementations in the `biofast` benchmark contains code to parse lines of the `.bed` files producing a `String` and two `Int32` values.
- Furthermore the results of these benchmarks are written out as very large text files. I don't plan to read a several-million-line text file to check if a program is working properly.
- Why go through this parsing of text files to create a numeric representation in each language?
- Writing code to parse a CSV or TSV file is tedious and error prone.
- But skilled people have put a lot of work into creating packages to do just that.
- More importantly they have tested, debugged, and documented their packages.
- Within the `CSV.jl` package, the `CSV.read` function reads and parses a file and converts it to a type specified by the second argument.
- As is common for such functions, there are many, many optional named arguments
- We read `ex-anno.bed` and create a `DataFrame` as
```{julia}
annodf = CSV.read(
joinpath(datadir, "ex-anno.bed"),
DataFrame;
delim = '\t',
types = [String, Int32, Int32],
header = ["chromo", "start", "stop"],
)
```
::: {.callout-note collapse="true"}
### Positional versus named arguments
- Positional arguments must come before named arguments.
- Optionally, the comma after the last positional argument can be replace by a semicolon, as shown above.
:::
- It turns out that both of the `.bed` files contain many duplicate rows - probably not intentionally. We eliminate the duplicates with `unique!`.
::: {.callout-note collapse="true"}
### Names of mutating functions
A function's name ending in `!` is a hint that it is a *mutating* function, which can modify one or more of its arguments.
:::
- We also make change the tags `"chr1"` up to `"chr9"` to `"chr01"` up to `"chr09"` so later sorting by these strings will also sort by chromosome numbers.
- This is done with a `replace!` method, which takes the object to modify and one or more `pair`s of values of the form `from => to`. These pairs can be generated using string interpolation as
```{julia}
replacements = ["chr$i" => "chr0$i" for i = 1:9]
```
::: {.callout-note collapse="true"}
### Comprehensions
The expression on the right of the `=` is called a "comprehension". I think of it as an inside-out loop.
:::
```{julia}
show(unique(replace!(annodf.chromo, replacements...)))
```
::: {.callout-note collapse="true"}
### Ellipsis in argument list
The ellipsis (`...`) after `replacements` in this call "splats" the single name into, in this case, 9 different arguments of pairs. That is, this call is the equivalent of a call to `replace!` with 10 arguments: `annodf.chromo`, `"chr1" => "chr01"`, `"chr2" => "chr02"`, and so on.
:::
```{julia}
unique!(annodf)
annodf.chromo = categorical(
annodf.chromo;
ordered = true,
levels = sort(levels(annodf.chromo)),
)
annodf
```
```{julia}
show(levels(annodf.chromo))
```
::: {.callout-note collapse="true"}
### Categorical arrays
Representing the `chromo` column as a `CategoricalArray` converts the individual values to indices into a vector of `String`s, like a `factor` in `R`.
The set of possible levels of `chromo` is given the natural ordering.
:::
# Arrow file format
- The [Arrow project](https://arrow.apache.org) defines a memory and file format for storing and manipulating column-oriented, static, tables (i.e. like data frames in R, Python/Pandas, and Julia)
- Either 'lz4' or 'zstd' compression can be used when creating an Arrow file.
- Metadata on the names and types of columns is automatically stored. Additional column or table metadata can be specified.
```{julia}
Arrow.write(
"./biofast-data-v1/ex-anno.arrow",
annodf;
compress = :lz4,
metadata = [
"url" => "https://github.com/lh3/biofast/releases/tag/biofast-data-v1",
],
)
```
- `bed2arrow` encapsulates reading the `.bed` file, replacing the strings in the `chromo` column, reducing the data frame to unique, sorted rows, and writing the Arrow file.
- Named arguments with defaults are used to configure the "usual" call but allow for variations.
- The Boolean `overwrite` named argument, which defaults to `false`, controls overwriting of existing files.
```{julia}
metadata =
["url" => "https://github.com/lh3/biofast/releases/tag/biofast-data-v1"]
function bed2arrow(
fnroot::AbstractString;
datadir = datadir,
delim = '\t',
header = [:chromo, :start, :stop],
types = [String, Int32, Int32],
compress = :lz4,
metadata = metadata,
replacements = replacements,
ordered = true,
overwrite::Bool = false,
)
bednm = joinpath(datadir, "$fnroot.bed")
arrownm = joinpath(datadir, "$fnroot.arrow")
if overwrite || !isfile(arrownm)
df = unique!(CSV.read(bednm, DataFrame; header, delim, types))
replace!(df.chromo, replacements...)
df.chromo =
categorical(df.chromo; ordered, levels = sort(unique(df.chromo)))
Arrow.write(arrownm, df; compress, metadata)
end
return arrownm
end
rnaarrownm = bed2arrow("ex-rna")
```
```{julia}
run(`ls -lh $datadir`);
```
:::{.callout-note collapse="true"}
### Semicolon separating positional and named arguments
In the calls to `CSV.read` and `Arrow.write`, the semicolon after the positional arguments followed by argument names only indicates that the value passed for the named argument is the object of the same name in the current namespace. That is, `Arrow.write(arrownm, df; compress, metadata)` is equivalent to `Arrow.write(arrownm, df; compress=compress, metadata=metadata)`.
:::
## Reading Arrow files in Julia
```{julia}
annotbl = Arrow.Table(joinpath(datadir, "ex-anno.arrow"))
```
- Although the schema describes the `chromo` column as `String`s the values are dictionary encoded such that each value is represented by one byte.
```{julia}
typeof(annotbl.chromo)
```
```{julia}
@time rnatbl = Arrow.Table(rnaarrownm)
```
- We can use operations like split-apply-combine on these tables to summarize properties
```{julia}
annogdf = groupby(DataFrame(annotbl), :chromo)
rnagdf = groupby(DataFrame(rnatbl), :chromo)
innerjoin(
combine(rnagdf, nrow => :nrna),
combine(annogdf, nrow => :nanno);
on = :chromo,
)
```
- In the next section we will use data from `chr01` for comparative timings. It has the greatest number of intervals in both the reference and target groups.
## Reading Arrow files in R
- In R (and in Python) the Arrow file format is confounded with an earlier file format called Feather and referred to as `Feather V2`.
- In R the `arrow::read_feather` function returns a tibble. In an R session it looks like
```r
> library(tibble)
> arrow::read_feather("biofast-data-v1/ex-rna.arrow")
# A tibble: 4,685,080 × 3
chromo start stop
<fct> <int> <int>
1 chr02 216499331 216501458
2 chr07 101239611 101245071
3 chr19 49487626 49491841
4 chr10 80155590 80169336
5 chr17 76270411 76271290
6 chr06 31268756 31272069
7 chr05 170083214 170083368
8 chr19 51989731 51989996
9 chr18 55225980 55226732
10 chr16 84565611 84566066
# … with 4,685,070 more rows
```
- The `RCall` package in Julia allows for running an R process within a Julia session.
- One way of executing R code with `RCall` is to prepend `R` to a string. This causes the string to be evaluated in R.
- `$`-interpolation in the string causes a Julia object to be copied into the R environment and its name in R interpolated.
```{julia}
R"""
library(tibble)
glimpse(rnatbl <- arrow::read_feather($rnaarrownm))
""";
```
## Reading Arrow files in Python
- The `pyarrow` package includes `pyarrow.feather`. Its use in a Python session looks like
```python
>>> import pyarrow.feather as fea
>>> fea.read_table('./biofast-data-v1/ex-rna.arrow')
pyarrow.Table
chromo: dictionary<values=string, indices=int8, ordered=0> not null
start: int32 not null
stop: int32 not null
----
chromo: [ -- dictionary:
["chr01","chr02","chr03","chr04","chr05",...,"chr20","chr21","chr22","chrX","chrY"] -- indices:
[1,6,18,9,16,...,16,15,10,22,0]]
start: [[216499331,101239611,49487626,80155590,76270411,...,7014179,75627747,59636724,153785767,182839364]]
stop: [[216501458,101245071,49491841,80169336,76271290,...,7014515,75631483,59666963,153787586,182887745]]
>>> fea.read_feather('./biofast-data-v1/ex-rna.arrow')
chromo start stop
0 chr02 216499331 216501458
1 chr07 101239611 101245071
2 chr19 49487626 49491841
3 chr10 80155590 80169336
4 chr17 76270411 76271290
... ... ... ...
4685075 chr17 7014179 7014515
4685076 chr16 75627747 75631483
4685077 chr11 59636724 59666963
4685078 chrX 153785767 153787586
4685079 chr01 182839364 182887745
[4685080 rows x 3 columns]
```
- `read_table` returns a `Table` object, `read_feather` returns a Pandas dataframe.
- The `PyCall` package for `Julia` starts a Python process and allows communication with it, including data transfer.
- I use this instead of the Python REPL when working with both Julia and Python.
- Configuring Python, Conda, pyarrow, pandas, and PyCall across platforms is sufficiently complicated to almost surely cause failures for some workshop participants. Instead of evaluating this code chunk we quote the results.
```julia
julia> using PyCall
julia> fea = pyimport("pyarrow.feather");
julia> fea.read_table(rnaarrownm)
PyObject pyarrow.Table
chromo: dictionary<values=string, indices=int8, ordered=0> not null
start: int32 not null
stop: int32 not null
----
chromo: [ -- dictionary:
["chr01","chr02","chr03","chr04","chr05",...,"chr20","chr21","chr22","chrX","chrY"] -- indices:
[1,6,18,9,16,...,16,15,10,22,0]]
start: [[216499331,101239611,49487626,80155590,76270411,...,7014179,75627747,59636724,153785767,182839364]]
stop: [[216501458,101245071,49491841,80169336,76271290,...,7014515,75631483,59666963,153787586,182887745]]
```
# Conclusions
- Julia provides an impressive array of tools for Bioinformatics and similar tasks
- We have shown the use of
[Arrow.jl](https://github.com/apache/arrow-julia),
[CategoricalArrays.jl](https://github.com/JuliaData/CategoricalArrays.jl)
[CSV.jl](https://github.com/JuliaData/CSV.jl),
and
[DataFrames](https://github.com/JuliaData/DataFrames.jl)
for data input, storage and manipulation.
- Although not shown here [DataFrameMacros.jl](https://github.com/jkrumbiegel/DataFrameMacros.jl) (or [DataFramesMeta.jl](https://github.com/JuliaData/DataFramesMeta.jl)) and [Chain.jl](https://github.com/jkrumbiegel/Chain.jl) are worth considering for more sophisticated work with DataFrames.
- A general framework for working with tabular data is provided in [Tables.jl](https://github.com/JuliaData/Tables.jl) (not shown here). Alternatively [TypedTables.jl](https://github.com/JuliaData/TypedTables.jl) provides a "lean and mean" implementation of both row- and column-oriented tabular data structures.
- We have shown the use of
[PyCall.jl](https://github.com/JuliaPy/PyCall.jl) and
[RCall.jl](https://github.com/JuliaInterop/RCall.jl) for running and communicating with other language systems within Julia.
- We have shown the use of utility packages [Downloads.jl](https://github.com/JuliaLang/Downloads.jl), [GZip.jl](https://github.com/JuliaIO/GZip.jl.git) and [Tar.jl](https://github.com/JuliaIO/Tar.jl) for scripting, say within [Quarto](https://quarto.org) documents like these.
- Take a moment to look at the repositories for some of these Julia packages. Most (all?) of them are 100% Julia code.
## Version information
```{julia}
versioninfo()
```