Skip to content

Commit a64ac40

Browse files
committed
Update docs, fix LFQ-parquet, release v0.14.0
1 parent e244510 commit a64ac40

File tree

5 files changed

+122
-111
lines changed

5 files changed

+122
-111
lines changed

CHANGELOG.md

+3-3
Original file line numberDiff line numberDiff line change
@@ -4,11 +4,11 @@ All notable changes to this project will be documented in this file.
44
The format is based on [Keep a Changelog](https://keepachangelog.com/en/1.0.0/),
55
and this project adheres to [Semantic Versioning](https://semver.org/spec/v2.0.0.html).
66

7-
## [Unreleased]
7+
## [v0.14.0]
88
### Added
9-
- Support for parquet file format output. Search results and reporter ion quantification will be written to one file (`results.sage.parquet`) and label-free quant will be written to another (`lfq.parquet`)
9+
- Support for parquet file format output. Search results and reporter ion quantification will be written to one file (`results.sage.parquet`) and label-free quant will be written to another (`lfq.parquet`). Parquet files tend to be significantly smaller than TSV files, faster to parse, and are compatible with a variety of distributed SQL engines.
1010
### Changed
11-
- Implement heapselect algorithm for faster sorting of candidate matches (#80)
11+
- Implement heapselect algorithm for faster sorting of candidate matches (#80). This is a backwards-incompatible change with respect to output - small changes in PSM ranks will be present between v0.13.4 and v0.14.0
1212

1313
## [v0.13.4]
1414
### Fixed

Cargo.toml

+2-2
Original file line numberDiff line numberDiff line change
@@ -6,6 +6,6 @@ members = [
66
]
77

88
[profile.release]
9-
#lto = "fat"
10-
#codegen-units = 1
9+
lto = "fat"
10+
codegen-units = 1
1111
panic = "abort"

DOCS.md

+2-2
Original file line numberDiff line numberDiff line change
@@ -240,7 +240,7 @@ The "results.sage.tsv" file contains the following columns (headers):
240240
- `spectrum_q`: Assigned spectrum-level q-value.
241241
- `peptide_q`: Assigned peptide-level q-value.
242242
- `protein_q`: Assigned protein-level q-value.
243-
- `ms1_intensity`: Intensity of the MS1 precursor ion
243+
- `ms1_intensity`: Intensity of the selected MS1 precursor ion (not label-free quant)
244244
- `ms2_intensity`: Total intensity of MS2 spectrum
245245

246-
These columns provide comprehensive information about each candidate peptide spectrum match (PSM) identified by the Sage search engine, enabling users to assess the quality and characteristics of the results.
246+
These columns provide comprehensive information about each candidate peptide spectrum match (PSM) identified by the Sage search engine.

README.md

+10-8
Original file line numberDiff line numberDiff line change
@@ -22,8 +22,10 @@ Check out the [blog post introducing Sage](https://lazear.github.io/sage/) for m
2222
- Incredible performance out of the box
2323
- Effortlessly cross-platform (Linux/MacOS/Windows), effortlessly parallel (uses all of your CPU cores)
2424
- Fragment indexing strategy allows for blazing fast narrow and open searches (> 500 Da precursor tolerance)
25-
- MS3-TMT quantification (R-squared of 0.999 with Proteome Discoverer)
25+
- Isobaric quantification (MS2/MS3-TMT, or custom reporter ions)
26+
- Label-free quantification: consider all charge states & isotopologues *a la* FlashLFQ
2627
- Capable of searching for chimeric/co-fragmenting spectra
28+
- Wide-window (dynamic precursor tolerance) search mode - enables WWA/PRM/DIA searches
2729
- Retention time prediction models fit to each LC/MS run
2830
- PSM rescoring using built-in linear discriminant analysis (LDA)
2931
- PEP calculation using a non-parametric model (KDE)
@@ -33,15 +35,11 @@ Check out the [blog post introducing Sage](https://lazear.github.io/sage/) for m
3335
- Built-in support for reading gzipped-mzML files
3436
- Support for reading/writing directly from AWS S3
3537

36-
### Experimental features
37-
38-
- Label-free quantification: consider all charge states & isotopologues *a la* FlashLFQ
39-
4038
### Assign multiple peptides to complex spectra
4139

4240
<img src="figures/chimera_27525.png" width="800">
4341

44-
- When chimeric searching is turned on, 2 peptide identifications will be reported for each MS2 scan, both with `rank=1`
42+
- When chimeric searching is enabled, multiple peptide identifications can be reported for each MS2 scan
4543

4644
### Sage trains machine learning models for FDR refinement and posterior error probability calculation
4745

@@ -113,6 +111,8 @@ Options:
113111
Path where search and quant results will be written. Overrides the directory specified in the configuration file.
114112
--batch-size <batch-size>
115113
Number of files to search in parallel (default = number of CPUs/2)
114+
--parquet
115+
Write parquet files instead of tab-separated files
116116
--write-pin
117117
Write percolator-compatible `.pin` output files
118118
-h, --help
@@ -127,7 +127,7 @@ Example usage: `sage config.json`
127127

128128
Some options in the parameters file can be over-written using the command line interface. These are:
129129

130-
1. The paths to the raw mzML data
130+
1. The paths to the mzML data
131131
2. The path to the database (fasta file)
132132
3. The output directory
133133

@@ -149,12 +149,14 @@ Running Sage will produce several output files (located in either the current di
149149
- MS2 search results will be stored as a tab-separated file (`results.sage.tsv`) file - this is a tab-separated file, which can be opened in Excel/Pandas/etc
150150
- MS2 and MS3 quantitation results will be stored as a tab-separated file (`tmt.tsv`, `lfq.tsv`) if `quant.tmt` or `quant.lfq` options are used in the parameter file
151151

152+
If `--parquet` is passed as a command line argument, `results.sage.parquet` (and optionally, `lfq.parquet`) will be written. These have a similar set of columns, but TMT values are stored as a nested array alongside PSM features
153+
152154
## Configuration file schema
153155

154156
### Notes
155157

156158
- The majority of parameters are optional - only "database.fasta", "precursor_tol", and "fragment_tol" are required. Sage will try and use reasonable defaults for any parameters not supplied
157-
- Tolerances are specified on the *experimental* m/z values. To perform a -100 to +500 Da open search (mass window applied to *precursor*), you would use `"da": [-500, 100]`
159+
- Tolerances are specified on the *experimental* m/z values. To perform a -100 to +500 Da open search (mass window applied to *theoretical*), you would use `"da": [-500, 100]`
158160

159161
### Decoys
160162

crates/sage-cloudpath/src/parquet.rs

+105-96
Original file line numberDiff line numberDiff line change
@@ -34,7 +34,7 @@ pub fn build_schema() -> Result<Type, parquet::errors::ParquetError> {
3434
required byte_array proteins (utf8);
3535
required int32 num_proteins;
3636
required int32 rank;
37-
required int32 label;
37+
required boolean is_decoy;
3838
required float expmass;
3939
required float calcmass;
4040
required int32 charge;
@@ -131,97 +131,99 @@ pub fn serialize_features(
131131

132132
let buf = Vec::new();
133133
let mut writer = SerializedFileWriter::new(buf, schema.into(), options.into())?;
134-
let mut rg = writer.next_row_group()?;
135134

136-
macro_rules! write_col {
137-
($field:ident, $ty:ident) => {
138-
if let Some(mut col) = rg.next_column()? {
139-
col.typed::<$ty>().write_batch(
140-
&features
141-
.iter()
142-
.map(|f| f.$field as <$ty as DataType>::T)
143-
.collect::<Vec<_>>(),
144-
None,
145-
None,
146-
)?;
147-
col.close()?;
148-
}
149-
};
150-
($lambda:expr, $ty:ident) => {
151-
if let Some(mut col) = rg.next_column()? {
152-
col.typed::<$ty>().write_batch(
153-
&features.iter().map($lambda).collect::<Vec<_>>(),
154-
None,
155-
None,
156-
)?;
157-
col.close()?;
158-
}
159-
};
160-
}
135+
for features in features.chunks(65536) {
136+
let mut rg = writer.next_row_group()?;
137+
macro_rules! write_col {
138+
($field:ident, $ty:ident) => {
139+
if let Some(mut col) = rg.next_column()? {
140+
col.typed::<$ty>().write_batch(
141+
&features
142+
.iter()
143+
.map(|f| f.$field as <$ty as DataType>::T)
144+
.collect::<Vec<_>>(),
145+
None,
146+
None,
147+
)?;
148+
col.close()?;
149+
}
150+
};
151+
($lambda:expr, $ty:ident) => {
152+
if let Some(mut col) = rg.next_column()? {
153+
col.typed::<$ty>().write_batch(
154+
&features.iter().map($lambda).collect::<Vec<_>>(),
155+
None,
156+
None,
157+
)?;
158+
col.close()?;
159+
}
160+
};
161+
}
161162

162-
write_col!(
163-
|f: &Feature| filenames[f.file_id].as_str().into(),
164-
ByteArrayType
165-
);
166-
write_col!(|f: &Feature| f.spec_id.as_str().into(), ByteArrayType);
167-
write_col!(
168-
|f: &Feature| database[f.peptide_idx].to_string().as_bytes().into(),
169-
ByteArrayType
170-
);
171-
write_col!(
172-
|f: &Feature| database[f.peptide_idx].sequence.as_ref().into(),
173-
ByteArrayType
174-
);
175-
write_col!(
176-
|f: &Feature| database[f.peptide_idx]
177-
.proteins(&database.decoy_tag, database.generate_decoys)
178-
.as_str()
179-
.into(),
180-
ByteArrayType
181-
);
182-
write_col!(
183-
|f: &Feature| database[f.peptide_idx].proteins.len() as i32,
184-
Int32Type
185-
);
186-
write_col!(rank, Int32Type);
187-
write_col!(label, Int32Type);
188-
write_col!(expmass, FloatType);
189-
write_col!(calcmass, FloatType);
190-
write_col!(charge, Int32Type);
191-
write_col!(peptide_len, Int32Type);
192-
write_col!(missed_cleavages, Int32Type);
193-
write_col!(isotope_error, FloatType);
194-
write_col!(delta_mass, FloatType);
195-
write_col!(average_ppm, FloatType);
196-
write_col!(hyperscore, FloatType);
197-
write_col!(delta_next, FloatType);
198-
write_col!(delta_best, FloatType);
199-
write_col!(rt, FloatType);
200-
write_col!(aligned_rt, FloatType);
201-
write_col!(predicted_rt, FloatType);
202-
write_col!(delta_rt_model, FloatType);
203-
write_col!(matched_peaks, Int32Type);
204-
write_col!(longest_b, Int32Type);
205-
write_col!(longest_y, Int32Type);
206-
write_col!(longest_y_pct, FloatType);
207-
write_col!(matched_intensity_pct, FloatType);
208-
write_col!(scored_candidates, Int32Type);
209-
write_col!(poisson, FloatType);
210-
write_col!(discriminant_score, FloatType);
211-
write_col!(posterior_error, FloatType);
212-
write_col!(spectrum_q, FloatType);
213-
write_col!(peptide_q, FloatType);
214-
write_col!(protein_q, FloatType);
215-
216-
if let Some(col) = rg.next_column()? {
217-
if reporter_ions.is_empty() {
218-
write_null_column(col, features.len())?;
219-
} else {
220-
write_reporter_ions(col, features, reporter_ions)?;
163+
write_col!(
164+
|f: &Feature| filenames[f.file_id].as_str().into(),
165+
ByteArrayType
166+
);
167+
write_col!(|f: &Feature| f.spec_id.as_str().into(), ByteArrayType);
168+
write_col!(
169+
|f: &Feature| database[f.peptide_idx].to_string().as_bytes().into(),
170+
ByteArrayType
171+
);
172+
write_col!(
173+
|f: &Feature| database[f.peptide_idx].sequence.as_ref().into(),
174+
ByteArrayType
175+
);
176+
write_col!(
177+
|f: &Feature| database[f.peptide_idx]
178+
.proteins(&database.decoy_tag, database.generate_decoys)
179+
.as_str()
180+
.into(),
181+
ByteArrayType
182+
);
183+
write_col!(
184+
|f: &Feature| database[f.peptide_idx].proteins.len() as i32,
185+
Int32Type
186+
);
187+
write_col!(rank, Int32Type);
188+
write_col!(|f: &Feature| f.label == -1, BoolType);
189+
write_col!(expmass, FloatType);
190+
write_col!(calcmass, FloatType);
191+
write_col!(charge, Int32Type);
192+
write_col!(peptide_len, Int32Type);
193+
write_col!(missed_cleavages, Int32Type);
194+
write_col!(isotope_error, FloatType);
195+
write_col!(delta_mass, FloatType);
196+
write_col!(average_ppm, FloatType);
197+
write_col!(hyperscore, FloatType);
198+
write_col!(delta_next, FloatType);
199+
write_col!(delta_best, FloatType);
200+
write_col!(rt, FloatType);
201+
write_col!(aligned_rt, FloatType);
202+
write_col!(predicted_rt, FloatType);
203+
write_col!(delta_rt_model, FloatType);
204+
write_col!(matched_peaks, Int32Type);
205+
write_col!(longest_b, Int32Type);
206+
write_col!(longest_y, Int32Type);
207+
write_col!(longest_y_pct, FloatType);
208+
write_col!(matched_intensity_pct, FloatType);
209+
write_col!(scored_candidates, Int32Type);
210+
write_col!(poisson, FloatType);
211+
write_col!(discriminant_score, FloatType);
212+
write_col!(posterior_error, FloatType);
213+
write_col!(spectrum_q, FloatType);
214+
write_col!(peptide_q, FloatType);
215+
write_col!(protein_q, FloatType);
216+
217+
if let Some(col) = rg.next_column()? {
218+
if reporter_ions.is_empty() {
219+
write_null_column(col, features.len())?;
220+
} else {
221+
write_reporter_ions(col, features, reporter_ions)?;
222+
}
221223
}
222-
}
223224

224-
rg.close()?;
225+
rg.close()?;
226+
}
225227
writer.into_inner()
226228
}
227229

@@ -231,7 +233,7 @@ pub fn build_lfq_schema() -> parquet::errors::Result<Type> {
231233
required byte_array peptide (utf8);
232234
required byte_array stripped_peptide (utf8);
233235
required byte_array proteins (utf8);
234-
required boolean decoy;
236+
required boolean is_decoy;
235237
required float q_value;
236238
required byte_array filename (utf8);
237239
required float intensity;
@@ -258,7 +260,10 @@ pub fn serialize_lfq<H: BuildHasher>(
258260
if let Some(mut col) = rg.next_column()? {
259261
let values = areas
260262
.iter()
261-
.map(|((peptide_idx, _), _)| database[*peptide_idx].to_string().as_bytes().into())
263+
.flat_map(|((peptide_idx, _), _)| {
264+
let val = database[*peptide_idx].to_string().as_bytes().into();
265+
std::iter::repeat(val).take(filenames.len())
266+
})
262267
.collect::<Vec<_>>();
263268

264269
col.typed::<ByteArrayType>()
@@ -269,7 +274,10 @@ pub fn serialize_lfq<H: BuildHasher>(
269274
if let Some(mut col) = rg.next_column()? {
270275
let values = areas
271276
.iter()
272-
.map(|((peptide_idx, _), _)| database[*peptide_idx].sequence.as_ref().into())
277+
.flat_map(|((peptide_idx, _), _)| {
278+
let val = database[*peptide_idx].sequence.as_ref().into();
279+
std::iter::repeat(val).take(filenames.len())
280+
})
273281
.collect::<Vec<_>>();
274282

275283
col.typed::<ByteArrayType>()
@@ -280,11 +288,12 @@ pub fn serialize_lfq<H: BuildHasher>(
280288
if let Some(mut col) = rg.next_column()? {
281289
let values = areas
282290
.iter()
283-
.map(|((peptide_idx, _), _)| {
284-
database[*peptide_idx]
291+
.flat_map(|((peptide_idx, _), _)| {
292+
let val = database[*peptide_idx]
285293
.proteins(&database.decoy_tag, database.generate_decoys)
286294
.as_str()
287-
.into()
295+
.into();
296+
std::iter::repeat(val).take(filenames.len())
288297
})
289298
.collect::<Vec<_>>();
290299

@@ -296,7 +305,7 @@ pub fn serialize_lfq<H: BuildHasher>(
296305
if let Some(mut col) = rg.next_column()? {
297306
let values = areas
298307
.iter()
299-
.map(|((_, decoy), _)| *decoy)
308+
.flat_map(|((_, decoy), _)| std::iter::repeat(*decoy).take(filenames.len()))
300309
.collect::<Vec<_>>();
301310

302311
col.typed::<BoolType>().write_batch(&values, None, None)?;
@@ -306,7 +315,7 @@ pub fn serialize_lfq<H: BuildHasher>(
306315
if let Some(mut col) = rg.next_column()? {
307316
let values = areas
308317
.iter()
309-
.map(|((_, _), (peak, _))| peak.q_value)
318+
.flat_map(|((_, _), (peak, _))| std::iter::repeat(peak.q_value).take(filenames.len()))
310319
.collect::<Vec<_>>();
311320

312321
col.typed::<FloatType>().write_batch(&values, None, None)?;

0 commit comments

Comments
 (0)