Conversation
Signed-off-by: Ben Sherman <bentshermann@gmail.com>
✅ Deploy Preview for nextflow-docs-staging canceled.
|
| } | ||
| ], | ||
|
|
||
| // inputs |
There was a problem hiding this comment.
Maybe it will be more clear if the input parameters are also in an inputs block or something similar? It's pretty easy to migrate the current schema to that
There was a problem hiding this comment.
That was the original plan, but we wanted to avoid breaking the JSON-schema validity of the file. Basically because it's useful to be able to load it and throw it into any JSON-schema validation library (unrecognised keys are typically ignored).
There was a problem hiding this comment.
Maybe a higher level use of definitions could help here? Like having one definitions for inputs, one for outputs, one for required...
There was a problem hiding this comment.
Depending on how the auto-generation works out, it might be feasible to maintain an actual inputs section that mirrors the outputs, while also generating the parameter schema
In theory, the user wouldn't need to worry about the resulting duplication because they shouldn't modify those bits anyway. They would only be adding things to the JSON-schema part
There was a problem hiding this comment.
Sounds good, let me know if you have something for me to test. I'd love to give it a go
|
|
||
| - The `manifest` config options are effectively converted directly to JSON with only nominal changes, such as `manifest.name` -> `title` (preserve structure of original nf-core schema) and `nextflowVersion` -> `requires.nextflow` (leave space for module versions in the future). | ||
|
|
||
| - The parameter schema follows the structure of the nf-core schema, which defines *parameter groups* under `$defs` and combines them using JSON schema properties such as `allOf`. This section should be generated with sensible defaults since some properties (e.g. group name) can not be specified in pipeline code. |
There was a problem hiding this comment.
Ungrouped parameters are also allowed in the schema, do we still want to support these or would you prefer to have everything in definitions?
There was a problem hiding this comment.
Good catch - I was thinking this yesterday and meant to leave a comment but forgot. Yes we need to accept top-level ungrouped properties (ungrouped params).
{
"$schema": "http://json-schema.org/draft-07/schema",
"$id": "https://raw.githubusercontent.com/YOUR_PIPELINE/master/nextflow_schema.json",
"title": "Nextflow pipeline parameters",
"description": "This pipeline uses Nextflow and processes some kind of data. The JSON Schema was built using the nf-core pipeline schema builder.",
"type": "object",
"properties": {
"some_parameter": {
"type": "string"
}
}
}
There was a problem hiding this comment.
Good to know. In that case I think Nextflow will generate ungrouped parameters by default and then preserve whatever groups the user adds
There was a problem hiding this comment.
Yes that's how we do it in nf-core too
nvnieuwk
left a comment
There was a problem hiding this comment.
I really, really like this!
Will this be easy to automate using a built-in Nextflow command? Or will this require some manual work for the users? How will we make it easy for developers in that case?
Yes, the idea is that Nextflow will validate it at runtime and fail with an error if the schema doesn't match the internally defined It will only do basic keys and it'll allow customisation - so if you add @bentsherman should we add this to the ADR? |
|
Perfect, that would make the transition between nf-schema and built-in Nextflow function pretty seamless. I assume a migration script will also be available to migrate the JSON schema one way or another |
|
Yes, there is a basic description of the spec generation / sync in the ADR but it would be good to flesh it out a bit more |
|
|
||
| ```json | ||
| { | ||
| // metadata |
There was a problem hiding this comment.
I'm personally not a fan of this kind of information duplication in what are essentially all source files (assuming the jsonschema is still providing some of the validation logic and is not just a representation of it). If this is the route we are headed down I think it would be much more preferable for nextflow_spec.json to become the source of truth and this information be pulled over by nextflow when it's needed in the compiled source code itself (consider how e.g. Python's pyproject.toml or Rust's Cargo.toml work).
I'd almost go further and question a bit the logic of combining metadata and source-code validation logic (the params schema) into a single document? I guess it's mostly about a convenient single source for e.g. Seqera Platform launch forms?
There was a problem hiding this comment.
The pipeline spec would be a source of truth for external systems, but not for Nextflow itself. The source-of-truth for Nextflow is the pipeline code
I suppose Nextflow could use the pipeline spec to perform additional validation that goes beyond what can be defined in the params block, but I would rather avoid that if possible
E.g. instead of using the pipeline spec to validate file extensions, we could add something like "blob types" to the Nextflow language that allow you to define specializations of the Path type such as Fastq, Bam, etc
There was a problem hiding this comment.
The problem as I see it right now is that using jsonschema via nf-schema allows us to validate cleanly things beyond the simple type of the param. We can for example use min and max to bound an integer param. We can check string lengths. We can check things like the length of an array.
Unless the intention is to add ways to constrain the param types in Nextflow in such a way then this validation will need to continue and to do so using nf-schema it will require the use of the params part of this new spec document?
There was a problem hiding this comment.
We have considered adding declarative ways to validate those kinds of things, for example:
params {
num_iterations: Integer = 100 {
min 1
max 1000
}
}But Paolo and I weren't too keen on doing this, at least not yet, since it would add a lot of noise to the params block. Meanwhile, you can always do this kind of validation with regular code:
workflow {
if( params.num_iterations < 1 || params.num_iterations > 1000 ) {
error "Parameter `num_iterations` should be between 1 and 1000"
}
// ...
}So it seems like we have the tools to validate everything that nf-schema validates through Nextflow code. But I'd like to wait and see how people use these features before we make any more drastic changes
There was a problem hiding this comment.
The declarative method would be much better in my personal opinion.
I don't know how others feel but I was under the impression that part of the benefit of nf-schema was not having to use these kind of conditional checks in code - prior to the original nf-validation you would see huge chunks of if-else blocks at the start of pipelines.
I think this is a bigger consideration too if there is any intention for nextflow to use record types for native samplesheet validation and conversion to lists of records (which I got the sense from some teasers from Phil is something that could be coming) because there is often more of this sort of validation needed there, and do we really want to be doing something that potentially looks like:
inputs = samplesheetToRecords('samplesheet.json')
def errors = [:]
inputs.map{
rec ->
errors["${rec.id}"] = []
if(rec.intval < 0){
errors["${rec.id}"].push(intval '${rec.intval}' must be greater than 0")
}
if(rec.someotherval not in ["a", "b", "c"]){
errors["${rec.id}"].push("${rec.id}: intval '${rec.intval}' must be greater than 0")
}
if(rec.finalval instanceOf Integer && rec.finalval > 0){
errors["${rec.id}"].push(intval '${rec.finalval}' must be less than 0 if an integer")
} elif (rec.finalval instanceOf String && rec.finalval not in ["ont", "pacbio"]) {
errors["${rec.id}"].push(finalval '${rec.finalval}' must be one of 'ont' or 'pacbio' if a string")
<repeat for every necessary complex check>
}
}
This also has the effect of de-coupling the type level checking from the value level checking, meaning if you ever modified the type you'd need to then locate the value check and ensure it was still compatible.
Note also how in the last example where if a value can take multiple types (not sure if type unions are actually supported in the native nextflow type-casting yet) you would need to check the type before checking the value.
The above issues will also apply to params just without needing to be in the map call.
There was a problem hiding this comment.
All fair points. We're just trying to take it one step at a time. The params block will provide much of the type-level validation, including samplesheets. But we have to find the appropriate line between declarative vs imperative validation in Nextflow, and that will take time, so I don't want to over-commit on anything yet
| ], | ||
|
|
||
| // outputs | ||
| "output": { |
There was a problem hiding this comment.
I might be a bit fuzzy here, but this looks like a sub-schema (multiqc_report looks to have standard keys for an item in properties) but only semi-defined as such - no top-level object type, no properties, etc.
What is the purpose of this section beyond documentation? If it's intended for e.g. validating outputs then it would be better if it was a properly defined sub-schema I think?
There was a problem hiding this comment.
Right now, each output can declare either a type (for simple outputs like numbers or files) or a schema (for complex outputs like a collection of samples)
In theory, we could embed the schema (e.g. for samples) directly in the pipeline spec. But if it is useful for that schema to be used on its own then it might make more sense to keep it in a separate file, as we do for samplesheet schemas.
This PR adds an ADR for the pipeline spec based on the latest investigations.
Note that the schema for pipeline specs has already been bootstrapped based on the original nf-core schema: https://github.com/nextflow-io/schemas/blob/main/pipeline/v1/schema.json
Features:
TODO:
nextflow-io/schemasadding new properties to pipeline schema