ADR: Unified record syntax for process inputs/outputs#6912
ADR: Unified record syntax for process inputs/outputs#6912pditommaso wants to merge 4 commits intomasterfrom
Conversation
Propose using the record() function-call notation uniformly for both
process inputs and outputs, replacing the asymmetric Record { ... }
block syntax currently used in inputs.
Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Signed-off-by: Paolo Di Tommaso <paolo.ditommaso@gmail.com>
✅ Deploy Preview for nextflow-docs-staging canceled.
|
| ```nextflow | ||
| process FASTQC { | ||
| input: | ||
| sample: Sample = record(id: String, fastq_1: Path, fastq_2: Path) |
There was a problem hiding this comment.
Worth mentioning that the = record(...) part here is redundant if you already declare the input with a record type (Sample)
There was a problem hiding this comment.
From our discussion -- it could be useful to have both as a way to future-proof against changes to the Sample type. For example, if someone adds some Path fields to Sample, you don't necessarily want FASTQC to automatically start staging them as inputs
But at that point, there is no point to using Sample at all -- you might as well just use the generic Record type:
sample: Record = record(id: String, fastq_1: Path, fastq_2: Path)| } | ||
| ``` | ||
|
|
||
| This asymmetry means the same concept (a record) is expressed with two different syntactic forms depending on context. The block syntax `Record { ... }` exists only in process input declarations and has no counterpart elsewhere in the language. Meanwhile, the `record()` function call used in outputs is already a general-purpose construct usable in any expression context. |
There was a problem hiding this comment.
This asymmetry means the same concept (a record) is expressed with two different syntactic forms depending on context.
I think the asymmetry is intentional. It helps distinguish between two slightly different concepts -- an input vs an output. Inputs and outputs have slightly different behaviors, especially in a process.
-
Inputs are receiving values from an external source, validating them against a declared structure, and staging them into the task environment
-
Outputs are collecting values from the task environment and pushing them into an output structure
So, process inputs and outputs are similar in some ways and different in others. The question is whether it is better to highlight their similarities or their differences
There was a problem hiding this comment.
For what it's worth, the language server actually does highlight their similarities. When hovering on a process call, the process hover hint will be rendered as:
process FASTQC {
input:
sample: Record {
id: String
fastq_1: Path
fastq_2: Path
}
output:
result: Record {
id: String
html: Path
zip: Path
}
}There was a problem hiding this comment.
I always parsed the difference as being the input was a type declaration (it has the same format except without a leading record I think?) that would be used to create an anonymous record type for input, and the output was instantiating a generic record instance, which can optionally be assigned?
This makes sense to me I think.
There was a problem hiding this comment.
Agreed, I think this syntax is clearer and nicely separates out typing from instantiation.
There was a problem hiding this comment.
I understand that's intentional, but my claim is that it's not needed cognitive load. Why as a user I should think to two different notations to express the structure of input and outputs.
Above all, the central point is how the syntax can be evolved keeping some structural continuity with the existing syntax so that as a nextflow developer I feel comfortable with it without the need to learn new concepts.
There was a problem hiding this comment.
I feel comfortable with it without the need to learn new concepts.
Static typing and records are new concepts. There is no getting around that.
The question is, given that users will have to learn the new syntax for parameters, type annotations, record types, etc, is it better for the process inputs/outputs to be consistent with the new syntax everywhere else or consistent with the legacy process syntax?
If legacy continuity gets in the way of expressing inputs/outputs in the new system, surely the latter must take precedence
At the same time, we might be able to achieve both...
The best case I can see is to ditch the assignment and just have a destructor -> constructor pattern:
legacy
input:
tuple val(id), path(fastq_1), path(fastq_2)
output:
tuple val(id), path("fastqc_${id}_logs")typed (tuple)
input:
tuple(id: String, fastq_1: Path, fastq_2: Path)
output:
tuple(id, file("fastqc_${id}_logs"))typed (record)
input:
record(
id: String,
fastq_1: Path,
fastq_2: Path
)
output:
record(
id: id,
fastqc: file("fastqc_${id}_logs")
)This works well with static typing and the legacy continuity is decent.
There was a problem hiding this comment.
My only concerns at that point would be:
- The commas can get ugly with type annotations, for example:
input:
record(
id: String,
single_end: Boolean,
reads: List<Path>, // ugly
args: String?, // ugly
prefix: String?
)This is partly what led me to the block syntax which doesn't require commas.
- If the new syntax is too similar, that could cause it's own confusion. Might be hard to distinguish between typed vs legacy syntax. This is the concern we had with typed workflows
| - Good, because reuses existing assignment and type annotation patterns. | ||
| - Good, because `record()` is already a general-purpose function, no new syntax needed. | ||
| - Good, because type annotations follow standard rules — `sample: Sample = record(...)` works like any typed assignment. | ||
| - Bad, because input `record()` arguments are types rather than values, which is a different usage of the function. |
There was a problem hiding this comment.
This is my main issue with Option 3. The double usage of record() here is subtle and more likely to confuse users and agents
Whereas the syntax I ended up using for record inputs is easy to explain as an "inline record type"
So I think with Option 3 you are ultimately trading one form of double usage for another, without much benefit
There was a problem hiding this comment.
My counter argument is that's aligned with the semantic for record constructor. Also the double usage is done for file. Above the central point is the continuity with existing notation val, path, tuple, etc
There was a problem hiding this comment.
Function overloading like with file() is fine if done cautiously, but what I'm talking about is syntax overloading. You are hijacking the assignment and function call syntax for different purposes, which creates unnecessary cognitive load
The function call by itself could make sense as a reverse constructor pattern:
input:
tuple(id: String, fastq_1: Path, fastq_2: Path)
record(id: String, fastq_1: Path, fastq_2: Path)That would also have better continuity with the legacy syntax
|
Great discussion overall. I would be keen to incorporate this somehow into the record type ADR, since we did discuss some of these options throughout the process but I didn't include them as alternatives in the original ADR Would be good to document these alternatives so that we have the clear rationale for the final syntax |
This comment was marked as outdated.
This comment was marked as outdated.
| ``` | ||
|
|
||
| - Good, because symmetric — same block form on both sides. | ||
| - Bad, because the output block mixes type declarations with value assignments (`Path = file(...)`). |
There was a problem hiding this comment.
Is this not more akin to a type definition with some defaults? I know this may not be something that really exists in Java/Groovy but it makes perfect sense when thinking of record types as akin to something like Pydantic, where you would declare a record with a default using similar syntax.
e.g.
class DemoModel(BaseModel):
ts: datetime = Field(default_factory=datetime.now)There was a problem hiding this comment.
Pydantic is an interesting comparison here. I think option 3 is essentially the same as Pydantic -- using function calls to create type definitions
Pydantic is constrained by Python syntax, so for them the best option is to use assignments and function calls, even though it conflates the meaning of this syntax (creating a value vs declaring a type). But this is probably still much better than creating a custom DSL that users would have to learn alongside Python
We have no such constraint in Nextflow, so we can differentiate these type declarations with different syntax
Extend Option 3 to establish a uniform constructor notation `name = constructor(...)` that applies to both record() and tuple() across process inputs and outputs. Highlights the migration path from classic tuple syntax through typed tuples to records. Signed-off-by: Paolo Di Tommaso <paolo.ditommaso@gmail.com> Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com> Signed-off-by: Paolo Di Tommaso <paolo.ditommaso@gmail.com>
4d22995 to
30b8440
Compare
|
Updated ADR: Unified constructor notation for Key changes in this update:
The core value: users learn one pattern and apply it everywhere. The only choice is positional ( |
Co-authored-by: Ben Sherman <bentshermann@gmail.com> Signed-off-by: Paolo Di Tommaso <paolo.ditommaso@gmail.com>
Signed-off-by: Paolo Di Tommaso <paolo.ditommaso@gmail.com>
Summary
record()function-call notation uniformly for both process inputs and outputsRecord { ... }block syntax currently used only in input declarations=) and type annotation (: Type) patterns already present in process I/OCurrent (asymmetric):
Proposed (uniform):
Test plan
🤖 Generated with Claude Code