Skip to content
Open
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
382 changes: 382 additions & 0 deletions adr/20260312-record-syntax-unification.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,382 @@
# Unified constructor notation for process inputs and outputs

- Authors: Paolo Di Tommaso
- Status: proposed
- Deciders: Paolo Di Tommaso, Ben Sherman
- Date: 2026-03-12
- Updated: 2026-03-20
- Tags: lang, records, tuples, syntax

Technical Story: Follow-up to [Record types ADR](20260306-record-types.md)

## Summary

The current record types implementation uses two different syntactic forms for records in process inputs (block syntax) vs outputs (function-call syntax). This RFC proposes a **uniform constructor notation** — `name = constructor(...)` with optional type annotation — that applies to both `record()` and `tuple()` across inputs and outputs. This establishes a single syntactic pattern for all structured types in process definitions, provides a natural migration path from tuples to records, and ensures consistency across the language.

## Problem Statement

The accepted record types ADR ([20260306-record-types](20260306-record-types.md)) introduces two distinct syntactic forms for records within process definitions:

**Input** — a `Record { ... }` block syntax unique to inputs:
```nextflow
process FASTQC {
input:
sample: Record {
id: String
fastq_1: Path
fastq_2: Path
}
...
}
```

**Output** — a `record()` function call:
```nextflow
process FASTQC {
...
output:
record(id: sample.id, html: file('*.html'), zip: file('*.zip'))
}
```

This asymmetry means the same concept (a record) is expressed with two different syntactic forms depending on context. The block syntax `Record { ... }` exists only in process input declarations and has no counterpart elsewhere in the language. Meanwhile, the `record()` function call used in outputs is already a general-purpose construct usable in any expression context.
Copy link
Member

@bentsherman bentsherman Mar 12, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This asymmetry means the same concept (a record) is expressed with two different syntactic forms depending on context.

I think the asymmetry is intentional. It helps distinguish between two slightly different concepts -- an input vs an output. Inputs and outputs have slightly different behaviors, especially in a process.

  • Inputs are receiving values from an external source, validating them against a declared structure, and staging them into the task environment

  • Outputs are collecting values from the task environment and pushing them into an output structure

So, process inputs and outputs are similar in some ways and different in others. The question is whether it is better to highlight their similarities or their differences

Copy link
Member

@bentsherman bentsherman Mar 12, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

For what it's worth, the language server actually does highlight their similarities. When hovering on a process call, the process hover hint will be rendered as:

process FASTQC {
    input:
    sample: Record {
        id: String
        fastq_1: Path
        fastq_2: Path
    }

    output:
    result: Record {
        id: String
        html: Path
        zip: Path
    }
}

Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I always parsed the difference as being the input was a type declaration (it has the same format except without a leading record I think?) that would be used to create an anonymous record type for input, and the output was instantiating a generic record instance, which can optionally be assigned?

This makes sense to me I think.

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Agreed, I think this syntax is clearer and nicely separates out typing from instantiation.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I understand that's intentional, but my claim is that it's not needed cognitive load. Why as a user I should think to two different notations to express the structure of input and outputs.

Above all, the central point is how the syntax can be evolved keeping some structural continuity with the existing syntax so that as a nextflow developer I feel comfortable with it without the need to learn new concepts.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I feel comfortable with it without the need to learn new concepts.

Static typing and records are new concepts. There is no getting around that.

The question is, given that users will have to learn the new syntax for parameters, type annotations, record types, etc, is it better for the process inputs/outputs to be consistent with the new syntax everywhere else or consistent with the legacy process syntax?

If legacy continuity gets in the way of expressing inputs/outputs in the new system, surely the latter must take precedence

At the same time, we might be able to achieve both...

The best case I can see is to ditch the assignment and just have a destructor -> constructor pattern:

legacy

    input:
    tuple val(id), path(fastq_1), path(fastq_2)

    output:
    tuple val(id), path("fastqc_${id}_logs")

typed (tuple)

    input:
    tuple(id: String, fastq_1: Path, fastq_2: Path)

    output:
    tuple(id, file("fastqc_${id}_logs"))

typed (record)

    input:
    record(
        id: String,
        fastq_1: Path,
        fastq_2: Path
    )

    output:
    record(
        id: id,
        fastqc: file("fastqc_${id}_logs")
    )

This works well with static typing and the legacy continuity is decent.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

My only concerns at that point would be:

  • The commas can get ugly with type annotations, for example:
    input:
    record(
        id: String,
        single_end: Boolean,
        reads: List<Path>, // ugly
        args: String?, // ugly
        prefix: String?
    )

This is partly what led me to the block syntax which doesn't require commas.

  • If the new syntax is too similar, that could cause it's own confusion. Might be hard to distinguish between typed vs legacy syntax. This is the concern we had with typed workflows


## Goals

- **Uniform constructor notation** — establish `name = constructor(...)` as the single syntactic pattern for all structured types (`record()` and `tuple()`) in process inputs and outputs.
- **Syntactic consistency** — use the same notation for records and tuples across inputs and outputs, eliminating context-dependent forms.
- **Alignment with existing syntax** — reuse assignment (`=`) and type annotation (`: Type`) patterns already present in process I/O, rather than introducing new block syntax.
- **Migration continuity** — provide a natural upgrade path from `tuple()` to `record()` by keeping the notation identical, so users only change the keyword to gain named-field semantics.
- **Standard type semantics** — record and tuple assignments should follow the same type compatibility rules as any other typed assignment in the language.

## Non-goals

- Changing the top-level `record` type definition syntax — the `record Name { field: Type }` declaration form is a type-level construct and is not affected by this proposal.
- Changing the `record()` function runtime behavior or the `RecordMap` implementation.
- Removing support for external type references (e.g. `sample: Sample`).
- Changing the runtime behavior of tuples — tuples retain their positional semantics.

## Considered Options

### Option 1: Current syntax (status quo)

Input uses a dedicated block syntax, output uses the `record()` function call:

```nextflow
process FASTQC {
input:
sample: Record {
id: String
fastq_1: Path
fastq_2: Path
}

output:
record(id: sample.id, html: file('*.html'), zip: file('*.zip'))
}
```

- Good, because input block syntax mirrors the top-level `record` definition.
- Bad, because two different notations for the same concept in the same process definition.

### Option 2: Block syntax for both inputs and outputs

Use `record { ... }` blocks in both input and output:

```nextflow
process FASTQC {
input:
record sample {
id: String
fastq_1: Path
fastq_2: Path
}

output:
record {
id: String = sample.id
html: Path = file('*.html')
zip: Path = file('*.zip')
}
}
```

- Good, because symmetric — same block form on both sides.
- Bad, because the output block mixes type declarations with value assignments (`Path = file(...)`).
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is this not more akin to a type definition with some defaults? I know this may not be something that really exists in Java/Groovy but it makes perfect sense when thinking of record types as akin to something like Pydantic, where you would declare a record with a default using similar syntax.

e.g.

class DemoModel(BaseModel):
    ts: datetime = Field(default_factory=datetime.now)

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pydantic is an interesting comparison here. I think option 3 is essentially the same as Pydantic -- using function calls to create type definitions

Pydantic is constrained by Python syntax, so for them the best option is to use assignments and function calls, even though it conflates the meaning of this syntax (creating a value vs declaring a type). But this is probably still much better than creating a custom DSL that users would have to learn alongside Python

We have no such constraint in Nextflow, so we can differentiate these type declarations with different syntax

- Bad, because block syntax in process I/O diverges from the function-call style already established for `record()`.

### Option 3: Uniform constructor notation for `record()` and `tuple()`

Establish `name = constructor(...)` as the single syntactic pattern for all structured types in process I/O. Both `record()` and `tuple()` follow the same three-tier notation — bare, assignment, and typed assignment:

**Record:**

```nextflow
// bare — anonymous output
record(id: sample.id, html: file('*.html'))

// assignment
result = record(id: sample.id, html: file('*.html'))

// typed assignment
result: QcResult = record(id: sample.id, html: file('*.html'))
```

**Tuple:**

```nextflow
// bare — anonymous output
tuple(id, file('*.bam'))

// assignment
out = tuple(id, file('*.bam'))

// typed assignment
out: Tuple<String,Path> = tuple(id, file('*.bam'))
```

The same pattern applies uniformly to inputs:

```nextflow
process FASTQC {
input:
sample = record(id: String, fastq_1: Path, fastq_2: Path)

output:
result = record(id: sample.id, html: file('*.html'), zip: file('*.zip'))
}
```

```nextflow
process ALIGN {
input:
in = tuple(id: String, fastq: Path)

output:
out = tuple(id, file('*.bam'))
}
```

With optional explicit type annotations:

```nextflow
process FASTQC {
input:
sample: Sample = record(id: String, fastq_1: Path, fastq_2: Path)
Copy link
Member

@bentsherman bentsherman Mar 12, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Worth mentioning that the = record(...) part here is redundant if you already declare the input with a record type (Sample)

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

From our discussion -- it could be useful to have both as a way to future-proof against changes to the Sample type. For example, if someone adds some Path fields to Sample, you don't necessarily want FASTQC to automatically start staging them as inputs

But at that point, there is no point to using Sample at all -- you might as well just use the generic Record type:

    sample: Record = record(id: String, fastq_1: Path, fastq_2: Path)


output:
result: QcResult = record(id: sample.id, html: file('*.html'), zip: file('*.zip'))
}
```

- Good, because same notation on both sides — `name = constructor(...)` — for both `record()` and `tuple()`.
- Good, because establishes a uniform constructor notation across all structured types.
- Good, because reuses existing assignment and type annotation patterns.
- Good, because `record()` and `tuple()` are already general-purpose functions, no new syntax needed.
- Good, because type annotations follow standard rules — `sample: Sample = record(...)` works like any typed assignment.
- Good, because the migration from `tuple()` to `record()` requires only changing the keyword — the notation is identical.
- Bad, because input `record()` and `tuple()` arguments are types rather than values, which is a different usage of the function.

## Solution or decision outcome

**Option 3**: Establish a **uniform constructor notation** — `name: Type = constructor(...)` — that applies to both `record()` and `tuple()` across process inputs and outputs. This eliminates the need for context-specific syntax forms and provides a natural migration path from tuples to records.

## Rationale & discussion

### Uniform constructor notation

The key insight is that both `record()` and `tuple()` are constructors, and everything else is standard Nextflow assignment and type annotation. This establishes a single syntactic pattern for all structured types in process definitions:

```
name = constructor(...) // assignment
name: Type = constructor(...) // typed assignment
constructor(...) // bare (anonymous output)
```

This pattern applies uniformly regardless of:
- **Constructor type** — `record()` or `tuple()`
- **Context** — input or output
- **Whether a type annotation is present**

No dedicated block syntax is needed. No context-dependent forms exist. Every structured input or output follows the same shape.

### Syntax pattern

The unified pattern is `name: Type = constructor(...)` for both inputs and outputs, for both records and tuples:

- **Record input**: `sample = record(id: String, fastq_1: Path, fastq_2: Path)` — declares the fields and their types being received.
- **Record output**: `result = record(id: sample.id, html: file('*.html'))` — declares the fields and their values being produced.
- **Tuple input**: `in = tuple(id: String, fastq: Path)` — declares the components and their types being received.
- **Tuple output**: `out = tuple(id, file('*.bam'))` — declares the components and their values being produced.

The only difference is what goes inside the constructor call — types on input (declaring structure), expressions on output (producing values). This parallels how assignment works elsewhere: the left side declares, the right side provides.

### Tuple and record: same notation, different semantics

The notation is identical for both constructors. The semantic difference is positional vs named:

| | `tuple()` | `record()` |
|---|---|---|
| Field access | Positional (`in[0]`) and named (`in.id`) | Named only (`sample.id`) |
| Order | Significant | Not significant |
| Duck typing | No | Yes |
| Extra fields | No | Yes (structural subtyping) |

This means migrating from tuple to record requires only changing the keyword — the surrounding notation stays the same:

```nextflow
// Tuple — positional semantics
in = tuple(id: String, fastq: Path)

// Record — named semantics (just change the keyword)
in = record(id: String, fastq: Path)
```

### Continuity with current tuple syntax

The typed process syntax already uses `tuple()` as a function-call constructor in outputs:

```nextflow
// Current typed output syntax
bam = tuple(id, file('*.bam'))
bai = tuple(id, file('*.bai'))
```

Option 3 extends this established pattern to inputs and applies the same pattern to `record()`. Users who already write `tuple()` in outputs understand the idiom — `record()` works the same way.

The migration path from classic DSL2 through the unified notation is:

```nextflow
// Classic DSL2
tuple val(id), path(fastq)

// Typed — uniform constructor notation
in = tuple(id: String, fastq: Path)

// Record — upgrade to named semantics when ready
in = record(id: String, fastq: Path)
```

Each step adds expressiveness without breaking the previous mental model.

### Type annotations

Type annotations are optional and follow standard semantics:

```nextflow
// Inferred type from record fields
sample = record(id: String, fastq_1: Path, fastq_2: Path)

// Explicit type — compiler checks compatibility with Sample
sample: Sample = record(id: String, fastq_1: Path, fastq_2: Path)

// Inferred type from tuple components
in = tuple(id: String, fastq: Path)

// Explicit type
in: Tuple<String,Path> = tuple(id: String, fastq: Path)
```

This is the same as writing `x: Integer = 42` vs `x = 42` — nothing constructor-specific about the assignment semantics.

### Alignment with existing process syntax

The proposed syntax reuses patterns that already exist in Nextflow process definitions:

| Existing pattern | Example | Constructor equivalent |
|-----------------|---------|----------------------|
| Scalar type annotation | `id: String` | `sample: Sample` |
| Assignment in output | `id = sample.id` | `result = record(...)` / `out = tuple(...)` |
| Typed assignment in output | `id: String = sample.id` | `result: QcResult = record(...)` / `out: Tuple<String,Path> = tuple(...)` |

### External type reference

When using a pre-defined record type, the syntax naturally simplifies:

```nextflow
// With inline fields
sample: Sample = record(id: String, fastq_1: Path, fastq_2: Path)

// With external type only (no inline fields needed)
sample: Sample
```

The `sample: Sample` shorthand remains valid — the `record()` call is only needed when defining fields inline.

### Full example

```nextflow
nextflow.preview.types = true

record Sample {
id: String
fastq_1: Path
fastq_2: Path
}

process TOUCH {
input:
id: String

output:
result = record(id: id, fastq_1: file('*_1.fastq'), fastq_2: file('*_2.fastq'))

script:
"""
touch ${id}_1.fastq
touch ${id}_2.fastq
"""
}

process FASTQC {
input:
sample: Sample = record(id: String, fastq_1: Path, fastq_2: Path)

output:
result = record(id: sample.id, html: file('*.html'), zip: file('*.zip'))

script:
"""
touch ${sample.id}.html
touch ${sample.id}.zip
"""
}

workflow {
ch_samples = TOUCH(channel.of('a', 'b', 'c'))
ch_fastqc = FASTQC(ch_samples)
ch_fastqc.view()
}
```

### Tuple and record coexistence

A process can use both tuples and records, with the same notation throughout:

```nextflow
process ALIGN {
input:
sample = record(id: String, fastq_1: Path, fastq_2: Path)

output:
result = record(id: sample.id, bam: file('*.bam'), bai: file('*.bai'))
}

process QUANT {
input:
in = tuple(id: String, bam: Path, bai: Path)

output:
out = tuple(id, file('quant'))

script:
"""
quant ${bam} ${bai} -o quant
"""
}
```

## Links

- Supersedes input syntax in [Record types ADR](20260306-record-types.md)
- Related: [Record types syntax summary](../plans/record-types-syntax-new.md)
Loading