diff --git a/docs/cache-and-resume.md b/docs/cache-and-resume.md index 97c0606e58..273da8eb3e 100644 --- a/docs/cache-and-resume.md +++ b/docs/cache-and-resume.md @@ -12,6 +12,8 @@ All task executions are automatically saved to the task cache, regardless of the The task cache is used in conjunction with the [work directory](#work-directory) to recover cached tasks in a resumed run. It is also used by the {ref}`cli-log` sub-command to query task metadata. +(cache-resume-task-hash)= + ### Task hash The task hash is computed from the following metadata: @@ -227,3 +229,7 @@ diff run_1.tasks.log run_2.tasks.log ``` You can then view the `diff` output or use a graphical diff viewer to compare `run_1.tasks.log` and `run_2.tasks.log`. + +:::{versionadded} 25.04.0 +Nextflow now has a built-in way to compare two task runs. See the {ref}`data-lineage-page` guide for details. +::: diff --git a/docs/data-lineage.md b/docs/data-lineage.md new file mode 100644 index 0000000000..63855445a8 --- /dev/null +++ b/docs/data-lineage.md @@ -0,0 +1,346 @@ +(data-lineage-page)= + +# Data lineage + +This guide shows how to get started with native provenance tracking, also known as *data lineage*, introduced in {ref}`Nextflow 25.04 `. + +:::{warning} +Data lineage is an experimental feature. It may change in future releases. +::: + +## Overview + +The *provenance* or *lineage* of a data entity, such as a file or record, is the history of computations and intermediate data that produced the entity. Data lineage is useful for verifying the integrity and reproducibility of pipeline results. + +Nextflow's built-in data lineage, when enabled, tracks all of your workflow runs, task runs, and outputs in a single place as *lineage records*. You can then query these records from the command line, or use them in a Nextflow script. Every lineage record has a unique hash called a *lineage ID* (LID) by which it is accessed. + +## Enable data lineage + +To get started, enable data lineage in your Nextflow configuraiton: + +```groovy +lineage.enabled = true +``` + +Optionally, set the location of the lineage store: + +```groovy +lineage.store.location = '.lineage' +``` + +It defaults to `.lineage` in the current directory. + +:::{tip} +Place these settings in `~/.nextflow/config` to apply them globally. +::: + +See the {ref}`config-lineage` config scope for details. + +## Generate lineage metadata + +Run a Nextflow pipeline to generate some lineage metadata. For example: + +```bash +nextflow run rnaseq-nf -profile conda +``` + +This pipeline will execute several tasks and publish several output files, all of which will be recorded in the lineage store. + +## Explore lineage + +Now that you have generated some lineage metadata, you can explore it from the command line using the {ref}`cli-lineage` command. + +First, use the `list` subcommand to list the workflow runs in the lineage store: + +```console +$ nextflow lineage list +TIMESTAMP RUN NAME SESSION ID LINEAGE ID +2025-05-09 13:28:30 CDT peaceful_blackwell 065bdc6b-89b4-42ee-92c1-2a5af37f2c50 lid://16b31030474f2e96c55f4940bca3ab64 +``` + +The *lineage ID* (LID) is the unique identifier for the workflow run and the entrypoint for exploring the lineage. + +Use the `view` subcommand to view the lineage record for the workflow run: + +```console +$ nextflow lineage view lid://16b31030474f2e96c55f4940bca3ab64 +{ + "type": "WorkflowRun", + "workflow": { + "scriptFiles": [ + ... + ], + "repository": "https://github.com/nextflow-io/rnaseq-nf", + "commitId": "86165b8c81d43a1f57363964431395152e353e56" + }, + "sessionId": "065bdc6b-89b4-42ee-92c1-2a5af37f2c50", + "name": "peaceful_blackwell", + "params": [ + ... + ], + "config": { + ... + } +} +``` + +Every workflow run is represented in the lineage store as a `WorkflowRun` record, which includes information such as the pipeline repository, revision, run name, parameters, and resolved config. + +:::{note} +The data model for every lineage record is defined in the Nextflow [source code](https://github.com/nextflow-io/nextflow/tree/master/modules/nf-lineage/src/main/nextflow/lineage/model). +::: + +The output files of a workflow run can be accessed as `lid:///`, where `` is the file path relative to the workflow output directory. + +:::{note} +Files must be published to the workflow output directory as defined by the `outputDir` config option (or `-output-dir` command line option) in order to be recorded as workflow outputs in the lineage store. +::: + +List the output directory to see the available files: + +```console +$ find results +results +results/fastqc_ggal_gut_logs +results/fastqc_ggal_gut_logs/ggal_gut_1_fastqc.html +results/fastqc_ggal_gut_logs/ggal_gut_1_fastqc.zip +results/fastqc_ggal_gut_logs/ggal_gut_2_fastqc.html +results/fastqc_ggal_gut_logs/ggal_gut_2_fastqc.zip +results/multiqc_report.html +``` + +Now, use the workflow LID and relative path to view the lineage record for an output file: + +```console +$ nextflow lineage view lid://16b31030474f2e96c55f4940bca3ab64/multiqc_report.html +{ + "type": "FileOutput", + "path": "/results/multiqc_report.html", + "checksum": { + "value": "03fd5ed150c7862e1fad5efd4f574a47", + "algorithm": "nextflow", + "mode": "standard" + }, + "source": "lid://862df53160e07cd823c0c3960545e747/multiqc_report.html", + "workflowRun": "lid://16b31030474f2e96c55f4940bca3ab64", + "taskRun": null, + "size": 5079806, + "createdAt": "2025-05-09T13:27:34.576590545-05:00", + "modifiedAt": "2025-05-09T13:27:34.586590551-05:00", + "labels": null +} +``` + +Every output file is represented in the lineage store as a `FileOutput` record, which includes basic file information such as the real path, checksum, file size and created/modified timestamps. It also includes lineage information, such as the workflow run and task run that produced it. + +Since this record is a workflow output, it is not linked directly to a task run, but rather to the original task output. + +Any LID in a lineage record can itself be viewed, allowing you to traverse the lineage metadata interactively. Use the value of `source` to view the original task output: + +```console +$ nextflow lineage view lid://862df53160e07cd823c0c3960545e747/multiqc_report.html +{ + "type": "FileOutput", + "path": "/work/86/2df53160e07cd823c0c3960545e747/multiqc_report.html", + "checksum": { + "value": "b14f5171a48ce5c22ea27d7b8e57b6c4", + "algorithm": "nextflow", + "mode": "standard" + }, + "source": "lid://862df53160e07cd823c0c3960545e747", + "workflowRun": "lid://16b31030474f2e96c55f4940bca3ab64", + "taskRun": "lid://862df53160e07cd823c0c3960545e747", + "size": 5079806, + "createdAt": "2025-05-09T13:27:34.236590379-05:00", + "modifiedAt": "2025-05-09T13:27:34.246590383-05:00", + "labels": null +} +``` + +This record is the task output for the same file -- it has a value for `taskRun` which is the same as its `source`. + +View the lineage record for the task that produced this file: + +```console +$ nextflow lineage view lid://862df53160e07cd823c0c3960545e747 +{ + "type": "TaskRun", + "sessionId": "065bdc6b-89b4-42ee-92c1-2a5af37f2c50", + "name": "MULTIQC", + "codeChecksum": { + "value": "edf2e9f84cd3a18ee9259012b660f2dd", + "algorithm": "nextflow", + "mode": "standard" + }, + "script": "\n cp multiqc/* .\n echo \"custom_logo: $PWD/nextflow_logo.png\" \u003e\u003e multiqc_config.yaml\n multiqc -n multiqc_report.html .\n ", + "input": [ + { + "type": "path", + "name": "*", + "value": [ + "lid://eff8846883b46c5a76f11e7e4480a6c8/ggal_gut", + "lid://2d8bd92c69f732605bc99941e60d5319/fastqc_ggal_gut_logs" + ] + }, + { + "type": "path", + "name": "config", + "value": [ + { + "path": "https://github.com/nextflow-io/rnaseq-nf/tree/86165b8c81d43a1f57363964431395152e353e56/multiqc", + "checksum": { + "value": "2aac500cdfb292e961e678433e7dc3d8", + "algorithm": "nextflow", + "mode": "standard" + } + } + ] + } + ], + "container": null, + "conda": "file:///conda/env-4a436c230263dfdbbf4dddd0623505d1", + "spack": null, + "architecture": null, + "globalVars": {}, + "binEntries": [], + "workflowRun": "lid://16b31030474f2e96c55f4940bca3ab64" +} +``` + +Every task run is represented in the lineage store as a `TaskRun`, which includes information such as the name, script, inputs, and software dependencies. From here, you can continue traversing through the file inputs to view upstream tasks. + +Finally, use the `render` subcommand to render the entire lineage of the MULTIQC report as an HTML report: + +```console +$ nextflow lineage render lid://16b31030474f2e96c55f4940bca3ab64/multiqc_report.html +Rendered lineage graph for lid://16b31030474f2e96c55f4940bca3ab64/multiqc_report.html to lineage.html +``` + +Open the HTML report in a web browser to view the lineage graph. + +## Query lineage records + +To find a lineage record, you normally have to know the LID of the record, or you have to know the LID of a downstream record (such as a workflow run) from which you can traverse to the desired record. However, you can also query the entire lineage store by fields, allowing you to quickly find relevant records and aggregate records from different runs. + +Use the `find` subcommand to find all tasks executed by a workflow run: + +```console +$ nextflow lineage find type=TaskRun workflowRun=lid://16b31030474f2e96c55f4940bca3ab64 +[ + "lid://2d8bd92c69f732605bc99941e60d5319", + "lid://eff8846883b46c5a76f11e7e4480a6c8", + "lid://862df53160e07cd823c0c3960545e747", + "lid://6d3bff36bf2c3c14c2d383384621e8ca" +] +``` + +You can use any field defined the [lineage data model](https://github.com/nextflow-io/nextflow/tree/master/modules/nf-lineage/src/main/nextflow/lineage/model). + +:::{tip} +Since the `find` and `view` subcommands always output JSON, you can use JSON processing tools such as [jq](https://jqlang.org/) to further query and transform results. +::: + +## Compare two task runs + +Since task run LIDs are based on the standard {ref}`task hash `, it is easy to compare two task runs in the lineage metadata. This is especially useful when a task is unexpectedly re-executed during a resumed run. As long as lineage is enabled for the initial and resumed runs, the two tasks can be compared without any additional runs. + +This section builds on the previous `rnaseq-nf` example to demonstrate how to compare two task runs in the event of a cache invalidation. + +First, modify the pipeline in a way that invalidates the cache, such as modifying the script of the `MULTIQC` process. + +Resume the pipeline, which should re-execute the `MULTIQC` task: + +```console +$ nextflow run rnaseq-nf -profile conda -resume + + ... + +[6d/3bff36] process > RNASEQ:INDEX (ggal_1_48850000_49020000) [100%] 1 of 1, cached: 1 ✔ +[2d/8bd92c] process > RNASEQ:FASTQC (FASTQC on ggal_gut) [100%] 1 of 1, cached: 1 ✔ +[ef/f88468] process > RNASEQ:QUANT (ggal_gut) [100%] 1 of 1, cached: 1 ✔ +[94/33dda7] process > MULTIQC [100%] 1 of 1 ✔ +``` + +Retrieve the hash of the MULTIQC run from the log file or work directory -- in this case it is `9433dda73f2193491f9a26e3e23cd8a1`. + +Finally, compare the task hash of the initial run (taken from the original example) to that of the resumed run: + +```console +$ nextflow lineage diff lid://862df53160e07cd823c0c3960545e747 lid://9433dda73f2193491f9a26e3e23cd8a1 +diff --git 862df53160e07cd823c0c3960545e747 9433dda73f2193491f9a26e3e23cd8a1 +--- 862df53160e07cd823c0c3960545e747 ++++ 9433dda73f2193491f9a26e3e23cd8a1 +@@ -3,11 +3,11 @@ + "sessionId": "065bdc6b-89b4-42ee-92c1-2a5af37f2c50", + "name": "MULTIQC", + "codeChecksum": { +- "value": "edf2e9f84cd3a18ee9259012b660f2dd", ++ "value": "9615a8da3a3f9e935cfc8e4042cdf5e0", + "algorithm": "nextflow", + "mode": "standard" + }, +- "script": "\n cp multiqc/* .\n echo \"custom_logo: $PWD/nextflow_logo.png\" \u003e\u003e multiqc_config.yaml\n multiqc -n multiqc_report.html .\n ", ++ "script": "\n cp multiqc/* . # hello!\n echo \"custom_logo: $PWD/nextflow_logo.png\" \u003e\u003e multiqc_config.yaml\n multiqc -n multiqc_report.html .\n ", + "input": [ + { + "type": "path", +@@ -38,5 +38,5 @@ + "architecture": null, + "globalVars": {}, + "binEntries": [], +- "workflowRun": "lid://16b31030474f2e96c55f4940bca3ab64" ++ "workflowRun": "lid://65044872aad36f97e42336b9ba0dee57" + } +``` + +Note the difference between the task scripts, highlighting the change that caused the task to be re-executed. + +## Use lineage with workflow outputs + +Workflow outputs declared in the `output` block are also recorded in the lineage store. The output of a workflow run can be accessed as `lid://#output`. + +Run the `rnaseq-nf` pipeline using the `preview-25-04` branch, which uses the `output` block to publish outputs: + +```console +$ nextflow -r preview-25-04 -profile conda +``` + +View the workflow output in the lineage metadata: + +```console +$ nextflow lineage view lid://9410d13abeec617640b5fe9735ba12fc#output +[ + { + "type": "Collection", + "name": "samples", + "value": "lid://9410d13abeec617640b5fe9735ba12fc/samples.json" + }, + { + "type": "Path", + "name": "summary", + "value": "lid://9410d13abeec617640b5fe9735ba12fc/multiqc_report.html" + } +] +``` + +This view can be used to traverse output files directly instead of inferring LIDs from the workflow output directory. + +See {ref}`workflow-output-def` for more information about the `output` block. + +## Use lineage in a Nextflow script + +Since lineage IDs are valid URIs, output files in the lineage store can be accessed by their LID in a Nextflow script, like any other path. The LID path returns the *real* path as defined by the `path` field in the `FileOutput` record. + +The following script uses the `samples.json` from the previous example as an input samplesheet: + +```nextflow +channel.fromPath('lid://9410d13abeec617640b5fe9735ba12fc/samples.json') + .splitJson() + .view() +``` + +```console +[id:gut, quant:/results/gut/quant, fastqc:/results/gut/fastqc] +``` + +The `fromLineage` channel factory can also be used to query lineage records in a similar manner as the `find` subcommand. See {ref}`channel-from-lineage` for details. diff --git a/docs/index.md b/docs/index.md index bac4fa08b8..2014a48e5d 100644 --- a/docs/index.md +++ b/docs/index.md @@ -161,6 +161,7 @@ developer/plugins :caption: Guides :maxdepth: 1 +data-lineage updating-spot-retries metrics flux diff --git a/docs/migrations/25-04.md b/docs/migrations/25-04.md index fa676fac0c..57cb5febd3 100644 --- a/docs/migrations/25-04.md +++ b/docs/migrations/25-04.md @@ -1,3 +1,4 @@ +(migrating-25-04-page)= # Migrating to 25.04 (preview) @@ -47,7 +48,7 @@ This release introduces built-in provenance tracking, also known as *data lineag You can explore this lineage from the command line using the {ref}`cli-lineage` command. Additionally, you can refer to files in the lineage store from a Nextflow script using the `lid://` path prefix as well as the {ref}`channel-from-lineage` channel factory. -See the {ref}`cli-lineage` command and {ref}`config-lineage` config scope for details. +See the {ref}`data-lineage-page` guide to get started. ## Enhancements diff --git a/docs/reference/cli.md b/docs/reference/cli.md index a53cbf1b35..5a4e717f52 100644 --- a/docs/reference/cli.md +++ b/docs/reference/cli.md @@ -703,67 +703,32 @@ $ nextflow lineage SUBCOMMAND [arg ..] **Description** -The `lineage` command is used to inspect lineage metadata. Data lineage can be enabled by setting `lineage.enabled` to `true` in your Nextflow configuration (see the {ref}`config-lineage` config scope for details). +The `lineage` command is used to inspect lineage metadata. + +See the {ref}`data-lineage-page` guide to learn how to get started with data lineage. **Options** `-h, -help` : Print the command usage. -**Examples** - -List the Nextflow runs with lineage metadata enabled, printing the corresponding lineage ID (LID) for each run. - -```console -$ nextflow lineage list -TIMESTAMP RUN NAME SESSION ID LINEAGE ID -2025-04-22 14:45:43 backstabbing_heyrovsky 21bc4fad-e8b8-447d-9410-388f926a711f lid://c914d714877cc5c882c55a5428b510b1 -``` - -View a lineage record. - -```console -$ nextflow lineage view -``` - -The output of a workflow run can be shown by appending `#output` to the workflow run LID: - -```console -$ nextflow lineage view lid://c914d714877cc5c882c55a5428b510b1#output -``` - -:::{tip} -You can use the [jq](https://jqlang.org/) command-line tool to apply further queries and transformations on the resulting lineage record. -::: - -Find all lineage records that match a set of key-value pairs: +**Subcommands** -```console -$ nextflow lineage find = = ... -``` +`diff ` +: Display a git-style diff between two lineage records. -Use any object property defined in the [Lineage metadata model](https://github.com/nextflow-io/nextflow/tree/master/modules/nf-lineage/src/main/nextflow/lineage/model) as a key. Use the `type` key to refer to a metadata object class: +`find = [= ...]` +: Find all lineage records that match the given field values. -```console -$ nextflow lineage find type=FileOutput workflowRun=lid://c914d714877cc5c882c55a5428b510b1 label=foo -``` -Find all tasks executed by a workflow: +`list` +: List the Nextflow runs with lineage enabled, printing the corresponding lineage ID (LID) for each run. -```console -$ nextflow lineage find type=TaskRun workflowRun=lid://c914d714877cc5c882c55a5428b510b1 -``` +`render [path]` +: Render the lineage graph for a lineage record as an HTML file (default output path: `./lineage.html`). +: The lineage record should be of type `FileOutput`, `TaskRun`, or `WorkflowRun`. -Display a git-style diff between two lineage records. - -```console -$ nextflow lineage diff -``` - -Render the lineage graph for a workflow or task output as an HTML file. (default file path: `./lineage.html`). - -```console -$ nextflow lineage render [html-file-path] -``` +`view ` +: View a lineage record. (cli-lint)= diff --git a/plugins/nf-google/src/main/nextflow/cloud/google/batch/logging/BatchLogging.groovy b/plugins/nf-google/src/main/nextflow/cloud/google/batch/logging/BatchLogging.groovy index ab0014967c..f00be651db 100644 --- a/plugins/nf-google/src/main/nextflow/cloud/google/batch/logging/BatchLogging.groovy +++ b/plugins/nf-google/src/main/nextflow/cloud/google/batch/logging/BatchLogging.groovy @@ -76,7 +76,7 @@ class BatchLogging implements Closeable { return [ stdout.toString(), stderr.toString() ] } - protected void parseOutput(LogEntry logEntry, StringBuilder stdout, StringBuilder stderr) { + protected static void parseOutput(LogEntry logEntry, StringBuilder stdout, StringBuilder stderr) { final output = logEntry.payload.data.toString() if (logEntry.severity == Severity.ERROR) { stderr.append(output) diff --git a/plugins/nf-google/src/test/nextflow/cloud/google/batch/logging/BatchLoggingTest.groovy b/plugins/nf-google/src/test/nextflow/cloud/google/batch/logging/BatchLoggingTest.groovy index 271f2b86ec..395dbe98e5 100644 --- a/plugins/nf-google/src/test/nextflow/cloud/google/batch/logging/BatchLoggingTest.groovy +++ b/plugins/nf-google/src/test/nextflow/cloud/google/batch/logging/BatchLoggingTest.groovy @@ -48,28 +48,24 @@ class BatchLoggingTest extends Specification { def OUT_ENTRY2 = LogEntry.newBuilder(StringPayload.of('Hello world')).setSeverity(Severity.INFO).build() def ERR_ENTRY1 = LogEntry.newBuilder(StringPayload.of('Oops something has failed. We are sorry.\n')).setSeverity(Severity.ERROR).build() def ERR_ENTRY2 = LogEntry.newBuilder(StringPayload.of('blah blah')).setSeverity(Severity.ERROR).build() - and: - def sess = Mock(Session) { getConfig() >> [:] } - def config = BatchConfig.create(sess) - def client = new BatchLogging(config) when: def stdout = new StringBuilder() def stderr = new StringBuilder() and: - client.parseOutput(OUT_ENTRY1, stdout, stderr) + BatchLogging.parseOutput(OUT_ENTRY1, stdout, stderr) then: stdout.toString() == 'No user sessions are running outdated binaries.\n' and: stderr.toString() == '' when: - client.parseOutput(ERR_ENTRY1, stdout, stderr) + BatchLogging.parseOutput(ERR_ENTRY1, stdout, stderr) then: stderr.toString() == 'Oops something has failed. We are sorry.\n' when: - client.parseOutput(ERR_ENTRY2, stdout, stderr) + BatchLogging.parseOutput(ERR_ENTRY2, stdout, stderr) then: // the message is appended to the stderr because not prefix is provided stderr.toString() == 'Oops something has failed. We are sorry.\nblah blah' @@ -78,7 +74,7 @@ class BatchLoggingTest extends Specification { stdout.toString() == 'No user sessions are running outdated binaries.\n' when: - client.parseOutput(OUT_ENTRY2, stdout, stderr) + BatchLogging.parseOutput(OUT_ENTRY2, stdout, stderr) then: // the message is added to the stdout stdout.toString() == 'No user sessions are running outdated binaries.\nHello world'