Skip to content

Direct Access

Paul Nilsson edited this page Apr 15, 2026 · 2 revisions

Direct I/O (Remote I/O)

Overview

The PanDA Pilot supports two distinct modes for delivering input files to a payload:

Copy-to-scratch is the default. The pilot uses a copytool (normally Rucio) to physically download each input file to the worker node's local scratch disk before the payload starts. The payload reads the files as ordinary local files. The transfer protocol used is transparent to the payload.

Direct I/O (also called remote I/O or remoteIO internally) skips the local copy. Instead, the pilot resolves a Transfer URL (TURL) for each input file and hands the list of TURLs to the payload. The payload opens the files directly over the network at runtime, typically using ROOT::TFile::Open(). No local disk space is consumed for the input data, and the payload must be able to speak the protocol embedded in each TURL.

The choice between these modes has significant implications: direct I/O reduces local disk pressure and stage-in time, but requires the payload to tolerate network latency and depends on the storage endpoint being reachable from the worker node at job execution time. Protocol compatibility between the payload and the TURL is essential, and it is a harder failure to diagnose than a stage-in error.


When direct I/O is used

Whether direct I/O is attempted for a given job depends on two independent conditions, both of which must be satisfied:

1. Queue-level permission (CRIC / queuedata)

The PanDA queue must have direct I/O enabled. This is controlled by two boolean fields read from CRIC and stored in queuedata:

Field Meaning
direct_access_lan Allow direct I/O when the storage endpoint is on the LAN
direct_access_wan Allow direct I/O when the storage endpoint is on the WAN

If neither field is set, direct I/O is never attempted for jobs at that queue, regardless of the job's transfertype.

2. Job-level permission (transfertype)

For analysis jobs, direct I/O is enabled whenever the queue permits it. No explicit transfertype is required.

For production jobs, direct I/O is gated on the transfertype field received from the PanDA server. The permitted values are:

transfertype Effect
(empty / Null) Copy-to-scratch. Direct I/O is disabled.
direct Direct I/O enabled. Protocol preference follows the queue default (root:// first).
root Direct I/O enabled. root:// protocol explicitly preferred.
davs Direct I/O enabled. davs:// protocol preferred, with fallback to other available protocols.
davs,root Direct I/O enabled. davs:// tried first, then root://, then remaining fallbacks.
root,davs Direct I/O enabled. root:// tried first, then davs://.
file Copy-to-scratch via a POSIX filesystem link (Rucio --protocol=file). Direct I/O is not used.

Comma-separated lists must contain only recognised direct-I/O keywords (direct, root, davs). Any list that includes file is treated as a copy-to-scratch instruction.

Backward compatibility: The direct / empty / Null / file behaviours are unchanged from before the protocol-selection feature was introduced. Existing jobs are unaffected.


How a file is resolved for direct I/O

When both conditions above are met, the pilot selects a TURL for each input file through the following ordered steps.

Replica selection

The pilot queries Rucio for all available replicas of each input file. Replicas are sorted with LAN endpoints (those in inputddms, derived from the read_lan activity in CRIC) ranked above WAN endpoints. Within each domain the ordering respects Rucio's configured priorities.

For each file, the pilot selects the best replica by trying schemas in a priority order. This order depends on transfertype:

  • For transfertype=direct or unset, the default priority order is used: ['root', 'dcache', 'dcap', 'file', 'https'] for LAN and ['root', 'https'] for WAN.
  • For transfertype=davs, davs is moved to the front of whichever list applies, and the remaining entries follow in their original order.
  • For a comma-separated list such as davs,root, the listed protocols are placed at the front in the given order, with remaining entries following.

This means the TURL handed to the payload will use the preferred protocol where a replica is available, and will fall back to the next protocol in the list if not.

Direct-access eligibility

A file is only eligible for direct I/O if:

  1. Its accessmode is direct (set during job initialisation based on transfertype and prodDBlockToken).
  2. The resolved TURL's schema is in the allowed set for its domain (direct_localinput_allowed_schemas for LAN, direct_remoteinput_allowed_schemas for WAN).
  3. The file's status is not local (i.e. prodDBlockToken != local).

If a file is not eligible — for example, because no replica with a permitted schema was found — it falls back to copy-to-scratch.

Status assignment

Eligible files are marked with status = "remote_io". Files that do not meet the eligibility criteria retain their status and are transferred normally by the copytool.


What the payload receives

Once replica selection is complete, the pilot modifies the payload command to inform the transformation script that direct I/O is in use:

  • --usePFCTurl is appended, instructing the payload to use the TURLs from the Pool File Catalog (PFC) rather than local paths.
  • --directIn is appended, instructing the payload to open files directly over the network.

The PFC (PoolFileCatalog.xml) is populated with the resolved TURLs for each remote_io file. The payload uses these TURLs with ROOT::TFile::Open() at runtime.

XCache proxy

If the environment variable ALRB_XCACHE_PROXY is set, the pilot prepends its value to each LAN TURL before writing the PFC. This routes reads through a local caching proxy, which can reduce latency and storage-endpoint load for sites that have one deployed.


Pre-flight file open verification

Before launching the payload, the pilot can optionally verify that each direct-I/O TURL is actually openable. This is controlled by config.Pilot.remotefileverification_log. When enabled:

  1. The pilot runs open_remote_file.py (a small script that calls ROOT::TFile::Open() in parallel threads) against all remote_io TURLs.
  2. TURLs that cannot be opened are recorded in a verification dictionary.
  3. The pilot sends a Rucio trace for each file: FOUND_ROOT for successfully opened files, FAILED_REMOTE_OPEN for those that could not be opened.
  4. TURLs that fail the pre-flight check are removed from the PFC, and the job may be aborted or the relevant files demoted to copy-to-scratch depending on configuration.

When the TURL list exceeds 500 entries (e.g. on large merge jobs), the list is written to a file (turls.txt) and passed via --turl-file rather than on the command line, to avoid OS argument-length limits.


Failure handling

If a direct-I/O file open fails inside the payload after the pre-flight check has passed, the error will appear in payload.stdout as an XRootD or ROOT error message rather than as a pilot stage-in error. The pilot's diagnose module scans the payload stdout for known patterns (such as TNetXNGFile::Open ERROR, No servers available, Unable to open ROOT file) for jobs where has_remoteio() is true, and classifies matching failures as STAGEINFAILED with the error line recorded as diagnostics. Without this, such failures would be reported as the less informative UNKNOWNPAYLOADFAILURE.


Configuration reference

Source Field Type Description
CRIC / queuedata direct_access_lan bool Enable direct I/O for LAN replicas at this queue
CRIC / queuedata direct_access_wan bool Enable direct I/O for WAN replicas at this queue
Job definition transferType string Protocol preference and direct I/O mode (see table above)
Job definition prodDBlockToken string Set to local to force copy-to-scratch for a specific file
Job parameters --accessmode=direct flag Override to force direct I/O (analysis jobs)
Job parameters --accessmode=copy / --useLocalIO flag Override to force copy-to-scratch
Environment ALRB_XCACHE_PROXY string XCache proxy URL prepended to LAN TURLs
Pilot config remotefileverification_log string Enables pre-flight TURL verification when set

Summary of transfertype values

transfertype=Null / ""   → copy-to-scratch (Rucio, default protocol selection)
transfertype=file        → copy-to-scratch via POSIX link (Rucio --protocol=file)
transfertype=direct      → direct I/O, root:// preferred (existing default)
transfertype=root        → direct I/O, root:// explicitly preferred
transfertype=davs        → direct I/O, davs:// preferred (useful for ML/HDF5 payloads)
transfertype=davs,root   → direct I/O, davs:// first then root://
transfertype=root,davs   → direct I/O, root:// first then davs://

In all direct I/O cases, the pilot falls back to secondary protocols (/1, /2, … in CRIC priority order) if no replica is available for the preferred protocol.

Clone this wiki locally