-
Notifications
You must be signed in to change notification settings - Fork 16
Direct Access
The PanDA Pilot supports two distinct modes for delivering input files to a payload:
Copy-to-scratch is the default. The pilot uses a copytool (normally Rucio) to physically download each input file to the worker node's local scratch disk before the payload starts. The payload reads the files as ordinary local files. The transfer protocol used is transparent to the payload.
Direct I/O (also called remote I/O or remoteIO internally) skips the local copy. Instead, the pilot resolves a Transfer URL (TURL) for each input file and hands the list of TURLs to the payload. The payload opens the files directly over the network at runtime, typically using ROOT::TFile::Open(). No local disk space is consumed for the input data, and the payload must be able to speak the protocol embedded in each TURL.
The choice between these modes has significant implications: direct I/O reduces local disk pressure and stage-in time, but requires the payload to tolerate network latency and depends on the storage endpoint being reachable from the worker node at job execution time. Protocol compatibility between the payload and the TURL is essential, and it is a harder failure to diagnose than a stage-in error.
Whether direct I/O is attempted for a given job depends on two independent conditions, both of which must be satisfied:
The PanDA queue must have direct I/O enabled. This is controlled by two boolean fields read from CRIC and stored in queuedata:
| Field | Meaning |
|---|---|
direct_access_lan |
Allow direct I/O when the storage endpoint is on the LAN |
direct_access_wan |
Allow direct I/O when the storage endpoint is on the WAN |
If neither field is set, direct I/O is never attempted for jobs at that queue, regardless of the job's transfertype.
For analysis jobs, direct I/O is enabled whenever the queue permits it. No explicit transfertype is required.
For production jobs, direct I/O is gated on the transfertype field received from the PanDA server. The permitted values are:
transfertype |
Effect |
|---|---|
(empty / Null) |
Copy-to-scratch. Direct I/O is disabled. |
direct |
Direct I/O enabled. Protocol preference follows the queue default (root:// first). |
root |
Direct I/O enabled. root:// protocol explicitly preferred. |
davs |
Direct I/O enabled. davs:// protocol preferred, with fallback to other available protocols. |
davs,root |
Direct I/O enabled. davs:// tried first, then root://, then remaining fallbacks. |
root,davs |
Direct I/O enabled. root:// tried first, then davs://. |
file |
Copy-to-scratch via a POSIX filesystem link (Rucio --protocol=file). Direct I/O is not used. |
Comma-separated lists must contain only recognised direct-I/O keywords (direct, root, davs). Any list that includes file is treated as a copy-to-scratch instruction.
Backward compatibility: The
direct/ empty /Null/filebehaviours are unchanged from before the protocol-selection feature was introduced. Existing jobs are unaffected.
When both conditions above are met, the pilot selects a TURL for each input file through the following ordered steps.
The pilot queries Rucio for all available replicas of each input file. Replicas are sorted with LAN endpoints (those in inputddms, derived from the read_lan activity in CRIC) ranked above WAN endpoints. Within each domain the ordering respects Rucio's configured priorities.
For each file, the pilot selects the best replica by trying schemas in a priority order. This order depends on transfertype:
- For
transfertype=director unset, the default priority order is used:['root', 'dcache', 'dcap', 'file', 'https']for LAN and['root', 'https']for WAN. - For
transfertype=davs,davsis moved to the front of whichever list applies, and the remaining entries follow in their original order. - For a comma-separated list such as
davs,root, the listed protocols are placed at the front in the given order, with remaining entries following.
This means the TURL handed to the payload will use the preferred protocol where a replica is available, and will fall back to the next protocol in the list if not.
A file is only eligible for direct I/O if:
- Its
accessmodeisdirect(set during job initialisation based ontransfertypeandprodDBlockToken). - The resolved TURL's schema is in the allowed set for its domain (
direct_localinput_allowed_schemasfor LAN,direct_remoteinput_allowed_schemasfor WAN). - The file's
statusis notlocal(i.e.prodDBlockToken != local).
If a file is not eligible — for example, because no replica with a permitted schema was found — it falls back to copy-to-scratch.
Eligible files are marked with status = "remote_io". Files that do not meet the eligibility criteria retain their status and are transferred normally by the copytool.
Once replica selection is complete, the pilot modifies the payload command to inform the transformation script that direct I/O is in use:
-
--usePFCTurlis appended, instructing the payload to use the TURLs from the Pool File Catalog (PFC) rather than local paths. -
--directInis appended, instructing the payload to open files directly over the network.
The PFC (PoolFileCatalog.xml) is populated with the resolved TURLs for each remote_io file. The payload uses these TURLs with ROOT::TFile::Open() at runtime.
If the environment variable ALRB_XCACHE_PROXY is set, the pilot prepends its value to each LAN TURL before writing the PFC. This routes reads through a local caching proxy, which can reduce latency and storage-endpoint load for sites that have one deployed.
Before launching the payload, the pilot can optionally verify that each direct-I/O TURL is actually openable. This is controlled by config.Pilot.remotefileverification_log. When enabled:
- The pilot runs
open_remote_file.py(a small script that callsROOT::TFile::Open()in parallel threads) against allremote_ioTURLs. - TURLs that cannot be opened are recorded in a verification dictionary.
- The pilot sends a Rucio trace for each file:
FOUND_ROOTfor successfully opened files,FAILED_REMOTE_OPENfor those that could not be opened. - TURLs that fail the pre-flight check are removed from the PFC, and the job may be aborted or the relevant files demoted to copy-to-scratch depending on configuration.
When the TURL list exceeds 500 entries (e.g. on large merge jobs), the list is written to a file (turls.txt) and passed via --turl-file rather than on the command line, to avoid OS argument-length limits.
If a direct-I/O file open fails inside the payload after the pre-flight check has passed, the error will appear in payload.stdout as an XRootD or ROOT error message rather than as a pilot stage-in error. The pilot's diagnose module scans the payload stdout for known patterns (such as TNetXNGFile::Open ERROR, No servers available, Unable to open ROOT file) for jobs where has_remoteio() is true, and classifies matching failures as STAGEINFAILED with the error line recorded as diagnostics. Without this, such failures would be reported as the less informative UNKNOWNPAYLOADFAILURE.
| Source | Field | Type | Description |
|---|---|---|---|
| CRIC / queuedata | direct_access_lan |
bool | Enable direct I/O for LAN replicas at this queue |
| CRIC / queuedata | direct_access_wan |
bool | Enable direct I/O for WAN replicas at this queue |
| Job definition | transferType |
string | Protocol preference and direct I/O mode (see table above) |
| Job definition | prodDBlockToken |
string | Set to local to force copy-to-scratch for a specific file |
| Job parameters | --accessmode=direct |
flag | Override to force direct I/O (analysis jobs) |
| Job parameters |
--accessmode=copy / --useLocalIO
|
flag | Override to force copy-to-scratch |
| Environment | ALRB_XCACHE_PROXY |
string | XCache proxy URL prepended to LAN TURLs |
| Pilot config | remotefileverification_log |
string | Enables pre-flight TURL verification when set |
transfertype=Null / "" → copy-to-scratch (Rucio, default protocol selection)
transfertype=file → copy-to-scratch via POSIX link (Rucio --protocol=file)
transfertype=direct → direct I/O, root:// preferred (existing default)
transfertype=root → direct I/O, root:// explicitly preferred
transfertype=davs → direct I/O, davs:// preferred (useful for ML/HDF5 payloads)
transfertype=davs,root → direct I/O, davs:// first then root://
transfertype=root,davs → direct I/O, root:// first then davs://
In all direct I/O cases, the pilot falls back to secondary protocols (/1, /2, … in CRIC priority order) if no replica is available for the preferred protocol.
- Introduction
- Pilot Architecture
- Project Structure
- Pilot Workflows
- Event service
- Metadata
- Signal Handling
- Error Codes
- Containers
- Special Algorithms
- Timing Measurements
- Data Transfers
- Copy Tools
- Direct Access
- Fallback Mechanism in Unified PanDA Queues
- Memory Monitoring
- Job Metrics
- Pilot release procedure