Skip to content

Add SOPClassUID and TransferSyntaxUID to the main index#125

Merged
fedorov merged 6 commits intomainfrom
add-sopclass-transfersyntax
Mar 24, 2026
Merged

Add SOPClassUID and TransferSyntaxUID to the main index#125
fedorov merged 6 commits intomainfrom
add-sopclass-transfersyntax

Conversation

@fedorov
Copy link
Copy Markdown
Member

@fedorov fedorov commented Mar 23, 2026

No description provided.

fedorov and others added 6 commits March 23, 2026 17:13
Add two series-level DICOM attributes to the main index query:

- SOPClassUID: unambiguous object type identifier, more specific than
  Modality for distinguishing object types (e.g., Enhanced CT vs legacy
  CT, parametric maps, structured reports)
- TransferSyntaxUID: encoding/compression of stored instances (e.g.,
  Explicit VR Little Endian, JPEG 2000, HTJ2K), useful for tool
  compatibility and performance planning

Both are mandatory DICOM attributes with very low cardinality, so the
size impact on the parquet file is negligible (< 1MB combined).

Refs #124

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
SM series almost always contain instances with mixed transfer syntaxes
(93.7% of SM series) — e.g., uncompressed for thumbnails/labels and
JPEG/JPEG2000 for tiles. ANY_VALUE would arbitrarily pick one, which
is misleading. STRING_AGG(DISTINCT ...) captures all values as a
comma-separated string while behaving identically to ANY_VALUE for
non-SM series (single value, no comma).

Refs #124

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Add sop_class_name and transfer_syntax_name columns derived from CASE
mappings of SOPClassUID and TransferSyntaxUID respectively. Names are
verified against pydicom's UID dictionary.

The ELSE clause uses ERROR() so the query fails loudly if an unmapped
UID appears in a future IDC version, forcing the mapping to be updated
rather than silently falling back to the raw UID.

Validated against BigQuery: 994,073 series, zero differences in all
original columns, no series lost.

Refs #124

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
The INNER JOIN with dicom_metadata_curated (used only for
BodyPartExamined) would silently drop series if a future IDC version
has incomplete curated coverage. LEFT JOIN preserves all series,
with BodyPartExamined as NULL when curated data is unavailable.

No change in output for v23 (both tables have identical instance
coverage: 46,870,903 instances, zero orphans in either direction).

Refs #124

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
The step summary was not visible on PR-triggered CD runs. Using
tee -a ensures the report appears in both the step log output and
the GitHub step summary, making it reliably visible regardless of
event type.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Update all actions to versions that use Node.js 24, ahead of the
June 2, 2026 deprecation deadline for Node.js 20:

- actions/checkout: v4 → v6
- actions/setup-python: v5 → v6
- actions/upload-artifact: v4 → v7
- actions/download-artifact: v4 → v8
- google-github-actions/auth: v2 → v3
- google-github-actions/upload-cloud-storage: v2 → v3

Also add google-cloud-bigquery-storage to pip install in the CD
workflow to silence the "BigQuery Storage module not found" warning
and use the faster gRPC-based API instead of the REST fallback.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
@fedorov fedorov merged commit eae624d into main Mar 24, 2026
10 checks passed
@fedorov fedorov deleted the add-sopclass-transfersyntax branch March 24, 2026 13:24
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant