ENH: deterministic ordering for better Parquet compression by fedorov · Pull Request #128 · ImagingDataCommons/idc-index-data

fedorov · 2026-03-27T20:17:40Z

Summary

Add ORDER BY to all ARRAY_AGG/STRING_AGG calls that lacked deterministic ordering (21 aggregations across 7 files), ensuring identical logical arrays are always encoded the same way across query runs
Add compression-friendly final ORDER BY to all queries (9 files), clustering rows by semantically meaningful columns (anatomy, staining, collection, algorithm type) rather than arbitrary UIDs
Replace STRING_AGG(DISTINCT collection_id) with ANY_VALUE(collection_id) in sm_index.sql since a series always belongs to one collection
Document that independently-aggregated DISTINCT columns in seg_index.sql do not preserve positional correspondence

These changes improve Parquet dictionary encoding and run-length encoding without removing any content from query results.

Files changed

File	Changes
`scripts/sql/idc_index.sql`	Add `ORDER BY collection_id, PatientID, StudyInstanceUID, SeriesInstanceUID`
`scripts/sql/prior_versions_index.sql`	Add `ORDER BY version` to `STRING_AGG` + final `ORDER BY collection_id, min_idc_version, Modality`
`scripts/sql/collections_index.sql`	Add `ORDER BY collection_id`
`scripts/sql/analysis_results_index.sql`	Add `ORDER BY analysis_result_id`
`assets/sm_index.sql`	Fix 7 aggregations + `ORDER BY primaryAnatomicStructure, staining, collection_id`
`assets/sm_instance_index.sql`	Fix 3 aggregations + reorder to `staining, TransferSyntaxUID, SeriesInstanceUID`
`assets/seg_index.sql`	Fix 5 aggregations + expand to `ORDER BY SegmentationType, AlgorithmType, AlgorithmName`
`assets/contrast_index.sql`	Fix 3 aggregations + `ORDER BY ContrastBolusAgent, Ingredient, Route`
`assets/ann_index.sql`	Add `ORDER BY AnnotationCoordinateType, referenced_SeriesInstanceUID`
`assets/ann_group_index.sql`	Add `ORDER BY category, type, algorithm`
`scripts/gdc/idc_gdc_selection.sql`	Fix 1 `ARRAY_AGG`

Test plan

Verify queries execute without errors against BigQuery
Compare row counts before/after to confirm no data loss
Compare Parquet file sizes before/after to measure compression improvement

🤖 Generated with Claude Code

… Parquet compression Add ORDER BY to all ARRAY_AGG/STRING_AGG calls that lacked it, ensuring identical logical arrays are always encoded the same way (improving dictionary encoding). Add compression-friendly final ORDER BY to all queries, clustering rows by semantically meaningful columns (anatomy, staining, collection) rather than arbitrary UIDs to maximize run-length and dictionary encoding in the output Parquet files. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

…ibility BigQuery does not support positional ORDER BY inside aggregate functions. Replace all ORDER BY 1 with the actual aggregated expression. Also fix final ORDER BY clauses that referenced ARRAY columns (not sortable in BQ) by using [SAFE_OFFSET(0)] to extract the first element. Add BQ dry-run validation guidance to CLAUDE.md. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

fedorov · 2026-03-27T20:59:56Z

original	updated

fedorov and others added 2 commits March 27, 2026 16:17

fedorov merged commit 4490925 into main Mar 27, 2026
10 checks passed

fedorov deleted the enh/deterministic-ordering-for-parquet-compression branch March 27, 2026 21:00

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

ENH: deterministic ordering for better Parquet compression#128

ENH: deterministic ordering for better Parquet compression#128
fedorov merged 2 commits intomainfrom
enh/deterministic-ordering-for-parquet-compression

fedorov commented Mar 27, 2026 •

edited

Loading

Uh oh!

fedorov commented Mar 27, 2026

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

fedorov commented Mar 27, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Summary

Files changed

Test plan

Uh oh!

fedorov commented Mar 27, 2026

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

fedorov commented Mar 27, 2026 •

edited

Loading