Skip to content

ENH: deterministic ordering for better Parquet compression#128

Merged
fedorov merged 2 commits intomainfrom
enh/deterministic-ordering-for-parquet-compression
Mar 27, 2026
Merged

ENH: deterministic ordering for better Parquet compression#128
fedorov merged 2 commits intomainfrom
enh/deterministic-ordering-for-parquet-compression

Conversation

@fedorov
Copy link
Copy Markdown
Member

@fedorov fedorov commented Mar 27, 2026

Summary

  • Add ORDER BY to all ARRAY_AGG/STRING_AGG calls that lacked deterministic ordering (21 aggregations across 7 files), ensuring identical logical arrays are always encoded the same way across query runs
  • Add compression-friendly final ORDER BY to all queries (9 files), clustering rows by semantically meaningful columns (anatomy, staining, collection, algorithm type) rather than arbitrary UIDs
  • Replace STRING_AGG(DISTINCT collection_id) with ANY_VALUE(collection_id) in sm_index.sql since a series always belongs to one collection
  • Document that independently-aggregated DISTINCT columns in seg_index.sql do not preserve positional correspondence

These changes improve Parquet dictionary encoding and run-length encoding without removing any content from query results.

Files changed

File Changes
scripts/sql/idc_index.sql Add ORDER BY collection_id, PatientID, StudyInstanceUID, SeriesInstanceUID
scripts/sql/prior_versions_index.sql Add ORDER BY version to STRING_AGG + final ORDER BY collection_id, min_idc_version, Modality
scripts/sql/collections_index.sql Add ORDER BY collection_id
scripts/sql/analysis_results_index.sql Add ORDER BY analysis_result_id
assets/sm_index.sql Fix 7 aggregations + ORDER BY primaryAnatomicStructure, staining, collection_id
assets/sm_instance_index.sql Fix 3 aggregations + reorder to staining, TransferSyntaxUID, SeriesInstanceUID
assets/seg_index.sql Fix 5 aggregations + expand to ORDER BY SegmentationType, AlgorithmType, AlgorithmName
assets/contrast_index.sql Fix 3 aggregations + ORDER BY ContrastBolusAgent, Ingredient, Route
assets/ann_index.sql Add ORDER BY AnnotationCoordinateType, referenced_SeriesInstanceUID
assets/ann_group_index.sql Add ORDER BY category, type, algorithm
scripts/gdc/idc_gdc_selection.sql Fix 1 ARRAY_AGG

Test plan

  • Verify queries execute without errors against BigQuery
  • Compare row counts before/after to confirm no data loss
  • Compare Parquet file sizes before/after to measure compression improvement

🤖 Generated with Claude Code

fedorov and others added 2 commits March 27, 2026 16:17
… Parquet compression

Add ORDER BY to all ARRAY_AGG/STRING_AGG calls that lacked it, ensuring
identical logical arrays are always encoded the same way (improving
dictionary encoding). Add compression-friendly final ORDER BY to all
queries, clustering rows by semantically meaningful columns (anatomy,
staining, collection) rather than arbitrary UIDs to maximize run-length
and dictionary encoding in the output Parquet files.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
…ibility

BigQuery does not support positional ORDER BY inside aggregate functions.
Replace all ORDER BY 1 with the actual aggregated expression. Also fix
final ORDER BY clauses that referenced ARRAY columns (not sortable in BQ)
by using [SAFE_OFFSET(0)] to extract the first element. Add BQ dry-run
validation guidance to CLAUDE.md.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
@fedorov
Copy link
Copy Markdown
Member Author

fedorov commented Mar 27, 2026

original updated
image image

@fedorov fedorov merged commit 4490925 into main Mar 27, 2026
10 checks passed
@fedorov fedorov deleted the enh/deterministic-ordering-for-parquet-compression branch March 27, 2026 21:00
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant