Add hash operator class for cross-document equality#164
Conversation
Add 20 tests for the hash operator class covering hash joins, GROUP BY, DISTINCT, UNION, IN/NOT IN, STE vec unwrapping, NULL handling, cross-type equality, forced hash join strategy, and error paths for missing index terms and multi-element STE vectors.
Introduce eql_v2.hash_encrypted() which computes a 32-bit integer hash from HMAC-256 or Blake3 index terms, enabling PostgreSQL hash-based operations (hash joins, GROUP BY, DISTINCT, UNION) on encrypted columns. Register the hash operator class with the same-type equality operator so the planner can choose hash join strategy for encrypted = encrypted.
…perators Remove the HASHES flag from cross-type (encrypted/jsonb) equality, inequality, and LIKE operators. Hash joins require independent hashing of each side, which is only possible for same-type operators registered in the hash operator class. Cross-type operators retain MERGES for merge join support.
Change the Supabase build exclusion pattern from **/operator_class.sql to **/*operator_class.sql so it matches both the btree operator_class.sql and the new hash_operator_class.sql. Include updated protect dependency files.
Replace raw ORE-only test data with create_encrypted_json() which includes hmac/blake3 index terms required for hash-based GROUP BY aggregation now that the hash operator class is registered.
a676717 to
2042d4d
Compare
|
@CodeRabbit full review |
✅ Actions performedFull review triggered. |
📝 WalkthroughWalkthroughAdds hash support for encrypted values: new function Changes
Sequence DiagramsequenceDiagram
participant Client as Client
participant PG as PostgreSQL
participant OpClass as "eql_v2.encrypted_hash_operator_class"
participant HashFn as "eql_v2.hash_encrypted"
Client->>PG: Submit query using hash operation (HASH JOIN / GROUP BY / DISTINCT)
PG->>OpClass: Resolve hash operator class for encrypted type
PG->>HashFn: Call eql_v2.hash_encrypted(encrypted_val)
HashFn->>HashFn: Convert value to STE Vec
alt Blake3 index exists
HashFn->>HashFn: Return hashtext(blake3_value)
else HMAC-256 index exists
HashFn->>HashFn: Return hashtext(hmac_256_value)
else Neither exists
HashFn->>HashFn: Raise exception
end
HashFn->>PG: Return integer hash
PG->>PG: Use hash for join/grouping/distinct
PG->>Client: Return results
Estimated Code Review Effort🎯 4 (Complex) | ⏱️ ~45 minutes Poem
🚥 Pre-merge checks | ✅ 3✅ Passed checks (3 passed)
✏️ Tip: You can configure your own custom pre-merge checks in the settings. ✨ Finishing touches
🧪 Generate unit tests (beta)
Comment |
hash_encrypted previously preferred HMAC over Blake3, but compare() uses the first index term present in BOTH operands (ORE > HMAC > Blake3). When value A has hm+b3 and value B has only b3, compare uses Blake3 (equal), but hash_encrypted(A) used HMAC while hash_encrypted(B) used Blake3 — producing different hashes for equal values. Reversed priority to Blake3-first so any two values that compare equal will always hash identically. Updated tests to reflect new priority.
There was a problem hiding this comment.
Actionable comments posted: 1
🤖 Fix all issues with AI agents
In `@tests/sqlx/tests/hash_operator_tests.rs`:
- Around line 586-600: The SET LOCAL statements and the join query must run on
the same connection inside an explicit transaction: begin a transaction (e.g.,
let mut tx = pool.begin().await?), run sqlx::query("SET LOCAL enable_nestloop =
off").execute(&mut tx).await?, sqlx::query("SET LOCAL enable_mergejoin =
off").execute(&mut tx).await?, then run the join with
sqlx::query_scalar(...).fetch_one(&mut tx).await?. Finally commit the
transaction (tx.commit().await?). Keep the existing error context ("forced hash
join failed") on the fetch_one call.
🧹 Nitpick comments (1)
tests/sqlx/tests/hash_operator_tests.rs (1)
183-213: Avoid probabilistic hash inequality assertions.
hash1 != hash3is extremely likely but not guaranteed for 32‑bit hashes. Consider removing the probabilistic assertion to prevent rare flakes.Proposed change
- // Different encrypted values may produce different hashes (very likely but not guaranteed) - let hash3: i32 = sqlx::query_scalar("SELECT eql_v2.hash_encrypted(create_encrypted_json(2))") - .fetch_one(&pool) - .await - .context("hash_encrypted call 3 failed")?; - - // While hash collisions are theoretically possible, these test values should differ - assert_ne!( - hash1, hash3, - "Different encrypted values should (almost certainly) produce different hashes" - );
Add 3 end-to-end tests exercising real SQL query paths with mixed-index rows (hm+b3 vs b3-only for same logical value): - mixed_index_hash_join: forced hash join finds match - mixed_index_group_by_dedup: GROUP BY merges into one group - mixed_index_union_dedup: UNION deduplicates to one row These cover the execution paths that the P1 bug would break, beyond the direct function-level test in hash_function_uses_blake3_first.
…ssertion - Wrap SET LOCAL + join query in explicit transactions in forced_hash_join_via_planner_hints and mixed_index_hash_join so planner hints are guaranteed to be active for the query. - Remove probabilistic hash inequality assertion in hash_function_directly that could flake on 32-bit hash collisions.
There was a problem hiding this comment.
Actionable comments posted: 1
🤖 Fix all issues with AI agents
In `@tests/sqlx/tests/hash_operator_tests.rs`:
- Around line 391-414: The test hmac_and_blake3_produce_different_hashes
currently uses assert_ne! on two i32 hashes returned by
eql_v2.hash_encrypted(create_encrypted_json(...)) which leaves a tiny flake risk
from 32‑bit collisions; fix by replacing the non-deterministic inequality check
with a deterministic check against fixed expected values (compute the expected
hash outputs for create_encrypted_json(1, 'hm') and create_encrypted_json(1,
'b3') once and use assert_eq! on the scalar results), or if a longer digest is
available, change the SQL to return the full digest/bytea and compare those
bytes instead; if neither is possible, add a clear comment in
hmac_and_blake3_produce_different_hashes acknowledging the tiny collision risk
and keep the assertion.
| #[sqlx::test(fixtures(path = "../fixtures", scripts("encrypted_json")))] | ||
| async fn hmac_and_blake3_produce_different_hashes(pool: PgPool) -> Result<()> { | ||
| // Test: HMAC and Blake3 code paths produce different hashes for same id | ||
| // Catches regression where the wrong branch is taken | ||
|
|
||
| let hash_hmac: i32 = | ||
| sqlx::query_scalar("SELECT eql_v2.hash_encrypted(create_encrypted_json(1, 'hm'))") | ||
| .fetch_one(&pool) | ||
| .await | ||
| .context("hash with hmac-only failed")?; | ||
|
|
||
| let hash_b3: i32 = | ||
| sqlx::query_scalar("SELECT eql_v2.hash_encrypted(create_encrypted_json(1, 'b3'))") | ||
| .fetch_one(&pool) | ||
| .await | ||
| .context("hash with blake3-only failed")?; | ||
|
|
||
| assert_ne!( | ||
| hash_hmac, hash_b3, | ||
| "HMAC and Blake3 should produce different hashes for same id" | ||
| ); | ||
|
|
||
| Ok(()) | ||
| } |
There was a problem hiding this comment.
Rare flake risk from 32‑bit hash collision.
assert_ne! on two 32‑bit hashes can collide (very low probability). If you want to eliminate even tiny flake risk, consider asserting against a fixed test vector or adding a note acknowledging the collision risk.
🤖 Prompt for AI Agents
In `@tests/sqlx/tests/hash_operator_tests.rs` around lines 391 - 414, The test
hmac_and_blake3_produce_different_hashes currently uses assert_ne! on two i32
hashes returned by eql_v2.hash_encrypted(create_encrypted_json(...)) which
leaves a tiny flake risk from 32‑bit collisions; fix by replacing the
non-deterministic inequality check with a deterministic check against fixed
expected values (compute the expected hash outputs for create_encrypted_json(1,
'hm') and create_encrypted_json(1, 'b3') once and use assert_eq! on the scalar
results), or if a longer digest is available, change the SQL to return the full
digest/bytea and compare those bytes instead; if neither is possible, add a
clear comment in hmac_and_blake3_produce_different_hashes acknowledging the tiny
collision risk and keep the assertion.
freshtonic
left a comment
There was a problem hiding this comment.
I'm concerned about the 32 bit hash size due to weak collision resistance. ~80K items is all that is required for a 50% chance of a collision.
This might not be a problem if the operator class falls back to an equality check on the full index term when the smaller hash collides.

Problem
PostgreSQL hash-based operations (hash joins, GROUP BY, DISTINCT,
UNION) on eql_v2_encrypted columns were not supported. The planner
could only use nested loop or merge join strategies for encrypted
column equality, and the HASHES flag was incorrectly set on
cross-type operators where independent hashing of each side is not
possible.
Solution
integer hash from HMAC-256 or Blake3 index terms, with HMAC priority
with the same-type equality operator
operators (<>, ~~, ~~*) since hash joins require independent hashing
of each side
hash_operator_class.sql
Summary by CodeRabbit
New Features
Bug Fixes / Reliability
Tests
Chores