Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[BREAKING] Unify hashing across all adapters, fix hash case sensitivity, add disable_hwm & source_is_single_batch to ma_sat, other bugfixes #325

Open
wants to merge 41 commits into
base: main
Choose a base branch
from

Conversation

tkiehn
Copy link
Collaborator

@tkiehn tkiehn commented Feb 18, 2025

!CONTAINS BREAKING CHANGES!

To be released in v2.0.0

Breaking changes

The hereby introduced changes will result in different hashing-results on almost all adapters.
We recommend re-loading the Data Vault from the PSA.
Otherwise expect fake-deltas in form of potentially new hashkeys for the existing business keys or new hashdiffs for unchanged data.

  • Case (in-)sensitivity for hashkeys and hashdiffs was implemented the wrong way around. This is now fixed.
    • !This leads to breaking changes regarding the hash results
      • If you want|need to keep your hash-results with the old case-sensitivity you need to change the variables from true to false and vice versa. Otherwise you might (depending on your data) receive fake-deltas.
      • This is applicable to all adapters except of Fabric & Synapse
  • Unify hashing across all adapters
    • To ensure consistency across all adapters the concatenation methods were aligned
    • The standardization of strings had to be changed for: Exasol & Oracle
      • !All hashing results will change for these adapters
    • The base-64 result of MD5() on BigQuery was lowered before being converted to a String. This caused the hash-results to be slightly off the correct value. This is fixed.
      • !All hashing results will change for BigQuery
    • The concatenation of had to be changed on redshift
      • !All hash results except single column hashkeys will change for redshift
    • On Fabric & Synapse the columns to hash are now converted to VARCHAR instead of NVARCHAR.
      • !This leads to changed hash results
    • The hash-standardization for multi-active-hashdiffs was changed on BigQuery, Exasol, Snowflake, Synapse, Redshift, Fabric, Databricks & Oracle.
      • !This will change hash-results for the same data

New Features

  • Add disable_hwm and source_is_single_batch parameters to multi-active-satellites

Other bugfixes

  • Make satellite hwm safe against outside truncate by wrapping the MAX(src_ldts) with COALESCE
  • Fix Fabric ledts calculation by changing it from -100 nanoseconds to -1 microsecond
  • Distinguish between hash_default_values for HASHTYPE_MD5 and HASH_MD5 on exasol
  • Include hash_datatype parameter in calls to hash_default_datatypes in ghost_record_per_datatype where applicable

Type of change

Please delete options that are not relevant.

  • Bug fix (non-breaking change which fixes an issue)
  • New feature (non-breaking change which adds functionality)
  • Breaking change (fix or feature that would cause existing functionality to not work as expected)

Checklist:

  • I have performed a self-review of my code
  • I have commented my code, particularly in hard-to-understand areas
  • I have made corresponding changes to the documentation or included information that needs updates (e.g. in the Wiki)

tkiehn added 30 commits February 3, 2025 15:10
…-active standardizations to use a comma a delimiter between rows for the block hashdiff
…d-source_is_single_batch' into ma_sat_hwm+hash-unification
in some implementations of hash_default_values() the datatype needs to be passed
@tkiehn tkiehn requested a review from tkirschke February 18, 2025 17:20
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

1 participant