Skip to content

Conversation

nj1973
Copy link
Contributor

@nj1973 nj1973 commented Oct 13, 2025

Description of changes

Use connection names in result handler config instead of the actual connection content. This prevents credentials from being written to config files. It's also a cleaner way to do it.

Issues to be closed

Closes #1608

Checklist

  • I have followed the CONTRIBUTING Guide.
  • I have commented my code, particularly in hard-to-understand areas
  • I have updated any relevant documentation to reflect my changes, if applicable
  • I have added unit and/or integration tests relevant to my change as needed
  • I have already checked locally that all unit tests and linting are passing (use the tests/local_check.sh script)
  • I have manually executed end-to-end testing (E2E) with the affected databases/engines

@nj1973 nj1973 marked this pull request as ready for review October 14, 2025 08:51
@nj1973
Copy link
Contributor Author

nj1973 commented Oct 14, 2025

/gcbrun

@sundar-mudupalli-work
Copy link
Collaborator

Hi,

I am not clear how to define and use the Postgres results table. Can you provide the DDL ?

Here is what I tried and the error I got

data-validation validate row -tc postgres -sc bq -tbls=data_validator_issues.issue1610=public.dvt_core_types -pk id -hash='col_datetime' -rh postgres.public.results
...
File "/home/user/professional-services-data-validator/env/lib/python3.12/site-packages/data_validation/result_handlers/postgres.py", line 162, in _insert_postgres
    _ = clients.get_ibis_table(self._client, schema_name, table_name)
...
__
    raise NotImplementedError(
NotImplementedError: Could not find signature for to_sqla_type: <PGDialect_psycopg2, NoneType>

Here is the results table definition and maybe that is not correct

CREATE TYPE PUBLIC.err_record AS ( code bigint, message text, details text );

CREATE TYPE PUBLIC.label_rec AS ( KEY text, VALUE text );

CREATE TABLE PUBLIC.results
  (
     run_id             TEXT,
     validation_name    TEXT,
     validation_type    TEXT,
     start_time         TIMESTAMP,
     end_time           TIMESTAMP,
     source_table_name  TEXT,
     target_table_name  TEXT,
     source_column_name TEXT,
     target_column_name TEXT,
     aggregation_type   TEXT,
     group_by_columns   TEXT,
     primary_keys       TEXT,
     num_random_rows    BIGINT,
     source_agg_value   TEXT,
     target_agg_value   TEXT,
     difference         DOUBLE PRECISION,
     pct_difference     DOUBLE PRECISION,
     pct_threshold      DOUBLE PRECISION,
     validation_status  TEXT,
     labels             LABEL_REC[],
     configuration_json TEXT,
     error_result       ERR_RECORD
  );

@nj1973
Copy link
Contributor Author

nj1973 commented Oct 16, 2025

You don't need to pre-create the table. The docs ask you just create a schema.

https://github.com/GoogleCloudPlatform/professional-services-data-validator/blob/develop/docs/installation.md#postgresql-1

I tested it and it works for me.

postgres=# \d pso_data_validator_results.results
Did not find any relation named "pso_data_validator_results.results".
data-validation validate row -sc ora_local -tc pg_local -tbls=pso_data_validator.dvt_core_types  --hash="*" -rh=pg_rh.pso_data_validator_results.results --labels=tag1=neil-dvt1

The table exists and holds the results:

postgres=# \d pso_data_validator_results.results
                   Table "pso_data_validator_results.results"
       Column       |           Type            | Collation | Nullable | Default
--------------------+---------------------------+-----------+----------+---------
 validation_name    | character varying(3000)   |           | not null |
 validation_type    | character varying(3000)   |           | not null |
 aggregation_type   | character varying(3000)   |           |          |
 source_table_name  | character varying(3000)   |           |          |
 source_column_name | character varying(3000)   |           |          |
 source_agg_value   | character varying(3000)   |           |          |
 target_table_name  | character varying(3000)   |           |          |
 target_column_name | character varying(3000)   |           |          |
 target_agg_value   | character varying(3000)   |           |          |
 group_by_columns   | character varying(3000)   |           |          |
 primary_keys       | character varying(3000)   |           |          |
 num_random_rows    | bigint                    |           |          |
 difference         | double precision          |           |          |
 pct_difference     | double precision          |           |          |
 pct_threshold      | double precision          |           |          |
 validation_status  | character varying(3000)   |           | not null |
 run_id             | character varying(3000)   |           | not null |
 labels             | character varying(3000)[] |           |          |
 start_time         | timestamp with time zone  |           | not null |
 end_time           | timestamp with time zone  |           | not null |

postgres=# select validation_name,labels from pso_data_validator_results.results;
 validation_name |       labels
-----------------+--------------------
 hash__all       | {'tag1=neil-dvt1'}
 hash__all       | {'tag1=neil-dvt1'}
 hash__all       | {'tag1=neil-dvt1'}

@sundar-mudupalli-work
Copy link
Collaborator

sundar-mudupalli-work commented Oct 17, 2025

Neil,

Yes, it does work. However we need some doc change(s):

  1. The validation reports section should clearly mention that for the Postgres result handler if the table does not exist the account must have CREATE TABLE privileges in order to create the table and write the results. The docs already say that you need to create a table for Big Query and provide the schema for the results table. We should also provide the schema if the customer wants to create the table before using DVT.
  2. I am not clear on the right schema. This line says the schema has a column called error_result. I don't see that in the Postgres table. Does this json schema need to be updated ?
  3. Part of the problem is that we are using two different mechanisms to write - one for Postgres and one for BigQuery. BigQuery has a limitation on the number of mutations per table per day. For customers running large number of Cloud Run jobs using pandas_gbq.to_gbq which is equivalent to <dataframe>.to_sql will not work.

Given there is a customer waiting and it is only doc changes, I have approved the PR.

Thank you for the excellent work.

Sundar Mudupalli

Copy link
Collaborator

@sundar-mudupalli-work sundar-mudupalli-work left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Please change the documentation as requested.

@nj1973
Copy link
Contributor Author

nj1973 commented Oct 17, 2025

Please change the documentation as requested.

For item 1 in your list I have added a link to the existing result handler setup documentation from the validation report section instead of duplicating the setup notes.

Item 2: I believe the error_result field is part of the BigQuery streaming and not applicable for a PostgreSQL table.

Item 3: Yes it is unfortunate that we can't reliably use load jobs for BigQuery (which would be a big performance improvement - #1572). I've added a small note to the installation doc.

@nj1973 nj1973 merged commit ef3df2e into develop Oct 17, 2025
4 of 5 checks passed
@nj1973 nj1973 deleted the 1608-result-handler-parameters-in-config-files branch October 17, 2025 08:22
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

Projects

None yet

Development

Successfully merging this pull request may close these issues.

Specifying CONNECTION_NAME.DATASET.TABLE for result handler results in parameters including passwords in config files

2 participants