added functions

Jim Barlow · Jim Barlow · commit 0ea519ae8624 · 2025-02-26T16:57:33.000+01:00
diff --git a/pages/functions/data_quality.md b/pages/functions/data_quality.md
@@ -0,0 +1,148 @@
+---
+sidebar_position: 2
+---
+
+# Data Quality Assurance Functions
+
+## table_date_partitions_query
+The `table_date_partitions_query` is used to validate date partition existence across all tables in a single project, across multiple datasets. It also flags tables where there are gaps in the existing date partitions and identifies specific missing dates.
+
+Argument | Data Type | Description
+--- | --- | ---
+`project_id` | STRING | Fully signed ID of the table to be profiled.
+`dataset_names` | ARRAY<STRING> | An array containing the names of datasets to be profiled.
+
+It can be executed using the following syntax:
+
+```sql
+DECLARE project_id STRING;
+DECLARE dataset_names ARRAY<STRING>;
+
+SET project_id = 'project_a';
+SET dataset_names = ['dataset_a', 'dataset_b', 'dataset_c'];
+
+EXECUTE IMMEDIATE (
+SELECT `datatovalue-tools.us_west1.table_date_partitions_query` (          
+project_id, 
+dataset_names));
+```
+
+## row_duplicate_query
+The `row_duplicate_query` is used to identify duplicate rows in any table.
+
+Argument | Data Type | Description
+--- | --- | ---
+`table_id` | STRING | Fully signed ID of the table to be profiled.
+
+It can be executed using the following syntax:
+
+```sql
+DECLARE table_id STRING;
+
+SET table_id = 'project_a.dataset_a.table_a';
+
+EXECUTE IMMEDIATE (
+SELECT `datatovalue-tools.us_west1.row_duplicate_query` (table_id));
+```
+
+The result will be the original table, with some additional metadata columns. The boolean column `duplicate_row_flag`  identifies duplicate rows in the data.
+
+## column_profile_query
+The `column_profile_query` is used compute column metrics such as minimum and maximum values, null values, null percentage,  distinct value counts and uniqueness percentage.
+
+Argument | Data Type | Description
+--- | --- | ---
+`table_id` | STRING | Fully signed ID of the table to be profiled.
+
+Note that since this function needs to query the `INFORMATION_SCHEMA.COLUMNS` view to get the precise table columns, executing the returned query will return _another_ SQL query. This must in turn be executed in order to obtain the result.
+
+This can be achieved using the following syntax:
+
+```sql
+DECLARE table_id, query STRING;
+
+SET table_id = 'project_a.dataset_a.table_a';
+
+EXECUTE IMMEDIATE (
+SELECT `datatovalue-tools.us_west1.column_profile_query` (table_id)
+) INTO query;
+
+EXECUTE IMMEDIATE (query);
+```
+
+## unique_combination_query
+The `unique_combination_query` is used to validate whether combinations of columns are unique in any table. This supports validation of granularity assumptions and unique key development, testing and monitoring.
+
+Argument | Data Type | Description
+--- | --- | ---
+`table_id` | STRING | Fully signed ID of the table to be profiled.
+`column_names` | ARRAY<STRING> | An array containing the names of columns to be profiled.
+
+It can be executed using the following syntax:
+
+```sql
+DECLARE table_id STRING;
+DECLARE column_names ARRAY<STRING>;
+
+SET table_id = 'project_a.dataset_a.table_a';
+SET column_names = ['column_a', 'column_b', 'column_c'];
+
+EXECUTE IMMEDIATE (
+SELECT `datatovalue-tools.us_west1.unique_combination_query` (table_id, column_names)
+);
+```
+
+## unique_combination_multi_query
+The `unique_combination_multi_query` is used to validate whether different combinations of columns are unique across multiple tables. This supports validation of granularity assumptions and unique key development, testing and monitoring, and is an extension of the `unique_combination_query` function.
+
+Argument | Data Type | Description
+--- | --- | ---
+`table_column_combinations` | ARRAY<STRUCT<table_id STRING, column_names ARRAY<STRING>>> | Struct array of inputs containing multiple `table_id` and `column_names` combinations.
+
+It can be executed using the following syntax:
+
+```sql
+DECLARE table_column_combinations ARRAY<STRUCT<table_id STRING, column_names ARRAY<STRING>>>;
+
+SET table_column_combinations = [
+    ('project_a.dataset_a.table_a', ["column_a"]),
+    ('project_a.dataset_a.table_a', ["column_a", "column_b"]),
+    ('project_a.dataset_a.table_a', ["column_a", "column_b", "column_c"]),
+    ('project_a.dataset_a.table_a', ["column_a", "column_b", "column_c", "column_d"]),
+    ];
+
+EXECUTE IMMEDIATE (
+SELECT `datatovalue-tools.us_west1.unique_combination_multi_query` (table_column_combinations)
+);
+```
+
+## metric_sum_query
+The `metric_sum_query` function enables tracing of metric sum values for specific numeric columns across all tables in multiple datasets.
+
+Argument | Data Type | Description
+--- | --- | ---
+`project_id` | STRING | The project ID in which tables are located.
+`region` | STRING | The region in which tables are located.
+`dataset_names` | ARRAY<STRING> | Dataset names in which tables are located.  This argument supports LIKE matching using the '%' wildcard. A NULL or empty array will result in a null query.
+`column_names` | ARRAY<STRING> | Column names to include. This argument supports LIKE matching using the '%' wildcard. NULL or empty array will result in all columns being returned.
+`rounding_digits` | INT64 | The number of decimal places to which the result will be rounded. NULL value will default to 0 decimal places.)
+
+Note that the result of this query is another SQL query, which must be executed to obtain the final result.
+
+```sql
+DECLARE project_id, region, query STRING;
+DECLARE dataset_names, column_names ARRAY<STRING>;
+DECLARE rounding_digits INT64;
+
+SET project_id = "project_a";
+SET region = "us-west1";
+SET dataset_names = ["dataset_prefix_%", "dataset_a", "dataset_b", "dataset_c"];
+SET column_names = ["column_prefix_%", "column_a", "column_b"];
+SET rounding_digits = 0;
+
+EXECUTE IMMEDIATE (
+SELECT `jwn-nmn-report-nonprod-6dpu.00_QA_dev`.metric_sum_query(project_id, region, dataset_names, column_names, rounding_digits )
+) INTO query;
+
+EXECUTE IMMEDIATE (query);
+```
diff --git a/pages/functions/information_schema.md b/pages/functions/information_schema.md
@@ -0,0 +1,198 @@
+---
+sidebar_position: 1
+---
+
+# Information Schema Functions
+
+The `infoschema` functions make BigQuery [INFORMATION SCHEMA](https://cloud.google.com/bigquery/docs/information-schema-intro) metadata accessible via parameterised table functions, without requiring hard-coding of region, project, dataset or table identifiers. This enables them to be used in programmatic metadata-driven use-cases such as data profiling, monitoring and templated dashboard development.
+
+Functions are available in all Google Cloud regions. Note that in the examples below, the `my_region` placeholder needs to be replaced with a region or multi-region identifier from any [BigQuery region](https://cloud.google.com/bigquery/docs/locations#regions), replacing any dash ('-') with an underscore ('_'). 
+
+All columns are retained from the associated `INFORMATION_SCHEMA` view, with an additional `id` added as the first column based on the metadata granularity of the response (i.e. `dataset_id` or `table_id`) for ease of subsequent manipulation and integration. Data types and structures are corrected where necessary, and column names for projects, datasets and tables are standardised to `project_id`, `dataset_name` and `table_name`. 
+
+- `dataset_id` = `project_id.dataset_name`
+- `project_id` = `project_id.dataset_name.table_name`.
+
+Additional fields are added in the following scenarios:
+
+- Unix timestamp fields have a `TIMESTAMP` equivalent field added.
+- Bytes values have a GiB field added (divided by $1024^3$).
+- Labels are converted to a JSON object, which can be queried directly using the label key: `SELECT JSON_VALUE(label_column_name, '$label_key') AS label key`.
+
+## Function Reference
+
+### Query Options
+Where functions include the JSON argument `query_options`, the following options can be set:
+
+JSON Path | Data Type | Description
+--- | --- | ---
+`where_clause` | `STRING` | Optional WHERE clause to filter response rows
+`except_columns` | `ARRAY<STRING>` | Column values to exclude from the reponse
+
+For example, the following `query_options` structure would exclude the `project_number` column and filter rows for where the `table_schema` (`dataset_name`) begins with "00":
+
+```sql
+SET query_options = JSON """
+{
+  "where_clause": "WHERE table_schema LIKE '00%'", 
+  "except_columns": ["project_number"]
+}
+""";
+```
+
+Note that the column _will_ still be present in the output from the table function, however it will contain null values.
+
+### infoschema.datasets
+This is a functional implementation of the [INFORMATION_SCHEMA.SCHEMATA](https://cloud.google.com/bigquery/docs/information-schema-datasets-schemata) view.
+
+Argument | Data Type | Description
+--- | --- | ---
+`region` | `STRING` | The region from which to retrieve dataset-level metadata
+
+```sql 
+DECLARE region, datasets_query STRING;
+DECLARE datasets_json JSON;
+
+SET region = "my_region" ; 
+
+SET datasets_query = (SELECT `datatovalue-tools.my_region`.datasets_query(region));
+EXECUTE IMMEDIATE (datasets_query) INTO datasets_json;
+SELECT * FROM `datatovalue-tools.my_region`.datasets(datasets_json);
+```
+
+### infoschema.tables
+This is a functional implementation of the [INFORMATION_SCHEMA.TABLES](https://cloud.google.com/bigquery/docs/information-schema-tables) view.
+
+Argument | Data Type | Description
+--- | --- | ---
+`dataset_ids` | `ARRAY<STRING>` | The datasets from which to retrieve table-level metadata.
+
+```sql
+DECLARE dataset_ids ARRAY<STRING>;
+DECLARE tables_query STRING; 
+DECLARE tables_json JSON;
+
+SET dataset_ids = ["dataset_id_a", "dataset_id_b", ... ];
+
+SET tables_query = (SELECT `datatovalue-tools.my_region`.tables_query(dataset_ids));
+EXECUTE IMMEDIATE (tables_query) INTO tables_json;
+SELECT * FROM `datatovalue-tools.my_region`.tables(tables_json);  
+```
+
+### infoschema.table_metadata
+This is a functional implementation of the BigQuery `__TABLES__` metadata view.
+
+Argument | Data Type | Description
+--- | --- | ---
+`dataset_ids` | `ARRAY<STRING>` | The datasets from which to retrieve table-level metadata.
+
+```sql
+DECLARE dataset_ids ARRAY<STRING>;
+DECLARE table_metadata_query STRING; 
+DECLARE table_metadata_json JSON;
+
+SET dataset_ids = ["dataset_id_a", "dataset_id_b", ... ];
+
+SET table_metadata_query = (SELECT `datatovalue-tools.my_region`.table_metadata_query(dataset_ids));
+EXECUTE IMMEDIATE (table_metadata_query) INTO table_metadata_json;     
+SELECT * FROM `datatovalue-tools.my_region`.table_metadata(table_metadata_json);
+```
+
+### infoschema.table_options
+This is a functional implementation of the [INFORMATION_SCHEMA.TABLE_OPTIONS](https://cloud.google.com/bigquery/docs/information-schema-table-options) view. The reponse is restructured from key-value pairs into columns for ease of subsequent manipulation.
+
+Argument | Data Type | Description
+--- | --- | ---
+`dataset_ids` | `ARRAY<STRING>` | The datasets from which to retrieve table-level metadata.
+
+```sql
+DECLARE dataset_ids ARRAY<STRING>;
+DECLARE table_options_query STRING; 
+DECLARE table_options_json JSON;
+
+SET dataset_ids = ["dataset_id_a", "dataset_id_b", ... ];
+
+SET table_options_query = (SELECT `datatovalue-tools.my_region`.table_options_query(dataset_ids));
+EXECUTE IMMEDIATE (table_options_query) INTO table_options_json;
+SELECT * FROM `datatovalue-tools.my_region`.table_options(table_options_json);
+```
+
+### infoschema.table_storage
+This is a functional implementation of the [INFORMATION_SCHEMA.TABLE_STORAGE](https://cloud.google.com/bigquery/docs/information-schema-table-storage) view.
+
+Argument | Data Type | Description
+--- | --- | ---
+`project_id` | `STRING` | The project from which to retrieve table-level storage metadata.
+`query_options` | `JSON` | The region from which to retrieve table-level storage metadata.
+
+```sql
+DECLARE region, project_id, table_storage_query STRING;
+DECLARE table_storage_json JSON;
+
+SET project_id = "my_project_id";
+SET region = "my_region";
+
+SET table_storage_query = (SELECT `datatovalue-tools.my_region`.table_storage_query(project_id, region));
+EXECUTE IMMEDIATE (table_storage_query) INTO table_storage_json;
+SELECT * FROM `datatovalue-tools.my_region`.table_storage(table_storage_json);
+```
+
+### infoschema.partitions
+This is a functional implementation of the [INFORMATION_SCHEMA.PARTITIONS](https://cloud.google.com/bigquery/docs/information-schema-partitions) view. 
+
+Argument | Data Type | Description
+--- | --- | ---
+`table_ids` | `ARRAY<STRING>` | The tables from which to retrieve partition-level metadata.
+
+```sql
+DECLARE table_ids ARRAY<STRING>;
+DECLARE partitions_query STRING; 
+DECLARE partitions_json JSON;
+
+SET table_ids = ["table_id_a", "table_id_b", ... ];
+
+SET partitions_query = (SELECT `datatovalue-tools.my_region`.partitions_query(table_ids));
+EXECUTE IMMEDIATE (partitions_query) INTO partitions_json;
+SELECT * FROM `datatovalue-tools.my_region`.partitions(partitions_json);
+```
+
+### infoschema.columns
+This is a functional implementation of the [INFORMATION_SCHEMA.COLUMNS](https://cloud.google.com/bigquery/docs/information-schema-columns) view. 
+
+Argument | Data Type | Description
+--- | --- | ---
+`dataset_ids` | `ARRAY<STRING>` | The datasets from which to retrieve column-level metadata.
+
+```sql
+DECLARE dataset_ids ARRAY<STRING>;
+DECLARE columns_query STRING; 
+DECLARE columns_json JSON;
+
+SET dataset_ids = ["dataset_id_a", "dataset_id_b", ... ];
+
+SET columns_query = (SELECT `datatovalue-tools.my_region`.columns_query(dataset_ids));
+EXECUTE IMMEDIATE (columns_query) INTO columns_json;
+SELECT * FROM `datatovalue-tools.my_region`.columns(columns_json);
+```
+
+### infoschema.column_field_paths
+This is a functional implementation of the [INFORMATION_SCHEMA.COLUMN_FIELD_PATHS](https://cloud.google.com/bigquery/docs/information-schema-columns) view. 
+
+Argument | Data Type | Description
+--- | --- | ---
+`dataset_ids` | `ARRAY<STRING>` | The datasets from which to retrieve column-level metadata.
+
+```sql
+DECLARE dataset_ids ARRAY<STRING>;
+DECLARE column_field_paths_query STRING; 
+DECLARE column_field_paths_json JSON;
+
+SET dataset_ids = ["dataset_id_a", "dataset_id_b", ... ];
+
+SET column_field_paths_query = (SELECT `datatovalue-tools.my_region`.column_field_paths_query(dataset_ids));
+EXECUTE IMMEDIATE (column_field_paths_query) INTO column_field_paths_json;
+SELECT * FROM `datatovalue-tools.my_region`.column_field_paths(column_field_paths_json);
+```
+
+
+