Skip to content

Commit 0ea519a

Browse files
author
Jim Barlow
committed
added functions
1 parent b67e2ea commit 0ea519a

File tree

2 files changed

+346
-0
lines changed

2 files changed

+346
-0
lines changed

pages/functions/data_quality.md

Lines changed: 148 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,148 @@
1+
---
2+
sidebar_position: 2
3+
---
4+
5+
# Data Quality Assurance Functions
6+
7+
## table_date_partitions_query
8+
The `table_date_partitions_query` is used to validate date partition existence across all tables in a single project, across multiple datasets. It also flags tables where there are gaps in the existing date partitions and identifies specific missing dates.
9+
10+
Argument | Data Type | Description
11+
--- | --- | ---
12+
`project_id` | STRING | Fully signed ID of the table to be profiled.
13+
`dataset_names` | ARRAY<STRING> | An array containing the names of datasets to be profiled.
14+
15+
It can be executed using the following syntax:
16+
17+
```sql
18+
DECLARE project_id STRING;
19+
DECLARE dataset_names ARRAY<STRING>;
20+
21+
SET project_id = 'project_a';
22+
SET dataset_names = ['dataset_a', 'dataset_b', 'dataset_c'];
23+
24+
EXECUTE IMMEDIATE (
25+
SELECT `datatovalue-tools.us_west1.table_date_partitions_query` (
26+
project_id,
27+
dataset_names));
28+
```
29+
30+
## row_duplicate_query
31+
The `row_duplicate_query` is used to identify duplicate rows in any table.
32+
33+
Argument | Data Type | Description
34+
--- | --- | ---
35+
`table_id` | STRING | Fully signed ID of the table to be profiled.
36+
37+
It can be executed using the following syntax:
38+
39+
```sql
40+
DECLARE table_id STRING;
41+
42+
SET table_id = 'project_a.dataset_a.table_a';
43+
44+
EXECUTE IMMEDIATE (
45+
SELECT `datatovalue-tools.us_west1.row_duplicate_query` (table_id));
46+
```
47+
48+
The result will be the original table, with some additional metadata columns. The boolean column `duplicate_row_flag` identifies duplicate rows in the data.
49+
50+
## column_profile_query
51+
The `column_profile_query` is used compute column metrics such as minimum and maximum values, null values, null percentage, distinct value counts and uniqueness percentage.
52+
53+
Argument | Data Type | Description
54+
--- | --- | ---
55+
`table_id` | STRING | Fully signed ID of the table to be profiled.
56+
57+
Note that since this function needs to query the `INFORMATION_SCHEMA.COLUMNS` view to get the precise table columns, executing the returned query will return _another_ SQL query. This must in turn be executed in order to obtain the result.
58+
59+
This can be achieved using the following syntax:
60+
61+
```sql
62+
DECLARE table_id, query STRING;
63+
64+
SET table_id = 'project_a.dataset_a.table_a';
65+
66+
EXECUTE IMMEDIATE (
67+
SELECT `datatovalue-tools.us_west1.column_profile_query` (table_id)
68+
) INTO query;
69+
70+
EXECUTE IMMEDIATE (query);
71+
```
72+
73+
## unique_combination_query
74+
The `unique_combination_query` is used to validate whether combinations of columns are unique in any table. This supports validation of granularity assumptions and unique key development, testing and monitoring.
75+
76+
Argument | Data Type | Description
77+
--- | --- | ---
78+
`table_id` | STRING | Fully signed ID of the table to be profiled.
79+
`column_names` | ARRAY<STRING> | An array containing the names of columns to be profiled.
80+
81+
It can be executed using the following syntax:
82+
83+
```sql
84+
DECLARE table_id STRING;
85+
DECLARE column_names ARRAY<STRING>;
86+
87+
SET table_id = 'project_a.dataset_a.table_a';
88+
SET column_names = ['column_a', 'column_b', 'column_c'];
89+
90+
EXECUTE IMMEDIATE (
91+
SELECT `datatovalue-tools.us_west1.unique_combination_query` (table_id, column_names)
92+
);
93+
```
94+
95+
## unique_combination_multi_query
96+
The `unique_combination_multi_query` is used to validate whether different combinations of columns are unique across multiple tables. This supports validation of granularity assumptions and unique key development, testing and monitoring, and is an extension of the `unique_combination_query` function.
97+
98+
Argument | Data Type | Description
99+
--- | --- | ---
100+
`table_column_combinations` | ARRAY<STRUCT<table_id STRING, column_names ARRAY<STRING>>> | Struct array of inputs containing multiple `table_id` and `column_names` combinations.
101+
102+
It can be executed using the following syntax:
103+
104+
```sql
105+
DECLARE table_column_combinations ARRAY<STRUCT<table_id STRING, column_names ARRAY<STRING>>>;
106+
107+
SET table_column_combinations = [
108+
('project_a.dataset_a.table_a', ["column_a"]),
109+
('project_a.dataset_a.table_a', ["column_a", "column_b"]),
110+
('project_a.dataset_a.table_a', ["column_a", "column_b", "column_c"]),
111+
('project_a.dataset_a.table_a', ["column_a", "column_b", "column_c", "column_d"]),
112+
];
113+
114+
EXECUTE IMMEDIATE (
115+
SELECT `datatovalue-tools.us_west1.unique_combination_multi_query` (table_column_combinations)
116+
);
117+
```
118+
119+
## metric_sum_query
120+
The `metric_sum_query` function enables tracing of metric sum values for specific numeric columns across all tables in multiple datasets.
121+
122+
Argument | Data Type | Description
123+
--- | --- | ---
124+
`project_id` | STRING | The project ID in which tables are located.
125+
`region` | STRING | The region in which tables are located.
126+
`dataset_names` | ARRAY<STRING> | Dataset names in which tables are located. This argument supports LIKE matching using the '%' wildcard. A NULL or empty array will result in a null query.
127+
`column_names` | ARRAY<STRING> | Column names to include. This argument supports LIKE matching using the '%' wildcard. NULL or empty array will result in all columns being returned.
128+
`rounding_digits` | INT64 | The number of decimal places to which the result will be rounded. NULL value will default to 0 decimal places.)
129+
130+
Note that the result of this query is another SQL query, which must be executed to obtain the final result.
131+
132+
```sql
133+
DECLARE project_id, region, query STRING;
134+
DECLARE dataset_names, column_names ARRAY<STRING>;
135+
DECLARE rounding_digits INT64;
136+
137+
SET project_id = "project_a";
138+
SET region = "us-west1";
139+
SET dataset_names = ["dataset_prefix_%", "dataset_a", "dataset_b", "dataset_c"];
140+
SET column_names = ["column_prefix_%", "column_a", "column_b"];
141+
SET rounding_digits = 0;
142+
143+
EXECUTE IMMEDIATE (
144+
SELECT `jwn-nmn-report-nonprod-6dpu.00_QA_dev`.metric_sum_query(project_id, region, dataset_names, column_names, rounding_digits )
145+
) INTO query;
146+
147+
EXECUTE IMMEDIATE (query);
148+
```
Lines changed: 198 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,198 @@
1+
---
2+
sidebar_position: 1
3+
---
4+
5+
# Information Schema Functions
6+
7+
The `infoschema` functions make BigQuery [INFORMATION SCHEMA](https://cloud.google.com/bigquery/docs/information-schema-intro) metadata accessible via parameterised table functions, without requiring hard-coding of region, project, dataset or table identifiers. This enables them to be used in programmatic metadata-driven use-cases such as data profiling, monitoring and templated dashboard development.
8+
9+
Functions are available in all Google Cloud regions. Note that in the examples below, the `my_region` placeholder needs to be replaced with a region or multi-region identifier from any [BigQuery region](https://cloud.google.com/bigquery/docs/locations#regions), replacing any dash ('-') with an underscore ('_').
10+
11+
All columns are retained from the associated `INFORMATION_SCHEMA` view, with an additional `id` added as the first column based on the metadata granularity of the response (i.e. `dataset_id` or `table_id`) for ease of subsequent manipulation and integration. Data types and structures are corrected where necessary, and column names for projects, datasets and tables are standardised to `project_id`, `dataset_name` and `table_name`.
12+
13+
- `dataset_id` = `project_id.dataset_name`
14+
- `project_id` = `project_id.dataset_name.table_name`.
15+
16+
Additional fields are added in the following scenarios:
17+
18+
- Unix timestamp fields have a `TIMESTAMP` equivalent field added.
19+
- Bytes values have a GiB field added (divided by $1024^3$).
20+
- Labels are converted to a JSON object, which can be queried directly using the label key: `SELECT JSON_VALUE(label_column_name, '$label_key') AS label key`.
21+
22+
## Function Reference
23+
24+
### Query Options
25+
Where functions include the JSON argument `query_options`, the following options can be set:
26+
27+
JSON Path | Data Type | Description
28+
--- | --- | ---
29+
`where_clause` | `STRING` | Optional WHERE clause to filter response rows
30+
`except_columns` | `ARRAY<STRING>` | Column values to exclude from the reponse
31+
32+
For example, the following `query_options` structure would exclude the `project_number` column and filter rows for where the `table_schema` (`dataset_name`) begins with "00":
33+
34+
```sql
35+
SET query_options = JSON """
36+
{
37+
"where_clause": "WHERE table_schema LIKE '00%'",
38+
"except_columns": ["project_number"]
39+
}
40+
""";
41+
```
42+
43+
Note that the column _will_ still be present in the output from the table function, however it will contain null values.
44+
45+
### infoschema.datasets
46+
This is a functional implementation of the [INFORMATION_SCHEMA.SCHEMATA](https://cloud.google.com/bigquery/docs/information-schema-datasets-schemata) view.
47+
48+
Argument | Data Type | Description
49+
--- | --- | ---
50+
`region` | `STRING` | The region from which to retrieve dataset-level metadata
51+
52+
```sql
53+
DECLARE region, datasets_query STRING;
54+
DECLARE datasets_json JSON;
55+
56+
SET region = "my_region" ;
57+
58+
SET datasets_query = (SELECT `datatovalue-tools.my_region`.datasets_query(region));
59+
EXECUTE IMMEDIATE (datasets_query) INTO datasets_json;
60+
SELECT * FROM `datatovalue-tools.my_region`.datasets(datasets_json);
61+
```
62+
63+
### infoschema.tables
64+
This is a functional implementation of the [INFORMATION_SCHEMA.TABLES](https://cloud.google.com/bigquery/docs/information-schema-tables) view.
65+
66+
Argument | Data Type | Description
67+
--- | --- | ---
68+
`dataset_ids` | `ARRAY<STRING>` | The datasets from which to retrieve table-level metadata.
69+
70+
```sql
71+
DECLARE dataset_ids ARRAY<STRING>;
72+
DECLARE tables_query STRING;
73+
DECLARE tables_json JSON;
74+
75+
SET dataset_ids = ["dataset_id_a", "dataset_id_b", ... ];
76+
77+
SET tables_query = (SELECT `datatovalue-tools.my_region`.tables_query(dataset_ids));
78+
EXECUTE IMMEDIATE (tables_query) INTO tables_json;
79+
SELECT * FROM `datatovalue-tools.my_region`.tables(tables_json);
80+
```
81+
82+
### infoschema.table_metadata
83+
This is a functional implementation of the BigQuery `__TABLES__` metadata view.
84+
85+
Argument | Data Type | Description
86+
--- | --- | ---
87+
`dataset_ids` | `ARRAY<STRING>` | The datasets from which to retrieve table-level metadata.
88+
89+
```sql
90+
DECLARE dataset_ids ARRAY<STRING>;
91+
DECLARE table_metadata_query STRING;
92+
DECLARE table_metadata_json JSON;
93+
94+
SET dataset_ids = ["dataset_id_a", "dataset_id_b", ... ];
95+
96+
SET table_metadata_query = (SELECT `datatovalue-tools.my_region`.table_metadata_query(dataset_ids));
97+
EXECUTE IMMEDIATE (table_metadata_query) INTO table_metadata_json;
98+
SELECT * FROM `datatovalue-tools.my_region`.table_metadata(table_metadata_json);
99+
```
100+
101+
### infoschema.table_options
102+
This is a functional implementation of the [INFORMATION_SCHEMA.TABLE_OPTIONS](https://cloud.google.com/bigquery/docs/information-schema-table-options) view. The reponse is restructured from key-value pairs into columns for ease of subsequent manipulation.
103+
104+
Argument | Data Type | Description
105+
--- | --- | ---
106+
`dataset_ids` | `ARRAY<STRING>` | The datasets from which to retrieve table-level metadata.
107+
108+
```sql
109+
DECLARE dataset_ids ARRAY<STRING>;
110+
DECLARE table_options_query STRING;
111+
DECLARE table_options_json JSON;
112+
113+
SET dataset_ids = ["dataset_id_a", "dataset_id_b", ... ];
114+
115+
SET table_options_query = (SELECT `datatovalue-tools.my_region`.table_options_query(dataset_ids));
116+
EXECUTE IMMEDIATE (table_options_query) INTO table_options_json;
117+
SELECT * FROM `datatovalue-tools.my_region`.table_options(table_options_json);
118+
```
119+
120+
### infoschema.table_storage
121+
This is a functional implementation of the [INFORMATION_SCHEMA.TABLE_STORAGE](https://cloud.google.com/bigquery/docs/information-schema-table-storage) view.
122+
123+
Argument | Data Type | Description
124+
--- | --- | ---
125+
`project_id` | `STRING` | The project from which to retrieve table-level storage metadata.
126+
`query_options` | `JSON` | The region from which to retrieve table-level storage metadata.
127+
128+
```sql
129+
DECLARE region, project_id, table_storage_query STRING;
130+
DECLARE table_storage_json JSON;
131+
132+
SET project_id = "my_project_id";
133+
SET region = "my_region";
134+
135+
SET table_storage_query = (SELECT `datatovalue-tools.my_region`.table_storage_query(project_id, region));
136+
EXECUTE IMMEDIATE (table_storage_query) INTO table_storage_json;
137+
SELECT * FROM `datatovalue-tools.my_region`.table_storage(table_storage_json);
138+
```
139+
140+
### infoschema.partitions
141+
This is a functional implementation of the [INFORMATION_SCHEMA.PARTITIONS](https://cloud.google.com/bigquery/docs/information-schema-partitions) view.
142+
143+
Argument | Data Type | Description
144+
--- | --- | ---
145+
`table_ids` | `ARRAY<STRING>` | The tables from which to retrieve partition-level metadata.
146+
147+
```sql
148+
DECLARE table_ids ARRAY<STRING>;
149+
DECLARE partitions_query STRING;
150+
DECLARE partitions_json JSON;
151+
152+
SET table_ids = ["table_id_a", "table_id_b", ... ];
153+
154+
SET partitions_query = (SELECT `datatovalue-tools.my_region`.partitions_query(table_ids));
155+
EXECUTE IMMEDIATE (partitions_query) INTO partitions_json;
156+
SELECT * FROM `datatovalue-tools.my_region`.partitions(partitions_json);
157+
```
158+
159+
### infoschema.columns
160+
This is a functional implementation of the [INFORMATION_SCHEMA.COLUMNS](https://cloud.google.com/bigquery/docs/information-schema-columns) view.
161+
162+
Argument | Data Type | Description
163+
--- | --- | ---
164+
`dataset_ids` | `ARRAY<STRING>` | The datasets from which to retrieve column-level metadata.
165+
166+
```sql
167+
DECLARE dataset_ids ARRAY<STRING>;
168+
DECLARE columns_query STRING;
169+
DECLARE columns_json JSON;
170+
171+
SET dataset_ids = ["dataset_id_a", "dataset_id_b", ... ];
172+
173+
SET columns_query = (SELECT `datatovalue-tools.my_region`.columns_query(dataset_ids));
174+
EXECUTE IMMEDIATE (columns_query) INTO columns_json;
175+
SELECT * FROM `datatovalue-tools.my_region`.columns(columns_json);
176+
```
177+
178+
### infoschema.column_field_paths
179+
This is a functional implementation of the [INFORMATION_SCHEMA.COLUMN_FIELD_PATHS](https://cloud.google.com/bigquery/docs/information-schema-columns) view.
180+
181+
Argument | Data Type | Description
182+
--- | --- | ---
183+
`dataset_ids` | `ARRAY<STRING>` | The datasets from which to retrieve column-level metadata.
184+
185+
```sql
186+
DECLARE dataset_ids ARRAY<STRING>;
187+
DECLARE column_field_paths_query STRING;
188+
DECLARE column_field_paths_json JSON;
189+
190+
SET dataset_ids = ["dataset_id_a", "dataset_id_b", ... ];
191+
192+
SET column_field_paths_query = (SELECT `datatovalue-tools.my_region`.column_field_paths_query(dataset_ids));
193+
EXECUTE IMMEDIATE (column_field_paths_query) INTO column_field_paths_json;
194+
SELECT * FROM `datatovalue-tools.my_region`.column_field_paths(column_field_paths_json);
195+
```
196+
197+
198+

0 commit comments

Comments
 (0)