From cb7b0e51189b50616da65530dad3c59d324cc271 Mon Sep 17 00:00:00 2001 From: DhananjayMukhedkar Date: Tue, 11 Mar 2025 15:12:57 +0100 Subject: [PATCH 1/7] add pre-insert schema validation errors --- docs/user_guides/fs/feature_group/data_types.md | 15 +++++++++++++++ 1 file changed, 15 insertions(+) diff --git a/docs/user_guides/fs/feature_group/data_types.md b/docs/user_guides/fs/feature_group/data_types.md index 86271ac44..0b28eb196 100644 --- a/docs/user_guides/fs/feature_group/data_types.md +++ b/docs/user_guides/fs/feature_group/data_types.md @@ -143,6 +143,21 @@ The byte size of each column is determined by its data type and calculated as fo | BLOB | 256 | | other | 8 | + +#### Pre-insert schema validation for online feature groups + +The input dataframe can be validated for schema as per the valid online schema data types before online ingestion. The most important checks are mentioned below along with possible corrective actions. It is enabled by setting the keyword argument `validation_options={'run_validation':True}` in the `insert()` API of feature groups. + + + +| Error type | Requirement | Suggested corrections | +|-------------------------------|---------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------|-------------------------------------------------------------------------------| +| Primary key contains null values | Primary key column should not cannot any null values. If the primary key is composite key then all columns of primary key are checked for null. | Remove the null rows from dataframe. OR impute the null values as applicable. | +| Primary key column is missing | The dataframe to be inserted must contain all the features of the defined the primary key as per the feature group schema. | Add all the primary key columns in the dataframe. | +| Event time column is missing | The dataframe to be inserted must contain an event time column if it was specified in the schema while feature group creation. | Add the event time column in the dataframe. | +| String length exceeded | The character length of a string row exceeds the maximum length specified in feature online schema. However, if the feature group is not created and if no explicit schema was provided during feature group creation, then the length will be auto-increased to the maximum length found in a string column. This is handled during the first data ingestion and no user action is needed in this case. **Note:** The maximum row size in bytes should be less than 30000. | Trim the string values to fit within maximum set during feature group creation. OR remove the invalid rows. If the lengths are very long consider changing the feature schema to **TEXT** or **BLOB.** | + + ### Timestamps and Timezones All timestamp features are stored in Hopsworks in UTC time. Also, all timestamp-based functions (such as [point-in-time joins](../../../concepts/fs/feature_view/offline_api.md#point-in-time-correct-training-data)) use UTC time. From eb1f35f5e48c4783eb0b0ea67052c065760c2fd6 Mon Sep 17 00:00:00 2001 From: Dhananjay Mukhedkar <55157590+dhananjay-mk@users.noreply.github.com> Date: Tue, 11 Mar 2025 15:18:19 +0100 Subject: [PATCH 2/7] Update docs/user_guides/fs/feature_group/data_types.md Co-authored-by: Copilot <175728472+Copilot@users.noreply.github.com> --- docs/user_guides/fs/feature_group/data_types.md | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/docs/user_guides/fs/feature_group/data_types.md b/docs/user_guides/fs/feature_group/data_types.md index 0b28eb196..73ce5b60a 100644 --- a/docs/user_guides/fs/feature_group/data_types.md +++ b/docs/user_guides/fs/feature_group/data_types.md @@ -152,7 +152,7 @@ The input dataframe can be validated for schema as per the valid online schema d | Error type | Requirement | Suggested corrections | |-------------------------------|---------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------|-------------------------------------------------------------------------------| -| Primary key contains null values | Primary key column should not cannot any null values. If the primary key is composite key then all columns of primary key are checked for null. | Remove the null rows from dataframe. OR impute the null values as applicable. | +| Primary key contains null values | Primary key columns must not contain any null values. For composite keys, all primary key columns are checked for nulls. | Remove the null rows from dataframe. OR impute the null values as applicable. | | Primary key column is missing | The dataframe to be inserted must contain all the features of the defined the primary key as per the feature group schema. | Add all the primary key columns in the dataframe. | | Event time column is missing | The dataframe to be inserted must contain an event time column if it was specified in the schema while feature group creation. | Add the event time column in the dataframe. | | String length exceeded | The character length of a string row exceeds the maximum length specified in feature online schema. However, if the feature group is not created and if no explicit schema was provided during feature group creation, then the length will be auto-increased to the maximum length found in a string column. This is handled during the first data ingestion and no user action is needed in this case. **Note:** The maximum row size in bytes should be less than 30000. | Trim the string values to fit within maximum set during feature group creation. OR remove the invalid rows. If the lengths are very long consider changing the feature schema to **TEXT** or **BLOB.** | From 0b3efbe7af7a95d2ef279760940d30b055417418 Mon Sep 17 00:00:00 2001 From: Dhananjay Mukhedkar <55157590+dhananjay-mk@users.noreply.github.com> Date: Tue, 11 Mar 2025 15:18:28 +0100 Subject: [PATCH 3/7] Update docs/user_guides/fs/feature_group/data_types.md Co-authored-by: Copilot <175728472+Copilot@users.noreply.github.com> --- docs/user_guides/fs/feature_group/data_types.md | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/docs/user_guides/fs/feature_group/data_types.md b/docs/user_guides/fs/feature_group/data_types.md index 73ce5b60a..1e718c13d 100644 --- a/docs/user_guides/fs/feature_group/data_types.md +++ b/docs/user_guides/fs/feature_group/data_types.md @@ -153,7 +153,7 @@ The input dataframe can be validated for schema as per the valid online schema d | Error type | Requirement | Suggested corrections | |-------------------------------|---------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------|-------------------------------------------------------------------------------| | Primary key contains null values | Primary key columns must not contain any null values. For composite keys, all primary key columns are checked for nulls. | Remove the null rows from dataframe. OR impute the null values as applicable. | -| Primary key column is missing | The dataframe to be inserted must contain all the features of the defined the primary key as per the feature group schema. | Add all the primary key columns in the dataframe. | +| Primary key column is missing | The dataframe to be inserted must contain all the features defined in the primary key as per the feature group schema. | Add all the primary key columns in the dataframe. | | Event time column is missing | The dataframe to be inserted must contain an event time column if it was specified in the schema while feature group creation. | Add the event time column in the dataframe. | | String length exceeded | The character length of a string row exceeds the maximum length specified in feature online schema. However, if the feature group is not created and if no explicit schema was provided during feature group creation, then the length will be auto-increased to the maximum length found in a string column. This is handled during the first data ingestion and no user action is needed in this case. **Note:** The maximum row size in bytes should be less than 30000. | Trim the string values to fit within maximum set during feature group creation. OR remove the invalid rows. If the lengths are very long consider changing the feature schema to **TEXT** or **BLOB.** | From 756b2f246f214a37c36218064cb17bd9c2e2d3bb Mon Sep 17 00:00:00 2001 From: DhananjayMukhedkar Date: Thu, 13 Mar 2025 13:35:15 +0100 Subject: [PATCH 4/7] updates --- .../fs/feature_group/data_types.md | 82 ++++++++++++++++--- 1 file changed, 69 insertions(+), 13 deletions(-) diff --git a/docs/user_guides/fs/feature_group/data_types.md b/docs/user_guides/fs/feature_group/data_types.md index 1e718c13d..e82b09a36 100644 --- a/docs/user_guides/fs/feature_group/data_types.md +++ b/docs/user_guides/fs/feature_group/data_types.md @@ -120,7 +120,7 @@ When a feature is being used as a primary key, certain types are not allowed. Examples of such types are *FLOAT*, *DOUBLE*, *TEXT* and *BLOB*. Additionally, the size of the sum of the primary key online data types storage requirements **should not exceed 4KB**. -#### Online restrictions for row size +#### Online restrictions for row size The online feature store supports **up to 500 columns** and all column types combined **should not exceed 30000 Bytes**. The byte size of each column is determined by its data type and calculated as follows: @@ -145,18 +145,74 @@ The byte size of each column is determined by its data type and calculated as fo #### Pre-insert schema validation for online feature groups - -The input dataframe can be validated for schema as per the valid online schema data types before online ingestion. The most important checks are mentioned below along with possible corrective actions. It is enabled by setting the keyword argument `validation_options={'run_validation':True}` in the `insert()` API of feature groups. - - - -| Error type | Requirement | Suggested corrections | -|-------------------------------|---------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------|-------------------------------------------------------------------------------| -| Primary key contains null values | Primary key columns must not contain any null values. For composite keys, all primary key columns are checked for nulls. | Remove the null rows from dataframe. OR impute the null values as applicable. | -| Primary key column is missing | The dataframe to be inserted must contain all the features defined in the primary key as per the feature group schema. | Add all the primary key columns in the dataframe. | -| Event time column is missing | The dataframe to be inserted must contain an event time column if it was specified in the schema while feature group creation. | Add the event time column in the dataframe. | -| String length exceeded | The character length of a string row exceeds the maximum length specified in feature online schema. However, if the feature group is not created and if no explicit schema was provided during feature group creation, then the length will be auto-increased to the maximum length found in a string column. This is handled during the first data ingestion and no user action is needed in this case. **Note:** The maximum row size in bytes should be less than 30000. | Trim the string values to fit within maximum set during feature group creation. OR remove the invalid rows. If the lengths are very long consider changing the feature schema to **TEXT** or **BLOB.** | - +For online enabled feature groups, the dataframe to be ingested needs to adhere to the online schema definitions. The input dataframe is validated for schema checks accordingly. +The validation is enabled by setting below property when calling `insert()` +=== "Python" + ```python + feature_group.insert(df, validation_options={'run_validation':True}) + ``` +The most important validation checks or error messages are mentioned below along with possible corrective actions. + +1. Primary key contains null values + + - **Rule** Primary key column should not contain any null values. + - **Example correction** Drop the rows containing null primary keys. Alternatively, find the null values and assign them an unique value as per preferred strategy for data imputation. + + === "Pandas" + ```python + # Assuming 'id' is the primary key column + df = df.dropna(subset=['id']) + # For composite keys + df = df.dropna(subset=['id1', 'id2']) + ``` + +2. Primary key column missing + + - **Rule** The dataframe to be inserted must contain all the columns defined as primary key(s) in the feature group. + - **Example correction** Add all the primary key columns in the dataframe. + + === "Pandas" + ```python + # Add missing primary key column + df['id'] = some_value + # If primary key is an auto-incrementing + df['id'] = range(1, len(df) + 1) + ``` + +3. String length exceeded + + - **Rule** The character length of a string should be within the maximum length capacity in the online schema type of a feature. If the feature group is not created and explicit feature schema was not provided, the limit will be auto-increased to the maximum length found in a string column in the dataframe. + - **Example correction** + Trim the string values to fit within maximum limit set during feature group creation. + + === "Pandas" + ```python + max_length = 100 + df['text_column'] = df['text_column'].str.slice(0, max_length) + ``` + + !!!note + The total row size limit should be less than 30kb as per [row size restrictions](#online-restrictions-for-row-size). In such cases it is possible to define the feature as **TEXT** or **BLOB**. + Below is an example of explicitly defining the string column as TEXT as online type. + + === "Pandas" + ```python + import pandas as pd + # example dummy datafrane with the string column + df = pd.DataFrame(columns=['id', 'string_col']) + from hsfs.feature import Feature + features = [ + Feature(name="id",type="bigint",online_type="bigint"), + Feature(name="string_col",type="string",online_type="text") + ] + + fg = fs.get_or_create_feature_group(name="fg_manual_text_schema", + version=1, + features=features, + online_enabled=True, + primary_key=['id']) + fg.insert(df) + ``` ### Timestamps and Timezones From 853f762fe8e9a3d8cb3fde331e0de3d5386f5ef9 Mon Sep 17 00:00:00 2001 From: DhananjayMukhedkar Date: Fri, 11 Apr 2025 11:11:27 +0200 Subject: [PATCH 5/7] updates --- .../fs/feature_group/data_types.md | 24 ++++++++++++++----- 1 file changed, 18 insertions(+), 6 deletions(-) diff --git a/docs/user_guides/fs/feature_group/data_types.md b/docs/user_guides/fs/feature_group/data_types.md index e82b09a36..079fdecb4 100644 --- a/docs/user_guides/fs/feature_group/data_types.md +++ b/docs/user_guides/fs/feature_group/data_types.md @@ -160,10 +160,20 @@ The most important validation checks or error messages are mentioned below along === "Pandas" ```python - # Assuming 'id' is the primary key column + # Drop rows: assuming 'id' is the primary key column df = df.dropna(subset=['id']) # For composite keys df = df.dropna(subset=['id1', 'id2']) + + # Data imputation: replace null values with incrementing last interger id + # existing max id + max_id = df['id'].max() + # counter to generate new id + next_id = max_id + 1 + # for each null id, assign the next id incrementally + for idx in df[df['id'].isna()].index: + df.loc[idx, 'id'] = next_id + next_id += 1 ``` 2. Primary key column missing @@ -173,9 +183,7 @@ The most important validation checks or error messages are mentioned below along === "Pandas" ```python - # Add missing primary key column - df['id'] = some_value - # If primary key is an auto-incrementing + # increamenting primary key upto the length of dataframe df['id'] = range(1, len(df) + 1) ``` @@ -183,13 +191,17 @@ The most important validation checks or error messages are mentioned below along - **Rule** The character length of a string should be within the maximum length capacity in the online schema type of a feature. If the feature group is not created and explicit feature schema was not provided, the limit will be auto-increased to the maximum length found in a string column in the dataframe. - **Example correction** - Trim the string values to fit within maximum limit set during feature group creation. + + - Trim the string values to fit within maximum limit set during feature group creation. === "Pandas" ```python max_length = 100 df['text_column'] = df['text_column'].str.slice(0, max_length) ``` + + - Another option is to simply [create new version of the feature group](https://docs.hopsworks.ai/hopsworks-api/{{{ hopsworks_version }}}/generated/api/feature_group_api/#get_or_create_feature_group) and insert the dataframe. + !!!note The total row size limit should be less than 30kb as per [row size restrictions](#online-restrictions-for-row-size). In such cases it is possible to define the feature as **TEXT** or **BLOB**. @@ -198,7 +210,7 @@ The most important validation checks or error messages are mentioned below along === "Pandas" ```python import pandas as pd - # example dummy datafrane with the string column + # example dummy dataframe with the string column df = pd.DataFrame(columns=['id', 'string_col']) from hsfs.feature import Feature features = [ From ae1a7f73ab627cd2c2e91cc3913a9b30bd0bbc1d Mon Sep 17 00:00:00 2001 From: DhananjayMukhedkar Date: Mon, 14 Apr 2025 12:01:18 +0200 Subject: [PATCH 6/7] update new flag --- docs/user_guides/fs/feature_group/data_types.md | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/docs/user_guides/fs/feature_group/data_types.md b/docs/user_guides/fs/feature_group/data_types.md index 079fdecb4..b32b3c443 100644 --- a/docs/user_guides/fs/feature_group/data_types.md +++ b/docs/user_guides/fs/feature_group/data_types.md @@ -149,7 +149,7 @@ For online enabled feature groups, the dataframe to be ingested needs to adhere The validation is enabled by setting below property when calling `insert()` === "Python" ```python - feature_group.insert(df, validation_options={'run_validation':True}) + feature_group.insert(df, validation_options={'online_schema_validation':True}) ``` The most important validation checks or error messages are mentioned below along with possible corrective actions. From 03cd2197d776783973a63f9bc5e14cea37c7b02b Mon Sep 17 00:00:00 2001 From: DhananjayMukhedkar Date: Thu, 24 Apr 2025 14:23:02 +0200 Subject: [PATCH 7/7] default to on --- docs/user_guides/fs/feature_group/data_types.md | 4 ++-- 1 file changed, 2 insertions(+), 2 deletions(-) diff --git a/docs/user_guides/fs/feature_group/data_types.md b/docs/user_guides/fs/feature_group/data_types.md index b32b3c443..6c1d03e72 100644 --- a/docs/user_guides/fs/feature_group/data_types.md +++ b/docs/user_guides/fs/feature_group/data_types.md @@ -146,10 +146,10 @@ The byte size of each column is determined by its data type and calculated as fo #### Pre-insert schema validation for online feature groups For online enabled feature groups, the dataframe to be ingested needs to adhere to the online schema definitions. The input dataframe is validated for schema checks accordingly. -The validation is enabled by setting below property when calling `insert()` +The validation is enabled by default and can be disabled by setting below key word argument when calling `insert()` === "Python" ```python - feature_group.insert(df, validation_options={'online_schema_validation':True}) + feature_group.insert(df, validation_options={'online_schema_validation':False}) ``` The most important validation checks or error messages are mentioned below along with possible corrective actions.