add pre-insert schema validation errors (#453)

dhananjay-mk · Copilot · web-flow · commit 81fa642a33b6 · 2025-04-24T20:47:10.000+02:00
* add pre-insert schema validation errors

* Update docs/user_guides/fs/feature_group/data_types.md

Co-authored-by: Copilot &lt;175728472+Copilot@users.noreply.github.com&gt;

* Update docs/user_guides/fs/feature_group/data_types.md

Co-authored-by: Copilot &lt;175728472+Copilot@users.noreply.github.com&gt;

* updates

* updates

* update new flag

* default to on

---------

Co-authored-by: Copilot &lt;175728472+Copilot@users.noreply.github.com&gt;
diff --git a/docs/user_guides/fs/feature_group/data_types.md b/docs/user_guides/fs/feature_group/data_types.md
@@ -120,7 +120,7 @@ When a feature is being used as a primary key, certain types are not allowed.
 Examples of such types are *FLOAT*, *DOUBLE*, *TEXT* and *BLOB*.
 Additionally, the size of the sum of the primary key online data types storage requirements **should not exceed 4KB**.
 
-#### Online restrictions for row size
+####  Online restrictions for row size
 
 The online feature store supports **up to 500 columns** and all column types combined **should not exceed 30000 Bytes**.
 The byte size of each column is determined by its data type and calculated as follows:
@@ -143,6 +143,89 @@ The byte size of each column is determined by its data type and calculated as fo
 | BLOB                            | 256          |
 | other                           | 8            |
 
+
+#### Pre-insert schema validation for online feature groups
+For online enabled feature groups, the dataframe to be ingested needs to adhere to the online schema definitions. The input dataframe is validated for schema checks accordingly.
+The validation is enabled by default and can be disabled by setting below key word argument when calling `insert()`
+=== "Python"
+    ```python
+    feature_group.insert(df, validation_options={'online_schema_validation':False})
+    ```
+The most important validation checks or error messages are mentioned below along with possible corrective actions. 
+
+1. Primary key contains null values 
+
+    - **Rule** Primary key column should not contain any null values.
+    - **Example correction** Drop the rows containing null primary keys. Alternatively, find the null values and assign them an unique value as per preferred strategy for data imputation.
+        
+        === "Pandas"
+        ```python
+        # Drop rows: assuming 'id' is the primary key column
+        df = df.dropna(subset=['id'])
+        # For composite keys
+        df = df.dropna(subset=['id1', 'id2'])
+
+        # Data imputation: replace null values with incrementing last interger id
+        # existing max id 
+        max_id = df['id'].max()
+        # counter to generate new id
+        next_id = max_id + 1
+        # for each null id, assign the next id incrementally
+        for idx in df[df['id'].isna()].index:
+            df.loc[idx, 'id'] = next_id
+            next_id += 1
+        ```
+
+2. Primary key column missing
+
+    - **Rule** The dataframe to be inserted must contain all the columns defined as primary key(s) in the feature group.
+    - **Example correction** Add all the primary key columns in the dataframe.
+        
+        === "Pandas"
+        ```python
+        # increamenting primary key upto the length of dataframe
+        df['id'] = range(1, len(df) + 1)
+        ```
+
+3. String length exceeded
+
+    - **Rule** The character length of a string should be within the maximum length capacity in the online schema type of a feature. If the feature group is not created and explicit feature schema was not provided, the limit will be auto-increased to the maximum length found in a string column in the dataframe. 
+    - **Example correction**
+    
+        - Trim the string values to fit within maximum limit set during feature group creation.
+        
+        === "Pandas"
+        ```python
+        max_length = 100
+        df['text_column'] = df['text_column'].str.slice(0, max_length)
+        ```
+        
+        - Another option is to simply [create new version of the feature group](https://docs.hopsworks.ai/hopsworks-api/{{{ hopsworks_version }}}/generated/api/feature_group_api/#get_or_create_feature_group) and insert the dataframe.
+
+
+        !!!note  
+            The total row size limit should be less than 30kb as per [row size restrictions](#online-restrictions-for-row-size). In such cases it is possible to define the feature as **TEXT** or **BLOB**.
+            Below is an example of explicitly defining the string column as TEXT as online type.
+
+        === "Pandas"
+        ```python
+        import pandas as pd
+        # example dummy dataframe with the string column
+        df = pd.DataFrame(columns=['id', 'string_col'])
+        from hsfs.feature import Feature
+        features = [
+        Feature(name="id",type="bigint",online_type="bigint"),
+        Feature(name="string_col",type="string",online_type="text")
+        ]
+
+        fg = fs.get_or_create_feature_group(name="fg_manual_text_schema",
+                                    version=1,
+                                    features=features,
+                                    online_enabled=True,
+                                    primary_key=['id'])
+        fg.insert(df)
+        ```
+
 ### Timestamps and Timezones
 
 All timestamp features are stored in Hopsworks in UTC time. Also, all timestamp-based functions (such as [point-in-time joins](../../../concepts/fs/feature_view/offline_api.md#point-in-time-correct-training-data)) use UTC time.