Skip to content

[FSTORE-1708] add pre-insert schema validation errors #453

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Merged
merged 7 commits into from
Apr 24, 2025
Merged
Changes from 6 commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
85 changes: 84 additions & 1 deletion docs/user_guides/fs/feature_group/data_types.md
Original file line number Diff line number Diff line change
Expand Up @@ -120,7 +120,7 @@ When a feature is being used as a primary key, certain types are not allowed.
Examples of such types are *FLOAT*, *DOUBLE*, *TEXT* and *BLOB*.
Additionally, the size of the sum of the primary key online data types storage requirements **should not exceed 4KB**.

#### Online restrictions for row size
#### Online restrictions for row size

The online feature store supports **up to 500 columns** and all column types combined **should not exceed 30000 Bytes**.
The byte size of each column is determined by its data type and calculated as follows:
Expand All @@ -143,6 +143,89 @@ The byte size of each column is determined by its data type and calculated as fo
| BLOB | 256 |
| other | 8 |


#### Pre-insert schema validation for online feature groups
For online enabled feature groups, the dataframe to be ingested needs to adhere to the online schema definitions. The input dataframe is validated for schema checks accordingly.
The validation is enabled by setting below property when calling `insert()`
=== "Python"
```python
feature_group.insert(df, validation_options={'online_schema_validation':True})
```
The most important validation checks or error messages are mentioned below along with possible corrective actions.

1. Primary key contains null values

- **Rule** Primary key column should not contain any null values.
- **Example correction** Drop the rows containing null primary keys. Alternatively, find the null values and assign them an unique value as per preferred strategy for data imputation.

=== "Pandas"
```python
# Drop rows: assuming 'id' is the primary key column
df = df.dropna(subset=['id'])
# For composite keys
df = df.dropna(subset=['id1', 'id2'])

# Data imputation: replace null values with incrementing last interger id
# existing max id
max_id = df['id'].max()
# counter to generate new id
next_id = max_id + 1
# for each null id, assign the next id incrementally
for idx in df[df['id'].isna()].index:
df.loc[idx, 'id'] = next_id
next_id += 1
```

2. Primary key column missing

- **Rule** The dataframe to be inserted must contain all the columns defined as primary key(s) in the feature group.
- **Example correction** Add all the primary key columns in the dataframe.

=== "Pandas"
```python
# increamenting primary key upto the length of dataframe
df['id'] = range(1, len(df) + 1)
```

3. String length exceeded

- **Rule** The character length of a string should be within the maximum length capacity in the online schema type of a feature. If the feature group is not created and explicit feature schema was not provided, the limit will be auto-increased to the maximum length found in a string column in the dataframe.
- **Example correction**

- Trim the string values to fit within maximum limit set during feature group creation.

=== "Pandas"
```python
max_length = 100
df['text_column'] = df['text_column'].str.slice(0, max_length)
```

- Another option is to simply [create new version of the feature group](https://docs.hopsworks.ai/hopsworks-api/{{{ hopsworks_version }}}/generated/api/feature_group_api/#get_or_create_feature_group) and insert the dataframe.


!!!note
The total row size limit should be less than 30kb as per [row size restrictions](#online-restrictions-for-row-size). In such cases it is possible to define the feature as **TEXT** or **BLOB**.
Below is an example of explicitly defining the string column as TEXT as online type.

=== "Pandas"
```python
import pandas as pd
# example dummy dataframe with the string column
df = pd.DataFrame(columns=['id', 'string_col'])
from hsfs.feature import Feature
features = [
Feature(name="id",type="bigint",online_type="bigint"),
Feature(name="string_col",type="string",online_type="text")
]

fg = fs.get_or_create_feature_group(name="fg_manual_text_schema",
version=1,
features=features,
online_enabled=True,
primary_key=['id'])
fg.insert(df)
```

### Timestamps and Timezones

All timestamp features are stored in Hopsworks in UTC time. Also, all timestamp-based functions (such as [point-in-time joins](../../../concepts/fs/feature_view/offline_api.md#point-in-time-correct-training-data)) use UTC time.
Expand Down