Skip to content

Commit 81fa642

Browse files
add pre-insert schema validation errors (#453)
* add pre-insert schema validation errors * Update docs/user_guides/fs/feature_group/data_types.md Co-authored-by: Copilot <[email protected]> * Update docs/user_guides/fs/feature_group/data_types.md Co-authored-by: Copilot <[email protected]> * updates * updates * update new flag * default to on --------- Co-authored-by: Copilot <[email protected]>
1 parent 95eeaf4 commit 81fa642

File tree

1 file changed

+84
-1
lines changed

1 file changed

+84
-1
lines changed

docs/user_guides/fs/feature_group/data_types.md

+84-1
Original file line numberDiff line numberDiff line change
@@ -120,7 +120,7 @@ When a feature is being used as a primary key, certain types are not allowed.
120120
Examples of such types are *FLOAT*, *DOUBLE*, *TEXT* and *BLOB*.
121121
Additionally, the size of the sum of the primary key online data types storage requirements **should not exceed 4KB**.
122122

123-
#### Online restrictions for row size
123+
#### Online restrictions for row size
124124

125125
The online feature store supports **up to 500 columns** and all column types combined **should not exceed 30000 Bytes**.
126126
The byte size of each column is determined by its data type and calculated as follows:
@@ -143,6 +143,89 @@ The byte size of each column is determined by its data type and calculated as fo
143143
| BLOB | 256 |
144144
| other | 8 |
145145

146+
147+
#### Pre-insert schema validation for online feature groups
148+
For online enabled feature groups, the dataframe to be ingested needs to adhere to the online schema definitions. The input dataframe is validated for schema checks accordingly.
149+
The validation is enabled by default and can be disabled by setting below key word argument when calling `insert()`
150+
=== "Python"
151+
```python
152+
feature_group.insert(df, validation_options={'online_schema_validation':False})
153+
```
154+
The most important validation checks or error messages are mentioned below along with possible corrective actions.
155+
156+
1. Primary key contains null values
157+
158+
- **Rule** Primary key column should not contain any null values.
159+
- **Example correction** Drop the rows containing null primary keys. Alternatively, find the null values and assign them an unique value as per preferred strategy for data imputation.
160+
161+
=== "Pandas"
162+
```python
163+
# Drop rows: assuming 'id' is the primary key column
164+
df = df.dropna(subset=['id'])
165+
# For composite keys
166+
df = df.dropna(subset=['id1', 'id2'])
167+
168+
# Data imputation: replace null values with incrementing last interger id
169+
# existing max id
170+
max_id = df['id'].max()
171+
# counter to generate new id
172+
next_id = max_id + 1
173+
# for each null id, assign the next id incrementally
174+
for idx in df[df['id'].isna()].index:
175+
df.loc[idx, 'id'] = next_id
176+
next_id += 1
177+
```
178+
179+
2. Primary key column missing
180+
181+
- **Rule** The dataframe to be inserted must contain all the columns defined as primary key(s) in the feature group.
182+
- **Example correction** Add all the primary key columns in the dataframe.
183+
184+
=== "Pandas"
185+
```python
186+
# increamenting primary key upto the length of dataframe
187+
df['id'] = range(1, len(df) + 1)
188+
```
189+
190+
3. String length exceeded
191+
192+
- **Rule** The character length of a string should be within the maximum length capacity in the online schema type of a feature. If the feature group is not created and explicit feature schema was not provided, the limit will be auto-increased to the maximum length found in a string column in the dataframe.
193+
- **Example correction**
194+
195+
- Trim the string values to fit within maximum limit set during feature group creation.
196+
197+
=== "Pandas"
198+
```python
199+
max_length = 100
200+
df['text_column'] = df['text_column'].str.slice(0, max_length)
201+
```
202+
203+
- Another option is to simply [create new version of the feature group](https://docs.hopsworks.ai/hopsworks-api/{{{ hopsworks_version }}}/generated/api/feature_group_api/#get_or_create_feature_group) and insert the dataframe.
204+
205+
206+
!!!note
207+
The total row size limit should be less than 30kb as per [row size restrictions](#online-restrictions-for-row-size). In such cases it is possible to define the feature as **TEXT** or **BLOB**.
208+
Below is an example of explicitly defining the string column as TEXT as online type.
209+
210+
=== "Pandas"
211+
```python
212+
import pandas as pd
213+
# example dummy dataframe with the string column
214+
df = pd.DataFrame(columns=['id', 'string_col'])
215+
from hsfs.feature import Feature
216+
features = [
217+
Feature(name="id",type="bigint",online_type="bigint"),
218+
Feature(name="string_col",type="string",online_type="text")
219+
]
220+
221+
fg = fs.get_or_create_feature_group(name="fg_manual_text_schema",
222+
version=1,
223+
features=features,
224+
online_enabled=True,
225+
primary_key=['id'])
226+
fg.insert(df)
227+
```
228+
146229
### Timestamps and Timezones
147230

148231
All timestamp features are stored in Hopsworks in UTC time. Also, all timestamp-based functions (such as [point-in-time joins](../../../concepts/fs/feature_view/offline_api.md#point-in-time-correct-training-data)) use UTC time.

0 commit comments

Comments
 (0)