You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Copy file name to clipboardExpand all lines: docs/user_guides/fs/feature_group/data_types.md
+84-1
Original file line number
Diff line number
Diff line change
@@ -120,7 +120,7 @@ When a feature is being used as a primary key, certain types are not allowed.
120
120
Examples of such types are *FLOAT*, *DOUBLE*, *TEXT* and *BLOB*.
121
121
Additionally, the size of the sum of the primary key online data types storage requirements **should not exceed 4KB**.
122
122
123
-
#### Online restrictions for row size
123
+
#### Online restrictions for row size
124
124
125
125
The online feature store supports **up to 500 columns** and all column types combined **should not exceed 30000 Bytes**.
126
126
The byte size of each column is determined by its data type and calculated as follows:
@@ -143,6 +143,89 @@ The byte size of each column is determined by its data type and calculated as fo
143
143
| BLOB | 256 |
144
144
| other | 8 |
145
145
146
+
147
+
#### Pre-insert schema validation for online feature groups
148
+
For online enabled feature groups, the dataframe to be ingested needs to adhere to the online schema definitions. The input dataframe is validated for schema checks accordingly.
149
+
The validation is enabled by default and can be disabled by setting below key word argument when calling `insert()`
The most important validation checks or error messages are mentioned below along with possible corrective actions.
155
+
156
+
1. Primary key contains null values
157
+
158
+
-**Rule** Primary key column should not contain any null values.
159
+
-**Example correction** Drop the rows containing null primary keys. Alternatively, find the null values and assign them an unique value as per preferred strategy for data imputation.
160
+
161
+
=== "Pandas"
162
+
```python
163
+
# Drop rows: assuming 'id' is the primary key column
164
+
df = df.dropna(subset=['id'])
165
+
# For composite keys
166
+
df = df.dropna(subset=['id1', 'id2'])
167
+
168
+
# Data imputation: replace null values with incrementing last interger id
169
+
# existing max id
170
+
max_id = df['id'].max()
171
+
# counter to generate new id
172
+
next_id = max_id +1
173
+
# for each null id, assign the next id incrementally
174
+
for idx in df[df['id'].isna()].index:
175
+
df.loc[idx, 'id'] = next_id
176
+
next_id +=1
177
+
```
178
+
179
+
2. Primary key column missing
180
+
181
+
-**Rule** The dataframe to be inserted must contain all the columns defined as primary key(s) in the feature group.
182
+
-**Example correction** Add all the primary key columns in the dataframe.
183
+
184
+
==="Pandas"
185
+
```python
186
+
# increamenting primary key upto the length of dataframe
187
+
df['id'] =range(1, len(df) +1)
188
+
```
189
+
190
+
3. String length exceeded
191
+
192
+
-**Rule** The character length of a string should be within the maximum length capacity in the online schema type of a feature. If the feature group isnot created and explicit feature schema was not provided, the limit will be auto-increased to the maximum length found in a string column in the dataframe.
193
+
-**Example correction**
194
+
195
+
- Trim the string values to fit within maximum limit set during feature group creation.
- Another option is to simply [create new version of the feature group](https://docs.hopsworks.ai/hopsworks-api/{{{ hopsworks_version }}}/generated/api/feature_group_api/#get_or_create_feature_group) and insert the dataframe.
204
+
205
+
206
+
!!!note
207
+
The total row size limit should be less than 30kbas per [row size restrictions](#online-restrictions-for-row-size). In such cases it is possible to define the feature as **TEXT** or **BLOB**.
208
+
Below is an example of explicitly defining the string column asTEXTas online type.
All timestamp features are stored in Hopsworks inUTC time. Also, all timestamp-based functions (such as [point-in-time joins](../../../concepts/fs/feature_view/offline_api.md#point-in-time-correct-training-data)) use UTC time.
0 commit comments