-
Notifications
You must be signed in to change notification settings - Fork 29
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
add pre-insert schema validation errors #453
base: main
Are you sure you want to change the base?
Conversation
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Pull Request Overview
This PR adds documentation detailing pre-insert schema validations for online feature groups with common error types and corrective actions.
- Adds a new documentation section on pre-insert schema validation for online feature groups.
- Introduces a table outlining key error types, requirements, and suggestions for correction.
Co-authored-by: Copilot <[email protected]>
Co-authored-by: Copilot <[email protected]>
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Pull Request Overview
This PR adds documentation for pre-insert schema validation errors for online feature groups, providing guidance on how to handle common validation issues.
- Introduces a new section on pre-insert schema validation.
- Provides Python and Pandas code examples for handling primary key null values, missing primary key columns, and string length issues.
- **Example correction** Drop the rows containing null primary keys. Alternatively, find the null values and assign them an unique value as per preferred strategy for data imputation. | ||
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Typo found: 'an unique value' should be 'a unique value'.
- **Example correction** Drop the rows containing null primary keys. Alternatively, find the null values and assign them an unique value as per preferred strategy for data imputation. | |
- **Example correction** Drop the rows containing null primary keys. Alternatively, find the null values and assign them a unique value as per preferred strategy for data imputation. |
Copilot is powered by AI, so mistakes are possible. Review output carefully before use.
=== "Pandas" | ||
```python | ||
import pandas as pd | ||
# example dummy datafrane with the string column |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Typo found: 'datafrane' should be corrected to 'dataframe'.
# example dummy datafrane with the string column | |
# example dummy dataframe with the string column |
Copilot is powered by AI, so mistakes are possible. Review output carefully before use.
|
||
#### Pre-insert schema validation for online feature groups | ||
|
||
The input dataframe can be validated for schema as per the valid online schema data types before online ingestion. The most important checks are mentioned below along with possible corrective actions. It is enabled by setting the keyword argument `validation_options={'run_validation':True}` in the `insert()` API of feature groups. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The validation is enabled by default so maybe you should say how to disable it instead
|
||
| Error type | Requirement | Suggested corrections | | ||
|-------------------------------|---------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------|-------------------------------------------------------------------------------| | ||
| Primary key contains null values | Primary key columns must not contain any null values. For composite keys, all primary key columns are checked for nulls. | Remove the null rows from dataframe. OR impute the null values as applicable. | |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Add a snippet to remove rows with at least one null entry in the PK column. You can do spark, polars and pandas.
And impute
-> input
|-------------------------------|---------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------|-------------------------------------------------------------------------------| | ||
| Primary key contains null values | Primary key columns must not contain any null values. For composite keys, all primary key columns are checked for nulls. | Remove the null rows from dataframe. OR impute the null values as applicable. | | ||
| Primary key column is missing | The dataframe to be inserted must contain all the features defined in the primary key as per the feature group schema. | Add all the primary key columns in the dataframe. | | ||
| Event time column is missing | The dataframe to be inserted must contain an event time column if it was specified in the schema while feature group creation. | Add the event time column in the dataframe. | |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think here you need to be clear about the type of this column needing to match what is in the FG potentially, i think we support both string and bigint. Give a snippet to put the date of the insertion
| Primary key contains null values | Primary key columns must not contain any null values. For composite keys, all primary key columns are checked for nulls. | Remove the null rows from dataframe. OR impute the null values as applicable. | | ||
| Primary key column is missing | The dataframe to be inserted must contain all the features defined in the primary key as per the feature group schema. | Add all the primary key columns in the dataframe. | | ||
| Event time column is missing | The dataframe to be inserted must contain an event time column if it was specified in the schema while feature group creation. | Add the event time column in the dataframe. | | ||
| String length exceeded | The character length of a string row exceeds the maximum length specified in feature online schema. However, if the feature group is not created and if no explicit schema was provided during feature group creation, then the length will be auto-increased to the maximum length found in a string column. This is handled during the first data ingestion and no user action is needed in this case. **Note:** The maximum row size in bytes should be less than 30000. | Trim the string values to fit within maximum set during feature group creation. OR remove the invalid rows. If the lengths are very long consider changing the feature schema to **TEXT** or **BLOB.** | |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is unclear. You cannot have the on-creation it does not throw this error explanation in this place. You need to link to the online data types. And are very long
is not precise enough. Either insert link or write things explicitly. Provide a snippet for trimming
=== "Pandas" | ||
```python | ||
# Add missing primary key column | ||
df['id'] = some_value |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
What are you talking about? all entries would have the same PK if you do that
=== "Pandas" | ||
```python | ||
max_length = 100 | ||
df['text_column'] = df['text_column'].str.slice(0, max_length) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Is this not polars?
df['id'] = range(1, len(df) + 1) | ||
``` | ||
|
||
3. String length exceeded |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I asked already to point to the docs to enlarge the table (which will create a new version of the FG)
No description provided.