Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

add pre-insert schema validation errors #453

Open
wants to merge 4 commits into
base: main
Choose a base branch
from

Conversation

dhananjay-mk
Copy link
Contributor

No description provided.

@dhananjay-mk dhananjay-mk requested review from Copilot and vatj March 11, 2025 14:13
Copy link
Contributor

@Copilot Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull Request Overview

This PR adds documentation detailing pre-insert schema validations for online feature groups with common error types and corrective actions.

  • Adds a new documentation section on pre-insert schema validation for online feature groups.
  • Introduces a table outlining key error types, requirements, and suggestions for correction.

@dhananjay-mk dhananjay-mk requested a review from Copilot March 13, 2025 15:07
Copy link
Contributor

@Copilot Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull Request Overview

This PR adds documentation for pre-insert schema validation errors for online feature groups, providing guidance on how to handle common validation issues.

  • Introduces a new section on pre-insert schema validation.
  • Provides Python and Pandas code examples for handling primary key null values, missing primary key columns, and string length issues.

Comment on lines +159 to +160
- **Example correction** Drop the rows containing null primary keys. Alternatively, find the null values and assign them an unique value as per preferred strategy for data imputation.

Copy link
Preview

Copilot AI Mar 13, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Typo found: 'an unique value' should be 'a unique value'.

Suggested change
- **Example correction** Drop the rows containing null primary keys. Alternatively, find the null values and assign them an unique value as per preferred strategy for data imputation.
- **Example correction** Drop the rows containing null primary keys. Alternatively, find the null values and assign them a unique value as per preferred strategy for data imputation.

Copilot is powered by AI, so mistakes are possible. Review output carefully before use.

=== "Pandas"
```python
import pandas as pd
# example dummy datafrane with the string column
Copy link
Preview

Copilot AI Mar 13, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Typo found: 'datafrane' should be corrected to 'dataframe'.

Suggested change
# example dummy datafrane with the string column
# example dummy dataframe with the string column

Copilot is powered by AI, so mistakes are possible. Review output carefully before use.


#### Pre-insert schema validation for online feature groups

The input dataframe can be validated for schema as per the valid online schema data types before online ingestion. The most important checks are mentioned below along with possible corrective actions. It is enabled by setting the keyword argument `validation_options={'run_validation':True}` in the `insert()` API of feature groups.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The validation is enabled by default so maybe you should say how to disable it instead


| Error type | Requirement | Suggested corrections |
|-------------------------------|---------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------|-------------------------------------------------------------------------------|
| Primary key contains null values | Primary key columns must not contain any null values. For composite keys, all primary key columns are checked for nulls. | Remove the null rows from dataframe. OR impute the null values as applicable. |
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Add a snippet to remove rows with at least one null entry in the PK column. You can do spark, polars and pandas.
And impute -> input

|-------------------------------|---------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------|-------------------------------------------------------------------------------|
| Primary key contains null values | Primary key columns must not contain any null values. For composite keys, all primary key columns are checked for nulls. | Remove the null rows from dataframe. OR impute the null values as applicable. |
| Primary key column is missing | The dataframe to be inserted must contain all the features defined in the primary key as per the feature group schema. | Add all the primary key columns in the dataframe. |
| Event time column is missing | The dataframe to be inserted must contain an event time column if it was specified in the schema while feature group creation. | Add the event time column in the dataframe. |
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think here you need to be clear about the type of this column needing to match what is in the FG potentially, i think we support both string and bigint. Give a snippet to put the date of the insertion

| Primary key contains null values | Primary key columns must not contain any null values. For composite keys, all primary key columns are checked for nulls. | Remove the null rows from dataframe. OR impute the null values as applicable. |
| Primary key column is missing | The dataframe to be inserted must contain all the features defined in the primary key as per the feature group schema. | Add all the primary key columns in the dataframe. |
| Event time column is missing | The dataframe to be inserted must contain an event time column if it was specified in the schema while feature group creation. | Add the event time column in the dataframe. |
| String length exceeded | The character length of a string row exceeds the maximum length specified in feature online schema. However, if the feature group is not created and if no explicit schema was provided during feature group creation, then the length will be auto-increased to the maximum length found in a string column. This is handled during the first data ingestion and no user action is needed in this case. **Note:** The maximum row size in bytes should be less than 30000. | Trim the string values to fit within maximum set during feature group creation. OR remove the invalid rows. If the lengths are very long consider changing the feature schema to **TEXT** or **BLOB.** |
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is unclear. You cannot have the on-creation it does not throw this error explanation in this place. You need to link to the online data types. And are very long is not precise enough. Either insert link or write things explicitly. Provide a snippet for trimming

=== "Pandas"
```python
# Add missing primary key column
df['id'] = some_value
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What are you talking about? all entries would have the same PK if you do that

=== "Pandas"
```python
max_length = 100
df['text_column'] = df['text_column'].str.slice(0, max_length)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is this not polars?

df['id'] = range(1, len(df) + 1)
```

3. String length exceeded
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I asked already to point to the docs to enlarge the table (which will create a new version of the FG)

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants