|
| 1 | +# Lazy Validation |
| 2 | + |
| 3 | +In many cases, dataframely's capability to validate and filter input data is used at core application boundaries. |
| 4 | +As a result, `validate` and `filter` are generally expected to be used at points where `collect` is called on a lazy |
| 5 | +frame. However, there may be situations where validation or filtering should simply be added to the lazy computation |
| 6 | +graph. Starting in dataframely v2, this is supported via a custom polars plugin. |
| 7 | + |
| 8 | +## The `eager` parameter |
| 9 | + |
| 10 | +All of the following methods expose an `eager: bool` parameter: |
| 11 | + |
| 12 | +- {meth}`Schema.validate() <dataframely.Schema.validate>` |
| 13 | +- {meth}`Schema.filter() <dataframely.Schema.filter>` |
| 14 | +- {meth}`Collection.validate() <dataframely.Collection.validate>` |
| 15 | +- {meth}`Collection.filter() <dataframely.Collection.filter>` |
| 16 | + |
| 17 | +By default, `eager=True`. However, users may decide to set `eager=False` in order to simply append the validation or |
| 18 | +the filtering operation to the lazy frame. For example, one might decide to run validation lazily: |
| 19 | + |
| 20 | +```python |
| 21 | +def validate_lf(lf: pl.LazyFrame) -> pl.LazyFrame: |
| 22 | + return lf.pipe(MySchema.validate, eager=False) |
| 23 | +``` |
| 24 | + |
| 25 | +When `eager=False`, validation is only run once the lazy frame is collected. If input data does not satisfy the schema, |
| 26 | +no error is raised here, yet. |
| 27 | + |
| 28 | +## Error Types |
| 29 | + |
| 30 | +Due to current limitations in polars plugins, the type of error that is being raised from the `validate` function (both |
| 31 | +for schemas and collections) is dependent on the value of the `eager` parameter: |
| 32 | + |
| 33 | +- When `eager=True`, a {class}`~dataframely.ValidationError` is raised from the `validate` function |
| 34 | +- When `eager=False`, a {class}`~polars.exceptions.ComputeError` is raised from the `collect` function |
| 35 | + |
| 36 | +```{note} |
| 37 | +For schemas, the error _message_ itself is equivalent. |
| 38 | +For collections, the error message for `eager=False` is limited and non-deterministic: the error message only includes |
| 39 | +information about a single member and, if multiple members fail validation, the member that the error message refers to |
| 40 | +may vary across executions. |
| 41 | +``` |
0 commit comments