Skip to content

[RFC] New @strict decorator for dataclass validation #2895

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Open
wants to merge 13 commits into
base: main
Choose a base branch
from
2 changes: 2 additions & 0 deletions docs/source/en/_toctree.yml
Original file line number Diff line number Diff line change
Expand Up @@ -86,3 +86,5 @@
title: Webhooks server
- local: package_reference/serialization
title: Serialization
- local: package_reference/dataclasses
title: Strict dataclasses
146 changes: 146 additions & 0 deletions docs/source/en/package_reference/dataclasses.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,146 @@
# Strict Dataclasses

The `huggingface_hub` package provides a utility to create **strict dataclasses**. These are enhanced versions of Python's standard `dataclass` with additional validation features. Strict dataclasses ensure that fields are validated both during initialization and assignment, making them ideal for scenarios where data integrity is critical.

## Overview

Strict dataclasses are created using the `@strict` decorator. They extend the functionality of regular dataclasses by:

- Validating field types based on type hints
- Supporting custom validators for additional checks
- Optionally allowing arbitrary keyword arguments in the constructor
- Validating fields both at initialization and during assignment

## Benefits

- **Data Integrity**: Ensures fields always contain valid data
- **Ease of Use**: Integrates seamlessly with Python's `dataclass` module
- **Flexibility**: Supports custom validators for complex validation logic
- **Lightweight**: Requires no additional dependencies such as Pydantic, attrs, or similar libraries

## Usage

### Basic Example

```python
from dataclasses import dataclass
from huggingface_hub.dataclasses import strict, validated_field

# Custom validator to ensure a value is positive
def positive_int(value: int):
if not value >= 0:
raise ValueError(f"Value must be positive, got {value}")

@strict
@dataclass
class Config:
model_type: str
hidden_size: int = validated_field(validator=positive_int)
vocab_size: int = 16 # Default value
```

Fields are validated during initialization:

```python
config = Config(model_type="bert", hidden_size=768) # Valid
config = Config(model_type="bert", hidden_size=-1) # Raises StrictDataclassFieldValidationError
```

Fields are also validated during assignment:

```python
config.hidden_size = 512 # Valid
config.hidden_size = -1 # Raises StrictDataclassFieldValidationError
```

### Custom Validators

You can attach multiple custom validators to fields using `validated_field`. A validator is a callable that takes a single argument and raises an exception if the value is invalid.

```python
def multiple_of_64(value: int):
if value % 64 != 0:
raise ValueError(f"Value must be a multiple of 64, got {value}")

@strict
@dataclass
class Config:
hidden_size: int = validated_field(validator=[positive_int, multiple_of_64])
```

In this example, both validators are applied to the `hidden_size` field.

### Additional Keyword Arguments

By default, strict dataclasses only accept fields defined in the class. You can allow additional keyword arguments by setting `accept_kwargs=True` in the `@strict` decorator.

```python
@strict(accept_kwargs=True)
@dataclass
class ConfigWithKwargs:
model_type: str
vocab_size: int = 16

config = ConfigWithKwargs(model_type="bert", vocab_size=30000, extra_field="extra_value")
print(config) # ConfigWithKwargs(model_type='bert', vocab_size=30000, *extra_field='extra_value')
```

Additional keyword arguments appear in the string representation of the dataclass but are prefixed with `*` to highlight that they are not validated.

### Integration with Type Hints

Strict dataclasses respect type hints and validate them automatically. For example:

```python
from typing import List

@strict
@dataclass
class Config:
layers: List[int]

config = Config(layers=[64, 128]) # Valid
config = Config(layers="not_a_list") # Raises StrictDataclassFieldValidationError
```

Supported types include:
- Any
- Union
- Optional
- Literal
- List
- Dict
- Tuple
- Set

And any combination of these types.

## API Reference

### `@strict`

The `@strict` decorator enhances a dataclass with strict validation.

[[autodoc]] dataclasses.strict

### `validated_field`

Creates a dataclass field with custom validation.

[[autodoc]] dataclasses.validated_field

### Errors

[[autodoc]] errors.StrictDataclassError

[[autodoc]] errors.StrictDataclassDefinitionError

[[autodoc]] errors.StrictDataclassFieldValidationError

## Why Not Use `pydantic`? (or `attrs`? or `marshmallow_dataclass`?)

- See discussion in https://github.com/huggingface/transformers/issues/36329 regarding adding Pydantic as a dependency. It would be a heavy addition and require careful logic to support both v1 and v2.
- We don't need most of Pydantic's features, especially those related to automatic casting, jsonschema, serialization, aliases, etc.
- We don't need the ability to instantiate a class from a dictionary.
- We don't want to mutate data. In `@strict`, "validation" means "checking if a value is valid." In Pydantic, "validation" means "casting a value, possibly mutating it, and then checking if it's valid."
- We don't need blazing-fast validation. `@strict` isn't designed for heavy loads where performance is critical. Common use cases involve validating a model configuration (performed once and negligible compared to running a model). This allows us to keep the code minimal.
Loading
Loading