Question about RNG state in NNX checkpointing with Orbax #4514

conorhassan · 2025-01-29T14:22:43Z

conorhassan
Jan 29, 2025

When attempting to save NNX models using Orbax checkpointing, certain models containing RNG state (from Dropout/MultiHeadAttention) failed with:

TypeError: Cannot interpret 'key<fry>' as a data type

A simple MLP with dropout works fine with standard checkpointing:

class SimpleModelWithDropout(nnx.Module):
    def __init__(self, rngs: nnx.Rngs):
        self.linear = nnx.Linear(64, 64, rngs=rngs)
        self.dropout = nnx.Dropout(0.2)
    
    def __call__(self, x):
        x = self.linear(x)
        return self.dropout(x)

model = SimpleModelWithDropout(rngs=nnx.Rngs(0))
_, state = nnx.split(model)
checkpointer.save(path, state)  # This works!

However, more complex models like this transformer implementation fail:

class TransformerBlock(nnx.Module):
    def __init__(self, input_dim: int, rngs: nnx.Rngs):
        self.attention = nnx.MultiHeadAttention(
            num_heads=4,
            in_features=input_dim,
            dropout_rate=0.2,
            rngs=rngs
        )
        self.mlp = MLP(...)  # Similar to SimpleModelWithDropout above

model = TransformerBlock(64, rngs=nnx.Rngs(0))
_, state = nnx.split(model)
checkpointer.save(path, state)  # This fails!

Full implementation of the failing model available here: https://github.com/conorhassan/tnp/blob/main/tnp/models/nnx_models/layers.py

The solution that I found was

# This fails with complex models:
_, state = nnx.split(model)
checkpointer.save(path, state)

# This works for all models:
_, params, _ = nnx.split(model, nnx.Param, ...)
checkpointer.save(path, params)

Example of what the State that fails looks like:

State({
    'attention': {
        'rngs': {
            'default': {
                'count': VariableState(type=RngCount, value=Array(6, dtype=uint32)),
                'key': VariableState(
                    type=RngKey,
                    value=Array((), dtype=key<fry>) overlaying: [0 0]
                )
            }
        }
    }
})

Questions

Is this the intended way to handle checkpointing with NNX and Orbax?
Why does a simple MLP with dropout work with standard split() but more complex models fail?
Is there something specific about MultiHeadAttention's implementation that causes it to store RNG state differently from simple Dropout?

Thanks for any help or thoughts!

cgarciae · 2025-02-03T19:17:07Z

cgarciae
Feb 3, 2025
Maintainer

Hey this is an known orbax issue, see google/orbax#1105.

10 replies

DiagRisker Feb 18, 2025

@cgarciae Sure, at least I can share this :

import jax, os
from jax import numpy as jnp, scipy as jsp
from jax import lax

from typing import Sequence, Any, Callable, Union, NamedTuple 
from jax.tree_util import Partial
from flax import nnx
cls = lambda : os.system('cls'); cls()

seed = 42

# ___ Random number routine
class Rng(): # for manual initialization
    def __init__(self, seed = seed):
        self.key = jax.random.key(seed)
        # print(self.key)
    def __call__(self):
        self.key = jax.random.split(self.key)[1]
        return self.key

# ___ TestBlock
class Examblock(nnx.Module):
    def __init__(self,
               Operator: nnx.Module = nnx.Conv,
               dropout : float = 0.,
               rngs = nnx.Rngs(default = seed), # default statement is crucial (if not manually adapt to the whole model sub arguments)
               **kwargs):
        if dropout > 0:
            self.Drop = nnx.Dropout( rate = dropout, rngs = rngs)
        else:
            self.Drop = jax.jit( lambda x : x)
        self.Operator = Operator( **kwargs)
    # Partial( jax.jit ... will create Dropout hurdles: state jit compilation each time the rng states changes..
    def __call__(self, x: jax.Array):
        return self.Operator( self.Drop(x))

# __Parameter regrouping functions

import inspect
def get_class_from_module(module):
    classes = []
    for name, obj in inspect.getmembers(module):
        if inspect.isclass(obj) and obj.__module__ == module.__name__:
            classes.append(obj)
    return classes
from flax import nnx
AdmC = get_class_from_module(nnx.variablelib) #example of admissible classes for the naming

def NaPytree( pytree, name = ''):
    """ flattened list of all tree branch names (while keeping hierachical name info)
    """
    if any(isinstance(pytree, cls) for cls in AdmC):
        return [name]
    LD = dict(pytree)
    if not len(LD):
        raise Exception( f"not a proper Pytree len(dict(pytree)) : {len(LD)}")
    K = LD.keys()
    names_list = []
    for el in K:
        names_list.extend( NaPytree(LD[el], name = name + '_' + el))
    return names_list

def ParamGp(params):
    Names = NaPytree(params) # flattened list of parameters name (hierarchical/ cascaded naming)
    PV = jax.tree_util.tree_flatten(params)[0] # state first
    Weights = { f'{name}' : val for name,val in zip(Names,PV) }
    return Weights

# __ Manual Gradient
def Gjacob(f):
    """ General Gradient of respective parameters (projecting jacobian on canonical directions (tangent valued of ones)  
        Usage : 
            direct gradient:
                Gjacob(fun , wrapper = False)(*args)
    """
    def Gacob(*x):
        y, vjp_fn = jax.vjp(f, *x)
        return vjp_fn(jnp.ones_like(y)) # tangent values at 1
    return Gacob

if __name__ == "__main__":
    batch_dim = 7
    inC, outC = 1, 3
    #
    spatial_dim = (10,10)
    kernel_dim = (3,3)
    #
    rng = Rng()
    input = jax.random.uniform( key = rng(), shape = (batch_dim,)+ spatial_dim+( inC,) )
    print( 'input.shape : ',input.shape)
    Layer2 = Examblock( Partial(nnx.Conv, in_features = inC, out_features = outC, kernel_size = kernel_dim, use_bias = False, rngs = nnx.Rngs(default = seed ) )
                        , dropout = .37654)
    Y = Layer2(input)
    Var = nnx.variables( Layer2)
    def loss( state, graphdef, input):
        model = nnx.merge(graphdef, state)
        return ((model(input))**2).mean()
    graphdef, state = nnx.split(Layer2)
    grads = Gjacob( Partial(loss, graphdef = graphdef, input = input))(state)
    params = grads[0]
    VParadict = ParamGp(Var)
    GParadict = ParamGp(params)
    print(GParadict.keys())
    print(VParadict.keys())
    print(type( VParadict['_Drop_rngs_default_count']) , VParadict['_Drop_rngs_default_count'].dtype )
    print(type( GParadict['_Drop_rngs_default_count']), GParadict['_Drop_rngs_default_count'].dtype)

in GParadict you will find a flattened list of parameter names (with cascaded tree like naming).
It includes ' _Drop_rngs_default_count', same as VParadict, the difference is that :

print(type( VParadict['_Drop_rngs_default_count']) , VParadict['_Drop_r\ngs_default_count'].dtype )
<class 'flax.nnx.rnglib.RngCount'> uint32
print(type( GParadict['_Drop_rngs_default_count']), GParadict['_Drop_rn\gs_default_count'].dtype)
<class 'numpy.ndarray'> [('float0', 'V')]

The problem is that e.g. '_Drop_rngs_default_count' is something I don't want to pass during optimization/learning, and I also don't want it when watching my weight distribution. My interest is of course to use ParamGP output to create grid plots with distributions for each parameter (not necessary to put that part of the code for this minimal example)

In the gradient case, I could do explicit filtering of anything with a certain name in it (related to Dropout for instance).
But I would rather look for a jax-like filter that avoids outputing Dropout states in the gradient and in output of nnx.variables.
For example doing nnx.state( Layer2, nnx.Dropout)) outputs an empty dict..
Furthermore using :

G = jax.grad(Partial(loss, graphdef = graphdef, input = input))( state)

fails, because it finds int32 in the pytree (and it comes from Dropout state). So how do we proceed in this case?

DiagRisker Feb 27, 2025

I edited out my manual implementation of nnx.Conv (see my first post in the discussion section) since the problem is agnostic to its statement, and reduced the amont of code.
Any idea still @cgarciae, on how to avoid Dropout state parameter in gradient and in state output by filtering ? (given this example) and how to fix jax.grad with nnx.Dropout?

cgarciae Mar 1, 2025
Maintainer

@DiagRisker thanks for providing the example! How about this approach?

def loss(state, rng_state, graphdef, input):
    model = nnx.merge(graphdef, rng_state, state)
    return ((model(input))**2).mean()
graphdef, rng_state, state = nnx.split(Layer2, nnx.RngState, ...)
grads = Gjacob( Partial(loss, rng_state=rng_state, graphdef=graphdef, input = input))(state)

Basically using nnx.split to filter the RngState (this catches both the RngKey and RngCount Variables inside Rngs objecst) and pass it as a capture to loss.

BTW: even better, we should be to implement soon nnx.vjp so you don't have to do split / merge at all.

DiagRisker Mar 1, 2025

@cgarciae Problem solved, I can't thank you enough 👍, Now I can complete a lot of other Tests !

NB: I was not familiar with the nnx.split(Layer2, nnx.RngState, ...) statement (your orbax answered hinted that, but I thought it was figurative), instead of nnx.split(Layer2, nnx.RngState) which does not work..
great to hear about nnx.vjp

DiagRisker Mar 3, 2025

small side question, that comes when mutiple arguments are used with split and merge (say I want to filter additionnal terms):
Is it normal that when I do :

def loss( graphdef = [], rng_state = [], state = [], input = []):

or

def loss( graphdef, rng_state , state , input ):

the vjp call does not work? :
grads = Gjacob( Partial(loss, graphdef = graphdef, rng_state = rng_state, input = input))(state)
gives the following message

y, vjp_fn = jax.vjp(f, *x) ~~~~~~~^^^^^^^ TypeError: loss() got multiple values for argument 'graphdef'

or should I use lambda in general?

EDIT:

Playing with parameter order, it seems, the only point that matters is that state parameter should be the loss definition first argument! (still it's not super convenient)
Stating grads = Gjacob( Partial(loss, graphdef = graphdef, rng_state = rng_state, input = input))(state = state) does not change the outcome
->not recognized: Gjacob.<locals>.Gacob() unexpected keyword argument 'state' )

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Question about RNG state in NNX checkpointing with Orbax #4514

{{title}}

{{editor}}'s edit

{{editor}}'s edit

Replies: 1 comment 10 replies

{{title}}

{{title}}

{{editor}}'s edit

{{editor}}'s edit

{{title}}

{{title}}

{{title}}

{{editor}}'s edit

{{editor}}'s edit

{{title}}

{{editor}}'s edit

{{editor}}'s edit

Select a reply

Question about RNG state in NNX checkpointing with Orbax #4514

conorhassan Jan 29, 2025

Questions

Replies: 1 comment · 10 replies

cgarciae Feb 3, 2025 Maintainer

DiagRisker Feb 18, 2025

DiagRisker Feb 27, 2025

cgarciae Mar 1, 2025 Maintainer

DiagRisker Mar 1, 2025

DiagRisker Mar 3, 2025

conorhassan
Jan 29, 2025

Replies: 1 comment 10 replies

cgarciae
Feb 3, 2025
Maintainer

cgarciae Mar 1, 2025
Maintainer