Performance considerations for NNX #4224

cgarciae · 2024-09-24T13:45:33Z

cgarciae
Sep 24, 2024
Maintainer

Currently nnx.jit traverses the object graph in Python. This is slow and primarily affects the small model regime, as the Python overhead starts to disappear as the model's width grows. To solve this in general, we will be developing a Rust extension called flaxlib (see first steps in #4196) to speedup some of the traversal logic in graph.py, similar to how JAX solved the same issue with jaxlib for standard pytrees.

UPDATE: see full Performance Considerations guide.

rademacher-p · 2024-10-09T14:56:00Z

rademacher-p
Oct 9, 2024

Thanks for posting this update, I remembered you commented on this performance consideration somewhere months back and couldn't find it. Been doing this split/merge w/ standard JAX transforms in my code just in case (and to stay as pure JAX as possible). If a PR goes through to address this, I'll try switching to the NNX transforms! 👍

0 replies

Tomas542 · 2024-10-11T20:35:24Z

Tomas542
Oct 11, 2024

Thanks for posting this! Is there a way to do update with metrics? Or only nnx.merge from #4045? Like

graphdef, state = nnx.split((model, optimizer, metrics))
...
nnx.update((model, optimizer, metrics), state)

Cause now it raises an error:

ValueError: Cannot set key 'count' on immutable node of type ScaleByAdamState

And also it will be great to create page with speed up tips for NNX API!

4 replies

cgarciae Oct 12, 2024
Maintainer Author

Ideally we need no tricks in the future :)
Can you post a minimal example with your issue?

Tomas542 Oct 14, 2024

Sure! It is similar to your code for MNIST from #4045 and nnx MNIST tutorial, but I'll sent it with iris for faster launch. We have imports and data init

# annotations for code readability
import typing as tp

# data visualization and split
from sklearn.datasets import load_iris
from sklearn.model_selection import train_test_split

# nn
import jax
from jax import numpy as jnp
from flax import nnx
from flax.typing import Array
import optax

iris = load_iris()
X = iris.data
y = iris.target

X_train, X_test, y_train, y_test = train_test_split(X, 
                                                    y, 
                                                    test_size=0.2, 
                                                    shuffle=True, 
                                                    random_state=42)

model class, initialization for metrics dict function and model, optimizer and metrics init

def _init_metrics_hisory() -> dict[str, Array]:
    """Creates metrics_history
    Example:
    >>> metrics = _init_metrics_hisory()
    >>> metrics
    {'train_loss': [], 'train_accuracy': [], 'test_loss': [], 'test_accuracy': []}
    """
class MLP(nnx.Module):
    """Simple MLP with variable hidden layers' size
    
    Args:
        - input_dim (int) - num of input features. Default: 60
        - hidden (tuple[int]) - optional sizes of hidden layers. Default: None
        - output_dim (int) - num of output features. Default: 10
        - rngs (nnx.Rngs) - random generator.
    
    Example:
    >>> import jax.numpy as jnp
    >>> from flax import nnx
    >>> model = MLP(input_dim=60, hidden=(5,), output_dim=10, rngs=nnx.Rngs(42))
    >>> y = jnp.ones(60)
    >>> model(y)
    Array([0.1900208], dtype=float32)
    """
    def __init__(self,
                 input_dim: int = 60,
                 hidden: tp.Optional[tuple[int]] = None, 
                 output_dim: int = 10,
                 rngs: tp.Optional[nnx.Rngs] = None) -> None:
        # init random generator for Jax
        if rngs is None:
            rngs = nnx.Rngs(42)
            
        layers = [] # store all layers

        # without hidden
        if hidden is None:
            layers.append(nnx.Linear(input_dim, output_dim, rngs=rngs))

        # init hidden layers with swish
        else:
            last_layer = input_dim # out_featuers of the last layer
            for hidden_layer in hidden:
                layers.append(nnx.Linear(in_features=last_layer, 
                                         out_features=hidden_layer, 
                                         rngs=rngs))
                layers.append(nnx.swish)
                # layers.append(nnx.Dropout(0.05, rngs=rngs))
                last_layer = hidden_layer
            layers.append(nnx.Linear(last_layer, output_dim, rngs=rngs))

        # unpack all layers to call them later
        self.nn = nnx.Sequential(*layers)
        del layers

    def __call__(self, x: Array) -> Array:
        x = self.nn(x)
        return x
        
        
    metrics_history = {
        'train_loss': [],
        'train_accuracy': [0],
        'test_loss': [],
        'test_accuracy': [0],
    }
    return metrics_history

model = MLP(input_dim=X_train.shape[1], hidden=(5,), output_dim=3, rngs=nnx.Rngs(42))
optimizer = nnx.Optimizer(model, optax.adamw(5e-3, 0.9))
metrics_history = _init_metrics_hisory()
metrics = nnx.MultiMetric(
    accuracy=nnx.metrics.Accuracy(),
    loss=nnx.metrics.Average('loss'),
)

then train and eval steps with CE loss

@nnx.jit
def loss_fn(model: MLP, 
            features: Array, 
            labels: Array) -> tuple[Array]:
    """Computes CE-Loss with optax. Returns loss and logits"""
    
    logits = nnx.vmap(model)(features)
    loss = jnp.mean(jax.vmap(
        optax.softmax_cross_entropy_with_integer_labels)
        (logits=logits, labels=labels)
    )
    return loss, logits


@nnx.jit
def train_step(graphdef: nnx.GraphDef, 
               state: nnx.GraphState,   
               features: Array, 
               labels: Array) -> nnx.GraphState:
    """Train for a single step."""
    model, optimizer, metrics = nnx.merge(graphdef, state)
    grad_fn = nnx.value_and_grad(loss_fn, has_aux=True)
    (loss, logits), grads = grad_fn(model, features, labels)
    metrics.update(loss=loss, logits=logits, labels=labels)
    optimizer.update(grads)
    _, state = nnx.split((model, optimizer, metrics))
    return state


@nnx.jit
def eval_step(graphdef: nnx.GraphDef, 
              state: nnx.GraphState,   
              features: Array,
              labels: Array) -> nnx.GraphState:
    """Eval for single step"""
    model, optimizer, metrics = nnx.merge(graphdef, state)
    loss, logits = loss_fn(model, features, labels)
    metrics.update(loss=loss, logits=logits, labels=labels)
    _, state = nnx.split((model, optimizer, metrics))
    return state

And training loop with batching

num_epochs = 5
batch_size = 10
for i in range(num_epochs):
    # train
    model.train()
    graphdef, state = nnx.split((model, optimizer, metrics))
    for j in range(0, len(X_train), batch_size):
        state = train_step(graphdef=graphdef,
                            state=state,
                            features=X_train[j:j+batch_size], 
                            labels=y_train[j:j+batch_size])

    model, optimizer, metrics = nnx.merge(graphdef, state)
    # nnx.update((model, optimizer, metrics), state) # ERROR

    # store train metrics
    for metric, value in metrics.compute().items():     
        metrics_history[f'train_{metric}'].append(value)
    metrics.reset()

    # eval
    model.eval()
    graphdef, state = nnx.split((model, optimizer, metrics))
    for j in range(0,len(X_test), batch_size):
        state = eval_step(graphdef=graphdef,
                            state=state,
                            features=X_test[j:j+batch_size], 
                            labels=y_test[j:j+batch_size])
    
    model, optimizer, metrics = nnx.merge(graphdef, state)
    # nnx.update((model, optimizer, metrics), state) # ERROR

    # store eval metrics
    for metric, value in metrics.compute().items():    
        metrics_history[f'test_{metric}'].append(value)
    metrics.reset() 

    print(
        f"[train] eposh: {i}, "
        f"loss: {metrics_history['train_loss'][-1]}, "
        f"accuracy: {metrics_history['train_accuracy'][-1] * 100}"
    )
    print(
        f"[test] epoch: {i}, "
        f"loss: {metrics_history['test_loss'][-1]}, "
        f"accuracy: {metrics_history['test_accuracy'][-1] * 100}"
    )

In the last piece coda before metrics storing is made with nnx.merge, and nnx.update raises an error.

The problem could be, that nnx.Optimizer it trying to share with metrics nnx.Average (Accuracy is built on top of average) and the line self.count.value += 1 if isinstance(values, (int, float)) else values.size in Average class trying to take values.size from state of optimizer.

cgarciae Oct 31, 2024
Maintainer Author

@Tomas542 sorry for the super late reply. I've updated the example above to use metrics but it worked fine. Can you try upgrading to the latest flax version?

Tomas542 Nov 3, 2024

@cgarciae Yeah, I've tested it with latest version and now it works, thank you!

rademacher-p · 2024-10-30T01:16:08Z

rademacher-p
Oct 30, 2024

@cgarciae As I mention above, I've been sticking to the split/merge + JAX transforms to future proof against any performance hits. However, I would consider switching to NNX transforms for my current dev if the expectation is that the Rust extension would definitively close the performance gap. Can you comment on the expected gains with flaxlib?

1 reply

cgarciae Oct 31, 2024
Maintainer Author

Hey @rademacher-p! Just to clarify, you only need to do this for the top-level jax.jit transform, inside the jitted function you can use NNX transforms without performance loss e.g. nnx.grad in the example.

Can you comment on the expected gains with flaxlib

Ideally the overhead becomes negligible. Similar to jaxlib, we could eventually specialize the traversal for certain objects such as Module, list, dict, etc, to minimize the Python footprint.

galah92 · 2024-12-22T17:38:27Z

galah92
Dec 22, 2024

@cgarciae in your example, at the end, metrics hold the latest value instead epoch values. Shouldn't you call nnx.update((model, optimizer, metrics), state) after each train_step?

2 replies

cgarciae Dec 23, 2024
Maintainer Author

Ideally not to save some compute.

galah92 Dec 23, 2024

Can you elaborate? What's the best practice here to correctly retrieve all the metrics?

kvablack · 2025-01-03T19:55:08Z

kvablack
Jan 3, 2025

Big fan of NNX!

I personally think there are reasons other than performance to use split/merge and standard JAX transforms. It's "closer to the metal," if you will -- once you understand the split/merge API and JAX's core APIs, you're empowered to do pretty much anything, with a little more boilerplate (holding on to the graphdef) which is not too bad in my opinion (especially since y'all have done such a great job with the static typing!). You can mix NNX's mutable reference semantics with JAX's pure functional semantics to write both convenient and bug-free code.

I worry that encouraging NNX transforms only, while sweeping split/merge under the rug, would be especially bad for newer JAX users. NNX transforms add a layer of abstraction that completely hides the underlying JAX abstractions, which may make it harder to pick up important concepts like tracing/staging out, PyTrees, sharding, etc. As a more experienced JAX user, I've definitely been finding split/merge with explicit state management more comfortable and legible. Another argument for encouraging this pattern is that, at least right now, you must understand split/merge and explicit state management to save and load checkpoints.

I realize not everyone will agree with me! My vote would be to document both split/merge and NNX transforms side-by-side as equivalent ways of doing things, even after flaxlib is complete. That way, even if people do want to use NNX transforms to save on boilerplate, they can still acquire a mental model of what is happening under the hood.

5 replies

8bitmp3 Jan 3, 2025
Maintainer

Thanks @kvablack !

... to document both split/merge and NNX transforms side-by-side as equivalent ways of doing things, even after flaxlib is complete. That way, even if people do want to use NNX transforms to save on boilerplate, they can still acquire a mental model of what is happening under the hood.

@cgarciae @IvyZX @levskaya WDYT

8bitmp3 Jan 3, 2025
Maintainer

fyi For flaxlib context, recent PR #4469 by @cgarciae

cgarciae Jan 4, 2025
Maintainer Author

Hey @kvablack, thanks for the feedback! I'm glad you like the Functional API (split/merge), some JAX contributors like it a lot.

I have two things I'd like to note regarding the preference for NNX transforms and our documentation strategy:

NNX Transforms have been modeled to match the API of JAX transforms (see JAX-style NNX Transforms) in hope that all the concepts like tracing/staging out, PyTrees, sharding, remain intact. In other words, you should be able to swap a JAX transform for its NNX counterpart and it should still work as is. Our goal is for users to think of NNX transforms as "The same JAX transforms but you can pass NNX objects", ideally this remains true.
Currently we are trying to follow the progressive disclosure of complexity principle by showing NNX transforms first since they provide automatic state management and should be easier to use for beginers, and then showing how to use the Functional API for more advanced use cases. You can see this in the ordering of the sections in Flax Basics. However, I do agree we should showcase the Functional API more as users need to use it for thinks like checkpointing or whenever they need to interact with JAX transforms we currently don't support or 3rd party libraries.

DiagRisker Jan 26, 2025

In the documentation, this trick is not shown:
I found that I gained execution time, and reduced memory usage (significantly) on wide Networks using:

import jax
from flax import nnx

class MyfancyModel(nnx.Module):
        def __init__(self, ...):
                ...
        @jax.tree_util.Partial( jax.jit, static_argnums = (0,)) # 0 for self, if other static arguments are added, change it accordingly
        def __call__(self, x: Array) -> Array:
                #calling layers you initialized in __init__
                return output

or the variant

class MyfancyModel(nnx.Module):
        def __init__(self, ...):
                ...
        @jax.tree_util.Partial( jax.jit, static_argnums = (0,2)) # 0 for self, 2 for Train
        def __call__(self, x: Array, train : bool = False) -> Array:
                #calling layers you initialized in __init__
                return output

When used in Gradient procedures with nnx.split / nnx. merge logic

def loss( graphdef, state):
        model = nnx.merge(graphdef, state)
        return ((model(input))**2).mean() # dummy loss example

with a gradient such as :

def Gjacob(f, *x, static_argnums = None, wrapper = True):
    """ General Gradient:
        applies the "canonical" directions to the jacobian of respective parameters
    """
    def Gacob(*x):
        y, vjp_fn = jax.vjp(f, *x)
        return vjp_fn(jnp.ones_like(y)) # tangent defined as tensor of 1s
    if wrapper:
        if static_argnums is not None:
            static_argnums = ensure_tuple(static_argnums)
        return Gacob
    return Gacob(*x)

Model = MyfancyModel(...)
BG = nnx.graphdef(Model)
f = jax.jit(lambda *arg,**kwargs : loss(*arg, graphdef = BG , **kwargs))
sG = Gjacob( f)(state = nnx.state(Model) )

No problem in tracing

cgarciae Jan 27, 2025
Maintainer Author

Hi @DiagRisker, treating NNX Modules as static arguments doesn't work in general. If you include a stateful operation you will get an error.

8bitmp3 · 2025-01-21T01:10:08Z

8bitmp3
Jan 21, 2025
Maintainer

fyi Pinning this discussion to google/flax/discussions/ @cgarciae #nnx

0 replies

jecampagne · 2025-02-25T14:04:17Z

jecampagne
Feb 25, 2025

Hello, I'm using
#print(jax.version, flax.version, ocp.version)
#0.5.0 0.10.3 0.11.5
and I have still a very big diffrence of performance using nnx.jit comapred to the jax.jit one inspired by the code by @cgarciae at the beginning of the thread. My mode is not so heavy I'm using some Linear + GroupNorm and some skip connection;

class GaussianFourierProjection(nnx.Module):
  """Gaussian random features for encoding time steps."""
  #embed_dim: int
  #scale: float = 30.
  
  def __init__(self, embed_dim: int, scale: float, *, rngs: nnx.Rngs):
    key = rngs.params()
    dout = embed_dim // 2
    self.W = nnx.Variable(jax.random.normal(key, (dout,))  *  scale)

  def __call__(self, x):
    x_proj = x[:, None] * self.W[None, :] * 2 * jnp.pi
    return jnp.concatenate([jnp.sin(x_proj), jnp.cos(x_proj)], axis=-1)

class ScoreNet(nnx.Module):
  channels: Tuple[int] =  (32, 64, 128, 256)
  embed_dim: int = 256
  scale: float = 30.  

  def __init__(self, marginal_prob_std:Any, din_t:int, rngs: nnx.Rngs):
    self.act = nnx.swish
    self.marginal_prob_std = marginal_prob_std
    #time embedding  
    self.embed = GaussianFourierProjection(embed_dim=self.embed_dim,
                                           scale=self.scale,
                                           rngs=rngs)
    self.LayerEmb = nnx.Linear(self.embed_dim, self.embed_dim, rngs=rngs)
    #encoding part  
    self.Layer1  = nnx.Linear(args['x_dim'],self.channels[0],use_bias=False, rngs=rngs)
    self.Layer1e = nnx.Linear(self.embed_dim,self.channels[0], rngs=rngs)
    self.Norm1   = nnx.GroupNorm(self.channels[0],num_groups=4, rngs=rngs)

    self.Layer2  = nnx.Linear(self.channels[0],self.channels[1],use_bias=False, rngs=rngs)
    self.Layer2e = nnx.Linear(self.embed_dim,self.channels[1], rngs=rngs)
    self.Norm2   = nnx.GroupNorm(self.channels[1], rngs=rngs) # num_groups=32 by default

    self.Layer3  = nnx.Linear(self.channels[1],self.channels[2],use_bias=False, rngs=rngs)
    self.Layer3e = nnx.Linear(self.embed_dim,self.channels[2], rngs=rngs)
    self.Norm3   = nnx.GroupNorm(self.channels[2], rngs=rngs) # num_groups=32 by default

    self.Layer4  = nnx.Linear(self.channels[2],self.channels[3],use_bias=False, rngs=rngs)
    self.Layer4e = nnx.Linear(self.embed_dim,self.channels[3], rngs=rngs)
    self.Norm4   = nnx.GroupNorm(self.channels[3], rngs=rngs) # num_groups=32 by default

    #decoding part  
    self.Layer5  = nnx.Linear(self.channels[3],self.channels[2],use_bias=False, rngs=rngs)
    self.Layer5e = nnx.Linear(self.embed_dim,self.channels[2], rngs=rngs)
    self.Norm5   = nnx.GroupNorm(self.channels[2], rngs=rngs) # num_groups=32 by default

    self.Layer6  = nnx.Linear(2*self.channels[2],self.channels[1],use_bias=False, rngs=rngs)
    self.Layer6e = nnx.Linear(self.embed_dim,self.channels[1], rngs=rngs)
    self.Norm6   = nnx.GroupNorm(self.channels[1], rngs=rngs) # num_groups=32 by default

    self.Layer7  = nnx.Linear(2*self.channels[1],self.channels[0],use_bias=False, rngs=rngs)
    self.Layer7e = nnx.Linear(self.embed_dim,self.channels[0], rngs=rngs)
    self.Norm7   = nnx.GroupNorm(self.channels[0], rngs=rngs) # num_groups=32 by default

    self.Layer7  = nnx.Linear(2*self.channels[1],self.channels[0],use_bias=False, rngs=rngs)
    self.Layer7e = nnx.Linear(self.embed_dim,self.channels[0], rngs=rngs)
    self.Norm7   = nnx.GroupNorm(self.channels[0], rngs=rngs) # num_groups=32 by default

    self.Layer8  = nnx.Linear(2*self.channels[0],args['x_dim'], rngs=rngs)
    
    
  def __call__ (self,x,t):
    # time embding
    embed = self.act(self.LayerEmb(self.embed(t)))
      
    # encoding
    h1 = self.Layer1(x)
    h1 += self.Layer1e(embed)
    h1 = self.Norm1(h1)
    h1 = self.act(h1)

    h2 = self.Layer2(h1)
    h2 += self.Layer2e(embed)
    h2 = self.Norm2(h2)
    h2 = self.act(h2)

    h3 = self.Layer3(h2)
    h3 += self.Layer3e(embed)
    h3 = self.Norm3(h3)
    h3 = self.act(h3)

    h4 = self.Layer4(h3)
    h4 += self.Layer4e(embed)
    h4 = self.Norm4(h4)
    h4 = self.act(h4)
    
    # decondig
    h = self.Layer5(h4)
    h += self.Layer5e(embed)
    h = self.Norm5(h)
    h = self.act(h)

    h = self.Layer6(jnp.concatenate([h, h3], axis=-1))
    h += self.Layer6e(embed)
    h = self.Norm6(h)
    h = self.act(h)

    h = self.Layer7(jnp.concatenate([h, h2], axis=-1))
    h += self.Layer7e(embed)
    h = self.Norm7(h)
    h = self.act(h)

    h = self.Layer8(jnp.concatenate([h, h1], axis=-1))

    # normalisation
    h = h / self.marginal_prob_std(t)[:, None]
    return h

I was wandering thst if the performance diffrence is still valid or my code is not perfect (btw I'm not an expert...)?

4 replies

cgarciae Feb 25, 2025
Maintainer Author

Hi @jecampagne, check out the new Performance Considerations guide. It includes a note on the new nnx.cached_partial API to speedup nnx.jit with minimal changes.

Your code looks fine. NNX is slow currently because of Python. Let me know if the above guide helped!

jecampagne Feb 25, 2025

Thanks. @cgarciae I will try asap.
I will open a new thread for a multi-GPU data parallelism with the nnx.jit option as I have questions...

jecampagne Feb 26, 2025

The nnx.cached_partial is not yet part of a tagged version isn't it, I see it only in the main branch here

cgarciae Feb 26, 2025
Maintainer Author

Thanks @jecampagne for pointing this out. I'll push a new release.

rademacher-p · 2025-03-19T20:59:10Z

rademacher-p
Mar 19, 2025

@cgarciae Been playing around with cached_partial, has required fewer changes to get full performance relative to my branch using functional API, which I like! Wondering if flaxlib is still something being developed? If so, and if it will provide the gains w/o code mods, I may delay a complete rewrite of my stuff using the caching mechanism. Thanks! 👍

0 replies

MRiabov · 2025-05-12T12:40:05Z

MRiabov
May 12, 2025

@cgarciae, I'm an avid nnx user, so want to add something that may help with flaxlib.
You may want to write flaxlib not in Rust but in Mojo language. It's a python superset that runs with the speed of C++/rust, except it's written as Python.
Simply because it's easier to write/modify. It was also written by one of product leads writing XLA compiler.

Note that I don't know of anybody using it in prod (last time I checked was a half a year ago), but maybe you'll be the first ones to. who doesn't want fast Python?

0 replies

Performance considerations for NNX #4224

Uh oh!

Uh oh!

cgarciae Sep 24, 2024 Maintainer

Replies: 9 comments · 16 replies

Uh oh!

Uh oh!

Uh oh!

Uh oh!

cgarciae Oct 12, 2024 Maintainer Author

Uh oh!

Uh oh!

Uh oh!

cgarciae Oct 31, 2024 Maintainer Author

Uh oh!

Uh oh!

Uh oh!

cgarciae Oct 31, 2024 Maintainer Author

Uh oh!

Uh oh!

cgarciae Dec 23, 2024 Maintainer Author

Uh oh!

Uh oh!

Uh oh!

8bitmp3 Jan 3, 2025 Maintainer

Uh oh!

8bitmp3 Jan 3, 2025 Maintainer

Uh oh!

Uh oh!

cgarciae Jan 4, 2025 Maintainer Author

Uh oh!

Uh oh!

Uh oh!

cgarciae Jan 27, 2025 Maintainer Author

Uh oh!

8bitmp3 Jan 21, 2025 Maintainer

Uh oh!

Uh oh!

cgarciae Feb 25, 2025 Maintainer Author

Uh oh!

Uh oh!

Uh oh!

Uh oh!

cgarciae Feb 26, 2025 Maintainer Author

Uh oh!

Uh oh!

cgarciae
Sep 24, 2024
Maintainer

Replies: 9 comments 16 replies

cgarciae Oct 12, 2024
Maintainer Author

cgarciae Oct 31, 2024
Maintainer Author

cgarciae Oct 31, 2024
Maintainer Author

cgarciae Dec 23, 2024
Maintainer Author

8bitmp3 Jan 3, 2025
Maintainer

8bitmp3 Jan 3, 2025
Maintainer

cgarciae Jan 4, 2025
Maintainer Author

cgarciae Jan 27, 2025
Maintainer Author

8bitmp3
Jan 21, 2025
Maintainer

cgarciae Feb 25, 2025
Maintainer Author

cgarciae Feb 26, 2025
Maintainer Author