Batch Normalization in Flax #2080

marcvanzee · 2021-01-22T15:08:24Z

marcvanzee
Jan 22, 2021
Maintainer

Batch Normalization is more complicated than most layers because of the mutation of moving averages during training.

The BatchNorm module is in normalization.py. The canonical example using it is Imagenet.

In a multi-device setting, every device updates its normalizing parameters (ra_mean and ra_var in the code) based on its own batch statistics (unless we specify axis_name, which makes it global). However, these parameters are only stored in in Module variablebatch_stats and not in params, so they are never synced across devices unless specifically done so.

If they aren't synced, they can theoretically diverge, but if your data is fairly uniform across shards they're likely to trend towards similar values. Syncing before eval is definitely a good idea though, since otherwise your eval results will depend on which devices process which examples.

axis_index_groups can be used to set for vanilla batchnorm (aggregate from a single device) or synchronized batchnorm (aggregate from all devices). It controls how large a device group you want to sync over. The current example in the comment is a bit confusing. If you just want to aggregate over all devices in the axis you don't need to pass axis_index_groups though.

Also take a look at the comments at #1489.

Let's consider an example. We define a trivial conv + BN layer.

class Foo(nn.Module):
  train: bool
  filters: int

  @nn.compact
  def __call__(self, x):
    x = nn.Conv(self.filters, (1, 1), use_bias=False, dtype=jnp.float32)(x)
    x = nn.BatchNorm(use_running_average=not self.train,
                     momentum=0.9,
                     epsilon=1e-5,
                     dtype=jnp.float32)(x)
    return x

key = random.PRNGKey(0)
x = jnp.ones((5,4,4,3))

# We instantiate the layer then call its init function to get initial variable collections.
foo_vars = Foo(filters=7, train=True).init(key, x)
foo_vars

This returns the following:

FrozenDict({
    params: {
        Conv_0: {
            kernel: DeviceArray([[[[ 0.50138927,  0.7354811 ,  0.7896391 , -0.63713336,
                             1.081016  , -0.29067358, -0.3780927 ],
                           [ 0.7357814 ,  0.24682549, -0.55378306,  0.16909008,
                             0.85014457,  1.0167135 ,  0.19896305],
                           [ 1.1461202 ,  0.8548834 , -1.0578486 , -0.6013309 ,
                             0.2501557 ,  0.3332178 , -0.36248836]]]], dtype=float32),
        },
        BatchNorm_0: {
            scale: DeviceArray([1., 1., 1., 1., 1., 1., 1.], dtype=float32),
            bias: DeviceArray([0., 0., 0., 0., 0., 0., 0.], dtype=float32),
        },
    },
    batch_stats: {
        BatchNorm_0: {
            mean: DeviceArray([0., 0., 0., 0., 0., 0., 0.], dtype=float32),
            var: DeviceArray([1., 1., 1., 1., 1., 1., 1.], dtype=float32),
        },
    },
})

We explicitly say which variable collections are to be mutated by the apply function, those are then returned as auxilliary variables.

y1, new_batch_stats = Foo(filters=7, train=True).apply(foo_vars, x, mutable=['batch_stats'])
new_batch_state

This returns the following:

FrozenDict({
    batch_stats: {
        BatchNorm_0: {
            mean: DeviceArray([ 0.23832898,  0.18371896, -0.08219925, -0.10693741,
                          0.21813159,  0.10592576, -0.05416182], dtype=float32),
            var: DeviceArray([0.9000007 , 0.90000004, 0.9       , 0.90000004, 0.9000001 ,
                         0.9       , 0.9       ], dtype=float32),
        },
    },
})

We stitch together params and batch stats collections to evaluate again.
Normally the params would have been updated by a training step using an optimizer.

new_foo_vars = {'params': foo_vars['params'], 'batch_stats': new_batch_stats}
y2, even_newer_batch_stats = Foo(filters=7, train=True).apply(new_foo_vars, x, mutable=['batch_stats'])

Answered by andsteing

May 31, 2022

Since @marcvanzee already provides an answer to "how to use batchnorm" in his original post above, I'm only going to cover the question "when and how to normalize statistics" asked by @cgarciae and more recently by @laoreja (on an internal forum), and answered above by @jheek and by @levskaya (on an internal forum)

Our examples/imagenet/ uses nn.BatchNorm in the ResNet model:

flax/examples/imagenet/models.py

Lines 95 to 99 in b3236ce

     norm = partial(nn.BatchNorm,  
   use_running_average=not train,  
   momentum=0.9,  
   epsilon=1e-5,  
   dtype=self.dtype)  

 

Then we define a utility function sync_batch_stats()...

flax/examples/imagenet/train.py

Lines 211 to 218 in b3…

View full answer

cccntu · 2021-01-30T04:35:08Z

cccntu
Jan 30, 2021

Hey, thanks for the guide. This is useful.

I found something unintuitive:

class SimpleModel(nn.Module):
    @nn.compact
    def __call__(self, x):
        x = nn.BatchNorm()(x)
        x = nn.Dense(1)(x)
        return x
model = SimpleModel()

rng = jax.random.PRNGKey(0)
dummy_input = jnp.ones((1, 10))
params = model.init(rng, dummy_input)

params['params'] gives me:

FrozenDict({
    BatchNorm_0: {
        scale: DeviceArray([1., 1., 1., 1., 1., 1., 1., 1., 1., 1.], dtype=float32),
        bias: DeviceArray([0., 0., 0., 0., 0., 0., 0., 0., 0., 0.], dtype=float32),
    },
    Dense_0: {
        kernel: DeviceArray([[ 0.15454544],
                     [ 0.5659187 ],
                     [ 0.04327311],
                     [ 0.08302793],
                     [-0.00323147],
                     [-0.12254263],
                     [-0.22088017],
                     [-0.2184048 ],
                     [-0.05149708],
                     [-0.42985576]], dtype=float32),
        bias: DeviceArray([0.], dtype=float32),
    },
})

and params.pop('params') gives me a tuple:

(FrozenDict({
     batch_stats: {
         BatchNorm_0: {
             mean: DeviceArray([0., 0., 0., 0., 0., 0., 0., 0., 0., 0.], dtype=float32),
             var: DeviceArray([1., 1., 1., 1., 1., 1., 1., 1., 1., 1.], dtype=float32),
         },
     },
 }),
 FrozenDict({
     BatchNorm_0: {
         scale: DeviceArray([1., 1., 1., 1., 1., 1., 1., 1., 1., 1.], dtype=float32),
         bias: DeviceArray([0., 0., 0., 0., 0., 0., 0., 0., 0., 0.], dtype=float32),
     },
     Dense_0: {
         kernel: DeviceArray([[ 0.15454544],
                      [ 0.5659187 ],
                      [ 0.04327311],
                      [ 0.08302793],
                      [-0.00323147],
                      [-0.12254263],
                      [-0.22088017],
                      [-0.2184048 ],
                      [-0.05149708],
                      [-0.42985576]], dtype=float32),
         bias: DeviceArray([0.], dtype=float32),
     },
 }))

Why are they different?

BTW, I found this usage here:

flax/examples/imagenet/train.py

Line 69 in d068512

model_state, params = variables.pop('params')

0 replies

avital · 2021-02-01T09:57:58Z

avital
Feb 1, 2021

Hi @cccntu -- what do you mean by "different"? pop returns two parts: the variable collection that you popped, and the remaining collections still all grouped together, e.g.:

variables = model.init(...)
# assume variables['params'] and variables['batch_stats'] are present here
other_variables, params = variables.pop('params')

# here params == variables['params'], and other_variables['batch_stats'] == variables['batch_stats']

Maybe the docstring for FrozenDict.pop isn't clear enough? But this is working as intended.

(BTW in your example you wrote params = model.init(rng, dummy_input) -- it should probably be variables = model.init(rng, dummy_input) as parameters are one of possible multiple variable collections)

Does this help? What could we improve in our documentation so that this would be less confusing?

0 replies

cccntu · 2021-02-01T10:17:56Z

cccntu
Feb 1, 2021

@avital Thanks for the reply.
I said different because I expect it to behave like python dict.
But I get it now, since flax/jax is functional, it make sense to return the remaining dictionary. Thanks for the help and the naming suggestion!

I didn't find the doc for FrozenDict from google search and readthedocs search is extremely slow. So I didn't read the doc.
I just found it by searching Github and I think it's clear enough.

0 replies

avital · 2021-02-01T10:21:40Z

avital
Feb 1, 2021

I don't think we even have reference docs for FrozenDict on RTD, we probably should. I filed an issue: #969

0 replies

cgarciae · 2021-03-06T02:50:08Z

cgarciae
Mar 6, 2021
Maintainer

Hey!

I am very interested in the best practices for BatchNorm (or batch_stats in general I guess) when used inside a pmap. Is the recommendation to use lax.pmean to synchronize them? Are there example available?

Thanks!

0 replies

jheek · 2021-03-08T14:29:52Z

jheek
Mar 8, 2021
Maintainer

I am very interested in the best practices for BatchNorm (or batch_stats in general I guess) when used inside a pmap

See the ImageNet example for the canonical example of combining BatchNorm and pmap. There indeed we sync the statistics before evaluation with pmean.

0 replies

andsteing · 2022-05-31T09:45:00Z

andsteing
May 31, 2022
Maintainer

Since @marcvanzee already provides an answer to "how to use batchnorm" in his original post above, I'm only going to cover the question "when and how to normalize statistics" asked by @cgarciae and more recently by @laoreja (on an internal forum), and answered above by @jheek and by @levskaya (on an internal forum)

Our examples/imagenet/ uses nn.BatchNorm in the ResNet model:

flax/examples/imagenet/models.py

Lines 95 to 99 in b3236ce

    
           norm = partial(nn.BatchNorm, 
        
                          use_running_average=not train, 
        
                          momentum=0.9, 
        
                          epsilon=1e-5, 
        
                          dtype=self.dtype)

Then we define a utility function sync_batch_stats()...

flax/examples/imagenet/train.py

Lines 211 to 218 in b3236ce

    
           cross_replica_mean = jax.pmap(lambda x: lax.pmean(x, 'x'), 'x') 
        
           def sync_batch_stats(state): 
        
             """Sync the batch statistics across replicas.""" 
        
             # Each device has its own version of the running average batch statistics and 
        
             # we sync them before evaluation. 
        
             return state.replace(batch_stats=cross_replica_mean(state.batch_stats))

...which we then call before evaluating...

flax/examples/imagenet/train.py

Lines 357 to 361 in b3236ce

    
           state = sync_batch_stats(state) 
        
           for _ in range(steps_per_eval): 
        
             eval_batch = next(eval_iter) 
        
             metrics = p_eval_step(state, eval_batch) 
        
             eval_metrics.append(metrics)

...and before writing a checkpoint

flax/examples/imagenet/train.py

Lines 369 to 371 in b3236ce

    
           if (step + 1) % steps_per_checkpoint == 0 or step + 1 == num_steps: 
        
             state = sync_batch_stats(state) 
        
             save_checkpoint(state, workdir)

Alternatively, we could have specified axis_name='batch' when instantiating nn.BatchNorm in the model code. Then the batch stats would be synced automatically in the layer at every step. This alternative approach is less verbose, but it requires to pass down the axis_name parameter to the model definition, and batch stats are synced at every step (causing some communication overhead).

Syncing would then happen here:

flax/flax/linen/normalization.py

Lines 84 to 90 in b3236ce

    
           if axis_name is not None: 
        
             concatenated_mean = jnp.concatenate([mean, mean2]) 
        
             mean, mean2 = jnp.split( 
        
                 lax.pmean( 
        
                     concatenated_mean, 
        
                     axis_name=axis_name, 
        
                     axis_index_groups=axis_index_groups), 2)

2 replies

nalzok Aug 11, 2022

Hmmm... could you explain why we don't need to synchronize the batch statistics at the end of each batch? I am worried that the training state on the devices may not be identical if we don't synchronize the batch statistics across the replications, even though their parameters are exactly the same.

marcvanzee Aug 11, 2022
Maintainer Author

If they aren't synced, they can theoretically diverge, but if your data is fairly uniform across shards they're likely to trend towards similar values. Syncing before eval is definitely a good idea though, since otherwise your eval results will depend on which devices process which examples.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Batch Normalization in Flax #2080

{{title}}

{{editor}}'s edit

{{editor}}'s edit

Replies: 7 comments 2 replies

{{title}}

{{title}}

{{title}}

{{title}}

{{editor}}'s edit

{{editor}}'s edit

{{title}}

{{title}}

{{title}}

{{title}}

{{editor}}'s edit

{{editor}}'s edit

{{title}}

Select a reply

	norm = partial(nn.BatchNorm,
	use_running_average=not train,
	momentum=0.9,
	epsilon=1e-5,
	dtype=self.dtype)

Batch Normalization in Flax #2080

marcvanzee Jan 22, 2021 Maintainer

Replies: 7 comments · 2 replies

cccntu Jan 30, 2021

avital Feb 1, 2021

cccntu Feb 1, 2021

avital Feb 1, 2021

cgarciae Mar 6, 2021 Maintainer

jheek Mar 8, 2021 Maintainer

andsteing May 31, 2022 Maintainer

nalzok Aug 11, 2022

marcvanzee Aug 11, 2022 Maintainer Author

marcvanzee
Jan 22, 2021
Maintainer

Replies: 7 comments 2 replies

cccntu
Jan 30, 2021

avital
Feb 1, 2021

cccntu
Feb 1, 2021

avital
Feb 1, 2021

cgarciae
Mar 6, 2021
Maintainer

jheek
Mar 8, 2021
Maintainer

andsteing
May 31, 2022
Maintainer

marcvanzee Aug 11, 2022
Maintainer Author