Training NesT from scatch (on CIFAR10) #751

cjsg · 2021-07-12T14:20:12Z

cjsg
Jul 12, 2021

Hi everyone,

I am trying to replicate the results of the NesT paper ( https://arxiv.org/abs/2105.12723 ) using timm's implementation (see timm/models/nest.py ) adatpted to CIFAR10 (see details below). I tried to use the parameters of the original jax code given here which resulted in this .yaml parameter file.

Unfortunately, with these parameters, I only get around 89.5% of test accuracy for a NesT-Tiny (S=1, depth=4), which is far below the 95-96%-ish promised by the paper (Table 1 & Fig.6).

Any ideas what may explain these differences ?

(Sorry in advance for the trivial dissimilarities that I surely oversaw between the original jax and the current implementation/parameters.
BTW: I am also happy if you can dinf the right parameters for ImageNet-1k. I just thought that CIFAR10 would be easier to debug.)

Thanks!
CJ

Main changes to adapt timm's code to CIFAR-10:

in timm/data/dataset_factory.py: added the following two lines

elif name == 'cifar10':
    ds = torchvision.datasets.CIFAR10(root=root, train=is_training, download=False)

in timm/models/nest.py, added the following:

CIFAR10_DEFAULT_MEAN = (0.4914, 0.4822, 0.4465)
CIFAR10_DEFAULT_STD = (0.2023, 0.1994, 0.2010)

def _cfg_cifar(url='', **kwargs):
    return {
        'url': url,
        'num_classes': 10, 'input_size': (3, 32, 32), 'pool_size': [4, 4],
        'crop_pct': .875, 'interpolation': 'bicubic', 'fixed_input_size': True,
        'mean': CIFAR10_DEFAULT_MEAN, 'std': CIFAR10_DEFAULT_STD,
        'first_conv': 'patch_embed.proj', 'classifier': 'head',
        **kwargs
    }

default_cfgs['nest_tiny_cifar'] =  _cfg_cifar()

@register_model                                                                                                                                                                                             
def nest_tiny_cifar(pretrained=False, **kwargs):
    model_kwargs = dict(
        img_size=32, patch_size=4, num_levels=4, embed_dims=192, num_heads=3, depths=3,
        num_classes=10)
    model = _create_nest('nest_tiny_cifar', pretrained=pretrained, **model_kwargs)
    return model

Answered by alexander-soare

Jul 12, 2021

@cjsg having a glance at what you've done here are some things that might get you a lot closer:

Probably most importantly, you need patch_size=1. It's a little confusing when comparing to the initial implementation as there the patch_size refers to the size of a "block" (as the term is used in the paper) in units of "patches". To be clear, in terms of official -> timm it's

patch_size -> block_size as determined here
init_patch_embed_size -> patch_size as set in the model_kwargs
Some check sums to make sure we're on the same page: Your image size is 32x32. You set num_levels to 4 which means the first hierarchical level has 8x8 "blocks" each with 4x4 pixels. And your patch size is 1x1, …

View full answer

alexander-soare · 2021-07-12T17:15:24Z

alexander-soare
Jul 12, 2021

@cjsg having a glance at what you've done here are some things that might get you a lot closer:

Probably most importantly, you need patch_size=1. It's a little confusing when comparing to the initial implementation as there the patch_size refers to the size of a "block" (as the term is used in the paper) in units of "patches". To be clear, in terms of official -> timm it's

patch_size -> block_size as determined here
init_patch_embed_size -> patch_size as set in the model_kwargs
Some check sums to make sure we're on the same page: Your image size is 32x32. You set num_levels to 4 which means the first hierarchical level has 8x8 "blocks" each with 4x4 pixels. And your patch size is 1x1, so each block has a seq length of 16.

If 1) is not enough to get the accuracy up, you'll might try using MultiQueryAttention which is not implemented in nest.py as it wasn't designed with CIFAR in mind. All the CIFAR models use it and the official implementation is here. BUT I'm not sure this will actually help increase your accuracy. Maybe it will help avoid overfitting? Not sure

Hope that helps, and curious to know how it goes.

7 replies

alexander-soare Jul 14, 2021

@cjsg Nice work! Regarding your very last point about num params, for this sort of stuff I'd consult the jax code directly. If you look at table A.1 in the paper the network hyperparameters for nest_tiny_cifar are actually different than the ones in the code. In fact, I'd go as far as saying you can't be too tied to the top-1 acc in the paper for this one. Better to just run the jax code.

BTW to get close to the 6.2M params I follow the setup in table A.1, and use normal multi-head attention (the paper never mentions multi-query attention, it's just in the code).

model_kwargs = dict(
        img_size=32, patch_size=1, num_levels=3, embed_dims=192, num_heads=3, depths=4,
        num_classes=10, qkv_bias=True)

This results in 6_264_202 trainable params.

alexander-soare Jul 14, 2021

Also note that even table A.1 might have issues. It certainly does for the ImageNet models. See this issue google-research/nested-transformer#2

cjsg Jul 14, 2021
Author

Thanks for testing the number of parameters with 3 levels (instead of 4 for me)! That would have been my next thing to try. I just assumed that in Table 1 they'd used the setting with 4 levels, since that's the one that worked best in all experiments (that use S=1) of Fig.6.
Anyway, you are certainly right that I should use/trust the jax repo more than the paper for the hyper-parameters/settings...

BTW: I managed to get 95.13% accuracy by essentially running the command which gave 94.55 twice (using the result from the first run as a "warm-start" for the second run). That starts to get close to the paper's claims.

Thanks again for all your help, and for converting this architecture to PyTorch!

charchit7 May 5, 2022

Hi @cjsg can you please share the code with us for CIFAR-10 :)

cjsg Jun 9, 2022
Author

Sorry for my late answer.
Here it is. It comes with no guarantees whatsoever, since I didn't even take the time to clean / review it before uploading it here. It is a bare copy-paste of the code in whatever state I left it now almost a year ago.
Hope that helps.

""" Nested Transformer (NesT) in PyTorch

A PyTorch implement of Aggregating Nested Transformers as described in:

'Aggregating Nested Transformers'
    - https://arxiv.org/abs/2105.12723

The official Jax code is released and available at https://github.com/google-research/nested-transformer. The weights
have been converted with convert/convert_nest_flax.py

Based on Alexander Soare's implementation of NesT in the timm package.
Carl-Johann Simon-Gabriel added some changes to include CIFAR10 specific architectures
and to play around with MultiQueryAttention (instead of the usual MultiHeadAttention).
See https://github.com/rwightman/pytorch-image-models/discussions/751 .

Acknowledgments:
* The paper authors for sharing their research, code, and model weights
* Ross Wightman's existing code off which I based this
* Alexander Soare for his answers on github

Copyrights and license as in the original timm code.
"""

import collections.abc
import logging
import math
from functools import partial

import torch
import torch.nn.functional as F
from torch import nn

from timm.data import IMAGENET_DEFAULT_MEAN, IMAGENET_DEFAULT_STD, CIFAR10_DEFAULT_MEAN, CIFAR10_DEFAULT_STD
from .helpers import build_model_with_cfg, named_apply
from .layers import PatchEmbed, Mlp, DropPath, create_classifier, trunc_normal_
from .layers import create_conv2d, create_pool2d, to_ntuple
from .registry import register_model

_logger = logging.getLogger(__name__)
_logger.setLevel(logging.WARNING)


def _cfg(url='', **kwargs):
    return {
        'url': url,
        'num_classes': 1000, 'input_size': (3, 224, 224), 'pool_size': [14, 14],
        'crop_pct': .875, 'interpolation': 'bicubic', 'fixed_input_size': True,
        'mean': IMAGENET_DEFAULT_MEAN, 'std': IMAGENET_DEFAULT_STD,
        'first_conv': 'patch_embed.proj', 'classifier': 'head',
        **kwargs
    }

def _cfg_cifar(url='', **kwargs):
    return {
        'url': url,
        'num_classes': 10, 'input_size': (3, 32, 32), 'pool_size': [4, 4],
        'crop_pct': .875, 'interpolation': 'bicubic', 'fixed_input_size': True,
        'mean': CIFAR10_DEFAULT_MEAN, 'std': CIFAR10_DEFAULT_STD,
        'first_conv': 'patch_embed.proj', 'classifier': 'head',
        **kwargs
    }


default_cfgs = {
    # (weights from official Google JAX impl)
    'nest_mini_cifar': _cfg_cifar(),
    'nest_tiny_cifar': _cfg_cifar(),
    'jx_nest_tiny_cifar': _cfg_cifar(),
    'nest_tiny_cifar_sharekv': _cfg_cifar(),
    'jx_nest_tiny_cifar_sharekv': _cfg_cifar(),
    'nest_base': _cfg(),
    'nest_small': _cfg(),
    'nest_mini': _cfg(),
    'nest_tiny': _cfg(),
    'jx_nest_base': _cfg(
        url='https://github.com/rwightman/pytorch-image-models/releases/download/v0.1-vt3p-weights/jx_nest_base-8bc41011.pth'),
    'jx_nest_small': _cfg(
        url='https://github.com/rwightman/pytorch-image-models/releases/download/v0.1-vt3p-weights/jx_nest_small-422eaded.pth'),
    'jx_nest_tiny': _cfg(
        url='https://github.com/rwightman/pytorch-image-models/releases/download/v0.1-vt3p-weights/jx_nest_tiny-e3428fb9.pth'),
}


from einops import rearrange as ra

class MultiQueryAttention(nn.Module):
    """
    Same as MultiHeadAttention except that the weights of kv are shared accross the multiple heads
    of self-attention. (Queries q remain unshared.)
    See Fast Transformer Decoding: One Write-Head is All You Need, https://arxiv.org/abs/1911.02150
    """
    def __init__(self, dim, num_heads=8, qkv_bias=False, attn_drop=0., proj_drop=0.):
        super().__init__()
        self.num_heads = num_heads
        self.head_dim = dim // num_heads
        self.scale = self.head_dim ** -0.5

        self.q = nn.Linear(dim, dim, bias=qkv_bias)
        self.kv = nn.Linear(dim, 2*self.head_dim, bias=qkv_bias)
        self.attn_drop = nn.Dropout(attn_drop)
        self.proj = nn.Linear(dim, dim)
        self.proj_drop = nn.Dropout(proj_drop)

    def forward(self, x):
        """
        x is shape: B (batch_size), T (image blocks), N (seq length per image block), C (embed dim)
        """ 
        B, T, N, C = x.shape  # C = dim  ; C // num_heads = head_dim ; H := num_heads
        # result of next line is (qkv, B, num (H)eads, T, N, (C')hannels per head)
        q = self.q(x).reshape(B, T, N, self.num_heads, C // self.num_heads).permute(0, 3, 1, 2, 4)  # B H T N head_dim
        kv = self.kv(x).reshape(B, T, N, 2, 1, self.head_dim).permute(3, 0, 4, 1, 2, 5)  # 2 B 1 T N head_dim
        k, v = kv[0], kv[1]   # make torchscript happy (cannot use tensor as tuple)

        attn = (q @ k.transpose(-2, -1)) * self.scale # (B, H, T, N, N)
        attn = attn.softmax(dim=-1)
        attn = self.attn_drop(attn)

        # (B, H, T, N, C'), permute -> (B, T, N, C', H)
        x = (attn @ v).permute(0, 2, 3, 4, 1).reshape(B, T, N, C)
        x = self.proj(x)
        x = self.proj_drop(x)
        return x  # (B, T, N, C)


class MultiHeadAttention(nn.Module):
    """
    This is much like `.vision_transformer.Attention` but uses *localised* self attention by accepting an input with
     an extra "image block" dim
    """
    def __init__(self, dim, num_heads=8, qkv_bias=False, attn_drop=0., proj_drop=0.):
        super().__init__()
        self.num_heads = num_heads
        head_dim = dim // num_heads
        self.scale = head_dim ** -0.5

        self.qkv = nn.Linear(dim, 3*dim, bias=qkv_bias)
        self.attn_drop = nn.Dropout(attn_drop)
        self.proj = nn.Linear(dim, dim)
        self.proj_drop = nn.Dropout(proj_drop)

    def forward(self, x):
        """
        x is shape: B (batch_size), T (image blocks), N (seq length per image block), C (embed dim)
        """ 
        B, T, N, C = x.shape
        # result of next line is (qkv, B, num (H)eads, T, N, (C')hannels per head)
        qkv = self.qkv(x).reshape(B, T, N, 3, self.num_heads, C // self.num_heads).permute(3, 0, 4, 1, 2, 5)
        q, k, v = qkv[0], qkv[1], qkv[2]  # make torchscript happy (cannot use tensor as tuple)

        attn = (q @ k.transpose(-2, -1)) * self.scale # (B, H, T, N, N)
        attn = attn.softmax(dim=-1)
        attn = self.attn_drop(attn)

        # (B, H, T, N, C'), permute -> (B, T, N, C', H)
        x = (attn @ v).permute(0, 2, 3, 4, 1).reshape(B, T, N, C)
        x = self.proj(x)
        x = self.proj_drop(x)
        return x  # (B, T, N, C)


class TransformerLayer(nn.Module):
    """
    This is much like `.vision_transformer.Block` but:
        - Called TransformerLayer here to allow for "block" as defined in the paper ("non-overlapping image blocks")
        - Uses modified MultiHeadAttention layer that handles the "block" dimension
    """
    def __init__(self, dim, num_heads, mlp_ratio=4., qkv_bias=False, drop=0., attn_drop=0., drop_path=0.,
                 act_layer=nn.GELU, norm_layer=nn.LayerNorm, attn_layer=MultiHeadAttention):
        super().__init__()
        self.norm1 = norm_layer(dim)
        self.attn = attn_layer(dim, num_heads=num_heads, qkv_bias=qkv_bias, attn_drop=attn_drop, proj_drop=drop)
        self.drop_path = DropPath(drop_path) if drop_path > 0. else nn.Identity()
        self.norm2 = norm_layer(dim)
        mlp_hidden_dim = int(dim * mlp_ratio)
        self.mlp = Mlp(in_features=dim, hidden_features=mlp_hidden_dim, act_layer=act_layer, drop=drop)

    def forward(self, x):
        """
        expects x of dim B T N C
        """
        x = x + self.drop_path(self.attn(self.norm1(x)))
        x = x + self.drop_path(self.mlp(self.norm2(x)))
        return x


class ConvPool(nn.Module):
    def __init__(self, in_channels, out_channels, norm_layer, pad_type='', original=True):
        # TODO: changed here to original=False
        super().__init__()
        self.conv = create_conv2d(in_channels, out_channels, kernel_size=3, padding=pad_type, bias=True)
        if original:
            self.norm = norm_layer(out_channels)
            self.pool = create_pool2d('max', kernel_size=3, stride=2, padding=pad_type)
        else:
            self.norm = None
            self.pool = nn.MaxPool2d(kernel_size=3, stride=2, padding=1)  # BEWARE: previous version used kernel_size=4
            # self.pool = nn.Sequential(
            #     nn.MaxPool2d(kernel_size=3, stride=1, padding=1),
            #     nn.MaxPool2d(kernel_size=2, stride=2, padding=0))

    def forward(self, x):
        """
        x is expected to have shape (B, C, H, W)
        """
        assert x.shape[-2] % 2 == 0, 'BlockAggregation requires even input spatial dims'
        assert x.shape[-1] % 2 == 0, 'BlockAggregation requires even input spatial dims'
        x = self.conv(x)
        if self.norm is not None:  # Layer norm done over channel dim only
            x = self.norm(x.permute(0, 2, 3, 1)).permute(0, 3, 1, 2)
        x = self.pool(x)
        return x  # (B, C, H//2, W//2)


def blockify(x, block_size: int):
    """image to blocks
    Args:
        x (Tensor): with shape (B, H, W, C)
        block_size (int): edge length of a single square block in units of H, W
    """
    B, H, W, C  = x.shape
    assert H % block_size == 0, '`block_size` must divide input height evenly'
    assert W % block_size == 0, '`block_size` must divide input width evenly'
    grid_height = H // block_size
    grid_width = W // block_size
    x = x.reshape(B, grid_height, block_size, grid_width, block_size, C)
    x = x.transpose(2, 3).reshape(B, grid_height * grid_width, -1, C)
    return x  # (B, T, N, C)


def deblockify(x, block_size: int):
    """blocks to image
    Args:
        x (Tensor): with shape (B, T, N, C) where T is number of blocks and N is sequence size per block
        block_size (int): edge length of a single square block in units of desired H, W
    """
    B, T, _, C = x.shape
    grid_size = int(math.sqrt(T))
    height = width = grid_size * block_size
    x = x.reshape(B, grid_size, grid_size, block_size, block_size, C)
    x = x.transpose(2, 3).reshape(B, height, width, C)
    return x  # (B, H, W, C)


class NestLevel(nn.Module):
    """ Single hierarchical level of a Nested Transformer
    """
    def __init__(
            self, num_blocks, block_size, seq_length, num_heads, depth, embed_dim, prev_embed_dim=None,
            mlp_ratio=4., qkv_bias=True, drop_rate=0., attn_drop_rate=0., drop_path_rates=[],
            norm_layer=None, act_layer=None, attn_layer=None, pad_type='', original=True):
        super().__init__()
        self.block_size = block_size
        if original:
            self.pos_embed = nn.Parameter(torch.zeros(1, num_blocks, seq_length, embed_dim))
            self.has_pos_embed = True
            if prev_embed_dim is not None:
                self.pool = ConvPool(prev_embed_dim, embed_dim, norm_layer=norm_layer, pad_type=pad_type, original=original)
            else:
                self.pool = nn.Identity()
        else:
            if prev_embed_dim is not None:
                self.pool = ConvPool(prev_embed_dim, embed_dim, norm_layer=norm_layer, pad_type=pad_type, original=original)
                self.has_pos_embed = False
            else:
                self.pool = nn.Identity()
                self.pos_embed = nn.Parameter(torch.zeros(1, num_blocks, seq_length, embed_dim))
                self.has_pos_embed = True

        # Transformer encoder
        if len(drop_path_rates):
            assert len(drop_path_rates) == depth, 'Must provide as many drop path rates as there are transformer layers'
        self.transformer_encoder = nn.Sequential(*[
            TransformerLayer(
                dim=embed_dim, num_heads=num_heads, mlp_ratio=mlp_ratio, qkv_bias=qkv_bias,
                drop=drop_rate, attn_drop=attn_drop_rate, drop_path=drop_path_rates[i],
                norm_layer=norm_layer, act_layer=act_layer, attn_layer=attn_layer)
            for i in range(depth)])

    def forward(self, x):
        """
        expects x as (B, C, H, W)
        """
        x = self.pool(x)
        x = x.permute(0, 2, 3, 1)  # (B, H', W', C), switch to channels last for transformer
        x = blockify(x, self.block_size)  # (B, T, N, C')
        if self.has_pos_embed:
            x = x + self.pos_embed
        x = self.transformer_encoder(x)  # (B, T, N, C')
        x = deblockify(x, self.block_size)  # (B, H', W', C')
        # Channel-first for block aggregation, and generally to replicate convnet feature map at each stage
        return x.permute(0, 3, 1, 2)  # (B, C, H', W')


class Nest(nn.Module):
    """ Nested Transformer (NesT)

    A PyTorch impl of : `Aggregating Nested Transformers`
        - https://arxiv.org/abs/2105.12723
    """
    def __init__(self, img_size=224, in_chans=3, patch_size=4, num_levels=3, embed_dims=(128, 256, 512),
                 num_heads=(4, 8, 16), depths=(2, 2, 20), num_classes=1000, mlp_ratio=4., qkv_bias=True,
                 drop_rate=0., attn_drop_rate=0., drop_path_rate=0.5, norm_layer=None, act_layer=None,
                 attn_layer=None, pad_type='', weight_init='', global_pool='avg', original=True):
        """
        Args:
            img_size (int, tuple): input image size
            in_chans (int): number of input channels
            patch_size (int): patch size (S in paper)
            num_levels (int): number of block hierarchies (T_d in the paper)
            embed_dims (int, tuple): embedding dimensions of each level
            num_heads (int, tuple): number of attention heads for each level
            depths (int, tuple): number of transformer layers for each level
            num_classes (int): number of classes for classification head
            mlp_ratio (int): ratio of mlp hidden dim to embedding dim for MLP of transformer layers
            qkv_bias (bool): enable bias for qkv if True
            drop_rate (float): dropout rate for MLP of transformer layers, MSA final projection layer, and classifier
            attn_drop_rate (float): attention dropout rate
            drop_path_rate (float): stochastic depth rate
            norm_layer: (nn.Module): normalization layer for transformer layers
            act_layer: (nn.Module): activation layer in MLP of transformer layers
            pad_type: str: Type of padding to use '' for PyTorch symmetric, 'same' for TF SAME
            weight_init: (str): weight init scheme
            global_pool: (str): type of pooling operation to apply to final feature map
            original (bol): use original nest implimentation if True; otherwise use CJ's improvements

        Notes:
            - Default values follow NesT-B from the original Jax code.
            - `embed_dims`, `num_heads`, `depths` should be ints or tuples with length
              `num_levels`, with values going from input to output layes.
            - For those following the paper, Table A1 may have errors!
                - https://github.com/google-research/nested-transformer/issues/2
        """
        super().__init__()

        for param_name in ['embed_dims', 'num_heads', 'depths']:
            param_value = locals()[param_name]
            if isinstance(param_value, collections.abc.Sequence):
                assert len(param_value) == num_levels, f'Require `len({param_name}) == num_levels`'

        embed_dims = to_ntuple(num_levels)(embed_dims)
        num_heads = to_ntuple(num_levels)(num_heads)
        depths = to_ntuple(num_levels)(depths)
        self.num_classes = num_classes
        self.num_features = embed_dims[-1]
        self.feature_info = []
        norm_layer = norm_layer or partial(nn.LayerNorm, eps=1e-6)
        act_layer = act_layer or nn.GELU
        # attn_layer = MultiQueryAttention if share_kv else MultiHeadAttention
        attn_layer = attn_layer or MultiHeadAttention
        self.drop_rate = drop_rate
        self.num_levels = num_levels
        if isinstance(img_size, collections.abc.Sequence):
            assert img_size[0] == img_size[1], 'Model only handles square inputs'
            img_size = img_size[0]
        assert img_size % patch_size == 0, '`patch_size` must divide `img_size` evenly'
        self.patch_size = patch_size

        # Number of blocks at each level
        self.num_blocks = (4 ** torch.arange(num_levels)).flip(0).tolist()
        assert (img_size // patch_size) % math.sqrt(self.num_blocks[0]) == 0, \
            'First level blocks don\'t fit evenly. Check `img_size`, `patch_size`, and `num_levels`'

        # Block edge size in units of patches
        # Hint: (img_size // patch_size) gives number of patches along edge of image. sqrt(self.num_blocks[0]) is the
        #  number of blocks along edge of image
        self.block_size = int((img_size // patch_size) // math.sqrt(self.num_blocks[0]))

        # Patch embedding
        self.patch_embed = PatchEmbed(
            img_size=img_size, patch_size=patch_size, in_chans=in_chans, embed_dim=embed_dims[0], flatten=False)
        self.num_patches = self.patch_embed.num_patches
        self.seq_length = self.num_patches // self.num_blocks[0]

        # Build up each hierarchical level
        levels = []
        dp_rates = [x.tolist() for x in torch.linspace(0, drop_path_rate, sum(depths)).split(depths)]
        prev_dim = None
        curr_stride = 4
        for i in range(len(self.num_blocks)):
            dim = embed_dims[i]
            levels.append(NestLevel(
                self.num_blocks[i], self.block_size, self.seq_length, num_heads[i], depths[i], dim, prev_dim,
                mlp_ratio, qkv_bias, drop_rate, attn_drop_rate, dp_rates[i], norm_layer, act_layer,
                attn_layer, pad_type=pad_type, original=original))
            self.feature_info += [dict(num_chs=dim, reduction=curr_stride, module=f'levels.{i}')]
            prev_dim = dim
            curr_stride *= 2
        self.levels = nn.Sequential(*levels)

        # Final normalization layer
        self.norm = norm_layer(embed_dims[-1])

        # Classifier
        self.global_pool, self.head = create_classifier(self.num_features, self.num_classes, pool_type=global_pool)

        self.init_weights(weight_init)

    def init_weights(self, mode=''):
        assert mode in ('nlhb', '')
        head_bias = -math.log(self.num_classes) if 'nlhb' in mode else 0.
        for level in self.levels:  # BEWARE: no init in a previous version
            if hasattr(level, 'pos_embed'):
                trunc_normal_(level.pos_embed, std=.02, a=-2, b=2)
        named_apply(partial(_init_nest_weights, head_bias=head_bias), self)

    @torch.jit.ignore
    def no_weight_decay(self):
        return {f'level.{i}.pos_embed' for i in range(len(self.levels))}

    def get_classifier(self):
        return self.head

    def reset_classifier(self, num_classes, global_pool='avg'):
        self.num_classes = num_classes
        self.global_pool, self.head = create_classifier(
            self.num_features, self.num_classes, pool_type=global_pool)

    def forward_features(self, x):
        """ x shape (B, C, H, W)
        """
        x = self.patch_embed(x)
        x = self.levels(x)
        # Layer norm done over channel dim only (to NHWC and back)
        x = self.norm(x.permute(0, 2, 3, 1)).permute(0, 3, 1, 2)
        return x

    def forward(self, x):
        """ x shape (B, C, H, W)
        """
        x = self.forward_features(x)
        x = self.global_pool(x)
        if self.drop_rate > 0.:
            x = F.dropout(x, p=self.drop_rate, training=self.training)
        return self.head(x)


def _init_nest_weights(module: nn.Module, name: str = '', head_bias: float = 0.):
    """ NesT weight initialization
    Can replicate Jax implementation. Otherwise follows vision_transformer.py
    """
    if isinstance(module, nn.Linear):
        if name.startswith('head'):
            trunc_normal_(module.weight, std=.02, a=-2, b=2)
            nn.init.constant_(module.bias, head_bias)
        else:
            trunc_normal_(module.weight, std=.02, a=-2, b=2)
            if module.bias is not None:
                nn.init.zeros_(module.bias)
    elif isinstance(module, nn.Conv2d):
        trunc_normal_(module.weight, std=.02, a=-2, b=2)
        if module.bias is not None:
            nn.init.zeros_(module.bias)
    elif isinstance(module, (nn.LayerNorm, nn.GroupNorm, nn.BatchNorm2d)):
        nn.init.zeros_(module.bias)
        nn.init.ones_(module.weight)


def resize_pos_embed(posemb, posemb_new):
    """
    Rescale the grid of position embeddings when loading from state_dict
    Expected shape of position embeddings is (1, T, N, C), and considers only square images
    """
    _logger.info('Resized position embedding: %s to %s', posemb.shape, posemb_new.shape)
    seq_length_old = posemb.shape[2]
    num_blocks_new, seq_length_new = posemb_new.shape[1:3]
    size_new = int(math.sqrt(num_blocks_new*seq_length_new))
    # First change to (1, C, H, W)
    posemb = deblockify(posemb, int(math.sqrt(seq_length_old))).permute(0, 3, 1, 2)
    posemb = F.interpolate(posemb, size=[size_new, size_new], mode='bicubic', align_corners=False)
    # Now change to new (1, T, N, C)
    posemb = blockify(posemb.permute(0, 2, 3, 1), int(math.sqrt(seq_length_new)))
    return posemb


def checkpoint_filter_fn(state_dict, model):
    """ resize positional embeddings of pretrained weights """
    pos_embed_keys = [k for k in state_dict.keys() if k.startswith('pos_embed_')]
    for k in pos_embed_keys:
        if state_dict[k].shape != getattr(model, k).shape:
            state_dict[k] = resize_pos_embed(state_dict[k], getattr(model, k))
    return state_dict


def _create_nest(variant, pretrained=False, default_cfg=None, **kwargs):
    default_cfg = default_cfg or default_cfgs[variant]
    model = build_model_with_cfg(
        Nest, variant, pretrained,
        default_cfg=default_cfg,
        feature_cfg=dict(out_indices=(0, 1, 2), flatten_sequential=True),
        pretrained_filter_fn=checkpoint_filter_fn,
        **kwargs)

    return model


@register_model
def nest_mini_cifar(pretrained=False, **kwargs):
    model_kwargs = dict(
        img_size=32, patch_size=1, num_levels=3, embed_dims=192, num_heads=3, depths=1,
        num_classes=10, original=False, **kwargs)
    model = _create_nest('nest_mini_cifar', pretrained=pretrained, **model_kwargs)
    return model

@register_model
def nest_mini2_cifar(pretrained=False, **kwargs):
    model_kwargs = dict(
        img_size=32, patch_size=1, num_levels=3, embed_dims=192, num_heads=3, depths=2,
        num_classes=10, original=False, **kwargs)
    model = _create_nest('nest_mini_cifar', pretrained=pretrained, **model_kwargs)
    return model

@register_model
def nest_tiny_cifar(pretrained=False, **kwargs):
    model_kwargs = dict(
        img_size=32, patch_size=1, num_levels=3, embed_dims=192, num_heads=3, depths=4,
        num_classes=10, original=False, **kwargs)
    model = _create_nest('nest_tiny_cifar', pretrained=pretrained, **model_kwargs)
    return model

@register_model
def jx_nest_tiny_cifar(pretrained=False, **kwargs):
    kwargs['pad_type'] = 'same'
    model_kwargs = dict(
        img_size=32, patch_size=1, num_levels=4, embed_dims=192, num_heads=3, depths=3,
        num_classes=10, original=False, **kwargs)
    model = _create_nest('nest_tiny_cifar', pretrained=pretrained, **model_kwargs)
    return model

@register_model
def nest_tiny_cifar_sharekv(pretrained=False, **kwargs):
    model_kwargs = dict(
        img_size=32, patch_size=1, num_levels=4, embed_dims=192, num_heads=3, depths=3,
        num_classes=10, attn_layer=MultiQueryAttention, original=False, **kwargs)
    model = _create_nest('nest_tiny_cifar_sharekv', pretrained=pretrained, **model_kwargs)
    return model

@register_model
def jx_nest_tiny_cifar_sharekv(pretrained=False, **kwargs):
    kwargs['pad_type'] = 'same'
    model_kwargs = dict(
        img_size=32, patch_size=1, num_levels=4, embed_dims=192, num_heads=3, depths=3,
        num_classes=10, attn_layer=MultiQueryAttention, original=False, **kwargs)
    model = _create_nest('nest_tiny_cifar_sharekv', pretrained=pretrained, **model_kwargs)
    return model



@register_model
def nest_base(pretrained=False, **kwargs):
    """ Nest-B @ 224x224
    """
    model_kwargs = dict(
        embed_dims=(128, 256, 512), num_heads=(4, 8, 16), depths=(2, 2, 20), **kwargs)
    model = _create_nest('nest_base', pretrained=pretrained, **model_kwargs)
    return model


@register_model
def nest_small(pretrained=False, **kwargs):
    """ Nest-S @ 224x224
    """
    model_kwargs = dict(embed_dims=(96, 192, 384), num_heads=(3, 6, 12), depths=(2, 2, 20), **kwargs)
    model = _create_nest('nest_small', pretrained=pretrained, **model_kwargs)
    return model

@register_model
def nest_mini(pretrained=False, **kwargs):
    """ Nest-T @ 224x224
    """
    model_kwargs = dict(embed_dims=192, num_heads=4, depths=2, **kwargs)
    model = _create_nest('nest_mini', pretrained=pretrained, **model_kwargs)
    return model

@register_model
def nest_mini_modified(pretrained=False, **kwargs):
    """ Nest-T @ 224x224
    """
    model_kwargs = dict(embed_dims=192, num_heads=4, depths=2, **kwargs)
    model = _create_nest('nest_mini', pretrained=pretrained, original=False, **model_kwargs)
    return model

@register_model
def nest_tiny(pretrained=False, **kwargs):
    """ Nest-T @ 224x224
    """
    model_kwargs = dict(embed_dims=(96, 192, 384), num_heads=(3, 6, 12), depths=(2, 2, 8), **kwargs)
    model = _create_nest('nest_tiny', pretrained=pretrained, **model_kwargs)
    return model

@register_model
def nest_tiny_modified(pretrained=False, **kwargs):
    """ Nest-T @ 224x224
    """
    model_kwargs = dict(embed_dims=(96, 192, 384), num_heads=(3, 6, 12), depths=(2, 2, 8), **kwargs)
    model = _create_nest('nest_tiny', pretrained=pretrained, original=False, **model_kwargs)
    return model


@register_model
def jx_nest_base(pretrained=False, **kwargs):
    """ Nest-B @ 224x224, Pretrained weights converted from official Jax impl.
    """
    kwargs['pad_type'] = 'same'
    model_kwargs = dict(embed_dims=(128, 256, 512), num_heads=(4, 8, 16), depths=(2, 2, 20), **kwargs)
    model = _create_nest('jx_nest_base', pretrained=pretrained, **model_kwargs)
    return model


@register_model
def jx_nest_small(pretrained=False, **kwargs):
    """ Nest-S @ 224x224, Pretrained weights converted from official Jax impl.
    """
    kwargs['pad_type'] = 'same'
    model_kwargs = dict(embed_dims=(96, 192, 384), num_heads=(3, 6, 12), depths=(2, 2, 20), **kwargs)
    model = _create_nest('jx_nest_small', pretrained=pretrained, **model_kwargs)
    return model


@register_model
def jx_nest_tiny(pretrained=False, **kwargs):
    """ Nest-T @ 224x224, Pretrained weights converted from official Jax impl.
    """
    kwargs['pad_type'] = 'same'
    model_kwargs = dict(embed_dims=(96, 192, 384), num_heads=(3, 6, 12), depths=(2, 2, 8), **kwargs)
    model = _create_nest('jx_nest_tiny', pretrained=pretrained, **model_kwargs)
    return model

Uh oh!

Training NesT from scatch (on CIFAR10) #751

Uh oh!

Uh oh!

cjsg Jul 12, 2021

Main changes to adapt timm's code to CIFAR-10:

Replies: 1 comment · 7 replies

Uh oh!

Uh oh!

alexander-soare Jul 12, 2021

Uh oh!

alexander-soare Jul 14, 2021

Uh oh!

alexander-soare Jul 14, 2021

Uh oh!

cjsg Jul 14, 2021 Author

Uh oh!

Uh oh!

charchit7 May 5, 2022

Uh oh!

cjsg Jun 9, 2022 Author

cjsg
Jul 12, 2021

Replies: 1 comment 7 replies

alexander-soare
Jul 12, 2021

cjsg Jul 14, 2021
Author

cjsg Jun 9, 2022
Author