Skip to content

Files

Latest commit

6dbb69d · Jun 11, 2020

History

History
74 lines (55 loc) · 3.89 KB

README_EN.md

File metadata and controls

74 lines (55 loc) · 3.89 KB

A memory balanced and communication efficient FullyConnected layer model parallel implementation in PyTorch

Why we need model parallel? Why not use the DataParallel?

Well, in face and re-id (person re-identification) areas, the number of labels in some private datasets may exceeds 1 million/10 millions/100 millions, and the parameters of the fully connected layer will occupy the whole GPU memory, and we can only use a small batch size which will result in slow training speed and poor evaluation performance

Fully connected layer with model parallel? It's simple!

class FullyConnected(nn.Module):
    def __init__(self, in_dim, out_dim, num_gpu, model_parallel=False):
        super(FullyConnected, self).__init__()
        self.num_gpu = num_gpu
        self.model_parallel = model_parallel
        if model_parallel:
            self.fc_chunks = nn.ModuleList()
            for i in range(num_gpu):
                _class_num = out_dim // num_gpu
                if i < (out_dim % num_gpu):
                    _class_num += 1
                self.fc_chunks.append(
                    nn.Linear(in_dim, _class_num, bias=False).cuda(i)
                )
        else:
            self.classifier = nn.Linear(in_dim, out_dim, bias=False)

    def forward(self, x):
        if self.model_parallel:
            x_list = []
            for i in range(self.num_gpu):
                _x = self.fc_chunks[i](x.cuda(i))
                x_list.append(_x)
            x = torch.cat(x_list, dim=1)
            return x
        else:
            return self.classifier(x)

Similar implementation can also be found here

this implementation can only solve part of the problem, and it will introduce new issues, GPU memory imbalanced usage between different gpus. Cause all the results will concat at GPU 0, and the loss calculation also happends at GPU 0, the GPU memory usage and computaion load will be much higher in GPU 0 compare to other GPUs, we still can not use big batch size.

Did this repository solve the problem?

Yes, and it extends to more occasions, like margin loss, mixed precison training and distributed training

some advantages:

  • GPU memory usage and computation load will balanced among all GPUs, we can use a big batch size, life will be easier:-)
  • support most of the margin loss in face and re-id areas, like ArcFace, SphereFace, CosFace, AM-softmax and so on
  • it won't affect your evaluation result after training with the model parallel
  • sometimes speed up your training (due to lower communication cost in optimized CrossEntropyLoss)
  • support mixed precision training and distributed training

How can I use this?

First make sure you do need model parallel:

  • Is the number of labels in your datasets exceed 1 million?
  • If the last layer of your model is fully connected layer? And Did you use the CrossEntropyLoss?
  • Do you have enough GPUs? (at least 4~8 GPUs)

If the anwser of all the above questions is yes, and you can consider using the model parallel. But as it requires to hack into the model and optimizer, you will need to migrate this to your repository by yourself

what about other deep learning frameworks?

the principle is the same, other frameworks like MXNet has a better support (kvstore) for distributed training

Chinese blogs