lmdeploy suppport parrllel embedding#4192
Conversation
26666f5 to
2bcced7
Compare
|
Please fix the lint issue. You can setup the |
It's done. And now deepseek_v2 will use ParallelEmbedding instead of nn.embedding. |
There was a problem hiding this comment.
Pull request overview
This PR introduces tensor parallelism (TP) support for embedding layers in lmdeploy to reduce GPU memory consumption. The implementation uses rowwise tensor parallelism to shard the vocabulary dimension across multiple GPUs.
Key changes:
- Added
ParallelEmbeddingmodule with configurable tensor parallelism support via theis_tpflag - Implemented backend abstraction for embedding operations with default CUDA implementation
- Updated
deepseek_v2model to use the new parallel embedding layer
Reviewed changes
Copilot reviewed 8 out of 8 changed files in this pull request and generated 17 comments.
Show a summary per file
| File | Description |
|---|---|
| lmdeploy/pytorch/nn/embedding.py | Core implementation of ParallelEmbedding with TP support and weight loading logic |
| lmdeploy/pytorch/backends/embedding.py | Abstract base classes for embedding implementation and builder |
| lmdeploy/pytorch/backends/default/embedding.py | Default embedding implementation with masking and all-reduce for TP |
| lmdeploy/pytorch/backends/default/op_backend.py | Registers embedding builder in the default backend |
| lmdeploy/pytorch/backends/base.py | Adds Embedding to OpType enum |
| lmdeploy/pytorch/nn/init.py | Exports ParallelEmbedding for public API |
| lmdeploy/pytorch/models/deepseek_v2.py | Replaces nn.Embedding with ParallelEmbedding in deepseek_v2 model |
| tests/pytorch/kernel/test_embedding.py | Unit tests for parallel embedding with multi-GPU setup |
💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.
tests/pytorch/nn/test_embedding.py
Outdated
| result_queue = mp.Queue() | ||
|
|
||
| for rank in range(world_size): | ||
| p = mp.Process(target=parrall_emb, |
There was a problem hiding this comment.
The function call contains a typo: 'parrall_emb' should be 'parallel_emb'.
| device=torch.device(type='cuda', index=0)) | ||
| token_emb.weight.data.copy_(weight) | ||
| token_emb._fill_padding_idx_with_zero() | ||
| input = x.to(torch.device(type='cuda', index=0)) |
There was a problem hiding this comment.
The variable 'input' shadows the built-in Python function 'input'. Consider renaming it to 'inputs', 'input_tensor', or 'x_cuda' to avoid shadowing built-ins.
|
|
||
|
|
||
| class DefaultEmbeddingImpl(EmbeddingImpl): | ||
| """Embedding implementation api.""" |
There was a problem hiding this comment.
The comment says "Embedding implementation api" but it should be "Embedding implementation API" (API should be uppercase as it's an acronym).
| """Embedding implementation api.""" | |
| """Embedding implementation API.""" |
| @pytest.mark.parametrize('seqlen', [1024, 1011, 128], indirect=True) | ||
| @pytest.mark.parametrize('tp', [2], indirect=True) | ||
| @pytest.mark.parametrize('dtype', [torch.bfloat16], indirect=True) | ||
| def test_embedding(self, vocab_size, feat_size, padding_idx, seqlen, tp, dtype, x, weight, gt): |
There was a problem hiding this comment.
The test only covers the tensor parallel case (is_tp=True). To ensure the ParallelEmbedding module works correctly in non-TP mode, add test cases with is_tp=False to verify that the module behaves like a standard embedding layer when tensor parallelism is disabled.
lmdeploy/pytorch/nn/embedding.py
Outdated
|
|
||
| dist_cfg = get_dist_manager().current_config() | ||
| _, self.rank = get_tp_world_rank(layer_type) | ||
| self.tp, tp_mode = dist_cfg.get_tp_by_layer(layer_type) |
There was a problem hiding this comment.
Variable tp_mode is not used.
| self.tp, tp_mode = dist_cfg.get_tp_by_layer(layer_type) | |
| self.tp, _ = dist_cfg.get_tp_by_layer(layer_type) |
| @@ -0,0 +1,98 @@ | |||
| # Copyright (c) OpenMMLab. All rights reserved. | |||
| import torch | |||
There was a problem hiding this comment.
Module 'torch' is imported with both 'import' and 'import from'.
Module 'lmdeploy.pytorch.check_env.torch' is imported with both 'import' and 'import from'.
| import torch | ||
| import torch.distributed as dist | ||
| import torch.multiprocessing as mp | ||
| from torch import nn |
There was a problem hiding this comment.
Module 'torch' is imported with both 'import' and 'import from'.
Module 'lmdeploy.pytorch.check_env.torch' is imported with both 'import' and 'import from'.
| from torch import nn |
|
We need to check if it affects the inference performance especially for larger |
| @@ -0,0 +1,125 @@ | |||
| import os | |||
There was a problem hiding this comment.
This is not a kernel, the ut should not be placed here.
There was a problem hiding this comment.
I’m not quite sure where to place the unit test files—could you give me a suggestion?
There was a problem hiding this comment.
Create a new folder under pytorch, may be pytorch/nn. Or just forget about the unit test, we have daily ete test.
| dtype: torch.dtype = None, | ||
| device: torch.device = None, | ||
| is_tp: bool = False, | ||
| padding_size: int = DEFAULT_VOCAB_PADDING_SIZE, |
There was a problem hiding this comment.
Different layer_type have different behaviour when dp>1. As you want to gather inputs in tp groups, I think the default value should be 'attn'.
| out = F.embedding(x, weight) | ||
|
|
||
| if all_reduce: | ||
| dist.all_reduce(out, group=group) |
There was a problem hiding this comment.
reduce can be place in the branch above.
|
|
||
|
|
||
| def get_masked_input_and_mask(input: torch.Tensor, start_index: int, end_index: int): | ||
| vocab_mask = (input >= start_index) & (input < end_index) |
There was a problem hiding this comment.
Can be done in less op
masked_input = (input - start_index).clamp(0, end_index - start_index)
inv_vocab_mask = masked_input != input
There was a problem hiding this comment.
The code may be not right. masked_input = (input - start_index).clamp(0, end_index - start_index)will modify the input values, causing even the unmasked values in masked_input to differ from their original values.
There was a problem hiding this comment.
what about
input = input - start_index
masked_input = input.clamp(0, end_index - start_index)
inv_vocab_mask = masked_input != input
There was a problem hiding this comment.
It's done. Because it is a right-open interval, the code is as masked_input = input.clamp(0, end_index - start_index - 1)
|
Any question about this PR? @grimoire |
You can rebase to the master to resolve the failed tests. |
|
Hi, @Tsundoku958 may you fix the typo speicified by copilot? |
Co-authored-by: Copilot <175728472+Copilot@users.noreply.github.com>
Co-authored-by: Copilot <175728472+Copilot@users.noreply.github.com>
The question speicified by copilot is fixed. |
Thanks for your contribution and we appreciate it a lot. The following instructions would make your pull request more healthy and more easily receiving feedbacks. If you do not understand some items, don't worry, just make the pull request and seek help from maintainers.
Motivation
I noticed that the current lmdeploy does not use tensor parallelism for the embedding layer and lm_head, yet they consume nearly as much GPU memory as the linear layers.Maybe This PR adds support for tensor parallelism in the embedding layer.
Modification
Perhaps TP (tensor parallelism) for embedding and lm_head could be enabled by default in lmdeploy, or a new args could be added to let users control whether to enable or disable embedding parallelism?
@grimoire @lvhan028