You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
RuntimeError: Expected all tensors to be on the same device, but found at least two devices, cuda:0 and cuda:1! (when checking argument for argument mat2 in method wrapper_CUDA_bmm)
#34695
Open
ra-MANUJ-an opened this issue
Nov 12, 2024
· 0 comments
I am trying to finetune Qwen2-0.5B model on some training data using a multi-GPU setup. The same code (given further below) seems to work in a single-GPU setting (when i set CUDA_VISIBLE_DEVICES=0):
---------------------------------------------------------------------------RuntimeErrorTraceback (mostrecentcalllast)
CellIn[15], line42importtorch3torch.autograd.set_detect_anomaly(True)
---->4main()
CellIn[12], line15, inmain()
8trainer=Trainer(env_params=env_params,
9model_params=model_params,
10optimizer_params=optimizer_params,
11trainer_params=trainer_params)
13copy_all_src(trainer.result_folder)
--->15trainer.run()
File~/neuralcombinatorialoptimization/NCO-master/NEW_py_ver/TSP/POMO/TSPTrainerTransformer.py:82, inTSPTrainer.run(self)
79self.scheduler.step()
81# Train--->82train_score, train_loss=self._train_one_epoch(epoch)
83self.result_log.append('train_score', epoch, train_score)
84self.result_log.append('train_loss', epoch, train_loss)
File~/neuralcombinatorialoptimization/NCO-master/NEW_py_ver/TSP/POMO/TSPTrainerTransformer.py:139, inTSPTrainer._train_one_epoch(self, epoch)
136remaining=train_num_episode-episode137batch_size=min(self.trainer_params['train_batch_size'], remaining)
-->139avg_score, avg_loss=self._train_one_batch(batch_size)
140score_AM.update(avg_score, batch_size)
141loss_AM.update(avg_loss, batch_size)
File~/neuralcombinatorialoptimization/NCO-master/NEW_py_ver/TSP/POMO/TSPTrainerTransformer.py:175, inTSPTrainer._train_one_batch(self, batch_size)
173# print(4, type(state), state)174whilenotdone:
-->175selected, prob=self.model.module(state)
176# print(3, selected.shape)177state, reward, done=self.env.step(selected)
File~/second/lib/python3.10/site-packages/torch/nn/modules/module.py:1736, inModule._wrapped_call_impl(self, *args, **kwargs)
1734returnself._compiled_call_impl(*args, **kwargs) # type: ignore[misc]1735else:
->1736returnself._call_impl(*args, **kwargs)
File~/second/lib/python3.10/site-packages/torch/nn/modules/module.py:1747, inModule._call_impl(self, *args, **kwargs)
1742# If we don't have any hooks, we want to skip the rest of the logic in1743# this function, and just call forward.1744ifnot (self._backward_hooksorself._backward_pre_hooksorself._forward_hooksorself._forward_pre_hooks1745or_global_backward_pre_hooksor_global_backward_hooks1746or_global_forward_hooksor_global_forward_pre_hooks):
->1747returnforward_call(*args, **kwargs)
1749result=None1750called_always_called_hooks=set()
File~/neuralcombinatorialoptimization/NCO-master/NEW_py_ver/TSP/POMO/TSPTransformerModelQuant_b.py:52, inTSPTransformer.forward(self, state)
50returnself._init_sequence(batch_size, pomo_size)
51else:
--->52returnself._continue_sequence(state, batch_size, pomo_size)
File~/neuralcombinatorialoptimization/NCO-master/NEW_py_ver/TSP/POMO/TSPTransformerModelQuant_b.py:100, inTSPTransformer._continue_sequence(self, state, batch_size, pomo_size)
96state.ninf_mask=state.ninf_mask.to(self.device)
98# Get probabilities from decoder-->100probs=self.decoder(self.seq_so_far, self.input_mask, state.ninf_mask)
102# Select next node103ifself.trainingorself.model_params['eval_type'] =='softmax':
File~/second/lib/python3.10/site-packages/torch/nn/modules/module.py:1736, inModule._wrapped_call_impl(self, *args, **kwargs)
1734returnself._compiled_call_impl(*args, **kwargs) # type: ignore[misc]1735else:
->1736returnself._call_impl(*args, **kwargs)
File~/second/lib/python3.10/site-packages/torch/nn/modules/module.py:1747, inModule._call_impl(self, *args, **kwargs)
1742# If we don't have any hooks, we want to skip the rest of the logic in1743# this function, and just call forward.1744ifnot (self._backward_hooksorself._backward_pre_hooksorself._forward_hooksorself._forward_pre_hooks1745or_global_backward_pre_hooksor_global_backward_hooks1746or_global_forward_hooksor_global_forward_pre_hooks):
->1747returnforward_call(*args, **kwargs)
1749result=None1750called_always_called_hooks=set()
File~/neuralcombinatorialoptimization/NCO-master/NEW_py_ver/TSP/POMO/TSPTransformerModelQuant_b.py:261, inDecoder.forward(self, seq_so_far, inp_mask, ninf_mask)
258flat_mask=flat_mask.to(self.device)
260# Forward pass through model-->261outputs=self.model(inputs_embeds=flat_seq, attention_mask=flat_mask)
262logits=outputs.logits.to(self.device)
264# Get last valid positionFile~/second/lib/python3.10/site-packages/torch/nn/modules/module.py:1736, inModule._wrapped_call_impl(self, *args, **kwargs)
1734returnself._compiled_call_impl(*args, **kwargs) # type: ignore[misc]1735else:
->1736returnself._call_impl(*args, **kwargs)
File~/second/lib/python3.10/site-packages/torch/nn/modules/module.py:1747, inModule._call_impl(self, *args, **kwargs)
1742# If we don't have any hooks, we want to skip the rest of the logic in1743# this function, and just call forward.1744ifnot (self._backward_hooksorself._backward_pre_hooksorself._forward_hooksorself._forward_pre_hooks1745or_global_backward_pre_hooksor_global_backward_hooks1746or_global_forward_hooksor_global_forward_pre_hooks):
->1747returnforward_call(*args, **kwargs)
1749result=None1750called_always_called_hooks=set()
File~/second/lib/python3.10/site-packages/peft/peft_model.py:1644, inPeftModelForCausalLM.forward(self, input_ids, attention_mask, inputs_embeds, labels, output_attentions, output_hidden_states, return_dict, task_ids, **kwargs)
1642withself._enable_peft_forward_hooks(**kwargs):
1643kwargs= {k: vfork, vinkwargs.items() ifknotinself.special_peft_forward_args}
->1644returnself.base_model(
1645input_ids=input_ids,
1646attention_mask=attention_mask,
1647inputs_embeds=inputs_embeds,
1648labels=labels,
1649output_attentions=output_attentions,
1650output_hidden_states=output_hidden_states,
1651return_dict=return_dict,
1652**kwargs,
1653 )
1655batch_size=_get_batch_size(input_ids, inputs_embeds)
1656ifattention_maskisnotNone:
1657# concat prompt attention maskFile~/second/lib/python3.10/site-packages/torch/nn/modules/module.py:1736, inModule._wrapped_call_impl(self, *args, **kwargs)
1734returnself._compiled_call_impl(*args, **kwargs) # type: ignore[misc]1735else:
->1736returnself._call_impl(*args, **kwargs)
File~/second/lib/python3.10/site-packages/torch/nn/modules/module.py:1747, inModule._call_impl(self, *args, **kwargs)
1742# If we don't have any hooks, we want to skip the rest of the logic in1743# this function, and just call forward.1744ifnot (self._backward_hooksorself._backward_pre_hooksorself._forward_hooksorself._forward_pre_hooks1745or_global_backward_pre_hooksor_global_backward_hooks1746or_global_forward_hooksor_global_forward_pre_hooks):
->1747returnforward_call(*args, **kwargs)
1749result=None1750called_always_called_hooks=set()
File~/second/lib/python3.10/site-packages/peft/tuners/tuners_utils.py:197, inBaseTuner.forward(self, *args, **kwargs)
196defforward(self, *args: Any, **kwargs: Any):
-->197returnself.model.forward(*args, **kwargs)
File~/second/lib/python3.10/site-packages/accelerate/hooks.py:170, inadd_hook_to_module.<locals>.new_forward(module, *args, **kwargs)
168output=module._old_forward(*args, **kwargs)
169else:
-->170output=module._old_forward(*args, **kwargs)
171returnmodule._hf_hook.post_forward(module, output)
File~/second/lib/python3.10/site-packages/transformers/models/qwen2/modeling_qwen2.py:1164, inQwen2ForCausalLM.forward(self, input_ids, attention_mask, position_ids, past_key_values, inputs_embeds, labels, use_cache, output_attentions, output_hidden_states, return_dict, cache_position, num_logits_to_keep, **loss_kwargs)
1161return_dict=return_dictifreturn_dictisnotNoneelseself.config.use_return_dict1163# decoder outputs consists of (dec_features, layer_state, dec_hidden, dec_attn)->1164outputs=self.model(
1165input_ids=input_ids,
1166attention_mask=attention_mask,
1167position_ids=position_ids,
1168past_key_values=past_key_values,
1169inputs_embeds=inputs_embeds,
1170use_cache=use_cache,
1171output_attentions=output_attentions,
1172output_hidden_states=output_hidden_states,
1173return_dict=return_dict,
1174cache_position=cache_position,
1175 )
1177hidden_states=outputs[0]
1178# Only compute necessary logits, and do not upcast them to float if we are not computing the lossFile~/second/lib/python3.10/site-packages/torch/nn/modules/module.py:1736, inModule._wrapped_call_impl(self, *args, **kwargs)
1734returnself._compiled_call_impl(*args, **kwargs) # type: ignore[misc]1735else:
->1736returnself._call_impl(*args, **kwargs)
File~/second/lib/python3.10/site-packages/torch/nn/modules/module.py:1747, inModule._call_impl(self, *args, **kwargs)
1742# If we don't have any hooks, we want to skip the rest of the logic in1743# this function, and just call forward.1744ifnot (self._backward_hooksorself._backward_pre_hooksorself._forward_hooksorself._forward_pre_hooks1745or_global_backward_pre_hooksor_global_backward_hooks1746or_global_forward_hooksor_global_forward_pre_hooks):
->1747returnforward_call(*args, **kwargs)
1749result=None1750called_always_called_hooks=set()
File~/second/lib/python3.10/site-packages/transformers/models/qwen2/modeling_qwen2.py:871, inQwen2Model.forward(self, input_ids, attention_mask, position_ids, past_key_values, inputs_embeds, use_cache, output_attentions, output_hidden_states, return_dict, cache_position)
868hidden_states=inputs_embeds870# create position embeddings to be shared across the decoder layers-->871position_embeddings=self.rotary_emb(hidden_states, position_ids)
873# decoder layers874all_hidden_states= () ifoutput_hidden_stateselseNoneFile~/second/lib/python3.10/site-packages/torch/nn/modules/module.py:1736, inModule._wrapped_call_impl(self, *args, **kwargs)
1734returnself._compiled_call_impl(*args, **kwargs) # type: ignore[misc]1735else:
->1736returnself._call_impl(*args, **kwargs)
File~/second/lib/python3.10/site-packages/torch/nn/modules/module.py:1747, inModule._call_impl(self, *args, **kwargs)
1742# If we don't have any hooks, we want to skip the rest of the logic in1743# this function, and just call forward.1744ifnot (self._backward_hooksorself._backward_pre_hooksorself._forward_hooksorself._forward_pre_hooks1745or_global_backward_pre_hooksor_global_backward_hooks1746or_global_forward_hooksor_global_forward_pre_hooks):
->1747returnforward_call(*args, **kwargs)
1749result=None1750called_always_called_hooks=set()
File~/second/lib/python3.10/site-packages/accelerate/hooks.py:170, inadd_hook_to_module.<locals>.new_forward(module, *args, **kwargs)
168output=module._old_forward(*args, **kwargs)
169else:
-->170output=module._old_forward(*args, **kwargs)
171returnmodule._hf_hook.post_forward(module, output)
File~/second/lib/python3.10/site-packages/torch/utils/_contextlib.py:116, incontext_decorator.<locals>.decorate_context(*args, **kwargs)
113 @functools.wraps(func)
114defdecorate_context(*args, **kwargs):
115withctx_factory():
-->116returnfunc(*args, **kwargs)
File~/second/lib/python3.10/site-packages/transformers/models/qwen2/modeling_qwen2.py:163, inQwen2RotaryEmbedding.forward(self, x, position_ids)
161device_type=device_typeifisinstance(device_type, str) anddevice_type!="mps"else"cpu"162withtorch.autocast(device_type=device_type, enabled=False):
-->163freqs= (inv_freq_expanded.float() @ position_ids_expanded.float()).transpose(1, 2)
164emb=torch.cat((freqs, freqs), dim=-1)
165cos=emb.cos()
RuntimeError: Expectedalltensorstobeonthesamedevice, butfoundatleasttwodevices, cuda:0andcuda:1! (whencheckingargumentforargumentmat2inmethodwrapper_CUDA_bmm)
Model.pyimporttorchimporttorch.nnasnnfromtransformersimportAutoModelForCausalLM, AutoTokenizer, BitsAndBytesConfigfrompeftimportLoraConfig, get_peft_model, prepare_model_for_kbit_training, PeftModel, TaskTypefromtypingimportOptional, Dict, Any, TupleclassTSPTransformer(nn.Module):
def__init__(self, **kwargs):
super().__init__()
self.model_params=kwargs# Set device firstself.device=kwargs.get('device', torch.device("cuda"iftorch.cuda.is_available() else"cpu"))
# Initialize componentsself.encoder=Encoder(**kwargs).to(self.device)
self.embedding_size=kwargs.get('embedding_dim', 896)
# Load the model with LoRA and 4-bit quantization if neededself.model=load_model(kwargs)
kwargs['device'] =self.device# Ensure decoder gets the same deviceself.decoder=Decoder(self.model, **kwargs)
# Initialize state storageself.encoded_nodes=Noneself.seq_so_far=Noneself.input_mask=Noneself.t=Nonedefpre_forward(self, reset_state):
"""Initialize model state for new sequence"""# Move input to correct deviceproblems=reset_state.problems.to(self.device)
self.encoded_nodes=self.encoder(problems)
self.problem_size=problems.size(1)
self.batch_size=problems.size(0)
defforward(self, state) ->Tuple[torch.Tensor, Optional[torch.Tensor]]:
# Ensure state tensors are on correct devicestate.BATCH_IDX=state.BATCH_IDX.to(self.device)
state.POMO_IDX=state.POMO_IDX.to(self.device)
ifstate.ninf_maskisnotNone:
state.ninf_mask=state.ninf_mask.to(self.device)
ifstate.current_nodeisnotNone:
state.current_node=state.current_node.to(self.device)
batch_size=state.BATCH_IDX.size(0)
pomo_size=state.BATCH_IDX.size(1)
ifstate.current_nodeisNone:
returnself._init_sequence(batch_size, pomo_size)
else:
returnself._continue_sequence(state, batch_size, pomo_size)
def_init_sequence(self, batch_size: int, pomo_size: int) ->Tuple[torch.Tensor, torch.Tensor]:
"""Initialize sequence state"""self.t=0# Start at 0 instead of -1# Create new tensors instead of modifying in placeselected=torch.arange(pomo_size, device=self.device).expand(batch_size, pomo_size)
prob=torch.ones(size=(batch_size, pomo_size), device=self.device)
# Initialize sequence storage with proper dimensionsself.seq_so_far=torch.zeros(
(batch_size, pomo_size, self.problem_size, self.embedding_size),
device=self.device
)
self.input_mask=torch.zeros(
(batch_size, pomo_size, self.problem_size),
dtype=torch.bool,
device=self.device
)
returnselected, probdef_continue_sequence(self, state, batch_size: int, pomo_size: int) ->Tuple[torch.Tensor, Optional[torch.Tensor]]:
"""Continue sequence generation"""# Get encoded representation of current nodeencoded_current=self._get_encoded_node(state.current_node)
# Move tensors to correct deviceencoded_current=encoded_current.to(self.device)
# Create new tensor for updated sequencenew_seq=self.seq_so_far.clone().to(self.device)
new_seq[:, :, self.t, :] =encoded_currentself.seq_so_far=new_seq# Create new tensor for updated masknew_mask=self.input_mask.clone().to(self.device)
new_mask[:, :, self.t] =Trueself.input_mask=new_mask# Ensure state.ninf_mask is on correct deviceifstate.ninf_maskisnotNone:
state.ninf_mask=state.ninf_mask.to(self.device)
# Get probabilities from decoderprobs=self.decoder(self.seq_so_far, self.input_mask, state.ninf_mask)
# Select next nodeifself.trainingorself.model_params['eval_type'] =='softmax':
selected, prob=self._sample_node(probs, state, batch_size, pomo_size)
else:
selected=probs.argmax(dim=2)
prob=Noneself.t+=1returnselected, probdef_get_encoded_node(self, node_indices: torch.Tensor) ->torch.Tensor:
"""Get encoded representation of nodes safely"""batch_size, pomo_size=node_indices.shapeembedding_dim=self.encoded_nodes.size(2)
# Create gathering indicegather_idx=node_indices[:, :, None].expand(batch_size, pomo_size, embedding_dim)
# gather_idx = gather_idx.to(self.encoded_nodes.device)self.encoded_nodes=self.encoded_nodes.to(self.device)
gather_idx=gather_idx.to(self.device)
# Ensure gather_idx is within the range of self.encoded_nodes.size(1)max_valid_index=self.encoded_nodes.size(1) -1gather_idx=torch.clamp(gather_idx, min=0, max=max_valid_index)
# assert gather_idx.max() <= max_valid_index, "gather_idx contains indices out of bounds"# Gather encoded representationsreturnself.encoded_nodes.gather(dim=1, index=gather_idx)
def_sample_node(self, probs: torch.Tensor, state, batch_size: int, pomo_size: int) ->Tuple[torch.Tensor, torch.Tensor]:
"""Sample next node with retry logic"""max_attempts=100for_inrange(max_attempts):
# probs=probs.to(self.device)
# Reshape for samplingflat_probs=probs.reshape(batch_size*pomo_size, -1)
# Sample indicesselected=flat_probs.multinomial(1, replacement=True)
selected=selected.reshape(batch_size, pomo_size)
# Calculate probabilitiesprob=probs[state.BATCH_IDX, state.POMO_IDX, selected]
prob=prob.reshape(batch_size, pomo_size)
if (prob>0).all():
returnselected, probraiseRuntimeError(f"Failed to sample valid nodes after {max_attempts} attempts")
classEncoder(nn.Module):
def__init__(self, **kwargs):
super().__init__()
self.embedding_dim=kwargs.get('embedding_dim', 896)
self.device=kwargs.get('device', torch.device("cuda"iftorch.cuda.is_available() else"cpu"))
# Linear layer to embed node coordinatesself.embed_layer=nn.Linear(2, self.embedding_dim)
# Multi-head self-attention layerself.num_heads=kwargs.get('num_attention_heads', 8)
self.attention_layer=nn.MultiheadAttention(
embed_dim=self.embedding_dim,
num_heads=self.num_heads,
batch_first=True
)
# Register positional encoding as a buffer so it's not updated during trainingself.register_buffer("positional_encoding", self._generate_positional_encoding(kwargs.get('problem_size', 20), self.embedding_dim))
def_generate_positional_encoding(self, problem_size: int, embed_dim: int) ->torch.Tensor:
"""Generate sinusoidal positional encoding for input sequences."""# Create a matrix of shape (problem_size, embed_dim) to hold the positional encodingsencoding=torch.zeros(problem_size, embed_dim)
position=torch.arange(0, problem_size, dtype=torch.float).unsqueeze(1) # Shape: (problem_size, 1)# Compute the division terms for sine and cosine functionsdiv_term=torch.exp(torch.arange(0, embed_dim, 2).float() * (-torch.log(torch.tensor(10000.0)) /embed_dim))
# Apply sine to even indices and cosine to odd indicesencoding[:, 0::2] =torch.sin(position*div_term) # Sine for even dimensionsencoding[:, 1::2] =torch.cos(position*div_term) # Cosine for odd dimensionsreturnencoding.unsqueeze(0) # Shape: (1, problem_size, embed_dim) to allow broadcastingdefforward(self, problems):
# Ensure `problems` is on the same device as the embedding layerproblems=problems.to(self.embed_layer.weight.device)
batch_size, problem_size=problems.shape[:2]
# Embed node coordinatesembedded=self.embed_layer(problems.reshape(-1, 2))
embedded=embedded.reshape(batch_size, problem_size, self.embedding_dim)
# Align positional encoding to the device of `embedded`pos_encoding=self.positional_encoding[:, :problem_size, :].to(embedded.device)
embedded=embedded+pos_encoding# Broadcast positional encoding across the batch# Apply self-attentionattention_output, _=self.attention_layer(embedded, embedded, embedded)
# Create position indices for concatenation if needed# ids = torch.arange(problem_size, device=self.device).expand(batch_size, problem_size)# Concatenate position indices with attention output (optional, depends on the architecture)# return torch.cat([ids.unsqueeze(-1).float(), attention_output], dim=-1)returnattention_outputclassDecoder(nn.Module):
def__init__(self, model: nn.Module, **kwargs):
super().__init__()
self.model=modelself.problem_size=kwargs.get('problem_size', 20)
self.use_lora=kwargs.get('use_lora', True)
self.device=kwargs.get('device', torch.device("cuda"iftorch.cuda.is_available() else"cpu"))
self._setup_model()
def_setup_model(self):
"""Configure model architecture"""# Get base model if wrapped in DataParallelbase_model=self.model# Modify output sizebase_model.lm_head=nn.Linear(
base_model.config.hidden_size,
self.problem_size
).to(self.device)
# Apply LoRA if requestedifself.use_lora:
lora_config=LoraConfig(
r=4,
lora_alpha=32,
target_modules=["q_proj", "v_proj"],
lora_dropout=0.1,
bias="none",
task_type=TaskType.CAUSAL_LM
)
self.model=get_peft_model(base_model, lora_config)
defforward(self, seq_so_far: torch.Tensor, inp_mask: torch.Tensor, ninf_mask: torch.Tensor) ->torch.Tensor:
batch_size, pomo_size, problem_size, embedding_dim=seq_so_far.shape# Reshape inputsflat_seq=seq_so_far.reshape(batch_size*pomo_size, problem_size, embedding_dim)
flat_mask=inp_mask.reshape(batch_size*pomo_size, problem_size)
try:
# Ensure inputs are on the correct deviceflat_seq=flat_seq.to(self.device)
flat_mask=flat_mask.to(self.device)
# Forward pass through modeloutputs=self.model(inputs_embeds=flat_seq, attention_mask=flat_mask)
logits=outputs.logits.to(self.device)
# Get last valid positionlast_positions=flat_mask.sum(dim=1).long() -1# Gather logits for last positionsbatch_indices=torch.arange(batch_size*pomo_size, device=self.device)
gathered_logits=logits[batch_indices, last_positions]
# Reshape and apply masklogits=gathered_logits.reshape(batch_size, pomo_size, problem_size)
ninf_mask=ninf_mask.to(self.device)
masked_logits=logits+ninf_mask.float()
# Return probabilitiesreturntorch.softmax(masked_logits, dim=2)
exceptExceptionase:
print(f"Error in decoder forward pass: {e}")
print(f"Device info - Model device: {self.device}, Input: {flat_seq.device}, Mask: {flat_mask.device}")
raisedefload_model(config: Dict[str, Any]) ->nn.Module:
"""Load model with proper configuration"""# print(config)device=config.get('device', torch.device("cuda"iftorch.cuda.is_available() else"cpu"))
ifconfig.get('checkpoint_path'):
# print('checkpoint_path')try:
returnPeftModel.from_pretrained(
config['model_name'],
config['checkpoint_path'],
is_trainable=True
).to(device)
exceptExceptionase:
print(f"Error loading checkpoint: {e}")
print("Falling back to base model...")
ifconfig.get('use_4bit', True):
bnb_config=BitsAndBytesConfig(
load_in_4bit=True,
bnb_4bit_compute_dtype=torch.bfloat16,
llm_int8_has_fp16_weight=False,
bnb_4bit_quant_type="nf4",
llm_int8_threshold=6.0,
bnb_4bit_use_double_quant=True,
)
model=AutoModelForCausalLM.from_pretrained(
config['model_name'],
trust_remote_code=True,
device_map="auto",
torch_dtype=torch.bfloat16,
quantization_config=bnb_config
)
model=prepare_model_for_kbit_training(model)
model.config.use_cache=Falseelse:
# print('else')model=AutoModelForCausalLM.from_pretrained(
config['model_name'],
torch_dtype=torch.bfloat16,
trust_remote_code=True
).to(device)
returnmodel
Expected behavior
Expected behavior is that the model should train in a multi-GPU setting without throwing any errors. The same script works in single-GPU setting but throws the above error in a multi-GPU setting
The text was updated successfully, but these errors were encountered:
Reproduction
I am trying to finetune Qwen2-0.5B model on some training data using a multi-GPU setup. The same code (given further below) seems to work in a single-GPU setting (when i set CUDA_VISIBLE_DEVICES=0):
Code for the above error is given below:
Expected behavior
Expected behavior is that the model should train in a multi-GPU setting without throwing any errors. The same script works in single-GPU setting but throws the above error in a multi-GPU setting
The text was updated successfully, but these errors were encountered: