-
Notifications
You must be signed in to change notification settings - Fork 251
Description
Hello,
I am learning to use nanotron. I encountered a few issues while running the example training with tiny Llama that I was able to resolve with some modification of nanotron's source code.
For context, I am running the same installation scripts as provided and I modified a bit the tiny Llama config so that I can run it with 1 GPU by reducing some parameters like the number of heads and layers. Otherwise, everything else is as suggested in the quick start guide.
- There is an issue with
llama.pyon the following line:
nanotron/src/nanotron/models/llama.py
Line 1095 in c737f00
parametrizator = parametrizator_cls(config=config.model)
It should be instead as in qwen.py:
nanotron/src/nanotron/models/qwen.py
Line 899 in c737f00
| parametrizator = parametrizator_cls(config=config) |
- The following code snippet in
trainer.pydoesn't work with huggingface datasets:
nanotron/src/nanotron/trainer.py
Lines 574 to 590 in c737f00
if hasattr(self.current_base_dl, "dataset"): self.current_base_dl.dataset.update_consumption_metrics( start_idx=(self.iteration_step - 1) * self.global_batch_size, # assumes we start from iteration_step=1 end_idx=self.iteration_step * self.global_batch_size, sequence_length=self.sequence_length, ) # Training Logs # Track consumed tokens for all dataset folders in current stage if hasattr(self.current_base_dl, "dataset"): consumption_stats = self.current_base_dl.dataset.get_consumption_stats() current_stage = self.metadata.data_stages[self.metadata.last_stage_idx] # Update consumed tokens for all folders in the consumption stats for folder_path, stats in consumption_stats.items(): current_stage.consumed_tokens_per_dataset_folder[folder_path] = stats["tokens"]
It assumes that update_consumption_metrics() method exists, but it only exists for Nanosets afaik.
I had to comment that part plus the following:
nanotron/src/nanotron/trainer.py
Lines 878 to 885 in c737f00
| # Log consumption statistics | |
| if hasattr(self.current_base_dl, "dataset"): | |
| for dataset_name, stats in self.current_base_dl.dataset.get_consumption_stats().items(): | |
| basic_log_entries.extend( | |
| [ | |
| LogItem(f"dataloader/consumed_tokens/{dataset_name}", stats["tokens"], "human_format"), | |
| ] | |
| ) |
- The example configs .yaml have "lighteval: null". However, this causes an issue in trainer.py:
nanotron/src/nanotron/trainer.py
Line 1178 in c737f00
| eval_interval_file = self.config.lighteval.eval_interval_file |
There should be a null check for lighteval before, similarly to the following:
nanotron/src/nanotron/trainer.py
Line 320 in c737f00
| if self.config.lighteval is not None: |
With the above changes I mentioned, I was able to run a small debug training run entirely.
While it is possible that some of the issues among those 3 I mentioned are due to incorrect configuration or script arguments, as I am still discovering how to use Nanotron, I believe other people in a similar position could face the same issues.