Skip to content

Issues running tiny Llama quick start example #362

@marluxiaboss

Description

@marluxiaboss

Hello,

I am learning to use nanotron. I encountered a few issues while running the example training with tiny Llama that I was able to resolve with some modification of nanotron's source code.

For context, I am running the same installation scripts as provided and I modified a bit the tiny Llama config so that I can run it with 1 GPU by reducing some parameters like the number of heads and layers. Otherwise, everything else is as suggested in the quick start guide.

  1. There is an issue with llama.py on the following line:
    parametrizator = parametrizator_cls(config=config.model)

It should be instead as in qwen.py:

parametrizator = parametrizator_cls(config=config)

  1. The following code snippet in trainer.py doesn't work with huggingface datasets:
    if hasattr(self.current_base_dl, "dataset"):
    self.current_base_dl.dataset.update_consumption_metrics(
    start_idx=(self.iteration_step - 1)
    * self.global_batch_size, # assumes we start from iteration_step=1
    end_idx=self.iteration_step * self.global_batch_size,
    sequence_length=self.sequence_length,
    )
    # Training Logs
    # Track consumed tokens for all dataset folders in current stage
    if hasattr(self.current_base_dl, "dataset"):
    consumption_stats = self.current_base_dl.dataset.get_consumption_stats()
    current_stage = self.metadata.data_stages[self.metadata.last_stage_idx]
    # Update consumed tokens for all folders in the consumption stats
    for folder_path, stats in consumption_stats.items():
    current_stage.consumed_tokens_per_dataset_folder[folder_path] = stats["tokens"]

It assumes that update_consumption_metrics() method exists, but it only exists for Nanosets afaik.
I had to comment that part plus the following:

# Log consumption statistics
if hasattr(self.current_base_dl, "dataset"):
for dataset_name, stats in self.current_base_dl.dataset.get_consumption_stats().items():
basic_log_entries.extend(
[
LogItem(f"dataloader/consumed_tokens/{dataset_name}", stats["tokens"], "human_format"),
]
)

  1. The example configs .yaml have "lighteval: null". However, this causes an issue in trainer.py:

eval_interval_file = self.config.lighteval.eval_interval_file

There should be a null check for lighteval before, similarly to the following:

if self.config.lighteval is not None:

With the above changes I mentioned, I was able to run a small debug training run entirely.
While it is possible that some of the issues among those 3 I mentioned are due to incorrect configuration or script arguments, as I am still discovering how to use Nanotron, I believe other people in a similar position could face the same issues.

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions