-
Notifications
You must be signed in to change notification settings - Fork 30
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[WIP] Write to mlperf run_history bigquery table for mlperf runs #120
base: master
Are you sure you want to change the base?
Conversation
9e2597e
to
ecb254e
Compare
ecb254e
to
50d7d39
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thank you! Could you also add a run from our team's tests to ensure this does not break post_process step?
@@ -29,6 +29,7 @@ | |||
BENCHMARK_BQ_JOB_TABLE_NAME = "job_history" | |||
BENCHMARK_BQ_METRIC_TABLE_NAME = "metric_history" | |||
BENCHMARK_BQ_METADATA_TABLE_NAME = "metadata_history" | |||
BENCHMARK_BQ_RUN_TABLE_NAME = "run_history" |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Do we want to name it more specifically to mlperf? i.e. mlperf_history
or mlperf_result
?
num_chips: int | ||
step_time: float | ||
throughput: float | ||
per_device_tflops_per_sec: float |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I see a little bit change of this name, so would like to double check, i.e. indicate teraflop?
multislice_topology: str | ||
num_params: int | ||
global_batch_size: int | ||
per_device_batch_size: float |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
per core or device? I feel core is widely used.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
summary_config: metric_config.SummaryConfig, | ||
) -> (List[List[bigquery.MetricHistoryRow]], List[List[bigquery.MetadataHistoryRow]],): | ||
) -> (Dict[str, Any], Dict[str, Any],): | ||
"""Process metrics and dimensions from TensorBoard file. | ||
|
||
Args: | ||
base_id: The unique ID for this test job. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Nit: remove base_id in comment
summary_config: metric_config.SummaryConfig, | ||
) -> (List[List[bigquery.MetricHistoryRow]], List[List[bigquery.MetadataHistoryRow]],): | ||
) -> (Dict[str, Any], Dict[str, Any],): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Dict[str, float] & Dict[str, str]? Same comment for the func below
* int(metadata["max_target_length/text_summary"]) | ||
) / aggregated_metrics["perf/step_time_seconds"] | ||
precision = ( | ||
"bfloat16" |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Will we always has either bf16 or int8? Or any other quantization is available?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
For now only bf16 and int8 are available
description=task_test_config.test_name, | ||
platform="Cloud", | ||
date=datetime.datetime.now(), | ||
base_cl="", |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
What's the plan for this metadata?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm not quite sure yet, I think I will need to add a change in MaxText to get this value into the tensorboard file. But for now will leave it blank.
ici_mesh_shape=f"[{ici_data_parallelism}, {ici_fsdp_parallelism}, {ici_sequence_parallelism}, {ici_tensor_parallelism}, {ici_autoregressive_parallelism}]", | ||
dcn_mesh_shape=f"[{dcn_data_parallelism}, {dcn_fsdp_parallelism}, {dcn_sequence_parallelism}, {dcn_tensor_parallelism}, {dcn_autoregressive_parallelism}]", | ||
xprof="", | ||
mfu=0.0, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Will this be extracted from TensorBoard?
Description
Parse tensorboard file and write to a new table
run_history
in themlperf_dataset
only for mlperf runs. Schema is created as discussed in http://shortn/_Hr3sWp59UK.Tests
Ran
maxtext_sweep_gke_example_dag
http://shortn/_NApHfEATw2 and wrote metrics tomlperf_dataset
http://shortn/_3CfsiDPr6vChecklist
Before submitting this PR, please make sure (put X in square brackets):