[WIP] Write to mlperf run_history bigquery table for mlperf runs #120

raymondzouu · 2024-02-16T00:19:06Z

Description

Parse tensorboard file and write to a new table run_history in the mlperf_dataset only for mlperf runs. Schema is created as discussed in http://shortn/_Hr3sWp59UK.

Tests

Ran maxtext_sweep_gke_example_dag http://shortn/_NApHfEATw2 and wrote metrics to mlperf_dataset http://shortn/_3CfsiDPr6v

Checklist

Before submitting this PR, please make sure (put X in square brackets):

I have performed a self-review of my code.
I have necessary comments in my code, particularly in hard-to-understand areas.
I have run one-shot tests and provided workload links above if applicable.
I have made or will make corresponding changes to the doc if needed.

RissyRan

Thank you! Could you also add a run from our team's tests to ensure this does not break post_process step?

RissyRan · 2024-02-21T03:28:37Z

xlml/utils/bigquery.py

@@ -29,6 +29,7 @@
 BENCHMARK_BQ_JOB_TABLE_NAME = "job_history"
 BENCHMARK_BQ_METRIC_TABLE_NAME = "metric_history"
 BENCHMARK_BQ_METADATA_TABLE_NAME = "metadata_history"
+BENCHMARK_BQ_RUN_TABLE_NAME = "run_history"


Do we want to name it more specifically to mlperf? i.e. mlperf_history or mlperf_result?

RissyRan · 2024-02-21T03:30:46Z

xlml/utils/bigquery.py

+  num_chips: int
+  step_time: float
+  throughput: float
+  per_device_tflops_per_sec: float


I see a little bit change of this name, so would like to double check, i.e. indicate teraflop?

RissyRan · 2024-02-21T03:48:36Z

xlml/utils/bigquery.py

+  multislice_topology: str
+  num_params: int
+  global_batch_size: int
+  per_device_batch_size: float


per core or device? I feel core is widely used.

In MaxText, device is used https://github.com/google/maxtext/blob/644a12720bddf4b941f06db9c213cb8241bba419/MaxText/configs/base.yml#L154

RissyRan · 2024-02-21T04:22:57Z

xlml/utils/metric.py

    summary_config: metric_config.SummaryConfig,
-) -> (List[List[bigquery.MetricHistoryRow]], List[List[bigquery.MetadataHistoryRow]],):
+) -> (Dict[str, Any], Dict[str, Any],):
  """Process metrics and dimensions from TensorBoard file.

  Args:
    base_id: The unique ID for this test job.


Nit: remove base_id in comment

RissyRan · 2024-02-21T04:24:19Z

xlml/utils/metric.py

    summary_config: metric_config.SummaryConfig,
-) -> (List[List[bigquery.MetricHistoryRow]], List[List[bigquery.MetadataHistoryRow]],):
+) -> (Dict[str, Any], Dict[str, Any],):


Dict[str, float] & Dict[str, str]? Same comment for the func below

RissyRan · 2024-02-21T04:27:43Z

xlml/utils/metric.py

+      * int(metadata["max_target_length/text_summary"])
+  ) / aggregated_metrics["perf/step_time_seconds"]
+  precision = (
+      "bfloat16"


Will we always has either bf16 or int8? Or any other quantization is available?

For now only bf16 and int8 are available

RissyRan · 2024-02-21T04:28:19Z

xlml/utils/metric.py

+      description=task_test_config.test_name,
+      platform="Cloud",
+      date=datetime.datetime.now(),
+      base_cl="",


What's the plan for this metadata?

I'm not quite sure yet, I think I will need to add a change in MaxText to get this value into the tensorboard file. But for now will leave it blank.

RissyRan · 2024-02-21T04:29:21Z

xlml/utils/metric.py

+      ici_mesh_shape=f"[{ici_data_parallelism}, {ici_fsdp_parallelism}, {ici_sequence_parallelism}, {ici_tensor_parallelism}, {ici_autoregressive_parallelism}]",
+      dcn_mesh_shape=f"[{dcn_data_parallelism}, {dcn_fsdp_parallelism}, {dcn_sequence_parallelism}, {dcn_tensor_parallelism}, {dcn_autoregressive_parallelism}]",
+      xprof="",
+      mfu=0.0,


Will this be extracted from TensorBoard?

raymondzouu force-pushed the raymondzou-mlperf-table branch 5 times, most recently from 9e2597e to ecb254e Compare February 16, 2024 00:47

Write to mlperf run_history bigquery table for mlperf runs

50d7d39

raymondzouu force-pushed the raymondzou-mlperf-table branch from ecb254e to 50d7d39 Compare February 16, 2024 03:58

raymondzouu marked this pull request as ready for review February 16, 2024 03:59

raymondzouu requested review from RissyRan and will-cromar as code owners February 16, 2024 03:59

raymondzouu changed the title ~~[WIP] Write to mlperf run_history bigquery table for mlperf runs~~ Write to mlperf run_history bigquery table for mlperf runs Feb 16, 2024

raymondzouu assigned RissyRan Feb 20, 2024

RissyRan reviewed Feb 21, 2024

View reviewed changes

raymondzouu assigned raymondzouu and unassigned RissyRan Feb 21, 2024

raymondzouu changed the title ~~Write to mlperf run_history bigquery table for mlperf runs~~ [WIP] Write to mlperf run_history bigquery table for mlperf runs Mar 14, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[WIP] Write to mlperf run_history bigquery table for mlperf runs #120

[WIP] Write to mlperf run_history bigquery table for mlperf runs #120

raymondzouu commented Feb 16, 2024

RissyRan left a comment

RissyRan Feb 21, 2024

RissyRan Feb 21, 2024

RissyRan Feb 21, 2024

raymondzouu Feb 21, 2024

RissyRan Feb 21, 2024

RissyRan Feb 21, 2024

RissyRan Feb 21, 2024

raymondzouu Feb 21, 2024

RissyRan Feb 21, 2024

raymondzouu Feb 21, 2024

RissyRan Feb 21, 2024

[WIP] Write to mlperf run_history bigquery table for mlperf runs #120

Are you sure you want to change the base?

[WIP] Write to mlperf run_history bigquery table for mlperf runs #120

Conversation

raymondzouu commented Feb 16, 2024

Description

Tests

Checklist

RissyRan left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment