How to determine replication factors #37

ADAM-CT · 2020-01-14T06:18:47Z

I run the following code
python optimizer_graph_hierarchical.py -f ../profiler/image_classification/profiles/vgg16/graph.txt -n 8 2 --activation_compression_ratio 1 -o vgg16_partitioned -b 4294967296 2147483648

i get the output.Can you tell me in this example: stage and replication factor?
How do they use the following output calculations?

Total number of states: 40
Solving optimization problem with 8 machines with inter-machine bandwidth of 4.29 GB/s
[[0.04692 0.056919000000000004 0.21645 ... 0.6715939999999999
0.6715939999999999 0.6725349999999999]
[None 0.009999000000000001 0.16953000000000001 ... 0.624674 0.624674
0.6256149999999999]
[None None 0.159531 ... 0.6146749999999999 0.6146749999999999
0.6156159999999998]
...
[None None None ... None 0.0 0.0009409999999999696]
[None None None ... None None 0.0009409999999999696]
[None None None ... None None None]]
Solving optimization problem with 2 machines with inter-machine bandwidth of 2.15 GB/s
[[0.005865730156898499 0.007115605156898499 0.027072026604413987 ...
0.09235717243739539 0.09235717243739539 0.09235717243739539]
[None 0.0012498750000000001 0.02120629644751549 ... 0.0856534978594099
0.0856534978594099 0.0856534978594099]
[None None 0.01995642144751549 ... 0.08422506928798132
0.08422506928798132 0.08422506928798132]
...
[None None None ... None 0.0 0.0009765625]
[None None None ... None None 0.001786962507337328]
[None None None ... None None None]]
[[0.002933282310962677 0.0035582198109626773 0.013545028504729271 ...
0.07743855509553638 0.07743855509553638 0.07839246224258628]
[None 0.0006249375000000001 0.010611746193766595 ... 0.0740863005740302
0.0740863005740302 0.07504020772108011]
[None None 0.009986808693766594 ... 0.07337208628831592
0.07337208628831592 0.07432599343536582]
...
[None None None ... None 0.0 0.0014421883970499039]
[None None None ... None None 0.0018473884007185679]
[None None None ... None None None]]

Level 2
Number of machines used: 2...
Compute time = 0.335797, Data-parallel communication time = 0.250080...

Number of machines in budget not used: 0...

(Split start, split end) / compute time taken per stage / replication factor per stage:
(0, 40) 0.6725349999999999 2
Total number of stages: 1
Level 1
Number of machines used: 1...
Split between layers 23 and 24...
Split before antichain ['node26']...
Compute time = 0.049474, Data-parallel communication time = 0.000000, Pipeline-parallel communication time = 0.023926...
Number of machines used: 7...
Compute time = 0.088874, Data-parallel communication time = 0.003483...
Number of machines in budget not used: 0...
(Split start, split end) / compute time taken per stage / replication factor per stage:
(0, 24) 0.6221200000000001 7
(24, 40) 0.050414999999999766 1

Total number of stages: 2
Time taken by single-stage pipeline: 0.6725349999999999
Time per stage in pipeline: 0.07839246224258628
Throughput increase (compared to single machine): 8.579077385257188
[Note that single-machine and (8,2)-machine DP might not fit given memory constraints]
Throughput increase of (8,2)-machine DP compared to single machine: 6.5655154772476045
Throughput increase (compared to (8,2)-machine DP): 1.3066875578905357

The text was updated successfully, but these errors were encountered:

ADAM-CT · 2020-01-14T06:22:53Z

my environment: Server1：8 V100 Server2: 8 V100
model: Vgg16

ADAM-CT · 2020-01-14T06:30:40Z

I think the replication factor of stage 0 is 7, and the replication factor of stage 1 is 1. But only 8 GPUs are used in this way. I don't know if my understanding is correct

deepakn94 · 2020-01-19T18:38:48Z

So the configuration here is a (7, 1) configuration replicated twice. In other words, ranks 0-6 and 8-14 run stage 1, and ranks 7+15 run stage 2.

You need to run the configurations from Level 1 to Level n (the number of levels correspond to the depth of your network hierarchy -- 2 in your example).

ADAM-CT · 2020-01-20T01:45:16Z

I think I understand what you mean, but I don't know what to do next. I still don't know how to pass parameters(--stage_to_num_ranks) and how to use convert.py to generate models.

python convert_graph_to_model.py -f vgg16_partitioned/gpus=16.txt -n VGG16Partitioned -a vgg16 -o ../runtime/image_classification/models/vgg16/gpus=16 --stage_to_num_ranks 0:14,1:2

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

How to determine replication factors #37

How to determine replication factors #37

ADAM-CT commented Jan 14, 2020

ADAM-CT commented Jan 14, 2020

ADAM-CT commented Jan 14, 2020

deepakn94 commented Jan 19, 2020

ADAM-CT commented Jan 20, 2020

How to determine replication factors #37

How to determine replication factors #37

Comments

ADAM-CT commented Jan 14, 2020

i get the output.Can you tell me in this example: stage and replication factor? How do they use the following output calculations?

ADAM-CT commented Jan 14, 2020

ADAM-CT commented Jan 14, 2020

deepakn94 commented Jan 19, 2020

ADAM-CT commented Jan 20, 2020

i get the output.Can you tell me in this example: stage and replication factor?
How do they use the following output calculations?