-
Notifications
You must be signed in to change notification settings - Fork 118
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
How to determine replication factors #37
Comments
my environment: Server1:8 V100 Server2: 8 V100 |
I think the replication factor of stage 0 is 7, and the replication factor of stage 1 is 1. But only 8 GPUs are used in this way. I don't know if my understanding is correct |
So the configuration here is a (7, 1) configuration replicated twice. In other words, ranks 0-6 and 8-14 run stage 1, and ranks 7+15 run stage 2. You need to run the configurations from Level 1 to Level n (the number of levels correspond to the depth of your network hierarchy -- 2 in your example). |
I think I understand what you mean, but I don't know what to do next. I still don't know how to pass parameters(--stage_to_num_ranks) and how to use convert.py to generate models.
|
I run the following code
python optimizer_graph_hierarchical.py -f ../profiler/image_classification/profiles/vgg16/graph.txt -n 8 2 --activation_compression_ratio 1 -o vgg16_partitioned -b 4294967296 2147483648
i get the output.Can you tell me in this example: stage and replication factor?
How do they use the following output calculations?
Total number of states: 40
Solving optimization problem with 8 machines with inter-machine bandwidth of 4.29 GB/s
[[0.04692 0.056919000000000004 0.21645 ... 0.6715939999999999
0.6715939999999999 0.6725349999999999]
[None 0.009999000000000001 0.16953000000000001 ... 0.624674 0.624674
0.6256149999999999]
[None None 0.159531 ... 0.6146749999999999 0.6146749999999999
0.6156159999999998]
...
[None None None ... None 0.0 0.0009409999999999696]
[None None None ... None None 0.0009409999999999696]
[None None None ... None None None]]
Solving optimization problem with 2 machines with inter-machine bandwidth of 2.15 GB/s
[[0.005865730156898499 0.007115605156898499 0.027072026604413987 ...
0.09235717243739539 0.09235717243739539 0.09235717243739539]
[None 0.0012498750000000001 0.02120629644751549 ... 0.0856534978594099
0.0856534978594099 0.0856534978594099]
[None None 0.01995642144751549 ... 0.08422506928798132
0.08422506928798132 0.08422506928798132]
...
[None None None ... None 0.0 0.0009765625]
[None None None ... None None 0.001786962507337328]
[None None None ... None None None]]
[[0.002933282310962677 0.0035582198109626773 0.013545028504729271 ...
0.07743855509553638 0.07743855509553638 0.07839246224258628]
[None 0.0006249375000000001 0.010611746193766595 ... 0.0740863005740302
0.0740863005740302 0.07504020772108011]
[None None 0.009986808693766594 ... 0.07337208628831592
0.07337208628831592 0.07432599343536582]
...
[None None None ... None 0.0 0.0014421883970499039]
[None None None ... None None 0.0018473884007185679]
[None None None ... None None None]]
Level 2
Number of machines used: 2...
Compute time = 0.335797, Data-parallel communication time = 0.250080...
Number of machines in budget not used: 0...
(Split start, split end) / compute time taken per stage / replication factor per stage:
(0, 40) 0.6725349999999999 2
Total number of stages: 1
Level 1
Number of machines used: 1...
Split between layers 23 and 24...
Split before antichain ['node26']...
Compute time = 0.049474, Data-parallel communication time = 0.000000, Pipeline-parallel communication time = 0.023926...
Number of machines used: 7...
Compute time = 0.088874, Data-parallel communication time = 0.003483...
Number of machines in budget not used: 0...
(Split start, split end) / compute time taken per stage / replication factor per stage:
(0, 24) 0.6221200000000001 7
(24, 40) 0.050414999999999766 1
Total number of stages: 2
Time taken by single-stage pipeline: 0.6725349999999999
Time per stage in pipeline: 0.07839246224258628
Throughput increase (compared to single machine): 8.579077385257188
[Note that single-machine and (8,2)-machine DP might not fit given memory constraints]
Throughput increase of (8,2)-machine DP compared to single machine: 6.5655154772476045
Throughput increase (compared to (8,2)-machine DP): 1.3066875578905357
The text was updated successfully, but these errors were encountered: