Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

How to determine replication factors #37

Open
ADAM-CT opened this issue Jan 14, 2020 · 4 comments
Open

How to determine replication factors #37

ADAM-CT opened this issue Jan 14, 2020 · 4 comments

Comments

@ADAM-CT
Copy link

ADAM-CT commented Jan 14, 2020

I run the following code
python optimizer_graph_hierarchical.py -f ../profiler/image_classification/profiles/vgg16/graph.txt -n 8 2 --activation_compression_ratio 1 -o vgg16_partitioned -b 4294967296 2147483648

i get the output.Can you tell me in this example: stage and replication factor?
How do they use the following output calculations?

Total number of states: 40
Solving optimization problem with 8 machines with inter-machine bandwidth of 4.29 GB/s
[[0.04692 0.056919000000000004 0.21645 ... 0.6715939999999999
0.6715939999999999 0.6725349999999999]
[None 0.009999000000000001 0.16953000000000001 ... 0.624674 0.624674
0.6256149999999999]
[None None 0.159531 ... 0.6146749999999999 0.6146749999999999
0.6156159999999998]
...
[None None None ... None 0.0 0.0009409999999999696]
[None None None ... None None 0.0009409999999999696]
[None None None ... None None None]]
Solving optimization problem with 2 machines with inter-machine bandwidth of 2.15 GB/s
[[0.005865730156898499 0.007115605156898499 0.027072026604413987 ...
0.09235717243739539 0.09235717243739539 0.09235717243739539]
[None 0.0012498750000000001 0.02120629644751549 ... 0.0856534978594099
0.0856534978594099 0.0856534978594099]
[None None 0.01995642144751549 ... 0.08422506928798132
0.08422506928798132 0.08422506928798132]
...
[None None None ... None 0.0 0.0009765625]
[None None None ... None None 0.001786962507337328]
[None None None ... None None None]]
[[0.002933282310962677 0.0035582198109626773 0.013545028504729271 ...
0.07743855509553638 0.07743855509553638 0.07839246224258628]
[None 0.0006249375000000001 0.010611746193766595 ... 0.0740863005740302
0.0740863005740302 0.07504020772108011]
[None None 0.009986808693766594 ... 0.07337208628831592
0.07337208628831592 0.07432599343536582]
...
[None None None ... None 0.0 0.0014421883970499039]
[None None None ... None None 0.0018473884007185679]
[None None None ... None None None]]

Level 2
Number of machines used: 2...
Compute time = 0.335797, Data-parallel communication time = 0.250080...

Number of machines in budget not used: 0...

(Split start, split end) / compute time taken per stage / replication factor per stage:
(0, 40) 0.6725349999999999 2
Total number of stages: 1
Level 1
Number of machines used: 1...
Split between layers 23 and 24...
Split before antichain ['node26']...
Compute time = 0.049474, Data-parallel communication time = 0.000000, Pipeline-parallel communication time = 0.023926...
Number of machines used: 7...
Compute time = 0.088874, Data-parallel communication time = 0.003483...
Number of machines in budget not used: 0...
(Split start, split end) / compute time taken per stage / replication factor per stage:
(0, 24) 0.6221200000000001 7
(24, 40) 0.050414999999999766 1

Total number of stages: 2
Time taken by single-stage pipeline: 0.6725349999999999
Time per stage in pipeline: 0.07839246224258628
Throughput increase (compared to single machine): 8.579077385257188
[Note that single-machine and (8,2)-machine DP might not fit given memory constraints]
Throughput increase of (8,2)-machine DP compared to single machine: 6.5655154772476045
Throughput increase (compared to (8,2)-machine DP): 1.3066875578905357

@ADAM-CT
Copy link
Author

ADAM-CT commented Jan 14, 2020

my environment: Server1:8 V100 Server2: 8 V100
model: Vgg16

@ADAM-CT
Copy link
Author

ADAM-CT commented Jan 14, 2020

I think the replication factor of stage 0 is 7, and the replication factor of stage 1 is 1. But only 8 GPUs are used in this way. I don't know if my understanding is correct

@deepakn94
Copy link
Collaborator

So the configuration here is a (7, 1) configuration replicated twice. In other words, ranks 0-6 and 8-14 run stage 1, and ranks 7+15 run stage 2.

You need to run the configurations from Level 1 to Level n (the number of levels correspond to the depth of your network hierarchy -- 2 in your example).

@ADAM-CT
Copy link
Author

ADAM-CT commented Jan 20, 2020

I think I understand what you mean, but I don't know what to do next. I still don't know how to pass parameters(--stage_to_num_ranks) and how to use convert.py to generate models.

python convert_graph_to_model.py -f vgg16_partitioned/gpus=16.txt -n VGG16Partitioned -a vgg16 -o ../runtime/image_classification/models/vgg16/gpus=16 --stage_to_num_ranks 0:14,1:2

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants