@@ -49,98 +49,6 @@ Material Form Model V2 | Resnet | saved model | [click here](https://storage.goo
49
49
Material Type Model V2| MobileNet | saved model | [ click here] ( https://storage.googleapis.com/tf_model_garden/vision/waste_identification_ml/mobilenet_material.zip )
50
50
Material Form Model V2| MobileNet | saved model | [ click here] ( https://storage.googleapis.com/tf_model_garden/vision/waste_identification_ml/mobilenet_material_form.zip )
51
51
52
-
53
- ## Training Guide
54
-
55
- 1 . Create a VM instance in Compute Engine of Google Cloud Platform with desired
56
- number of GPUs.
57
- 2 . Make sure compatible Cuda version is installed. Check your GPU devices using
58
- ` nvidia-smi ` command.
59
- 3 . SSH to your VM instance in the Compute Engine and create a conda environment
60
- ` conda create -n circularnet-train python=3.11 `
61
- 4 . Activate your conda environment
62
- ` conda activate circularnet-train `
63
- 5 . Install the following libraries
64
- ` pip install tensorflow[and-cuda] tf-models-official `
65
- 6 . Move your data in GCP bucket or inside the VM instance. I moved it inside
66
- the VM instance. Your data should be in the TFRecords format.
67
- 7 . Move the configuration file for model training inside the VM as well.
68
- 8 . Your configuration file contains all the parameters and path to your datasets
69
- Example of configuration file for GPU training has been uploaded in the same
70
- directory with name ` config.yaml `
71
- 8 . Create a directory where you want to save the output checkpoints.
72
- 9 . Run the following command to initiate the training -
73
- `python -m official.vision.train --experiment="maskrcnn_resnetfpn_coco"
74
- --mode="train_and_eval" --model_dir="output_directory"
75
- --config_file="config.yaml"`
76
- 10 . You can also start a screen session and run the training in the background.
77
-
78
- ## Config file parameters
79
-
80
- - ` annotation_file ` - path to the validation file in COCO JSON format.
81
- - ` init_checkpoint ` - path to the checkpoints for transfer learning.
82
- - ` init_checkpoint_modules ` - to load both the backbone or decoder or any one
83
- of them.
84
- - ` freeze_backbone ` - if you want to freeze your backbone or not while
85
- training.
86
- - ` input_size ` - image size according to which the model is trained.
87
- - ` num_classes ` - total number of classes + 1 ( background )
88
- - ` per_category_metrics ` - in case you need metric for each class
89
- - ` global_batch_size ` - batch size.
90
- - ` input_path ` - path to the dataset set.
91
- - ` parser ` - contains the data augmentation operations.
92
- - ` steps_per_loop ` - number of steps to complete one epoch. It's usually
93
- ` training tal data size / batch size ` .
94
- - ` summary_interval ` - how often you want to plot the metric
95
- - ` train_steps ` - total steps for training. Its equal to
96
- ` steps_per_loop x epochs `
97
- - ` validation_interval ` - how often do you want to evaluate the validation
98
- data.
99
- - ` validation_steps ` - steps to cover validation data. Its equal to
100
- ` validation data size / batch size `
101
- - ` warmup_learning_rate ` - it is a strategy that gradually increases the
102
- learning rate from a very low value to a desired initial learning rate over
103
- a predefined number of iterations or epochs.
104
- To stabilize training in the early stages by allowing the model to adapt to
105
- the data slowly before using a higher learning rate.
106
- - ` warmup_steps ` - steps for the warmup learning rate
107
- - ` initial_learning_rate ` - The initial learning rate is the value of the
108
- learning rate at the very start of the training process.
109
- - ` checkpoint_interval ` - number of steps to export the model.
110
-
111
- A common practice to calculate the parameters are below:
112
-
113
- ``` python
114
- total_training_samples = 4389
115
- total_validation_samples = 485
116
-
117
- train_batch_size = 512
118
- val_batch_size = 128
119
- num_epochs = 700
120
- warmup_learning_rate = 0.0001
121
- initial_learning_rate = 0.001
122
-
123
- steps_per_loop = total_training_samples // train_batch_size
124
- summary_interval = steps_per_loop
125
- train_steps = num_epochs * steps_per_loop
126
- validation_interval = steps_per_loop
127
- validation_steps = total_validation_samples // val_batch_size
128
- warmup_steps = steps_per_loop * 10
129
- checkpoint_interval = steps_per_loop * 5
130
- decay_steps = int (train_steps)
131
-
132
- print (f ' steps_per_loop: { steps_per_loop} ' )
133
- print (f ' summary_interval: { summary_interval} ' )
134
- print (f ' train_steps: { train_steps} ' )
135
- print (f ' validation_interval: { validation_interval} ' )
136
- print (f ' validation_steps: { validation_steps} ' )
137
- print (f ' warmup_steps: { warmup_steps} ' )
138
- print (f ' warmup_learning_rate: { warmup_learning_rate} ' )
139
- print (f ' initial_learning_rate: { initial_learning_rate} ' )
140
- print (f ' decay_steps: { decay_steps} ' )
141
- print (f ' checkpoint_interval: { checkpoint_interval} ' )
142
- ```
143
-
144
52
## Authors and Maintainers
145
53
- Umair Sabir
146
54
- Sujit Sanjeev
0 commit comments