Skip to content

Commit 3a8b8bc

Browse files
committed
Initial commit
1 parent 0f72028 commit 3a8b8bc

28 files changed

+7237
-2
lines changed

LICENSE

+1,115
Large diffs are not rendered by default.

README.md

+103-2
Original file line numberDiff line numberDiff line change
@@ -1,2 +1,103 @@
1-
# SceneSegmentation-SCRL
2-
Code for CVPR 2022 paper "Scene Consistency Representation Learning for Video Scene Segmentation"
1+
# Scene Consistency Representation Learning for Video Scene Segmentation (CVPR2022)
2+
This is an official PyTorch implementation of SCRL, the CVPR2022 paper is available at [here](https://openaccess.thecvf.com/content/CVPR2022/html/Wu_Scene_Consistency_Representation_Learning_for_Video_Scene_Segmentation_CVPR_2022_paper.html).
3+
4+
# Getting Started
5+
6+
## Data Preparation
7+
### MovieNet Dataset
8+
Download MovieNet Dataset from its [Official Website](https://movienet.github.io/).
9+
### SceneSeg318 Dataset
10+
Download the Annotation of [SceneSeg318](https://drive.google.com/drive/folders/1NFyL_IZvr1mQR3vR63XMYITU7rq9geY_?usp=sharing), you can find the download instructions in [LGSS](https://github.com/AnyiRao/SceneSeg/blob/master/docs/INSTALL.md) repository.
11+
12+
### Make Puzzles for pre-training
13+
In order to reduce the number of IO accesses and perform data augmentation (a.k.a *Scene Agnostic Clip-Shuffling* in the paper) at the same time, we suggest to stitch 16 shots into one image (puzzle) during the pre-training stage. You can make the data by yourself:
14+
```
15+
python ./data/data_preparation.py
16+
```
17+
And the processed data will be saved in `./compressed_shot_images/`, a puzzle example [figure](./figures/puzzle_example.jpg).
18+
<!-- Or download the processed data in [here](). -->
19+
20+
21+
### Load the Data into Memory [Optional]
22+
We **strongly recommend** loading data into memory to speed up pre-training, which additionally requires your device to have at least 100GB of RAM.
23+
```
24+
mkdir /tmpdata
25+
mount tmpfs /tmpdata -t tmpfs -o size=100G
26+
cp -r ./compressed_shot_images/ /tmpdata/
27+
```
28+
29+
30+
## Initialization Weights Preparation
31+
Download the ResNet-50 weights trained on ImageNet-1k ([resnet50-19c8e357.pth](https://download.pytorch.org/models/resnet50-19c8e357.pth)), and save it in `./pretrain/` folder.
32+
33+
## Prerequisites
34+
35+
* python >= 3.6
36+
* pytorch >= 1.6
37+
* cv2
38+
* pickle
39+
* numpy
40+
* yaml
41+
* sklearn
42+
43+
44+
## Usage
45+
### STEP 1: Encoder Pre-training
46+
Using the default configuration to pretrain the model. Make sure the data path is correct and the GPUs are sufficient (e.g. 8 NVIDIA V100 GPUs)
47+
```
48+
python pretrain_main.py --config ./config/SCRL_pretrain_default.yaml
49+
```
50+
The checkpoint, copy of config and log will be saved in `./output/`.
51+
52+
### STEP 2: Feature Extraction
53+
54+
```
55+
python extract_embeddings.py $CKP_PATH --shot_img_path $SHOT_PATH --Type all --gpu-id 0
56+
```
57+
`$CKP_PATH` is the path of an encoder checkpoint, and `$SHOT_PATH` is the keyframe path of MovieNet.
58+
The extracted embeddings (in pickle format) and log will be saved in `./embeddings/`.
59+
60+
### STEP 3: Video Scene Segmentation Evaluation
61+
62+
```
63+
cd SceneSeg
64+
65+
python main.py \
66+
-train $TRAIN_PKL_PATH \
67+
-test $TEST_PKL_PATH \
68+
-val $VAL_PKL_PATH \
69+
--seq-len 40 \
70+
--gpu-id 0
71+
```
72+
73+
The checkpoints and log will be saved in `./SceneSeg/output/`.
74+
75+
## Models
76+
We provide checkpoints, logs and results under two different pre-training settings, i.e. with and without ImageNet-1K initialization, respectively.
77+
78+
| Initialization | AP | F1 | Config File | Pre-training <br> STEP 1| Embeddings <br> STEP 2| Fine-tuning <br> STEP 3 |
79+
| :-----| :---- | :---- | :---- | :-----| :---- | :---- |
80+
| w/o ImageNet-1k | 55.16 | 51.32 | `SCRL_pretrain_without_imagenet1k.yaml` | [ckp and log](https://drive.google.com/drive/folders/1ZYg9PFRU_lt3G5qJrldkguA52T2oxErR?usp=sharing) | [embedings](https://drive.google.com/drive/folders/1uen_HP3BZu8bcrPBikkgV3j9wzUjQ0C1?usp=sharing) | [log](https://drive.google.com/drive/folders/1rJbOnVbqTdPmnh2grIkePXOmwpNELnrK?usp=sharing) |
81+
| w/ ImageNet-1k | 56.65 | 52.45 | `SCRL_pretrain_with_imagenet1k.yaml` | [ckp and log](https://drive.google.com/drive/folders/1BG5ZLqrPKKGTtDIZj8aps_QuWc6K3c3V?usp=sharing) | [embedings](https://drive.google.com/drive/folders/1NFvGhkvRxpmEJYNjRnwp3ybuHQaG25gW?usp=sharing) | [log](https://drive.google.com/drive/folders/1dE0JFi-MDua70_CgI1CvyLNRnhwLjaUV?usp=sharing) |
82+
83+
84+
## License
85+
Please see [LICENSE](./LICENSE) file for the details.
86+
87+
## Acknowledgments
88+
Part of codes are borrowed from the following repositories:
89+
* [MoCo](https://github.com/facebookresearch/moco)
90+
* [LGSS](https://github.com/AnyiRao/SceneSeg)
91+
92+
## Citation
93+
Please cite our work if it's useful for your research.
94+
```
95+
@InProceedings{Wu_2022_CVPR,
96+
author = {Wu, Haoqian and Chen, Keyu and Luo, Yanan and Qiao, Ruizhi and Ren, Bo and Liu, Haozhe and Xie, Weicheng and Shen, Linlin},
97+
title = {Scene Consistency Representation Learning for Video Scene Segmentation},
98+
booktitle = {Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR)},
99+
month = {June},
100+
year = {2022},
101+
pages = {14021-14030}
102+
}
103+
```

SceneSeg/BiLSTM_protocol.py

+68
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,68 @@
1+
import torch
2+
import torch.nn as nn
3+
import torch.nn.functional as F
4+
5+
6+
class BiLSTM(nn.Module):
7+
def __init__(self, input_feature_dim=2048, fc_dim=1024, hidden_size=512,
8+
input_drop_rate=0.3, lstm_drop_rate=0.6, fc_drop_rate=0.7, use_bn=True):
9+
super(BiLSTM, self).__init__()
10+
11+
input_size = input_feature_dim
12+
output_size = fc_dim
13+
self.embed_sizes = input_feature_dim
14+
self.embed_fc = nn.Linear(input_size, output_size)
15+
self.hidden_size = hidden_size
16+
self.lstm = nn.LSTM(
17+
input_size=output_size,
18+
hidden_size=self.hidden_size,
19+
num_layers=2,
20+
batch_first=True,
21+
dropout=lstm_drop_rate,
22+
bidirectional=True
23+
)
24+
self.input_dropout = nn.Dropout(p=input_drop_rate)
25+
self.fc_dropout = nn.Dropout(p=fc_drop_rate)
26+
self.fc1 = nn.Linear(self.hidden_size*2, hidden_size)
27+
self.fc2 = nn.Linear(hidden_size, 2)
28+
self.softmax = nn.Softmax(2)
29+
self.use_bn = use_bn
30+
31+
if self.use_bn:
32+
self.bn1 = nn.BatchNorm1d(output_size)
33+
self.bn2 = nn.BatchNorm1d(hidden_size)
34+
35+
36+
def forward(self, x):
37+
x = self.input_dropout(x)
38+
x = self.embed_fc(x)
39+
40+
if self.use_bn:
41+
seq_len, C = x.shape[1:3]
42+
x = x.view(-1, C)
43+
x = self.bn1(x)
44+
x = x.view(-1, seq_len, C)
45+
46+
x = self.fc_dropout(x)
47+
self.lstm.flatten_parameters()
48+
out, (_, _) = self.lstm(x, None)
49+
out = self.fc1(out)
50+
if self.use_bn:
51+
seq_len, C = out.shape[1:3]
52+
out = out.view(-1, C)
53+
out = self.bn2(out)
54+
out = out.view(-1, seq_len, C)
55+
out = self.fc_dropout(out)
56+
out = F.relu(out)
57+
out = self.fc2(out)
58+
if not self.training:
59+
out = self.softmax(out)
60+
return out
61+
62+
if __name__ == '__main__':
63+
B, seq_len, C = 10, 20, 2048
64+
input = torch.randn(B, seq_len, C)
65+
model = BiLSTM()
66+
out = model(input)
67+
# torch.Size([10, 20, 2])
68+
print(out.size())

SceneSeg/__init__.py

Whitespace-only changes.

0 commit comments

Comments
 (0)