TencentYoutuResearch
diff --git a/‎LICENSE
+1,115 b/‎LICENSE
+1,115
diff --git a/‎README.md
+103-2 b/‎README.md
+103-2
diff --git a/‎SceneSeg/BiLSTM_protocol.py
+68 b/‎SceneSeg/BiLSTM_protocol.py
+68
diff --git a/‎SceneSeg/__init__.py b/‎SceneSeg/__init__.py
@@ -1,2 +1,103 @@
-# SceneSegmentation-SCRL
-Code for CVPR 2022 paper "Scene Consistency Representation Learning for Video Scene Segmentation"
+# Scene Consistency Representation Learning for Video Scene Segmentation (CVPR2022)
+This is an official PyTorch implementation of SCRL, the CVPR2022 paper is available at [here](https://openaccess.thecvf.com/content/CVPR2022/html/Wu_Scene_Consistency_Representation_Learning_for_Video_Scene_Segmentation_CVPR_2022_paper.html).
+
+# Getting Started
+
+## Data Preparation
+### MovieNet Dataset 
+Download MovieNet Dataset from its [Official Website](https://movienet.github.io/).
+### SceneSeg318 Dataset
+Download the Annotation of [SceneSeg318](https://drive.google.com/drive/folders/1NFyL_IZvr1mQR3vR63XMYITU7rq9geY_?usp=sharing), you can find the download instructions in [LGSS](https://github.com/AnyiRao/SceneSeg/blob/master/docs/INSTALL.md) repository.
+
+### Make Puzzles for pre-training
+In order to reduce the number of IO accesses and perform data augmentation (a.k.a *Scene Agnostic Clip-Shuffling* in the paper) at the same time, we suggest to stitch 16 shots into one image (puzzle) during the pre-training stage. You can make the data by yourself:
+```
+python ./data/data_preparation.py
+```
+And the processed data will be saved in `./compressed_shot_images/`, a puzzle example [figure](./figures/puzzle_example.jpg).
+<!-- Or download the processed data in [here](). -->
+
+
+### Load the Data into Memory [Optional]
+We **strongly recommend** loading data into memory to speed up pre-training, which additionally requires your device to have at least 100GB of RAM.
+```
+mkdir /tmpdata
+mount tmpfs /tmpdata -t tmpfs -o size=100G
+cp -r ./compressed_shot_images/ /tmpdata/
+```
+
+
+## Initialization Weights Preparation
+Download the ResNet-50 weights trained on ImageNet-1k ([resnet50-19c8e357.pth](https://download.pytorch.org/models/resnet50-19c8e357.pth)), and save it in `./pretrain/` folder.
+
+##  Prerequisites
+
+* python >= 3.6
+* pytorch >= 1.6
+* cv2
+* pickle
+* numpy
+* yaml
+* sklearn
+
+
+## Usage
+### STEP 1: Encoder Pre-training
+Using the default configuration to pretrain the model. Make sure the data path is correct and the GPUs are sufficient (e.g. 8 NVIDIA V100 GPUs)
+```
+python pretrain_main.py --config ./config/SCRL_pretrain_default.yaml
+```
+The checkpoint, copy of config and log will be saved in `./output/`.
+
+### STEP 2: Feature Extraction
+
+```
+python extract_embeddings.py $CKP_PATH --shot_img_path $SHOT_PATH --Type all --gpu-id 0
+```
+`$CKP_PATH` is the path of an encoder checkpoint, and `$SHOT_PATH` is the keyframe path of MovieNet.
+The extracted embeddings (in pickle format) and log will be saved in `./embeddings/`.
+
+### STEP 3: Video Scene Segmentation Evaluation
+
+```
+cd SceneSeg
+
+python main.py \
+    -train $TRAIN_PKL_PATH \
+    -test  $TEST_PKL_PATH \
+    -val   $VAL_PKL_PATH \
+    --seq-len 40 \
+    --gpu-id 0
+```
+
+The checkpoints and log will be saved in `./SceneSeg/output/`.
+
+## Models
+We provide checkpoints, logs and results under two different pre-training settings, i.e. with and without ImageNet-1K initialization, respectively.
+
+| Initialization | AP | F1 | Config File | Pre-training <br> STEP 1| Embeddings <br> STEP 2| Fine-tuning <br> STEP 3 |
+| :-----| :---- | :---- | :---- | :-----| :---- | :---- |
+| w/o ImageNet-1k | 55.16 | 51.32 | `SCRL_pretrain_without_imagenet1k.yaml` | [ckp and log](https://drive.google.com/drive/folders/1ZYg9PFRU_lt3G5qJrldkguA52T2oxErR?usp=sharing) | [embedings](https://drive.google.com/drive/folders/1uen_HP3BZu8bcrPBikkgV3j9wzUjQ0C1?usp=sharing) | [log](https://drive.google.com/drive/folders/1rJbOnVbqTdPmnh2grIkePXOmwpNELnrK?usp=sharing) |
+| w/ ImageNet-1k | 56.65 | 52.45 | `SCRL_pretrain_with_imagenet1k.yaml` | [ckp and log](https://drive.google.com/drive/folders/1BG5ZLqrPKKGTtDIZj8aps_QuWc6K3c3V?usp=sharing) | [embedings](https://drive.google.com/drive/folders/1NFvGhkvRxpmEJYNjRnwp3ybuHQaG25gW?usp=sharing) | [log](https://drive.google.com/drive/folders/1dE0JFi-MDua70_CgI1CvyLNRnhwLjaUV?usp=sharing) |
+
+
+## License
+Please see [LICENSE](./LICENSE) file for the details.
+
+## Acknowledgments
+Part of codes are borrowed from the following repositories:
+* [MoCo](https://github.com/facebookresearch/moco)
+* [LGSS](https://github.com/AnyiRao/SceneSeg)
+
+## Citation
+Please cite our work if it's useful for your research.
+```
+@InProceedings{Wu_2022_CVPR,
+    author    = {Wu, Haoqian and Chen, Keyu and Luo, Yanan and Qiao, Ruizhi and Ren, Bo and Liu, Haozhe and Xie, Weicheng and Shen, Linlin},
+    title     = {Scene Consistency Representation Learning for Video Scene Segmentation},
+    booktitle = {Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR)},
+    month     = {June},
+    year      = {2022},
+    pages     = {14021-14030}
+}
+```
@@ -0,0 +1,68 @@
+import torch
+import torch.nn as nn
+import torch.nn.functional as F
+
+
+class BiLSTM(nn.Module):
+    def __init__(self, input_feature_dim=2048, fc_dim=1024, hidden_size=512,
+        input_drop_rate=0.3, lstm_drop_rate=0.6, fc_drop_rate=0.7, use_bn=True):
+        super(BiLSTM, self).__init__()
+
+        input_size = input_feature_dim
+        output_size = fc_dim
+        self.embed_sizes = input_feature_dim
+        self.embed_fc = nn.Linear(input_size, output_size)
+        self.hidden_size = hidden_size
+        self.lstm = nn.LSTM(
+            input_size=output_size,
+            hidden_size=self.hidden_size,
+            num_layers=2,
+            batch_first=True,
+            dropout=lstm_drop_rate,
+            bidirectional=True
+        )
+        self.input_dropout = nn.Dropout(p=input_drop_rate)
+        self.fc_dropout = nn.Dropout(p=fc_drop_rate)
+        self.fc1 = nn.Linear(self.hidden_size*2, hidden_size)
+        self.fc2 = nn.Linear(hidden_size, 2)
+        self.softmax = nn.Softmax(2)
+        self.use_bn = use_bn
+        
+        if self.use_bn:
+            self.bn1 = nn.BatchNorm1d(output_size)
+            self.bn2 = nn.BatchNorm1d(hidden_size)
+        
+        
+    def forward(self, x):
+        x = self.input_dropout(x)
+        x = self.embed_fc(x)
+        
+        if self.use_bn:
+            seq_len, C = x.shape[1:3]
+            x = x.view(-1, C)
+            x = self.bn1(x)
+            x = x.view(-1, seq_len, C)
+        
+        x = self.fc_dropout(x)
+        self.lstm.flatten_parameters()
+        out, (_, _) = self.lstm(x, None)
+        out = self.fc1(out)
+        if self.use_bn:
+            seq_len, C = out.shape[1:3]
+            out = out.view(-1, C)
+            out = self.bn2(out)
+            out = out.view(-1, seq_len, C)
+        out = self.fc_dropout(out)
+        out = F.relu(out)
+        out = self.fc2(out)
+        if not self.training:
+            out = self.softmax(out)
+        return out
+
+if __name__ == '__main__':
+    B, seq_len, C = 10, 20, 2048
+    input = torch.randn(B, seq_len, C)
+    model = BiLSTM()
+    out = model(input)
+    # torch.Size([10, 20, 2])
+    print(out.size())