TencentYoutuResearch
diff --git a/‎LICENSE
+1,401 b/‎LICENSE
+1,401
diff --git a/‎README.md
+164-2 b/‎README.md
+164-2
diff --git a/‎engine.py
+159 b/‎engine.py
+159
diff --git a/‎requirements.txt
+10 b/‎requirements.txt
+10
@@ -1,2 +1,164 @@
-# CrowdCounting-P2PNet
-The official codes for the ICCV2021 Oral presentation "Rethinking Counting and Localization in Crowds: A Purely Point-Based Framework"
+# P2PNet (ICCV2021 Oral Presentation)
+
+This repository contains codes for the official implementation in PyTorch of **P2PNet** as described in [Rethinking Counting and Localization in Crowds: A Purely Point-Based Framework](https://arxiv.org/abs/2107.12746).
+ 
+An brief introduction of P2PNet can be found at [机器之心 (almosthuman)](https://mp.weixin.qq.com/s?__biz=MzA3MzI4MjgzMw==&mid=2650827826&idx=3&sn=edd3d66444130fb34a59d08fab618a9e&chksm=84e5a84cb392215a005a3b3424f20a9d24dc525dcd933960035bf4b6aa740191b5ecb2b7b161&mpshare=1&scene=1&srcid=1004YEOC7HC9daYRYeUio7Xn&sharer_sharetime=1633675738338&sharer_shareid=7d375dccd3b2f9eec5f8b27ee7c04883&version=3.1.16.5505&platform=win#rd).
+
+The codes is tested with PyTorch 1.5.0. It may not run with other versions.
+
+## Visualized demos for P2PNet
+<img src="vis/congested1.png" width="1000"/>   
+<img src="vis/congested2.png" width="1000"/> 
+<img src="vis/congested3.png" width="1000"/> 
+
+## The network
+The overall architecture of the P2PNet. Built upon the VGG16, it firstly introduce an upsampling path to obtain fine-grained feature map. 
+Then it exploits two branches to simultaneously predict a set of point proposals and their confidence scores.
+
+<img src="vis/net.png" width="1000"/>   
+
+## Comparison with state-of-the-art methods
+The P2PNet achieved state-of-the-art performance on several challenging datasets with various densities.
+
+| Methods   | Venue     | SHTechPartA <br> MAE/MSE  |SHTechPartB <br> MAE/MSE | UCF_CC_50 <br> MAE/MSE | UCF_QNRF <br> MAE/MSE   |
+|:----:|:----:|:----:|:----:|:----:|:----:|
+CAN  | CVPR'19 | 62.3/100.0 | 7.8/12.2 | 212.2/**243.7** | 107.0/183.0 |
+Bayesian+ | ICCV'19 | 62.8/101.8 | 7.7/12.7 | 229.3/308.2 | 88.7/154.8 |
+S-DCNet  | ICCV'19 | 58.3/95.0 | 6.7/10.7 | 204.2/301.3 | 104.4/176.1 |
+SANet+SPANet  | ICCV'19 | 59.4/92.5 | 6.5/**9.9** | 232.6/311.7 | -/- |
+DUBNet  | AAAI'20 | 64.6/106.8 | 7.7/12.5 | 243.8/329.3 | 105.6/180.5 |
+SDANet | AAAI'20 | 63.6/101.8 | 7.8/10.2 | 227.6/316.4 | -/- |
+ADSCNet | CVPR'20 | <u>55.4</u>/97.7 | <u>6.4</u>/11.3 | 198.4/267.3 | **71.3**/**132.5**|
+ASNet   | CVPR'20 | 57.78/<u>90.13</u> | -/- | <u>174.84</u>/<u>251.63</u> | 91.59/159.71 |
+AMRNet  | ECCV'20 | 61.59/98.36 | 7.02/11.00 | 184.0/265.8 | 86.6/152.2 |
+AMSNet  | ECCV'20 | 56.7/93.4 | 6.7/10.2 | 208.4/297.3 | 101.8/163.2|
+DM-Count  | NeurIPS'20 | 59.7/95.7 | 7.4/11.8 | 211.0/291.5 | 85.6/<u>148.3</u>|
+**Ours** |- | **52.74**/**85.06** | **6.25**/**9.9** | **172.72**/256.18 | <u>85.32</u>/154.5 |
+
+Comparison on the [NWPU-Crowd](https://www.crowdbenchmark.com/resultdetail.html?rid=81) dataset.
+
+| Methods   | MAE[O]  |MSE[O] | MAE[L] | MAE[S]   |
+|:----:|:----:|:----:|:----:|:----:|
+MCNN  | 232.5|714.6 | 220.9|1171.9 |
+SANet  | 190.6 | 491.4 | 153.8 | 716.3|
+CSRNet | 121.3 | 387.8 | 112.0 | <u>522.7</u> |
+PCC-Net  | 112.3 | 457.0 | 111.0 | 777.6 |
+CANNet  | 110.0 | 495.3 | 102.3 | 718.3|
+Bayesian+  | 105.4 | 454.2 | 115.8 | 750.5 |
+S-DCNet   | 90.2 | 370.5 | **82.9** | 567.8 |
+DM-Count  | <u>88.4</u> | 388.6 | 88.0 | **498.0** |
+**Ours** | **77.44**|**362** | <u>83.28</u>| 553.92 |
+
+The overall performance for both counting and localization.
+
+|nAP$_{\delta}$|SHTechPartA| SHTechPartB | UCF_CC_50 | UCF_QNRF | NWPU_Crowd |
+|:----:|:----:|:----:|:----:|:----:|:----:|    
+$\delta=0.05$ | 10.9\% | 23.8\%  | 5.0\% | 5.9\% | 12.9\% | 
+$\delta=0.25$ | 70.3\% | 84.2\%  | 54.5\% | 55.4\% | 71.3\% |  
+$\delta=0.50$ | 90.1\% | 94.1\%  | 88.1\% | 83.2\% | 89.1\% | 
+$\delta=\{{0.05:0.05:0.50}\}$ | 64.4\% | 76.3\%  | 54.3\% | 53.1\% | 65.0\% |  
+
+Comparison for the localization performance in terms of F1-Measure on NWPU.
+
+| Method| F1-Measure |Precision| Recall |
+|:----:|:----:|:----:|:----:|
+FasterRCNN  |  0.068 |  0.958 | 0.035 |
+TinyFaces |  0.567  |  0.529 | 0.611 |
+RAZ |   0.599 |  0.666 |  0.543|
+Crowd-SDNet |  0.637  | 0.651  | 0.624  |
+PDRNet |  0.653 | 0.675  | 0.633  |
+TopoCount | 0.692  | 0.683  | **0.701** |
+D2CNet | <u>0.700</u> | **0.741**  | 0.662 |
+**Ours** |**0.712** | <u>0.729</u>  | <u>0.695</u> |
+
+## Installation
+* Clone this repo into a directory named P2PNET_ROOT
+* Organize your datasets as required
+* Install Python dependencies. We use python 3.6.5 and pytorch 1.5.0
+```
+pip install -r requirements.txt
+```
+
+## Organize the counting dataset
+We use a list file to collect all the images and their ground truth annotations in a counting dataset. When your dataset is organized as recommended in the following, the format of this list file is defined as:
+```
+train/scene01/img01.jpg train/scene01/img01.txt
+train/scene01/img02.jpg train/scene01/img02.txt
+...
+train/scene02/img01.jpg train/scene02/img01.txt
+```
+
+### Dataset structures:
+```
+DATA_ROOT/
+        |->train/
+        |    |->scene01/
+        |    |->scene02/
+        |    |->...
+        |->test/
+        |    |->scene01/
+        |    |->scene02/
+        |    |->...
+        |->train.list
+        |->test.list
+```
+DATA_ROOT is your path containing the counting datasets.
+
+### Annotations format
+For the annotations of each image, we use a single txt file which contains one annotation per line. Note that indexing for pixel values starts at 0. The expected format of each line is:
+```
+x1 y1
+x2 y2
+...
+```
+
+## Training
+
+The network can be trained using the `train.py` script. For training on SHTechPartA, use
+
+```
+CUDA_VISIBLE_DEVICES=0 python train.py --data_root $DATA_ROOT \
+    --dataset_file SHHA \
+    --epochs 3500 \
+    --lr_drop 3500 \
+    --output_dir ./logs \
+    --checkpoints_dir ./weights \
+    --tensorboard_dir ./logs \
+    --lr 0.0001 \
+    --lr_backbone 0.00001 \
+    --batch_size 8 \
+    --eval_freq 1 \
+    --gpu_id 0
+```
+By default, a periodic evaluation will be conducted on the validation set.
+
+## Testing
+
+A trained model (with an MAE of **51.96**) on SHTechPartA is available at "./weights", run the following commands to launch a visualization demo:
+
+```
+CUDA_VISIBLE_DEVICES=0 python run_test.py --weight_path ./weights/SHTechA.pth --output_dir ./logs/
+```
+
+## Acknowledgements
+
+- Part of codes are borrowed from the [C^3 Framework](https://github.com/gjy3035/C-3-Framework).
+- We refer to [DETR](https://github.com/facebookresearch/detr) to implement our matching strategy.
+
+
+## Citing P2PNet
+
+If you find P2PNet is useful in your project, please consider citing us:
+
+```BibTeX
+@inproceedings{song2021rethinking,
+  title={Rethinking Counting and Localization in Crowds: A Purely Point-Based Framework},
+  author={Song, Qingyu and Wang, Changan and Jiang, Zhengkai and Wang, Yabiao and Tai, Ying and Wang, Chengjie and Li, Jilin and Huang, Feiyue and Wu, Yang},
+  journal={Proceedings of the IEEE/CVF International Conference on Computer Vision},
+  year={2021}
+}
+```
+
+## Related works from Tencent Youtu Lab
+- [AAAI2021] To Choose or to Fuse? Scale Selection for Crowd Counting. ([paper link](https://ojs.aaai.org/index.php/AAAI/article/view/16360) & [codes](https://github.com/TencentYoutuResearch/CrowdCounting-SASNet))
+- [ICCV2021] Uniformity in Heterogeneity: Diving Deep into Count Interval Partition for Crowd Counting. ([paper link](https://arxiv.org/abs/2107.12619) & [codes](https://github.com/TencentYoutuResearch/CrowdCounting-UEPNet))
@@ -0,0 +1,159 @@
+# Copyright (c) Facebook, Inc. and its affiliates. All Rights Reserved
+"""
+Train and eval functions used in main.py
+Mostly copy-paste from DETR (https://github.com/facebookresearch/detr).
+"""
+import math
+import os
+import sys
+from typing import Iterable
+
+import torch
+
+import util.misc as utils
+from util.misc import NestedTensor
+import numpy as np
+import time
+import torchvision.transforms as standard_transforms
+import cv2
+
+class DeNormalize(object):
+    def __init__(self, mean, std):
+        self.mean = mean
+        self.std = std
+
+    def __call__(self, tensor):
+        for t, m, s in zip(tensor, self.mean, self.std):
+            t.mul_(s).add_(m)
+        return tensor
+
+def vis(samples, targets, pred, vis_dir, des=None):
+    '''
+    samples -> tensor: [batch, 3, H, W]
+    targets -> list of dict: [{'points':[], 'image_id': str}]
+    pred -> list: [num_preds, 2]
+    '''
+    gts = [t['point'].tolist() for t in targets]
+
+    pil_to_tensor = standard_transforms.ToTensor()
+
+    restore_transform = standard_transforms.Compose([
+        DeNormalize(mean=[0.485, 0.456, 0.406], std=[0.229, 0.224, 0.225]),
+        standard_transforms.ToPILImage()
+    ])
+    # draw one by one
+    for idx in range(samples.shape[0]):
+        sample = restore_transform(samples[idx])
+        sample = pil_to_tensor(sample.convert('RGB')).numpy() * 255
+        sample_gt = sample.transpose([1, 2, 0])[:, :, ::-1].astype(np.uint8).copy()
+        sample_pred = sample.transpose([1, 2, 0])[:, :, ::-1].astype(np.uint8).copy()
+
+        max_len = np.max(sample_gt.shape)
+
+        size = 2
+        # draw gt
+        for t in gts[idx]:
+            sample_gt = cv2.circle(sample_gt, (int(t[0]), int(t[1])), size, (0, 255, 0), -1)
+        # draw predictions
+        for p in pred[idx]:
+            sample_pred = cv2.circle(sample_pred, (int(p[0]), int(p[1])), size, (0, 0, 255), -1)
+
+        name = targets[idx]['image_id']
+        # save the visualized images
+        if des is not None:
+            cv2.imwrite(os.path.join(vis_dir, '{}_{}_gt_{}_pred_{}_gt.jpg'.format(int(name), 
+                                                des, len(gts[idx]), len(pred[idx]))), sample_gt)
+            cv2.imwrite(os.path.join(vis_dir, '{}_{}_gt_{}_pred_{}_pred.jpg'.format(int(name), 
+                                                des, len(gts[idx]), len(pred[idx]))), sample_pred)
+        else:
+            cv2.imwrite(
+                os.path.join(vis_dir, '{}_gt_{}_pred_{}_gt.jpg'.format(int(name), len(gts[idx]), len(pred[idx]))),
+                sample_gt)
+            cv2.imwrite(
+                os.path.join(vis_dir, '{}_gt_{}_pred_{}_pred.jpg'.format(int(name), len(gts[idx]), len(pred[idx]))),
+                sample_pred)
+
+# the training routine
+def train_one_epoch(model: torch.nn.Module, criterion: torch.nn.Module,
+                    data_loader: Iterable, optimizer: torch.optim.Optimizer,
+                    device: torch.device, epoch: int, max_norm: float = 0):
+    model.train()
+    criterion.train()
+    metric_logger = utils.MetricLogger(delimiter="  ")
+    metric_logger.add_meter('lr', utils.SmoothedValue(window_size=1, fmt='{value:.6f}'))
+    # iterate all training samples
+    for samples, targets in data_loader:
+        samples = samples.to(device)
+        targets = [{k: v.to(device) for k, v in t.items()} for t in targets]
+        # forward
+        outputs = model(samples)
+        # calc the losses
+        loss_dict = criterion(outputs, targets)
+        weight_dict = criterion.weight_dict
+        losses = sum(loss_dict[k] * weight_dict[k] for k in loss_dict.keys() if k in weight_dict)
+
+        # reduce all losses
+        loss_dict_reduced = utils.reduce_dict(loss_dict)
+        loss_dict_reduced_unscaled = {f'{k}_unscaled': v
+                                      for k, v in loss_dict_reduced.items()}
+        loss_dict_reduced_scaled = {k: v * weight_dict[k]
+                                    for k, v in loss_dict_reduced.items() if k in weight_dict}
+        losses_reduced_scaled = sum(loss_dict_reduced_scaled.values())
+
+        loss_value = losses_reduced_scaled.item()
+
+        if not math.isfinite(loss_value):
+            print("Loss is {}, stopping training".format(loss_value))
+            print(loss_dict_reduced)
+            sys.exit(1)
+        # backward
+        optimizer.zero_grad()
+        losses.backward()
+        if max_norm > 0:
+            torch.nn.utils.clip_grad_norm_(model.parameters(), max_norm)
+        optimizer.step()
+        # update logger
+        metric_logger.update(loss=loss_value, **loss_dict_reduced_scaled, **loss_dict_reduced_unscaled)
+        metric_logger.update(lr=optimizer.param_groups[0]["lr"])
+    # gather the stats from all processes
+    metric_logger.synchronize_between_processes()
+    print("Averaged stats:", metric_logger)
+    return {k: meter.global_avg for k, meter in metric_logger.meters.items()}
+
+# the inference routine
+@torch.no_grad()
+def evaluate_crowd_no_overlap(model, data_loader, device, vis_dir=None):
+    model.eval()
+
+    metric_logger = utils.MetricLogger(delimiter="  ")
+    metric_logger.add_meter('class_error', utils.SmoothedValue(window_size=1, fmt='{value:.2f}'))
+    # run inference on all images to calc MAE
+    maes = []
+    mses = []
+    for samples, targets in data_loader:
+        samples = samples.to(device)
+
+        outputs = model(samples)
+        outputs_scores = torch.nn.functional.softmax(outputs['pred_logits'], -1)[:, :, 1][0]
+
+        outputs_points = outputs['pred_points'][0]
+
+        gt_cnt = targets[0]['point'].shape[0]
+        # 0.5 is used by default
+        threshold = 0.5
+
+        points = outputs_points[outputs_scores > threshold].detach().cpu().numpy().tolist()
+        predict_cnt = int((outputs_scores > threshold).sum())
+        # if specified, save the visualized images
+        if vis_dir is not None: 
+            vis(samples, targets, [points], vis_dir)
+        # accumulate MAE, MSE
+        mae = abs(predict_cnt - gt_cnt)
+        mse = (predict_cnt - gt_cnt) * (predict_cnt - gt_cnt)
+        maes.append(float(mae))
+        mses.append(float(mse))
+    # calc MAE, MSE
+    mae = np.mean(maes)
+    mse = np.sqrt(np.mean(mses))
+
+    return mae, mse
@@ -0,0 +1,10 @@
+torch
+torchvision
+tensorboardX
+easydict
+pandas
+numpy
+scipy
+matplotlib
+Pillow
+opencv-python