[Feature]Add coco, lvis and voc dataset download script based on pr open-mmlab#6715 (open-mmlab#7015)

Czm369 · triple-Mu · q3394101 · web-flow · commit 75f26c8e8f32 · 2022-01-19T22:46:03.000+08:00
* Add coco dataset download script 

You can use command "python tools/download.py --win --unzip" to download coco dataset.
Linux for using command "python tools/download.py  --unzip"

* Add coco dataset download script

* Add coco dataset download script

* Add coco dataset download script

* add some notes and modify dataset urls

* add some notes and modify dataset urls

* remove some useless lines and modify urls list to dict

* add urls of lvis and voc, and delete --win

* add parse_args()

* Add documentation of this tool in docs/en/1_exist_data_model.md, docs/zh_cn/1_exist_data_model.md and docs/en/useful_tools.md.

* add a link

* Download files regardless of system。

* Use get() of dict

* add empty line above the code block

* Update useful_tools.md

Co-authored-by: q3394101 &lt;92794867+q3394101@users.noreply.github.com&gt;
Co-authored-by: q3394101 &lt;3394101@qq.com&gt;
Co-authored-by: Wenwei Zhang &lt;40779233+ZwwWayne@users.noreply.github.com&gt;
diff --git a/docs/en/1_exist_data_model.md b/docs/en/1_exist_data_model.md
@@ -174,6 +174,10 @@ Public datasets like [Pascal VOC](http://host.robots.ox.ac.uk/pascal/VOC/index.h
 It is recommended to download and extract the dataset somewhere outside the project directory and symlink the dataset root to `$MMDETECTION/data` as below.
 If your folder structure is different, you may need to change the corresponding paths in config files.
 
+We provide a script to download datasets such as COCO , you can run `python tools/misc/download_dataset.py --dataset-name coco2017` to download COCO dataset.
+
+For more usage please refer to [dataset-download](https://github.com/open-mmlab/mmdetection/tree/master/docs/en/useful_tools.md#dataset-download)
+
 ```text
 mmdetection
 ├── mmdet
diff --git a/docs/en/useful_tools.md b/docs/en/useful_tools.md
@@ -377,6 +377,16 @@ python tools/dataset_converters/cityscapes.py ${CITYSCAPES_PATH} [-h] [--img-dir
 python tools/dataset_converters/pascal_voc.py ${DEVKIT_PATH} [-h] [-o ${OUT_DIR}]
 ```
 
+## Dataset Download
+
+`tools/misc/download_dataset.py` supports downloading datasets such as COCO, VOC, and LVIS.
+
+```shell
+python tools/misc/download_dataset.py --dataset-name coco2017
+python tools/misc/download_dataset.py --dataset-name voc2007
+python tools/misc/download_dataset.py --dataset-name lvis
+```
+
 ## Benchmark
 
 ### Robust Detection Benchmark
diff --git a/docs/zh_cn/1_exist_data_model.md b/docs/zh_cn/1_exist_data_model.md
@@ -172,6 +172,7 @@ asyncio.run(main())
 注意：在检测任务中，Pascal VOC 2012 是 Pascal VOC 2007 的无交集扩展，我们通常将两者一起使用。
 我们建议将数据集下载，然后解压到项目外部的某个文件夹内，然后通过符号链接的方式，将数据集根目录链接到 `$MMDETECTION/data` 文件夹下，格式如下所示。
 如果你的文件夹结构和下方不同的话，你需要在配置文件中改变对应的路径。
+我们提供了下载 COCO 等数据集的脚本，你可以运行 `python tools/misc/download_dataset.py --dataset-name coco2017` 下载 COCO 数据集。
 
 ```plain
 mmdetection
diff --git a/tools/misc/download_dataset.py b/tools/misc/download_dataset.py
@@ -0,0 +1,102 @@
+import argparse
+from itertools import repeat
+from multiprocessing.pool import ThreadPool
+from pathlib import Path
+from tarfile import TarFile
+from zipfile import ZipFile
+
+import torch
+
+
+def parse_args():
+    parser = argparse.ArgumentParser(
+        description='Download datasets for training')
+    parser.add_argument(
+        '--dataset-name', type=str, help='dataset name', default='coco2017')
+    parser.add_argument(
+        '--save-dir',
+        type=str,
+        help='the dir to save dataset',
+        default='data/coco')
+    parser.add_argument(
+        '--unzip',
+        action='store_true',
+        help='whether unzip dataset or not, zipped files will be saved')
+    parser.add_argument(
+        '--delete',
+        action='store_true',
+        help='delete the download zipped files')
+    parser.add_argument(
+        '--threads', type=int, help='number of threading', default=4)
+    args = parser.parse_args()
+    return args
+
+
+def download(url, dir, unzip=True, delete=False, threads=1):
+
+    def download_one(url, dir):
+        f = dir / Path(url).name
+        if Path(url).is_file():
+            Path(url).rename(f)
+        elif not f.exists():
+            print('Downloading {} to {}'.format(url, f))
+            torch.hub.download_url_to_file(url, f, progress=True)
+        if unzip and f.suffix in ('.zip', '.tar'):
+            print('Unzipping {}'.format(f.name))
+            if f.suffix == '.zip':
+                ZipFile(f).extractall(path=dir)
+            elif f.suffix == '.tar':
+                TarFile(f).extractall(path=dir)
+            if delete:
+                f.unlink()
+                print('Delete {}'.format(f))
+
+    dir = Path(dir)
+    if threads > 1:
+        pool = ThreadPool(threads)
+        pool.imap(lambda x: download_one(*x), zip(url, repeat(dir)))
+        pool.close()
+        pool.join()
+    else:
+        for u in [url] if isinstance(url, (str, Path)) else url:
+            download_one(u, dir)
+
+
+def main():
+    args = parse_args()
+    path = Path(args.save_dir)
+    if not path.exists():
+        path.mkdir(parents=True, exist_ok=True)
+    data2url = dict(
+        # TODO: Support for downloading Panoptic Segmentation of COCO
+        coco2017=[
+            'http://images.cocodataset.org/zips/train2017.zip',
+            'http://images.cocodataset.org/zips/val2017.zip',
+            'http://images.cocodataset.org/zips/test2017.zip',
+            'http://images.cocodataset.org/annotations/' +
+            'annotations_trainval2017.zip'
+        ],
+        lvis=[
+            'https://s3-us-west-2.amazonaws.com/dl.fbaipublicfiles.com/LVIS/lvis_v1_train.json.zip',  # noqa
+            'https://s3-us-west-2.amazonaws.com/dl.fbaipublicfiles.com/LVIS/lvis_v1_train.json.zip',  # noqa
+        ],
+        voc2007=[
+            'http://host.robots.ox.ac.uk/pascal/VOC/voc2007/VOCtrainval_06-Nov-2007.tar',  # noqa
+            'http://host.robots.ox.ac.uk/pascal/VOC/voc2007/VOCtest_06-Nov-2007.tar',  # noqa
+            'http://host.robots.ox.ac.uk/pascal/VOC/voc2007/VOCdevkit_08-Jun-2007.tar',  # noqa
+        ],
+    )
+    url = data2url.get(args.dataset_name, None)
+    if url is None:
+        print('Only support COCO, VOC, and LVIS now!')
+        return
+    download(
+        url,
+        dir=path,
+        unzip=args.unzip,
+        delete=args.delete,
+        threads=args.threads)
+
+
+if __name__ == '__main__':
+    main()