All datasets inherit from the torch_geometric
Dataset
class, allowing for
automated preprocessing and inference-time transforms.
See the official documentation
for more details.
Dataset | Download from ? | Which files ? | Where to ? |
---|---|---|---|
S3DIS | link | Stanford3dDataset_v1.2.zip |
data/s3dis/ |
KITTI-360 | link | data_3d_semantics.zip data_3d_semantics_test.zip |
data/kitti360/ |
DALES | link | DALESObjects.tar.gz |
data/dales/ |
S3DIS data directory structure.
└── data
└── s3dis # Structure for S3DIS
├── Stanford3dDataset_v1.2.zip # (optional) Downloaded zipped dataset with non-aligned rooms
├── raw # Raw dataset files
│ └── Area_{{1, 2, 3, 4, 5, 6}} # S3DIS's area/room/room.txt structure
│ └── Area_{{1, 2, 3, 4, 5, 6}}_alignmentAngle.txt # Room alignment angles required for entire floor reconstruction
│ └── {{room_name}}
│ └── {{room_name}}.txt
└── processed # Preprocessed data
└── {{train, val, test}} # Dataset splits
└── {{preprocessing_hash}} # Preprocessing folder
└── Area_{{1, 2, 3, 4, 5, 6}}.h5 # Preprocessed Area file
Warning: Make sure you download
Stanford3dDataset_v1.2.zip
and NOT the aligned version ⛔Stanford3dDataset_v1.2_Aligned_Version.zip
, which does not contain theArea_{{1, 2, 3, 4, 5, 6}}_alignmentAngle.txt
files.
KITTI-360 data directory structure.
└── data
└─── kitti360 # Structure for KITTI-360
├── data_3d_semantics_test.zip # (optional) Downloaded zipped test dataset
├── data_3d_semantics.zip # (optional) Downloaded zipped train dataset
├── raw # Raw dataset files
│ └── data_3d_semantics # Contains all raw train and test sequences
│ └── {{sequence_name}} # KITTI-360's sequence/static/window.ply structure
│ └── static
│ └── {{window_name}}.ply
└── processed # Preprocessed data
└── {{train, val, test}} # Dataset splits
└── {{preprocessing_hash}} # Preprocessing folder
└── {{sequence_name}}
└── {{window_name}}.h5 # Preprocessed window file
DALES data directory structure.
└── data
└── dales # Structure for DALES
├── DALESObjects.tar.gz # (optional) Downloaded zipped dataset
├── raw # Raw dataset files
│ └── {{train, test}} # DALES' split/tile.ply structure
│ └── {{tile_name}}.ply
└── processed # Preprocessed data
└── {{train, val, test}} # Dataset splits
└── {{preprocessing_hash}} # Preprocessing folder
└── {{tile_name}}.h5 # Preprocessed tile file
Warning: Make sure you download the
DALESObjects.tar.gz
and NOT ⛔dales_semantic_segmentation_las.tar.gz
nor ⛔dales_semantic_segmentation_ply.tar.gz
versions, which do not contain all required point attributes.
Note: Already have the dataset on your machine ? Save memory 💾 by simply symlinking or copying the files to
data/<dataset_name>/raw/
, following the above-describeddata/
structure.
Following torch_geometric
's Dataset
behaviour:
0. Dataset instantiation
➡ Load preprocessed data in data/<dataset_name>/processed
- Missing files in
data/<dataset_name>/processed
structure
➡ Automatic preprocessing using files indata/<dataset_name>/raw
- Missing files in
data/<dataset_name>/raw
structure
➡ Automatic unzipping of the downloaded dataset indata/<dataset_name>
- Missing downloaded dataset in
data/<dataset_name>
structure
➡Automaticmanual download todata/<dataset_name>
Warning: We do not support ❌ automatic download, for compliance reasons. Please manually download the required dataset files to the required location as indicated in the above table.
The data/
and logs/
directories will store all your datasets and training
logs. By default, these are placed in the repository directory.
Since this may take some space, or your heavy data may be stored elsewhere, you
may specify other paths for these directories by creating a
configs/local/defaults.yaml
file containing the following:
# @package paths
# path to data directory
data_dir: /path/to/your/data/
# path to logging directory
log_dir: /path/to/your/logs/
Pre-transforms are the functions making up the preprocessing.
These are called only once and their output is saved in
data/<dataset_name>/processed/
. These typically encompass neighbor search and
partition construction.
The transforms are called by the Dataloaders
at batch-creation time. These
typically encompass sampling and data augmentations and are performed on CPU,
before moving the batch to the GPU.
On-device transforms, are transforms to be performed on GPU. These are typically compute intensive operations that could not be done once and for all at preprocessing time, and are too slow to be performed on CPU.
Different from torch_geometric
, you can have multiple
preprocessed versions of each dataset, identified by their preprocessing hash.
This hash will change whenever the preprocessing configuration (i.e. pre-transforms) is modified in an impactful way (e.g. changing the partition regularization).
Modifications of the transforms and on-device transforms will not affect your preprocessing hash.
Each dataset has a "mini" version which only processes a portion of the data, to speedup experimentation. To use it, set your the dataset config of your choice:
mini: True
Or, if you are using the CLI, use the following syntax:
# Train SPT on mini-DALES
python src/train.py experiment=dales +datamodule.mini=True
To create your own dataset, you will need to do the following:
- create
YourDataset
class inheriting fromsrc.datasets.BaseDataset
- create
YourDataModule
class inheriting fromsrc.datamodules.DataModule
- create
configs/datamodule/your_dataset.yaml
config
Instructions are provided in the docstrings of those classes and you can get inspiration from our code for S3DIS, KITTI-360 and DALES to get started.
We suggest that your config inherits from configs/datamodule/default.yaml
. See
configs/datamodule/s3dis.yaml
, configs/datamodule/kitti360.yaml
, and
configs/datamodule/dales.yaml
for inspiration.
The semantic labels of your dataset must follow certain rules.
Indeed, your points are expected to have labels within num_classes
you define in your YourDataset
.
-
All labels
$[0, C - 1]$ are assumed to be present in your dataset. As such, they will all be used in metrics and losses computation. - A point with the
$C$ label will be considered void/ignored/unlabeled (whichever you call it). As such, it will be excluded from from metrics and losses computation
Hence, make sure the output of your YourDataset.read_single_raw_cloud()
reader method never returns labels outside of your torch_geometric.nn.pool.consecutive.consecutive_cluster
can help you with
that, if need be), while making sure you only use the label