diff --git a/guide/14-deep-learning/point_cloud_classification_using_point_transformer.ipynb b/guide/14-deep-learning/point_cloud_classification_using_point_transformer.ipynb new file mode 100644 index 0000000000..6fbbd9b90a --- /dev/null +++ b/guide/14-deep-learning/point_cloud_classification_using_point_transformer.ipynb @@ -0,0 +1,452 @@ +{ + "cells": [ + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "# Point cloud classification using Point Transformer" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "

Table of Contents

\n", + "
" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "## Introduction" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "The `arcgis.learn` module has a state-of-the-art point cloud classification model based on the now popular transformer architecture, called Point Transformer V3 [1], which can be used to classify a large number of points in a point cloud dataset. In general, point cloud datasets are gathered using LiDAR sensors, which apply a laser beam to sample the Earth's surface and generate high-precision x, y, and z points. These points, known as \"point clouds,\" are commonly generated through the use of terrestrial and airborne LiDAR.\n", + "\n", + "Point clouds are collections of 3D points that carry the location, measured in x, y, and z coordinates. These points also have some additional information like \"GPS timestamps,\" \"intensity,\" and \"number of returns.\" The intensity represents the returning strength from the laser pulse that scanned the area, and the number of returns shows how many times a given pulse returned. LiDAR data can also be fused with RGB (red, green, and blue) bands, derived from imagery taken simultaneously with the LiDAR survey.\n", + "\n", + "Point cloud classification is based on the type of object that reflected the laser pulse. For example, a point that reflects off the ground is classified into the ground category. LiDAR points can be classified into different categories like buildings, trees, highways, water, etc. These different classes have numeric codes assigned to them." + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "

\n", + "

\n", + "\n", + "
\n", + "

\n", + "
\n", + "
Figure 1. Visualization of point cloud dataset, with classes represented by different colors.
" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "Point cloud classification is a task where each point in the point cloud is assigned a label, representing a real-world entity (see Figure 1). And similar to how it's done in traditional methods, for deep learning, the point cloud classification process involves training, where the neural network learns from an already classified (labeled) point cloud dataset, where each point has a unique class code. These class codes are used to represent the features that we want the neural network to recognize." + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "When it comes to classifying point clouds, deep learning and neural networks are a great choice since they offer a scalable and efficient architecture. They have enormous potential to make manual or semi-assisted classification modes of point clouds a thing of the past. With that in mind, we can take a closer look at the Point Transformer V3 model included in `arcgis.learn` and how it can be used for point cloud classification.\n", + "\n", + "Point Transformer V3 (PTv3) is a new and improved point transformer model that builds upon the successes of its predecessors, PTv1 and PTv2. It's designed with a focus on simplicity, efficiency, and performance. One of the key improvements in PTv3 over PTv1 is the introduction of grouped vector attention (GVA). This mechanism allows for efficient information exchange within the model, leading to better performance. PTv3 also boasts a receptive field that is 64 times wider than PTv1, enabling it to capture a broader context of the point cloud data. [1] It replaces the computationally expensive KNN neighbor search with a more efficient serialized neighbor mapping. The complex attention patch interaction mechanisms of PTv2 are also simplified in PTv3, further enhancing efficiency. Moreover, PTv3 replaces relative positional encoding with a prepositive sparse convolutional layer, contributing to its overall simplicity and performance.\n", + "\n", + "It's worth noting that PTv3's strength also lies in its ability to effectively capture local context and structural information within point clouds. This is crucial for various 3D understanding tasks such as classification, segmentation, and object detection. [1]\n" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "## Point Transformer architecture" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "PTv3 follows a U-Net framework with four stages of encoders and decoders. It simplifies the block structure using a pre-norm structure and layer normalization. It employs grid pooling for efficient downsampling. The three major aspects of PTv3 are as follows:\n", + "\n", + "1. **Serialization:** Point transformer introduces point cloud serialization, transforming the data into a structured format for efficient processing. It utilizes space-filling curves like the Z-order curve and the Hilbert curve to preserve spatial proximity.\n", + "\n", + "2. **Attention Mechanism:** PTv3 employs a simplified attention mechanism that is tailored for serialized point clouds. It utilizes a patch attention mechanism, grouping points into non-overlapping patches for localized processing.\n", + "\n", + "3. **Positional Encoding:** PTv3 replaces the computationally expensive relative positional encoding with a simpler and more efficient conditional positional encoding (xCPE). This is implemented by a sparse convolutional layer." + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "### Point Cloud Serialization" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "

\n", + "

\n", + "\n", + "
\n", + "

\n", + "
\n", + "
Figure 2. Four point cloud serialization patterns are shown, each with a triplet visualization. The triplets show the serialization curve, sorting order, and grouped patches for local attention. [1].
" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "Unlike images, which have a natural 2D structure, point clouds are inherently unordered. Point cloud serialization is a crucial step in Point Transformer V3 (PTv3) that transforms the inherently unordered point cloud data into a structured format. This structured format enables the model to process the data more efficiently and leverage the advantages of sequence processing techniques commonly used in natural language processing.\n", + "\n", + "The serialization process utilizes space-filling curves, such as the Z-order curve and the Hilbert curve. These curves traverse the 3D space in a way that preserves spatial locality, meaning that points close together in 3D space are also close together in the serialized sequence (see Figure 2).\n", + "\n", + "PTv3 introduces a novel concept of shifting across different serialization patterns. This shifting allows the attention mechanism (explained later) to capture a wider range of spatial relationships and contexts within the point cloud, leading to improved accuracy and generalization capabilities." + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "### Attention Mechanism" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "PTv3 employs a simplified attention mechanism that is tailored for serialized point clouds. It utilizes a patch attention mechanism, which groups points into non-overlapping patches. Attention is then computed within each patch, allowing for localized processing and reducing computational complexity. This approach contrasts with previous Point Transformer versions that used more computationally expensive global attention mechanisms.\n", + "\n", + "This patch attention mechanism is further enhanced by various patch interaction designs, such as:\n", + "\n", + "- Shift Dilation: Staggering patch grouping by a specific step to extend the receptive field.\n", + "- Shift Patch: Shifting the positions of patches across the serialized point cloud, similar to the shift-window strategy in image transformers.\n", + "- Shift Order: Dynamically varying the serialized order of the point cloud data between attention blocks to prevent overfitting to a single pattern.\n", + "- Shuffle Order: Randomizing the sequence of serialization patterns to further enhance the receptive field of each attention layer.\n", + "\n", + "These designs contribute to the efficiency and effectiveness of the attention mechanism in capturing complex relationships within the point cloud data." + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "### Positional Encoding" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "Accurate positional information is crucial for point cloud understanding. PTv3 replaces the computationally expensive relative positional encoding (RPE) used in earlier versions with a simpler and more efficient approach. It utilizes a conditional positional encoding (xCPE) implemented by a sparse convolutional layer. This xCPE effectively captures positional information while minimizing computational overhead. The sparse convolutional layer is prepended before the attention layer with a skip connection, further enhancing the efficiency of the positional encoding process. The changes in positional encoding contribute to the overall efficiency and scalability of PTv3, enabling it to handle large-scale point cloud data with improved speed and accuracy." + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "## Implementation in `arcgis.learn`" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "When training a Point Transformer V3 model (`PTv3Seg`) using `arcgis.learn`, the raw point cloud dataset in LAS files is first converted into blocks of points, containing a specific number of points along with their class codes." + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "

\n", + "

\n", + "\n", + "
\n", + "

\n", + "
\n", + "
Figure 3. Prepare Point Cloud Training Data tool in ArcGIS Pro.
" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "For this step of exporting the data into an intermediate format, use Prepare Point Cloud Training Data tool, in the 3D Analyst extension (see Figure 3)." + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "These exported blocks are used to create a `data bunch` object that is passed into the `PTv3Seg` model for training.\n", + "\n", + "```python\n", + "output_path=r'C:/project/training_data.pctd'\n", + "data = prepare_data(output_path, dataset_type='PointCloud', batch_size=2)\n", + "rl = PTv3Seg(data)\n", + "rl.fit(20)\n", + "```\n", + "After training the `PTv3Seg` model, `compute_precision_recall()` method can be used to compute, per-class metrics (precision, recall, and f1-score) with respect to validation data. And `save()` method can be used to save the model." + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "

\n", + "

\n", + "\n", + "
\n", + "

\n", + "
Figure 4. Classify Points Using Trained Model tool.
" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "For inferencing, use Classify Points Using Trained Model tool, in the 3D Analyst extension (see Figure 4).\n", + "\n", + "Main features available during the inferencing step:\n", + " \n", + "- _Target classification:_ selective classification for flexibility and control in trained model's predictions.\n", + "\n", + "\n", + "- _Preserving specific classes in input data from modification:_ this can be used for updating old datasets and for noise control in model's prediction." + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "Detailed tool references and resources for point cloud classification using deep learning in ArcGIS Pro can be found here." + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "### For advanced users " + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "We can also specify, additional parameters that directly affect the properties of the architecture itself, this can be done while initializing the `PTv3` model, by using the following parameters.\n", + "\n", + "- `sub_sampling_ratio` Sampling ratio of points in each layer.\n", + "\n", + "\n", + "- `seq_len` Sequence length for transformer.\n", + "\n", + "\n", + "- `voxel_size` Defines the size of voxels in meters for a block.\n", + "\n", + "\n", + "A typical usage with respect to API looks like:\n", + "\n", + "For Point Cloud Classification:\n", + "```python\n", + "pt = PTv3Seg(data=data, \n", + " sub_sampling_ratio=2,\n", + " seq_len=1200,\n", + " \n", + " )\n", + "```\n", + "\n", + "For 3D Object Detection:\n", + "\n", + "```python\n", + "ptdet = PTv3Det(data=data)\n", + "```" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "## Best practices for Point Transformer workflow" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "The following tips and best practices can be used while using Point Transformer V3:" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "- The 3D deep learning tools in the 3D Analyst extension, takes care of the coordinate system, and related discrepancies, automatically. So, one can train a model using ArcGIS Pro on a dataset with a metric coordinate system, then use that trained model on a dataset with any other coordinate system, and vice-versa without any need for re-projection." + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "- High-quality labeled data will result in a better-trained model. For generalization and robustness of the trained model, significant diversity or variety should be present in the training data, in terms of geography, building architectures, terrains, object-related variations, etc." + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "- If the object of interest is significantly larger or smaller in size than the default value of `Block Size`, then a better value can be used for improving the results further. Like, for a dataset in a metric coordinate system, a _'warehouse'_ won't fit in a '50 meter' x '50 meter' `Block Size`, hence the `Block Size` can be increased in this case. " + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "- Through a series of experiments, it was found that an additional one or two `extra_features` apart from X, Y, and Z usually works best, in most cases. Over usage of 'extra attributes' for model training might reduce generalization, i.e. _'how generic the trained model will be'_. " + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "- Deciding which 'extra attributes' to consider, depends upon the properties of the object of interest, the nature of noise, sensor-specific attributes, etc. " + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "- It is recommended to filter or withheld points that belong to the 'high noise' class from the dataset." + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "- If the training and validation dataset is very large and each epoch is taking a lot of time to complete, then `iters_per_epoch` can be used to see the epoch/training table quickly by reducing the time taken for the completion of an epoch. This is achieved by a random selection/filtering of fewer batches, governed by the user-provided value of `iters_per_epoch`. So in each epoch, the model is exposed to a lesser number of randomly selected batches, this results in faster completion of an epoch, but it can lead to more numbers of epochs before the model converges. " + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "- `mask_class` functionality in `show_results()` can be used for analyzing any inter-class noises present in the validation output. This can be used to understand which classes need more diversity in training data or need an increase in its number of labeled points _(See Figure 5)_.\n", + "\n", + "\n", + "

\n", + "\n", + "
Figure 5. Class-based masking of points, to understand the nature of noise in the prediction.
" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "- The default value of `max_display_point` in `show_batch()` and `show_results()` is set to '20000', keeping the rendering-related browser limitation in mind, which can occur for very dense point clouds. This value can be increased if needed, for detailed visualization, within the browser itself. " + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "- `Target Classification` and `Class Preservation` in Classify Points Using Trained Model tool, can be used in conjunction to combine the knowledge of multiple trained models for a single scene. " + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "- Parameters like, `classes_of_interest` and `min_points` are especially useful when training a model for SfM or mobile/terrestrial point clouds. In specific scenarios when the 'training data' is not small, these features can be very useful in speeding up the 'training time', improving the convergence during training, and addressing the class imbalance up to some extent." + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "- Fine-tuning a pretrained model is only preferred if the 'object of interest' is either same or similar, else it is not beneficial. Otherwise, fine-tuning a pretrained model can save cost, time, and compute resources while providing better accuracy/quality in results." + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "- Class codes can be given a meaningful name, using `class_mapping`. The names of the class codes are saved inside the model, which is automatically retrieved by Classify Points Using Trained Model tool and Train Point Cloud Classification Model tool, when a trained model is loaded." + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "- For fine-tuning a model with default architecture settings; 'Class Structure', 'Extra Attributes', and 'Block Point Limit' should match between the pretrained model and the exported 'training data'." + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "## References" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "\n", + "[1] Wu, X., Jiang, L., Wang, P.-S., Liu, Z., Liu, X., Qiao, Y., Ouyang, W., He, T., & Zhao, H. (2023). Point Transformer V3: Simpler, Faster, Stronger. http://arxiv.org/abs/2312.10035" + ] + } + ], + "metadata": { + "kernelspec": { + "display_name": "Python [conda env:conda-arcgispro-py3-clone] *", + "language": "python", + "name": "conda-env-conda-arcgispro-py3-clone-py" + }, + "language_info": { + "codemirror_mode": { + "name": "ipython", + "version": 3 + }, + "file_extension": ".py", + "mimetype": "text/x-python", + "name": "python", + "nbconvert_exporter": "python", + "pygments_lexer": "ipython3", + "version": "3.11.11" + }, + "toc": { + "base_numbering": 1, + "nav_menu": { + "height": "403px", + "width": "385px" + }, + "number_sections": false, + "sideBar": true, + "skip_h1_title": true, + "title_cell": "Table of Contents", + "title_sidebar": "Contents", + "toc_cell": false, + "toc_position": { + "height": "47.7031px", + "left": "0px", + "top": "111.125px", + "width": "165px" + }, + "toc_section_display": false, + "toc_window_display": true + } + }, + "nbformat": 4, + "nbformat_minor": 4 +} diff --git a/guide/14-deep-learning/point_cloud_classification_using_randlanet.ipynb b/guide/14-deep-learning/point_cloud_classification_using_randlanet.ipynb index 3e13b63bcc..af9a395925 100644 --- a/guide/14-deep-learning/point_cloud_classification_using_randlanet.ipynb +++ b/guide/14-deep-learning/point_cloud_classification_using_randlanet.ipynb @@ -14,7 +14,7 @@ }, "source": [ "

Table of Contents

\n", - "
" + "
" ] }, { @@ -28,7 +28,7 @@ "cell_type": "markdown", "metadata": {}, "source": [ - "The `arcgis.learn` module has an efficient point cloud classification model called RandLA-Net [1], which can be used to classify a large number of points in a point cloud dataset. In general, point cloud datasets are gathered using LiDAR sensors, which apply a laser beam to sample the earth's surface and generate high-precision x, y, and z points. These points, are known as 'point clouds' and are commonly generated through the use of terrestrial and airborne LiDAR.\n", + "The `arcgis.learn` module has an efficient point cloud classification model called RandLA-Net [1], which can be used to classify a large number of points in a point cloud dataset. In general, point cloud datasets are gathered using LiDAR sensors, which apply a laser beam to sample the earth's surface and generate high-precision x, y, and z points. These points, are known as 'point clouds' and are commonly generated through the use of terrestrial and airborne LiDAR.\n", "\n", "Point clouds are collections of 3D points that carry the location, measured in x, y, and z coordinates. These points also have some additional information like 'GPS timestamps', 'intensity', and 'number of returns'. The intensity represents the returning strength from the laser pulse that scanned the area, and the number of returns shows how many times a given pulse returned. LiDAR data can also be fused with RGB (red, green, and blue) bands, derived from imagery taken simultaneously with the LiDAR survey. \n", "\n", @@ -52,7 +52,7 @@ "cell_type": "markdown", "metadata": {}, "source": [ - "Point cloud classification is a task where each point in the point cloud is assigned a label, representing a real-world entity (see Figure 1.). And similar to how it's done in traditional methods, for deep learning, the point cloud classification process involves training – where the neural network learns from an already classified (labeled) point cloud dataset, where each point has a unique class code. These class codes are used to represent the features that we want the neural network to recognize. \n", + "Point cloud classification is a task where each point in the point cloud is assigned a label, representing a real-world entity (see Figure 1). And similar to how it's done in traditional methods, for deep learning, the point cloud classification process involves training – where the neural network learns from an already classified (labeled) point cloud dataset, where each point has a unique class code. These class codes are used to represent the features that we want the neural network to recognize. \n", "\n", "In deep learning workflows for point cloud classification, one should not use a ‘thinned-out’ representation of a point cloud dataset that preserves only class codes of interest but drops a majority of the undesired return points, as we would like the neural network to learn and be able to differentiate points of interest and those that are not. Likewise, additional attributes that are present in training datasets, for example, Intensity, RGB, number of returns, etc. will improve the model’s accuracy but could inversely affect it if those parameters are not correct in the datasets that are used for inferencing." ] @@ -63,7 +63,7 @@ "source": [ "When it comes to classifying point clouds, deep learning and neural networks are a great choice since they offer a scalable and efficient architecture. They have enormous potential to make manual or semi-assisted classification modes of point clouds a thing of the past. With that in mind, we can take a closer look at the RandLA-Net model included in `arcgis.learn` and how it can be used for point cloud classification.\n", "\n", - "RandLA-Net is a unique architecture that utilizes random sampling and a local feature aggregator to improve efficient learning and semantic segmentation on a large-scale for point clouds. Compared to existing approaches, RandLA-Net is up to 200 times faster and surpasses state-of-the-art benchmarks like Semantic3D and SemanticKITTI. Its effective local feature aggregation approach preserves complex local structures and delivers significant memory and computational gains over other methods [1]." + "RandLA-Net is a unique architecture that utilizes random sampling and a local feature aggregator to improve efficient learning and semantic segmentation on a large-scale for point clouds. Compared to existing approaches, RandLA-Net is up to 200 times faster and surpasses state-of-the-art benchmarks like Semantic3D and SemanticKITTI. Its effective local feature aggregation approach preserves complex local structures and delivers significant memory and computational gains over other methods [1]." ] }, { @@ -83,14 +83,14 @@ "\n", "

\n", "
\n", - "
Figure 2. The detailed architecture of RandLA-Net. (N, D) represents the number of points and feature dimension respectively. FC: Fully Connected layer, LFA: Local Feature Aggregation, RS: Random Sampling, MLP: shared Multi-Layer Perceptron, US: Up-sampling, DP: Dropout [1].
" + "
Figure 2. The detailed architecture of RandLA-Net. (N, D) represents the number of points and feature dimension respectively. FC: Fully Connected layer, LFA: Local Feature Aggregation, RS: Random Sampling, MLP: shared Multi-Layer Perceptron, US: Up-sampling, DP: Dropout [1].
" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ - "RandLA-Net is an architecture that allows for the learning of point features within a point cloud by using an encoder-decoder sequence with skip connections. The network applies shared MLP layers along with four encoding and decoding layers, as well as three fully-connected layers and a dropout layer to predict the semantic label of each point (see Figure 2.).\n", + "RandLA-Net is an architecture that allows for the learning of point features within a point cloud by using an encoder-decoder sequence with skip connections. The network applies shared MLP layers along with four encoding and decoding layers, as well as three fully-connected layers and a dropout layer to predict the semantic label of each point (see Figure 2).\n", "\n", "\n", "- The input to the architecture is a large-scale point cloud consisting of N points with feature dimensions of din, where the batch dimension is dropped for simplicity.\n", @@ -105,7 +105,7 @@ "- The final semantic label of each point is predicted by three fully-connected layers, (N, 64) → (N, 32) → (N, nclass), and a dropout layer. The dropout ratio is 0.5.\n", "\n", "\n", - "- The output of RandLA-Net is the predicted semantics of all points, with a size of N × nclass, where nclass is the number of classes [1].\n" + "- The output of RandLA-Net is the predicted semantics of all points, with a size of N × nclass, where nclass is the number of classes [1].\n" ] }, { @@ -125,7 +125,7 @@ "\n", "

\n", "
\n", - "
Figure 3. RandLA-Net utilizes downsampling of point clouds at each layer, while still preserving important features required for precise classification [1].
" + "
Figure 3. RandLA-Net utilizes downsampling of point clouds at each layer, while still preserving important features required for precise classification [1].
" ] }, { @@ -140,7 +140,7 @@ "- attentive pooling,\n", "- and dilated residual block.\n", "\n", - "These units work together to learn complex local structures by preserving local geometric features while progressively increasing the receptive field size in each neural layer (see Figure 3.). The LocSE unit is introduced first to capture the local spatial encoding of the point. Then, the attentive pooling unit is leveraged to select the most useful local features that contribute the most to the classification task. Finally, the multiple LocSE and attentive pooling units are stacked together as a dilated residual block to further enhance the effective receptive field for each point in a computationally efficient way." + "These units work together to learn complex local structures by preserving local geometric features while progressively increasing the receptive field size in each neural layer (see Figure 3). The LocSE unit is introduced first to capture the local spatial encoding of the point. Then, the attentive pooling unit is leveraged to select the most useful local features that contribute the most to the classification task. Finally, the multiple LocSE and attentive pooling units are stacked together as a dilated residual block to further enhance the effective receptive field for each point in a computationally efficient way." ] }, { @@ -153,7 +153,7 @@ "\n", "

\n", "
\n", - "
Figure 4. Illustration of the dilated residual block which significantly increases the receptive field (dotted circle) of each point, colored points represent the aggregated features. L: Local spatial encoding, A: Attentive pooling [1].
" + "
Figure 4. Illustration of the dilated residual block which significantly increases the receptive field (dotted circle) of each point, colored points represent the aggregated features. L: Local spatial encoding, A: Attentive pooling [1].
" ] }, { @@ -164,7 +164,7 @@ "\n", "In an attentive pooling unit, the attention mechanism is used to automatically learn important local features and aggregate neighboring point features while avoiding the loss of crucial information. It also maintains the focus on the overall objective, which is to learn complex local structures in a point cloud by considering the relative importance of neighboring point features.\n", "\n", - "Lastly in the dilated residual block unit, the receptive field is increased for each point by stacking multiple LocSE and Attentive Pooling units. This dilated residual block operates by cheaply dilating the receptive field and expanding the effective neighborhood through feature propagation (see Figure 4.). Stacking more and more units enhances the receptive field and makes the block more powerful, which may compromise the overall computation efficiency and lead to overfitting. Hence, in RandLA-Net, two sets of LocSE and Attentive Pooling are stacked as a standard residual block to achieve a balance between efficiency and effectiveness [1]." + "Lastly in the dilated residual block unit, the receptive field is increased for each point by stacking multiple LocSE and Attentive Pooling units. This dilated residual block operates by cheaply dilating the receptive field and expanding the effective neighborhood through feature propagation (see Figure 4). Stacking more and more units enhances the receptive field and makes the block more powerful, which may compromise the overall computation efficiency and lead to overfitting. Hence, in RandLA-Net, two sets of LocSE and Attentive Pooling are stacked as a standard residual block to achieve a balance between efficiency and effectiveness [1]." ] }, { @@ -198,7 +198,7 @@ "cell_type": "markdown", "metadata": {}, "source": [ - "For this step of exporting the data into an intermediate format, use Prepare Point Cloud Training Data tool, in the 3D Analyst extension, available from ArcGIS Pro 2.8 onwards (see Figure 5.)." + "For this step of exporting the data into an intermediate format, use Prepare Point Cloud Training Data tool, in the 3D Analyst extension (see Figure 5)." ] }, { @@ -232,7 +232,7 @@ "cell_type": "markdown", "metadata": {}, "source": [ - "For inferencing, use Classify Points Using Trained Model tool, in the 3D Analyst extension, available from ArcGIS Pro 2.8 onwards (see Figure 6.).\n", + "For inferencing, use Classify Points Using Trained Model tool, in the 3D Analyst extension (see Figure 6).\n", "\n", "Main features available during the inferencing step:\n", " \n", @@ -285,64 +285,6 @@ "```" ] }, - { - "cell_type": "markdown", - "metadata": {}, - "source": [ - "## Setting up the environment" - ] - }, - { - "cell_type": "markdown", - "metadata": {}, - "source": [ - "Make sure to update your 'GPU driver' to a recent version and use 'Administrator Rights' for all the steps, written in this guide.\n", - "\n", - "_**Below, are the instructions to set up the required 'conda environment':**_" - ] - }, - { - "cell_type": "markdown", - "metadata": {}, - "source": [ - "#### For ArcGIS Pro users:" - ] - }, - { - "cell_type": "markdown", - "metadata": {}, - "source": [ - "Deep learning frameworks\n", - "can be used to install all the required dependencies in ArcGIS Pro's default python environment using an MSI installer. \n", - "\n", - "Alternatively, \n", - "for a cloned environment of ArcGIS Pro's default environment, `deep-learning-essentials` metapackage can be used to install the required dependencies which can be done using the following command, in the _`Python Command Prompt`_ (included with ArcGIS Pro):\n", - "\n", - "`conda install deep-learning-essentials`" - ] - }, - { - "cell_type": "markdown", - "metadata": {}, - "source": [ - "#### For Anaconda users (Windows and Linux platforms):" - ] - }, - { - "cell_type": "markdown", - "metadata": {}, - "source": [ - "`arcgis_learn` metapackage can be used for both `windows` and `linux` installations of `Anaconda` in a new environment.\n", - "\n", - "The following command will update `Anaconda` to the latest version. \n", - "\n", - "`conda update conda`\n", - "\n", - "After that, metapackage can be installed using the command below:\n", - "\n", - "`conda install -c esri arcgis_learn=3.9`" - ] - }, { "cell_type": "markdown", "metadata": {}, @@ -410,7 +352,7 @@ "cell_type": "markdown", "metadata": {}, "source": [ - "- `mask_class` functionality in `show_results()` can be used for analyzing any inter-class noises present in the validation output. This can be used to understand which classes need more diversity in training data or need an increase in its number of labeled points _(As shown below, in Figure 7.)_.\n", + "- `mask_class` functionality in `show_results()` can be used for analyzing any inter-class noises present in the validation output. This can be used to understand which classes need more diversity in training data or need an increase in its number of labeled points _(See Figure 7)_.\n", "\n", "\n", "

\n", @@ -471,16 +413,15 @@ "cell_type": "markdown", "metadata": {}, "source": [ - "\n", "[1] Hu, Q., Yang, B., Xie, L., Rosa, S., Guo, Y., Wang, Z., Trigoni, N., & Markham, A. (2020). Randla-Net: Efficient semantic segmentation of large-scale point clouds. Proceedings of the IEEE Computer Society Conference on Computer Vision and Pattern Recognition, 11105–11114. https://doi.org/10.1109/CVPR42600.2020.01112" ] } ], "metadata": { "kernelspec": { - "display_name": "Python 3 (ipykernel)", + "display_name": "Python [conda env:conda-arcgispro-py3-clone] *", "language": "python", - "name": "python3" + "name": "conda-env-conda-arcgispro-py3-clone-py" }, "language_info": { "codemirror_mode": { @@ -492,7 +433,7 @@ "name": "python", "nbconvert_exporter": "python", "pygments_lexer": "ipython3", - "version": "3.9.18" + "version": "3.11.11" }, "toc": { "base_numbering": 1, diff --git a/guide/14-deep-learning/point_cloud_classification_using_sqn.ipynb b/guide/14-deep-learning/point_cloud_classification_using_sqn.ipynb index 3ed5bb7f00..f81ce82afc 100644 --- a/guide/14-deep-learning/point_cloud_classification_using_sqn.ipynb +++ b/guide/14-deep-learning/point_cloud_classification_using_sqn.ipynb @@ -9,12 +9,10 @@ }, { "cell_type": "markdown", - "metadata": { - "toc": true - }, + "metadata": {}, "source": [ "

Table of Contents

\n", - "
" + "
" ] }, { @@ -28,7 +26,7 @@ "cell_type": "markdown", "metadata": {}, "source": [ - "SQN [1] is a point cloud classification model available in the `arcgis.learn` module, designed to efficiently classify a vast amount of point clouds. Typically, LiDAR sensors use laser technology to survey the earth's surface, generating precise 3D coordinates (x, y, and z) that form point clouds. These points also have some additional information like 'GPS timestamps', 'intensity', and 'number of returns'. The intensity represents the returning strength from the laser pulse that scanned the area, and the number of returns shows how many times a given pulse returned. LiDAR data can also be fused with RGB (red, green, and blue) bands, derived from imagery taken simultaneously with the LiDAR survey. \n", + "SQN [1] is a point cloud classification model available in the `arcgis.learn` module, designed to efficiently classify a vast amount of point clouds. Typically, LiDAR sensors use laser technology to survey the earth's surface, generating precise 3D coordinates (x, y, and z) that form point clouds. These points also have some additional information like 'GPS timestamps', 'intensity', and 'number of returns'. The intensity represents the returning strength from the laser pulse that scanned the area, and the number of returns shows how many times a given pulse returned. LiDAR data can also be fused with RGB (red, green, and blue) bands, derived from imagery taken simultaneously with the LiDAR survey. \n", "\n", "Point cloud classification is based on the type of object that reflected the laser pulse. For example, a point that reflects off the ground is classified into the ground category. LiDAR points can be classified into different categories like buildings, trees, highways, water, etc. These different classes have numeric codes assigned to them." ] @@ -50,7 +48,7 @@ "cell_type": "markdown", "metadata": {}, "source": [ - "Point cloud classification is a task where each point in the point cloud is assigned a label, representing a real-world entity (see Figure 1.). And similar to how it's done in traditional methods, for deep learning, the point cloud classification process involves training – where the neural network learns from an already classified (labeled) point cloud dataset, where each point has a unique class code. These class codes are used to represent the features that we want the neural network to recognize. \n", + "Point cloud classification is a task where each point in the point cloud is assigned a label, representing a real-world entity (see Figure 1). And similar to how it's done in traditional methods, for deep learning, the point cloud classification process involves training – where the neural network learns from an already classified (labeled) point cloud dataset, where each point has a unique class code. These class codes are used to represent the features that we want the neural network to recognize. \n", "\n", "In deep learning workflows for point cloud classification, one should not use a ‘thinned-out’ representation of a point cloud dataset that preserves only class codes of interest but drops a majority of the undesired return points, as we would like the neural network to learn and be able to differentiate points of interest and those that are not. Likewise, additional attributes that are present in training datasets, for example, Intensity, RGB, number of returns, etc. will improve the model’s accuracy but could inversely affect it if those parameters are not correct in the datasets that are used for inferencing." ] @@ -61,7 +59,7 @@ "source": [ "When it comes to classifying point clouds, deep learning and neural networks are a great choice since they offer a scalable and efficient architecture. They have enormous potential to make manual or semi-assisted classification modes of point clouds a thing of the past. With that in mind, we can take a closer look at the SQN model included in `arcgis.learn` and how it can be used for point cloud classification.\n", "\n", - "SQN is a novel approach for semantic segmentation of 3D point cloud data, which can achieve high performance even with a small percentage of labeled data for training. It is based on a feature extractor that encodes the raw point cloud into a set of hierarchical latent representations, which can be queried using an arbitrary point position within a local neighborhood. The queried representations are then summarized into a compact vector, which is fed into a multilayer perceptron (MLP) to predict the final semantic label. Additionally, SQN takes into account the semantic similarity between neighboring 3D points, which allows it to back-propagate the sparse training signals to a wider spatial region and hence achieve superior performance under weak supervision [1]." + "SQN is a novel approach for semantic segmentation of 3D point cloud data, which can achieve high performance even with a small percentage of labeled data for training. It is based on a feature extractor that encodes the raw point cloud into a set of hierarchical latent representations, which can be queried using an arbitrary point position within a local neighborhood. The queried representations are then summarized into a compact vector, which is fed into a multilayer perceptron (MLP) to predict the final semantic label. Additionally, SQN takes into account the semantic similarity between neighboring 3D points, which allows it to back-propagate the sparse training signals to a wider spatial region and hence achieve superior performance under weak supervision [1]." ] }, { @@ -81,7 +79,7 @@ "\n", "

\n", "
\n", - "
Figure 2. Architecture of SQN, depicting the training stage with weak supervision. Here, only one query point is shown for simplicity [1].
" + "
Figure 2. Architecture of SQN, depicting the training stage with weak supervision. Here, only one query point is shown for simplicity [1].
" ] }, { @@ -93,7 +91,7 @@ "\n", "The Point Local Feature Extractor component of SQN aims to extract local features for all points using a hierarchical approach, while the Point Feature Query Network is designed to collect relevant features with the help of a 3D query point, using the training signals to be shared and back-propagated for the relevant points.\n", "\n", - "After the features are extracted using the Point Local Feature Extractor and relevant features are collected using the Point Feature Query Network, the unique and representative feature vector for the query point is obtained. This feature vector is then fed into the final component of SQN, which is a series of MLPs that directly infer the point semantic category (see Figure 2.) [1].\n" + "After the features are extracted using the Point Local Feature Extractor and relevant features are collected using the Point Feature Query Network, the unique and representative feature vector for the query point is obtained. This feature vector is then fed into the final component of SQN, which is a series of MLPs that directly infer the point semantic category (see Figure 2.) [1].\n" ] }, { @@ -107,7 +105,7 @@ "cell_type": "markdown", "metadata": {}, "source": [ - "The Point Local Feature Extractor in SQN is a hierarchical approach for extracting local features from all points in an input point cloud. It consists of multiple layers of Local Feature Aggregation (LFA) followed by a Random Sampling (RS) operation. The LFA layers enable the extraction of hierarchical point features, with four levels of feature vectors extracted after each encoding layer. The four levels are N x 32, N x 128, N x 256, and N x 512, where N represents the number of points in the input point cloud. The RS operation preserves the point location data in each feature vector. This component is not restricted to any specific backbone network, and is designed to extract diverse visual patterns from the input point cloud, allowing the network to learn more geometrically meaningful local patterns from sparse training signals (see Figure 2.)." + "The Point Local Feature Extractor in SQN is a hierarchical approach for extracting local features from all points in an input point cloud. It consists of multiple layers of Local Feature Aggregation (LFA) followed by a Random Sampling (RS) operation. The LFA layers enable the extraction of hierarchical point features, with four levels of feature vectors extracted after each encoding layer. The four levels are N x 32, N x 128, N x 256, and N x 512, where N represents the number of points in the input point cloud. The RS operation preserves the point location data in each feature vector. This component is not restricted to any specific backbone network, and is designed to extract diverse visual patterns from the input point cloud, allowing the network to learn more geometrically meaningful local patterns from sparse training signals (see Figure 2)." ] }, { @@ -155,7 +153,7 @@ "cell_type": "markdown", "metadata": {}, "source": [ - "For this step of exporting the data into an intermediate format, use Prepare Point Cloud Training Data tool, in the 3D Analyst extension, available from ArcGIS Pro 2.8 onwards (see Figure 3.)." + "For this step of exporting the data into an intermediate format, use Prepare Point Cloud Training Data tool, in the 3D Analyst extension (see Figure 3)." ] }, { @@ -190,7 +188,7 @@ "cell_type": "markdown", "metadata": {}, "source": [ - "For inferencing, use Classify Points Using Trained Model tool, in the 3D Analyst extension, available from ArcGIS Pro 2.8 onwards (see Figure 4.).\n", + "For inferencing, use Classify Points Using Trained Model tool, in the 3D Analyst extension (see Figure 4).\n", "\n", "Main features available during the inferencing step:\n", " \n", @@ -243,64 +241,6 @@ "```" ] }, - { - "cell_type": "markdown", - "metadata": {}, - "source": [ - "## Setting up the environment" - ] - }, - { - "cell_type": "markdown", - "metadata": {}, - "source": [ - "Make sure to update your 'GPU driver' to a recent version and use 'Administrator Rights' for all the steps, written in this guide.\n", - "\n", - "_**Below, are the instructions to set up the required 'conda environment':**_" - ] - }, - { - "cell_type": "markdown", - "metadata": {}, - "source": [ - "#### For ArcGIS Pro users:" - ] - }, - { - "cell_type": "markdown", - "metadata": {}, - "source": [ - "Deep learning frameworks\n", - "can be used to install all the required dependencies in ArcGIS Pro's default python environment using an MSI installer. \n", - "\n", - "Alternatively, \n", - "for a cloned environment of ArcGIS Pro's default environment, `deep-learning-essentials` metapackage can be used to install the required dependencies which can be done using the following command, in the _`Python Command Prompt`_ (included with ArcGIS Pro):\n", - "\n", - "`conda install deep-learning-essentials`" - ] - }, - { - "cell_type": "markdown", - "metadata": {}, - "source": [ - "#### For Anaconda users (Windows and Linux platforms):" - ] - }, - { - "cell_type": "markdown", - "metadata": {}, - "source": [ - "`arcgis_learn` metapackage can be used for both `windows` and `linux` installations of `Anaconda` in a new environment.\n", - "\n", - "The following command will update `Anaconda` to the latest version. \n", - "\n", - "`conda update conda`\n", - "\n", - "After that, metapackage can be installed using the command below:\n", - "\n", - "`conda install -c esri arcgis_learn python=3.9`" - ] - }, { "cell_type": "markdown", "metadata": {}, @@ -368,7 +308,7 @@ "cell_type": "markdown", "metadata": {}, "source": [ - "- `mask_class` functionality in `show_results()` can be used for analyzing any inter-class noises present in the validation output. This can be used to understand which classes need more diversity in training data or need an increase in its number of labeled points _(As shown below, in Figure 5.)_.\n", + "- `mask_class` functionality in `show_results()` can be used for analyzing any inter-class noises present in the validation output. This can be used to understand which classes need more diversity in training data or need an increase in its number of labeled points _(See Figure 5)_.\n", "\n", "\n", "

\n", @@ -440,9 +380,9 @@ ], "metadata": { "kernelspec": { - "display_name": "Python 3 (ipykernel)", + "display_name": "Python [conda env:conda-arcgispro-py3-clone] *", "language": "python", - "name": "python3" + "name": "conda-env-conda-arcgispro-py3-clone-py" }, "language_info": { "codemirror_mode": { @@ -454,7 +394,7 @@ "name": "python", "nbconvert_exporter": "python", "pygments_lexer": "ipython3", - "version": "3.9.18" + "version": "3.11.11" }, "toc": { "base_numbering": 1, @@ -467,7 +407,7 @@ "skip_h1_title": true, "title_cell": "Table of Contents", "title_sidebar": "Contents", - "toc_cell": true, + "toc_cell": false, "toc_position": { "height": "calc(100% - 180px)", "left": "10px", diff --git a/guide/14-deep-learning/point_cloud_object_detection_using_second.ipynb b/guide/14-deep-learning/point_cloud_object_detection_using_second.ipynb new file mode 100644 index 0000000000..b41c91ec6d --- /dev/null +++ b/guide/14-deep-learning/point_cloud_object_detection_using_second.ipynb @@ -0,0 +1,386 @@ +{ + "cells": [ + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "# Point cloud object detection using SECOND" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "

Table of Contents

\n", + "
" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "## Introduction" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "The `arcgis.learn` module supports point cloud object detection as a downstream task using the Sparsely Embedded Convolutional Detection (SECOND) architecture [1]. This architecture enables the detection of 3D objects within a point cloud dataset, represented by 3D bounding boxes (multipatch feature class). Point cloud datasets are typically acquired using LiDAR sensors, which employ laser beams to sample the Earth's surface and generate precise x, y, and z coordinates. These points, collectively referred to as \"point clouds,\" are commonly generated through terrestrial and airborne LiDAR surveys.\n", + "\n", + "Point clouds consist of collections of 3D points that contain location information (x, y, and z coordinates) along with additional attributes such as GPS timestamps, intensity, and number of returns. Intensity reflects the strength of the returning laser pulse, while the number of returns indicates how many times a given pulse was reflected. LiDAR data can be combined with RGB (red, green, and blue) bands derived from imagery captured concurrently with the LiDAR survey.\n", + "\n", + "Point cloud object detection focuses on identifying the location of an object within a scene, rather than determining the precise class code of each point that constitutes the object. For example, this technique can be used to identify objects like cars, poles, and street furniture, which generally have consistent shapes and sizes. While LiDAR points can be classified into categories such as buildings, trees, highways, and water (each with assigned numeric codes), the object detection workflow doesn't require this class code information during either training or inference. It can detect objects as 3D multipatches even within unclassified point cloud datasets." + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "

\n", + "

\n", + "\n", + "
\n", + "

\n", + "
\n", + "
Figure 1. Visualization of point cloud object detection dataset, where 3D bounding boxes represents the object's class.
" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "Point cloud object detection is a task where each bounding box (multipatch) is assigned a label representing a real-world entity (see Figure 1). Similar to traditional methods, deep learning approaches for point cloud object detection involve a training process. During training, a neural network learns from a paired dataset of point clouds (not necessarily classified) and their corresponding bounding boxes." + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "With that in mind, we can take a closer look at the Sparsely Embedded Convolutional Detection (SECOND) model included in `arcgis.learn` and how it can be used for point cloud object detection.[1]\n" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "## Sparsely Embedded Convolutional Detection (SECOND) architecture" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "

\n", + "

\n", + "
\n", + "

\n", + "
\n", + "
Figure 2. Sparsely Embedded Convolutional Detection architecture.
" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "The SECOND detector is a novel LiDAR-based 3D object detection network that significantly improves upon previous methods by leveraging sparse convolutional networks, a new angle loss regression approach, and a unique data augmentation technique.\n", + "\n", + "The SECOND detector has a three-part architecture (see Figure 2):\n", + "\n", + "- Voxel Feature Extractor: This component converts raw point cloud data into a voxel representation. It uses Voxel Feature Encoding (VFE) layers, which consist of a linear layer, batch normalization, and a ReLU, to extract features from each voxel.\n", + "\n", + "- Sparse Convolutional Middle Extractor: This is the core of the network. It employs spatially sparse convolutional networks to extract information from the z-axis (height) and reduces the 3D data into a 2D bird's-eye view (BEV) representation. This part utilizes both submanifold convolution and standard sparse convolution for downsampling, followed by a conversion of sparse data into a dense feature map. This approach significantly improves processing speed.\n", + "\n", + "- Region Proposal Network (RPN): An SSD-like RPN takes the features from the middle extractor. It uses a series of convolutional and deconvolutional layers to generate bounding box proposals, classify objects, and refine their orientations." + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "### Sparse Convolution" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "Sparse convolution is the key aspect of the SECOND's efficiency. Unlike traditional dense convolutions that compute outputs for every location in a grid, sparse convolutions only compute outputs for locations where there is input data. This is highly advantageous for LiDAR point clouds, which are inherently sparse.\n", + "\n", + "It has an improved sparse convolution algorithm with a GPU-based rule generation method. This method overcomes the performance bottleneck of rule generation (determining which input points contribute to which output points) by performing it in parallel on the GPU, avoiding costly data transfers between the CPU and GPU. The rule generation algorithm first identifies unique output locations and then uses a lookup table to efficiently map input indices to output indices. By performing these computations in parallel, the improved sparse convolution achieves a substantial speedup compared to previous implementations." + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "### Angle Loss Regression" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "The angle loss regression is a crucial innovation that enhances the accuracy of orientation estimation. Previous methods for angle regression often suffered from large loss gradients when the predicted angle and the ground truth angle differed, even though these angles represent the same physical orientation of the bounding box.\n", + "\n", + "The SECOND detector introduces a sine-error loss function. This formulation elegantly addresses the problem by using the sine of the angle difference, which naturally handles the periodicity of angles. The sine function ensures that the loss is small when the angle difference is close to 0 or π, reflecting the fact that these angles correspond to similar bounding box orientations. Additionally, an auxiliary direction classifier is used to distinguish between orientations that differ by π. This novel angle loss regression leads to improved orientation estimation performance." + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "## Implementation in `arcgis.learn`" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "When training a `Sparsely Embedded Convolutional Detection` (SECOND) model using `arcgis.learn`, the raw point cloud dataset in LAS files is first converted into blocks of points, containing a specific number of points along with corresponding 3D bounding boxes (as a multipatch feature class) for objects. Multiple available GP tools can be used to create 3D multipatch feature class. When creating these from a classified point cloud dataset Extract Objects From Point Cloud tool can be very useful." + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "

\n", + "

\n", + "
\n", + "

\n", + "
\n", + "
Figure 3. Prepare Point Cloud Object Detection Training Data tool in ArcGIS Pro.
" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "For this step of exporting the data into an intermediate format, use Prepare Point Cloud Object Detection Training Data tool, in the 3D Analyst extension (see Figure 3)." + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "These exported blocks are used to create a `data bunch` object that is passed into the `SECOND` model for training.\n", + "\n", + "```python\n", + "output_path=r'C:/project/training_data.pcotd'\n", + "data = prepare_data(output_path, dataset_type='PointCloudOD', batch_size=2)\n", + "pcd = MMDetection3D(data, model='SECOND')\n", + "pcd.fit(20)\n", + "```\n", + "After training the `SECOND` model, `average_precision_score()` method can be used to compute, per-class metrics with respect to validation data. And `save()` method can be used to save the model." + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "

\n", + "

\n", + "
\n", + "

\n", + "
Figure 4. Detect Objects From Point Cloud Using Trained Model tool.
" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "For inferencing, use Detect Objects From Point Cloud Using Trained Model tool, in the 3D Analyst extension (see Figure 4).\n" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "Detailed tool references and resources for point cloud object detection using deep learning in ArcGIS Pro can be found here." + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "### For advanced users " + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "We can also specify, additional parameters that directly affect the properties of the architecture itself, this can be done while initializing the `SECOND` architecture via `MMDetection3D`, by using the following parameters.\n", + "\n", + "- `voxel_size` List of voxel dimensions in meter[x,y,z].\n", + "\n", + "\n", + "- `voxel_points` This parameter controls the maximum number of points per voxel.\n", + "\n", + "\n", + "- `max_voxels` List of maximum number of voxels in [training, validation].\n", + "\n", + "\n", + "A typical usage with respect to API looks like:\n", + "\n", + "```python\n", + "pcd = MMDetection3D(data,\n", + " model='SECOND',\n", + " voxel_parms={'voxel_size': [0.2, 0.2, 0.3],\n", + " 'voxel_points': 10,\n", + " 'max_voxels':(16000, 40000),\n", + " }\n", + " )\n", + "```" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "## Best practices for SECOND workflow" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "The following tips and best practices can be used while using SECOND:" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "- The 3D deep learning tools in the 3D Analyst extension, takes care of the coordinate system, and related discrepancies, automatically. So, one can train a model using ArcGIS Pro on a dataset with a metric coordinate system, then use that trained model on a dataset with any other coordinate system, and vice-versa without any need for re-projection." + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "- High-quality labeled data will result in a better-trained model. For generalization and robustness of the trained model, significant diversity or variety should be present in the training data, in terms of geography, building architectures, terrains, object-related variations, etc." + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "- If the object of interest is significantly larger or smaller in size than the default value of `Block Size`, then a better value can be used for improving the results further. Like, for a dataset in a metric coordinate system, a _'warehouse'_ won't fit in a '50 meter' x '50 meter' `Block Size`, hence the `Block Size` can be increased in this case. " + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "- Through a series of experiments, it was found that an additional one or two `extra_features` apart from X, Y, and Z usually works best, in most cases. Over usage of 'extra attributes' for model training might reduce generalization, i.e. _'how generic the trained model will be'_. " + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "- Deciding which 'extra attributes' to consider, depends upon the properties of the object of interest, the nature of noise, sensor-specific attributes, etc. " + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "- It is recommended to filter or withheld points that belong to the 'high noise' class from the dataset." + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "- The default value of `max_display_point` in `show_batch()` and `show_results()` is set to '20000', keeping the rendering-related browser limitation in mind, which can occur for very dense point clouds. This value can be increased if needed, for detailed visualization, within the browser itself. " + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "- Parameters like, `classes_of_interest` and `min_points` are especially useful when training a model for SfM or mobile/terrestrial point clouds. In specific scenarios when the 'training data' is not small, these features can be very useful in speeding up the 'training time', improving the convergence during training, and addressing the class imbalance up to some extent." + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "- Fine-tuning a pretrained model is only preferred if the 'object of interest' is either same or similar, else it is not beneficial. Otherwise, fine-tuning a pretrained model can save cost, time, and compute resources while providing better accuracy/quality in results." + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "- Class codes can be given a meaningful name, using `class_mapping`. The names of the class codes are saved inside the model, which is automatically retrieved by Detect Objects From Point Cloud Using Trained Model tool and Train Point Cloud Object Detection Model tool, when a trained model is loaded." + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "- For fine-tuning a model with default architecture settings; 'Class Structure', 'Extra Attributes', and 'Block Point Limit' should match between the pretrained model and the exported 'training data'. Apart from this the fine-tuning data should have similar sized, 3D bounding boxes." + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "## References" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "\n", + "[1] Yan, Y., Mao, Y., & Li, B. (2018). SECOND: Sparsely Embedded Convolutional Detection. Sensors, 18(10), 3337. https://doi.org/10.3390/s18103337" + ] + } + ], + "metadata": { + "kernelspec": { + "display_name": "Python [conda env:conda-arcgispro-py3-clone] *", + "language": "python", + "name": "conda-env-conda-arcgispro-py3-clone-py" + }, + "language_info": { + "codemirror_mode": { + "name": "ipython", + "version": 3 + }, + "file_extension": ".py", + "mimetype": "text/x-python", + "name": "python", + "nbconvert_exporter": "python", + "pygments_lexer": "ipython3", + "version": "3.11.11" + }, + "toc": { + "base_numbering": 1, + "nav_menu": { + "height": "403px", + "width": "385px" + }, + "number_sections": false, + "sideBar": true, + "skip_h1_title": true, + "title_cell": "Table of Contents", + "title_sidebar": "Contents", + "toc_cell": false, + "toc_position": { + "height": "47.7031px", + "left": "0px", + "top": "111.125px", + "width": "165px" + }, + "toc_section_display": false, + "toc_window_display": true + } + }, + "nbformat": 4, + "nbformat_minor": 4 +} diff --git a/guide/14-deep-learning/point_cloud_segmentation_using_pointcnn.ipynb b/guide/14-deep-learning/point_cloud_segmentation_using_pointcnn.ipynb index c358f63625..24d136b64f 100644 --- a/guide/14-deep-learning/point_cloud_segmentation_using_pointcnn.ipynb +++ b/guide/14-deep-learning/point_cloud_segmentation_using_pointcnn.ipynb @@ -9,12 +9,10 @@ }, { "cell_type": "markdown", - "metadata": { - "toc": true - }, + "metadata": {}, "source": [ "

Table of Contents

\n", - "
" + "
" ] }, { @@ -28,7 +26,7 @@ "cell_type": "markdown", "metadata": {}, "source": [ - "The `arcgis.learn` module has an efficient point cloud classification model called PointCNN [1], which can be used to classify a large number of points in a point cloud dataset. In general, point cloud datasets are gathered using LiDAR sensors, which apply a laser beam to sample the earth's surface and generate high-precision x, y, and z points. These points, are known as 'point clouds' and are commonly generated through the use of terrestrial and airborne LiDAR.\n", + "The `arcgis.learn` module has an efficient point cloud classification model called PointCNN [1], which can be used to classify a large number of points in a point cloud dataset. In general, point cloud datasets are gathered using LiDAR sensors, which apply a laser beam to sample the earth's surface and generate high-precision x, y, and z points. These points, are known as 'point clouds' and are commonly generated through the use of terrestrial and airborne LiDAR.\n", "\n", "Point clouds are collections of 3D points that carry the location, measured in x, y, and z coordinates. These points also have some additional information like 'GPS timestamps', 'intensity', and 'number of returns'. The intensity represents the returning strength from the laser pulse that scanned the area, and the number of returns shows how many times a given pulse returned. LiDAR data can also be fused with RGB (red, green, and blue) bands, derived from imagery taken simultaneously with the LiDAR survey. \n", "\n", @@ -45,14 +43,14 @@ "\n", "

\n", "
\n", - "
Figure 1. Visualization of point cloud dataset with RGB values [3]. The features apart from x, y, and z values, such as intensity and number of returns are quite valuable for the task of classification, but at the same time, they are sensor-dependent and could become the main reasons for loss of generalization.
" + "
Figure 1. Visualization of point cloud dataset with RGB values [3]. The features apart from x, y, and z values, such as intensity and number of returns are quite valuable for the task of classification, but at the same time, they are sensor-dependent and could become the main reasons for loss of generalization.
" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ - "Point cloud classification is a task where each point in the point cloud is assigned a label, representing a real-world entity (see Figure 1.). And similar to how it's done in traditional methods, for deep learning, the point cloud classification process involves training – where the neural network learns from an already classified (labeled) point cloud dataset, where each point has a unique class code. These class codes are used to represent the features that we want the neural network to recognize." + "Point cloud classification is a task where each point in the point cloud is assigned a label, representing a real-world entity (see Figure 1). And similar to how it's done in traditional methods, for deep learning, the point cloud classification process involves training – where the neural network learns from an already classified (labeled) point cloud dataset, where each point has a unique class code. These class codes are used to represent the features that we want the neural network to recognize." ] }, { @@ -65,14 +63,14 @@ "\n", "

\n", "
\n", - "
Figure 2. On the left side, raw LiDAR points can be seen. And for the same area, on the right side, we have classified points, where class codes are assigned to different colors [2].
" + "
Figure 2. On the left side, raw LiDAR points can be seen. And for the same area, on the right side, we have classified points, where class codes are assigned to different colors [2].
" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ - "In deep learning workflows for point cloud classification, one should not use a ‘thinned-out’ representation of a point cloud dataset that preserves only class codes of interest but drops a majority of the undesired return points, as we would like the neural network to learn and be able to differentiate points of interest and those that are not. Likewise, additional attributes that are present in training datasets, for example, Intensity, RGB, number of returns, etc (see Figure 2.). will improve the model’s accuracy but could inversely affect it if those parameters are not correct in the datasets that are used for inferencing." + "In deep learning workflows for point cloud classification, one should not use a ‘thinned-out’ representation of a point cloud dataset that preserves only class codes of interest but drops a majority of the undesired return points, as we would like the neural network to learn and be able to differentiate points of interest and those that are not. Likewise, additional attributes that are present in training datasets, for example, Intensity, RGB, number of returns, etc (see Figure 2). will improve the model’s accuracy but could inversely affect it if those parameters are not correct in the datasets that are used for inferencing." ] }, { @@ -111,7 +109,7 @@ "\n", "

\n", "
\n", - "
Figure 3. A Generalized representation of a PointCNN for classification architecture [1]. In each X-Conv operation, N represents the number of points in the next layer, C represents the number of channels, K represents the number of nearest neighbors and D represents the dilation rate [1].
\n" + "
Figure 3. A Generalized representation of a PointCNN for classification architecture [1]. In each X-Conv operation, N represents the number of points in the next layer, C represents the number of channels, K represents the number of nearest neighbors and D represents the dilation rate [1].
\n" ] }, { @@ -122,8 +120,8 @@ "\n", "To state it succinctly, PointCNN differs from conventional grid-based CNNs primarily due to the application of X-Conv layers. Even then, the general process is like, how CNNs are used in grid-based convolution frameworks. The main differences are with respect to:\n", "\n", - "1. The way the local regions are extracted, K ⇥K patches vs. K neighboring points around representative points (see Figure 3.).\n", - "2. The way the information from local regions is learned (Conv vs. X-Conv) [1]." + "1. The way the local regions are extracted, K ⇥K patches vs. K neighboring points around representative points (see Figure 3).\n", + "2. The way the information from local regions is learned (Conv vs. X-Conv) [1]." ] }, { @@ -150,14 +148,14 @@ "
\n", "

\n", "
\n", - "
Figure 4. A diagram illustrating the differences and similarities of hierarchical convolution and PointCNN. The process above the dotted line denotes CNN in regular grids where convolutions are recursively applied on local grid patches. The process involves grid reductions – as done similarly in raster processing or meshing reducing the grid resolution successively (4X4⇥3X3⇥2X2), while increasing the channel number (visualized by dot thickness). Similarly, in point clouds, X-Conv is recursively applied to “project” or “aggregate” information, from the neighborhoods into fewer representative points (9⇥5⇥2) but each with richer information [1].
" + "
Figure 4. A diagram illustrating the differences and similarities of hierarchical convolution and PointCNN. The process above the dotted line denotes CNN in regular grids where convolutions are recursively applied on local grid patches. The process involves grid reductions – as done similarly in raster processing or meshing reducing the grid resolution successively (4X4⇥3X3⇥2X2), while increasing the channel number (visualized by dot thickness). Similarly, in point clouds, X-Conv is recursively applied to “project” or “aggregate” information, from the neighborhoods into fewer representative points (9⇥5⇥2) but each with richer information [1].
" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ - "Let's illustrate this step with an example. The first step involves sampling several points, let's call it sample P from the input set of points N. Then, for the P number of points, we find K nearest neighbors from N points (see Figure 4.). This process is performed to form a local neighborhood of points for each point in P. This local neighborhood of points is then brought to a local coordinate space for each neighborhood. After these operations, we get an array of points of the shape (P, K, 3+E), where E is the number of extra features present (such as intensity, RGB values, or the number of returns), other than x, y, and z." + "Let's illustrate this step with an example. The first step involves sampling several points, let's call it sample P from the input set of points N. Then, for the P number of points, we find K nearest neighbors from N points (see Figure 4). This process is performed to form a local neighborhood of points for each point in P. This local neighborhood of points is then brought to a local coordinate space for each neighborhood. After these operations, we get an array of points of the shape (P, K, 3+E), where E is the number of extra features present (such as intensity, RGB values, or the number of returns), other than x, y, and z." ] }, { @@ -191,7 +189,7 @@ "cell_type": "markdown", "metadata": {}, "source": [ - "For this step of exporting the data into an intermediate format, use Prepare Point Cloud Training Data tool, in the 3D Analyst extension, available from ArcGIS Pro 2.8 onwards (see Figure 5.)." + "For this step of exporting the data into an intermediate format, use Prepare Point Cloud Training Data tool, in the 3D Analyst extension (see Figure 5)." ] }, { @@ -226,7 +224,7 @@ "cell_type": "markdown", "metadata": {}, "source": [ - "For inferencing, use Classify Points Using Trained Model tool, in the 3D Analyst extension, available from ArcGIS Pro 2.8 onwards (see Figure 6.).\n", + "For inferencing, use Classify Points Using Trained Model tool, in the 3D Analyst extension (see Figure 6).\n", "\n", "Main features available during the inferencing step:\n", " \n", @@ -240,9 +238,7 @@ "cell_type": "markdown", "metadata": {}, "source": [ - "Detailed tool references and resources for point cloud classification using deep learning in ArcGIS Pro can be found here.\n", - "\n", - "_**Note**:_ _API's_ `export_point_dataset()` _function and_ `predict_las()` _method for `PointCNN`, previously used for 'exporting' and 'inferencing' respectively, have been deprecated starting from 'ArcGIS API for Python' version 1.9.0._\n" + "Detailed tool references and resources for point cloud classification using deep learning in ArcGIS Pro can be found here." ] }, { @@ -276,64 +272,6 @@ "```" ] }, - { - "cell_type": "markdown", - "metadata": {}, - "source": [ - "## Setting up the environment" - ] - }, - { - "cell_type": "markdown", - "metadata": {}, - "source": [ - "Make sure to update your 'GPU driver' to a recent version and use 'Administrator Rights' for all the steps, written in this guide.\n", - "\n", - "_**Below, are the instructions to set up the required 'conda environment':**_" - ] - }, - { - "cell_type": "markdown", - "metadata": {}, - "source": [ - "#### For ArcGIS Pro users:" - ] - }, - { - "cell_type": "markdown", - "metadata": {}, - "source": [ - "Deep learning frameworks\n", - "can be used to install all the required dependencies in ArcGIS Pro's default python environment using an MSI installer. \n", - "\n", - "Alternatively, \n", - "for a cloned environment of ArcGIS Pro's default environment, `deep-learning-essentials` metapackage can be used to install the required dependencies which can be done using the following command, in the _`Python Command Prompt`_ (included with ArcGIS Pro):\n", - "\n", - "`conda install -c esri deep-learning-essentials`" - ] - }, - { - "cell_type": "markdown", - "metadata": {}, - "source": [ - "#### For Anaconda users (Windows and Linux platforms):" - ] - }, - { - "cell_type": "markdown", - "metadata": {}, - "source": [ - "`arcgis_learn` metapackage can be used for both `windows` and `linux` installations of `Anaconda` in a new environment.\n", - "\n", - "The following command will update `Anaconda` to the latest version. \n", - "\n", - "`conda update conda`\n", - "\n", - "After that, metapackage can be installed using the command below:\n", - "\n", - "`conda install -c esri arcgis_learn=3.9`" - ] - }, { "cell_type": "markdown", "metadata": {}, @@ -401,7 +339,7 @@ "cell_type": "markdown", "metadata": {}, "source": [ - "- `mask_class` functionality in `show_results()` can be used for analyzing any inter-class noises present in the validation output. This can be used to understand which classes need more diversity in training data or need an increase in its number of labeled points _(As shown below, in Figure 7.)_.\n", + "- `mask_class` functionality in `show_results()` can be used for analyzing any inter-class noises present in the validation output. This can be used to understand which classes need more diversity in training data or need an increase in its number of labeled points _(See Figure 7)_.\n", "\n", "\n", "

\n", @@ -477,9 +415,9 @@ ], "metadata": { "kernelspec": { - "display_name": "Python 3 (ipykernel)", + "display_name": "Python [conda env:conda-arcgispro-py3-clone] *", "language": "python", - "name": "python3" + "name": "conda-env-conda-arcgispro-py3-clone-py" }, "language_info": { "codemirror_mode": { @@ -491,7 +429,7 @@ "name": "python", "nbconvert_exporter": "python", "pygments_lexer": "ipython3", - "version": "3.9.18" + "version": "3.11.11" }, "toc": { "base_numbering": 1, @@ -504,7 +442,7 @@ "skip_h1_title": true, "title_cell": "Table of Contents", "title_sidebar": "Contents", - "toc_cell": true, + "toc_cell": false, "toc_position": {}, "toc_section_display": false, "toc_window_display": true