A high-performance GPU-accelerated N-body gravitational simulation implementing the Barnes-Hut algorithm for O(N log N) complexity. This project uses CUDA development techniques and parallel algorithms, capable of simulating up to 1 million particles simultaneously on modern NVIDIA GPUs.
1.Million.bodies.--.144Hz.mp4
This simulation calculates gravitational forces between particles using two approaches:
- Brute Force Method: Direct O(N²) force calculation for baseline comparison
- Barnes-Hut Algorithm: Hierarchical O(N log N) approximation using octree spatial decomposition
The implementation showcases modern GPU computing techniques including Morton code spatial hashing, parallel tree construction, and optimized memory access patterns.
- Morton Code Generation: 3D Z-order curve spatial mapping for efficient particle sorting
- Octree Construction: GPU-parallel hierarchical spatial decomposition
- Barnes-Hut Force Calculation: Tree traversal with configurable approximation parameter
- Velocity Verlet Integration: Stable numerical integration for orbital mechanics
- Memory Coalescing: Optimized global memory access patterns
- Thrust Library Integration: High-performance GPU sorting primitives
- Flat Array Representation: Cache-friendly data structures avoiding pointer chasing
- Stack-based Tree Traversal: Recursion-free GPU kernel design
- Massive Scale Simulation: Handles 100,000 to 1,000,000+ particles simultaneously
- Real-time Performance: Interactive rates for simulations up to 50,000 particles
- Scalable Architecture: Linear memory scaling from thousands to millions of particles
- Multiple Initial Conditions: Random distributions, galaxy disks, colliding clusters
- Configurable Parameters: Gravitational constant, softening factor, approximation threshold
- NVIDIA GPU with Compute Capability 3.5 or higher
- Minimum 2GB GPU memory (6GB+ recommended for large simulations)
- Tested on RTX 3060 Laptop GPU (3840 CUDA cores)
- CUDA Toolkit 11.0 or later (tested with CUDA 12.9)
- CMake 3.18 or later
- C++14 compatible compiler
- Windows: Visual Studio Build Tools 2019/2022
- Linux: GCC 7.0+ or Clang 6.0+
# Clone the repository
git clone
cd N-Body-Simulation
# Build using the provided script
build.bat
# Or use CMake directly
mkdir build
cd build
cmake ..
cmake --build . --config Release# Clone and build
git clone
cd N-Body-Simulation
mkdir build
cd build
cmake ..
make -j$(nproc)The project automatically detects CUDA installation and configures appropriate compiler flags for the target GPU architecture.
# Run the main simulation
./bin/nbody_simulation
# Run performance benchmarks
./bin/benchmark# Small-scale interactive simulation
./bin/nbody_simulation --particles 10000
# Large-scale simulation (100K particles)
./bin/nbody_simulation --particles 100000 --timesteps 500 --dt 0.01
# Maximum scale simulation (1M particles)
./bin/nbody_simulation --particles 1000000 --timesteps 100 --dt 0.005
# Galaxy simulation with realistic particle count
./bin/nbody_simulation --initial galaxy_disk --particles 250000random_sphere: Particles randomly distributed in a sphererandom_cube: Uniform distribution in a cubic volumegalaxy_disk: Rotating disk with central bulgecolliding_galaxies: Two galaxy disks on collision coursebenchmark: Optimized distribution for performance testing
- Particle Count: 1,024
- Precision: Single precision floating point
- Compiler: NVCC with Visual Studio 2022
Component Time (ms) Complexity
Morton Code Generation 1.47 O(N)
Particle Sorting 3.16 O(N log N)
Octree Construction 0.39 O(N)
Barnes-Hut Force Calc 0.27 O(N log N)
Brute Force Calculation 0.70 O(N²)
Total Barnes-Hut Pipeline: 5.29 ms
Force Calculation Speedup: 2.59x
100,000 particles:
- Force calculation: ~27ms (100x particles, log factor)
- Expected speedup: 50-100x over brute force
- Target: 30+ FPS for real-time simulation
1,000,000 particles:
- Force calculation: ~270ms
- Memory usage: ~50MB GPU memory
- Batch processing: 3-4 FPS for million-particle simulations
- 100K particles: ~5MB GPU memory
- 500K particles: ~25MB GPU memory
- 1M particles: ~50MB GPU memory
- RTX 3060 capacity: 6GB total (120x headroom for maximum scale)
- Theoretical maximum: 2-3 million particles on RTX 3060
- Particle Capacity: Up to 1,000,000 particles on RTX 3060 (6GB VRAM)
- Memory Usage: Linear scaling - 50MB for 1M particles
- Computational Complexity: O(N log N) vs O(N²) for brute force
- Expected Performance: 50-100x speedup for 100K+ particles
- Real-time Capability: 30+ FPS for simulations up to 50,000 particles
The implementation uses 30-bit Morton codes (10 bits per dimension) to map 3D particle positions to a 1D space-filling curve. This ensures spatial locality and enables efficient parallel sorting.
// Convert 3D position to Morton code
uint32_t morton_code = worldToMorton(position, bbox_min, bbox_max);The algorithm uses the θ-criterion to determine when to use center-of-mass approximation:
- If s/d < θ, use the node's center of mass
- Otherwise, recursively examine child nodes
- Default θ = 0.5 balances accuracy and performance
- Compute global bounding box using GPU reduction
- Generate Morton codes for all particles
- Sort particles by Morton code using Thrust
- Build octree structure from sorted particles
- Compute center of mass for each node
N-Body-Simulation/
├── include/ # Header files
│ ├── particle.h # Particle data structure
│ ├── nbody_simulation.h # Main simulation class
│ ├── initial_conditions.h # Initial condition generators
│ ├── octree.h # Octree data structures
│ ├── barnes_hut_force.h # Force calculation
│ └── particle_utils.h # Utility functions
├── src/ # Source implementations
│ ├── main.cu # Main simulation entry point
│ ├── particle.cu # Particle system implementation
│ ├── nbody_simulation.cu # Simulation loop and integration
│ ├── initial_conditions.cu # Initial condition implementations
│ ├── morton_code.cu # Morton code algorithms
│ ├── octree.cu # Octree construction
│ ├── barnes_hut_force.cu # Barnes-Hut force calculation
│ ├── particle_utils.cu # Data extraction utilities
│ └── benchmark.cu # Performance benchmarking
├── CMakeLists.txt # Build configuration
└── build.bat # Windows build script
G_CONSTANT: Gravitational constant (default: 6.67430e-11 scaled)SOFTENING_FACTOR: Force softening to prevent singularities (default: 0.1)TIME_STEP: Integration time step (default: 0.01)MAX_TIMESTEPS: Maximum simulation steps (default: 1000)
THETA: Approximation parameter (default: 0.5)MAX_TREE_DEPTH: Maximum octree depth (default: 20)THREADS_PER_BLOCK: CUDA block size (default: 256)
MAX_PARTICLES: Maximum supported particles (configurable)OCTREE_NODE_POOL_SIZE: Pre-allocated node pool size
- Coalesced Memory Access: Structured data layout for optimal bandwidth
- Shared Memory Usage: Block-level data sharing for reduction operations
- Occupancy Optimization: Tuned block sizes for target GPU architecture
- Atomic Operation Minimization: Reduced serialization in tree construction
- Morton Code Caching: Reuse spatial hashes across timesteps when possible
- Adaptive Tree Depth: Dynamic depth limiting based on particle distribution
- Memory Pool Management: Pre-allocated buffers to avoid dynamic allocation
- Vectorized Operations: SIMD-friendly data structures and algorithms
The project includes comprehensive benchmarking tools:
# Run scaling analysis up to 1 million particles
./bin/benchmark --scaling --min-particles 1000 --max-particles 1000000
# Memory usage analysis for large simulations
./bin/benchmark --memory --particles 500000
# Algorithm comparison at various scales
./bin/benchmark --compare --particles 100000Compatible with NVIDIA profiling tools:
# Profile with NSight Compute
ncu --set full ./bin/nbody_simulation --particles 10000
# Profile with NSight Systems
nsys profile ./bin/nbody_simulation --particles 10000- Barnes, J. & Hut, P. (1986). "A hierarchical O(N log N) force-calculation algorithm"
- Warren, M. S. & Salmon, J. K. (1993). "A parallel hashed oct-tree N-body algorithm"