Skip to content

A high-performance GPU-accelerated N-body gravitational simulation implementing the Barnes-Hut algorithm for O(N log N) complexity. This project uses CUDA development techniques and parallel algorithms, capable of simulating up to 1 million particles simultaneously on modern NVIDIA GPUs.

Notifications You must be signed in to change notification settings

Cybiii/N-Body-Simulation

Repository files navigation

CUDA N-Body Simulation (Barnes-Hut Algorithm)

A high-performance GPU-accelerated N-body gravitational simulation implementing the Barnes-Hut algorithm for O(N log N) complexity. This project uses CUDA development techniques and parallel algorithms, capable of simulating up to 1 million particles simultaneously on modern NVIDIA GPUs.

1.Million.bodies.--.144Hz.mp4

Overview

This simulation calculates gravitational forces between particles using two approaches:

  • Brute Force Method: Direct O(N²) force calculation for baseline comparison
  • Barnes-Hut Algorithm: Hierarchical O(N log N) approximation using octree spatial decomposition

The implementation showcases modern GPU computing techniques including Morton code spatial hashing, parallel tree construction, and optimized memory access patterns.

Features

Algorithms

  • Morton Code Generation: 3D Z-order curve spatial mapping for efficient particle sorting
  • Octree Construction: GPU-parallel hierarchical spatial decomposition
  • Barnes-Hut Force Calculation: Tree traversal with configurable approximation parameter
  • Velocity Verlet Integration: Stable numerical integration for orbital mechanics

GPU Optimizations

  • Memory Coalescing: Optimized global memory access patterns
  • Thrust Library Integration: High-performance GPU sorting primitives
  • Flat Array Representation: Cache-friendly data structures avoiding pointer chasing
  • Stack-based Tree Traversal: Recursion-free GPU kernel design

Performance

  • Massive Scale Simulation: Handles 100,000 to 1,000,000+ particles simultaneously
  • Real-time Performance: Interactive rates for simulations up to 50,000 particles
  • Scalable Architecture: Linear memory scaling from thousands to millions of particles
  • Multiple Initial Conditions: Random distributions, galaxy disks, colliding clusters
  • Configurable Parameters: Gravitational constant, softening factor, approximation threshold

System Requirements

Hardware

  • NVIDIA GPU with Compute Capability 3.5 or higher
  • Minimum 2GB GPU memory (6GB+ recommended for large simulations)
  • Tested on RTX 3060 Laptop GPU (3840 CUDA cores)

Software

  • CUDA Toolkit 11.0 or later (tested with CUDA 12.9)
  • CMake 3.18 or later
  • C++14 compatible compiler
  • Windows: Visual Studio Build Tools 2019/2022
  • Linux: GCC 7.0+ or Clang 6.0+

Building the Project

Windows with Visual Studio

# Clone the repository
git clone
cd N-Body-Simulation

# Build using the provided script
build.bat

# Or use CMake directly
mkdir build
cd build
cmake ..
cmake --build . --config Release

Linux/Unix

# Clone and build
git clone
cd N-Body-Simulation
mkdir build
cd build
cmake ..
make -j$(nproc)

CMake Configuration

The project automatically detects CUDA installation and configures appropriate compiler flags for the target GPU architecture.

Usage

Basic Simulation

# Run the main simulation
./bin/nbody_simulation

# Run performance benchmarks
./bin/benchmark

Command Line Options

# Small-scale interactive simulation
./bin/nbody_simulation --particles 10000

# Large-scale simulation (100K particles)
./bin/nbody_simulation --particles 100000 --timesteps 500 --dt 0.01

# Maximum scale simulation (1M particles)
./bin/nbody_simulation --particles 1000000 --timesteps 100 --dt 0.005

# Galaxy simulation with realistic particle count
./bin/nbody_simulation --initial galaxy_disk --particles 250000

Available Initial Conditions

  • random_sphere: Particles randomly distributed in a sphere
  • random_cube: Uniform distribution in a cubic volume
  • galaxy_disk: Rotating disk with central bulge
  • colliding_galaxies: Two galaxy disks on collision course
  • benchmark: Optimized distribution for performance testing

Performance Results

Benchmark Configuration

  • Particle Count: 1,024
  • Precision: Single precision floating point
  • Compiler: NVCC with Visual Studio 2022

Timing Results (1,024 particles baseline)

Component                   Time (ms)    Complexity
Morton Code Generation      1.47         O(N)
Particle Sorting           3.16         O(N log N)
Octree Construction        0.39         O(N)
Barnes-Hut Force Calc      0.27         O(N log N)
Brute Force Calculation    0.70         O(N²)

Total Barnes-Hut Pipeline: 5.29 ms
Force Calculation Speedup: 2.59x

Large-Scale Performance Projections

100,000 particles:

  • Force calculation: ~27ms (100x particles, log factor)
  • Expected speedup: 50-100x over brute force
  • Target: 30+ FPS for real-time simulation

1,000,000 particles:

  • Force calculation: ~270ms
  • Memory usage: ~50MB GPU memory
  • Batch processing: 3-4 FPS for million-particle simulations

Memory Scaling Capacity

  • 100K particles: ~5MB GPU memory
  • 500K particles: ~25MB GPU memory
  • 1M particles: ~50MB GPU memory
  • RTX 3060 capacity: 6GB total (120x headroom for maximum scale)
  • Theoretical maximum: 2-3 million particles on RTX 3060

Scaling Characteristics

  • Particle Capacity: Up to 1,000,000 particles on RTX 3060 (6GB VRAM)
  • Memory Usage: Linear scaling - 50MB for 1M particles
  • Computational Complexity: O(N log N) vs O(N²) for brute force
  • Expected Performance: 50-100x speedup for 100K+ particles
  • Real-time Capability: 30+ FPS for simulations up to 50,000 particles

Algorithm Details

Morton Code Spatial Hashing

The implementation uses 30-bit Morton codes (10 bits per dimension) to map 3D particle positions to a 1D space-filling curve. This ensures spatial locality and enables efficient parallel sorting.

// Convert 3D position to Morton code
uint32_t morton_code = worldToMorton(position, bbox_min, bbox_max);

Barnes-Hut Approximation

The algorithm uses the θ-criterion to determine when to use center-of-mass approximation:

  • If s/d < θ, use the node's center of mass
  • Otherwise, recursively examine child nodes
  • Default θ = 0.5 balances accuracy and performance

Tree Construction Pipeline

  1. Compute global bounding box using GPU reduction
  2. Generate Morton codes for all particles
  3. Sort particles by Morton code using Thrust
  4. Build octree structure from sorted particles
  5. Compute center of mass for each node

Code Structure

N-Body-Simulation/
├── include/                    # Header files
│   ├── particle.h             # Particle data structure
│   ├── nbody_simulation.h     # Main simulation class
│   ├── initial_conditions.h   # Initial condition generators
│   ├── octree.h              # Octree data structures
│   ├── barnes_hut_force.h    # Force calculation
│   └── particle_utils.h      # Utility functions
├── src/                       # Source implementations
│   ├── main.cu               # Main simulation entry point
│   ├── particle.cu           # Particle system implementation
│   ├── nbody_simulation.cu   # Simulation loop and integration
│   ├── initial_conditions.cu # Initial condition implementations
│   ├── morton_code.cu        # Morton code algorithms
│   ├── octree.cu            # Octree construction
│   ├── barnes_hut_force.cu  # Barnes-Hut force calculation
│   ├── particle_utils.cu    # Data extraction utilities
│   └── benchmark.cu         # Performance benchmarking
├── CMakeLists.txt           # Build configuration
└── build.bat               # Windows build script

Configuration Parameters

Simulation Parameters

  • G_CONSTANT: Gravitational constant (default: 6.67430e-11 scaled)
  • SOFTENING_FACTOR: Force softening to prevent singularities (default: 0.1)
  • TIME_STEP: Integration time step (default: 0.01)
  • MAX_TIMESTEPS: Maximum simulation steps (default: 1000)

Barnes-Hut Parameters

  • THETA: Approximation parameter (default: 0.5)
  • MAX_TREE_DEPTH: Maximum octree depth (default: 20)
  • THREADS_PER_BLOCK: CUDA block size (default: 256)

Memory Parameters

  • MAX_PARTICLES: Maximum supported particles (configurable)
  • OCTREE_NODE_POOL_SIZE: Pre-allocated node pool size

Performance Optimization

GPU-Specific Optimizations

  • Coalesced Memory Access: Structured data layout for optimal bandwidth
  • Shared Memory Usage: Block-level data sharing for reduction operations
  • Occupancy Optimization: Tuned block sizes for target GPU architecture
  • Atomic Operation Minimization: Reduced serialization in tree construction

Algorithm Optimizations

  • Morton Code Caching: Reuse spatial hashes across timesteps when possible
  • Adaptive Tree Depth: Dynamic depth limiting based on particle distribution
  • Memory Pool Management: Pre-allocated buffers to avoid dynamic allocation
  • Vectorized Operations: SIMD-friendly data structures and algorithms

Benchmarking and Profiling

Built-in Benchmarks

The project includes comprehensive benchmarking tools:

# Run scaling analysis up to 1 million particles
./bin/benchmark --scaling --min-particles 1000 --max-particles 1000000

# Memory usage analysis for large simulations
./bin/benchmark --memory --particles 500000

# Algorithm comparison at various scales
./bin/benchmark --compare --particles 100000

Profiling Integration

Compatible with NVIDIA profiling tools:

# Profile with NSight Compute
ncu --set full ./bin/nbody_simulation --particles 10000

# Profile with NSight Systems
nsys profile ./bin/nbody_simulation --particles 10000

References

Scientific Background

  • Barnes, J. & Hut, P. (1986). "A hierarchical O(N log N) force-calculation algorithm"
  • Warren, M. S. & Salmon, J. K. (1993). "A parallel hashed oct-tree N-body algorithm"

About

A high-performance GPU-accelerated N-body gravitational simulation implementing the Barnes-Hut algorithm for O(N log N) complexity. This project uses CUDA development techniques and parallel algorithms, capable of simulating up to 1 million particles simultaneously on modern NVIDIA GPUs.

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published