A Crystal library providing a standardized interface for SIMD (Single Instruction, Multiple Data) operations across different CPU architectures. The library automatically selects the best available instruction set at runtime and falls back to scalar operations when SIMD is not available or slower.
- Cross-platform: Supports x86_64 (SSE2, SSE4.1, AVX2, AVX-512) and AArch64 (NEON, SVE, SVE2)
- Automatic dispatch: Detects CPU capabilities at runtime and uses the fastest available implementation
- Smart fallbacks: Operations that are slower with SIMD automatically use scalar implementations
- Zero-copy: Works directly with Crystal
Slicetypes for efficient memory usage
-
Add the dependency to your
shard.yml:dependencies: simd: github: spider-gazelle/simd
-
Run
shards install
require "simd"
# Get the optimal SIMD implementation for this CPU
simd = SIMD.instance
# Check what's supported
puts SIMD.supported_instruction_sets # => SSE2 | SSE41 | AVX2 | AVX512
# Create some data
a = Slice(Float32).new(1024) { |i| i.to }
b = Slice(Float32).new(1024) { |i| (i * 2).to }
dst = Slice(Float32).new(1024)
# Perform SIMD-accelerated operations
simd.add(dst, a, b) # dst[i] = a[i] + b[i]
simd.mul(dst, a, b) # dst[i] = a[i] * b[i]
simd.fma(dst, a, b, dst) # dst[i] = a[i] * b[i] + dst[i]
# Reductions
sum = simd.sum(a) # Returns sum of all elements
dot = simd.dot(a, b) # Returns dot product
max = simd.max(a) # Returns maximum value
# Integer operations
u32_a = Slice(UInt32).new(1024) { |i| i.to }
u32_b = Slice(UInt32).new(1024) { |i| (i * 3).to }
u32_dst = Slice(UInt32).new(1024)
simd.xor(u32_dst, u32_a, u32_b) # Bitwise XOR
simd.bswap(u32_dst, u32_a) # Byte swap (endian conversion)AI is an example where you might need to transform data to feed into a neural network.
f32/convert_u8_scale UInt8 -> Float32 (scale)
Scalar 739.86k ( 1.35µs) (± 0.51%) 0.0B/op 8.76× slower
SSE2 739.97k ( 1.35µs) (± 0.00%) 0.0B/op 8.76× slower
SSE41 740.11k ( 1.35µs) (± 0.00%) 0.0B/op 8.76× slower
AVX2 3.95M (253.10ns) (± 0.68%) 0.0B/op 1.64× slower
AVX512 6.48M (154.21ns) (± 1.41%) 0.0B/op fastest
Optimal code paths can considerably speed up this processing.
| Method | Description |
|---|---|
add(dst, a, b) |
dst[i] = a[i] + b[i] |
sub(dst, a, b) |
dst[i] = a[i] - b[i] |
mul(dst, a, b) |
dst[i] = a[i] * b[i] |
fma(dst, a, b, c) |
dst[i] = a[i] * b[i] + c[i] (fused multiply-add) |
clamp(dst, a, lo, hi) |
dst[i] = clamp(a[i], lo, hi) |
axpby(dst, a, b, alpha, beta) |
dst[i] = alpha * a[i] + beta * b[i] |
| Method | Description |
|---|---|
sum(a) |
Returns sum of all elements |
dot(a, b) |
Returns dot product sum(a[i] * b[i]) |
max(a) |
Returns maximum value |
| Method | Description |
|---|---|
bitwise_and(dst, a, b) |
dst[i] = a[i] & b[i] |
bitwise_or(dst, a, b) |
dst[i] = a[i] | b[i] |
xor(dst, a, b) |
dst[i] = a[i] ^ b[i] |
bswap(dst, a) |
Byte swap each element (endian conversion) |
popcount(a) |
Count total bits set across all bytes |
| Method | Description |
|---|---|
cmp_gt_mask(mask, a, b) |
mask[i] = a[i] > b[i] ? 0xFF : 0x00 |
blend(dst, t, f, mask) |
dst[i] = mask[i] ? t[i] : f[i] |
compress(dst, src, mask) |
Copy elements where mask[i] != 0, returns count |
| Method | Description |
|---|---|
copy(dst, src) |
Fast memory copy |
fill(dst, value) |
Fill memory with value |
| Method | Description |
|---|---|
convert(dst, src) |
Convert Int16 to Float32 |
convert(dst, src, scale) |
Convert UInt8 to Float32 with scaling |
| Method | Description |
|---|---|
fir(dst, src, coeff) |
FIR filter: dst[i] = sum(coeff[k] * src[i+k]) |
| Method | Description |
|---|---|
xor_block16(dst, src, key16) |
XOR with repeating 16-byte key |
You can also instantiate specific implementations directly:
{% if flag?(:x86_64) %}
sse2 = SIMD::SSE2.new
avx2 = SIMD::AVX2.new if SIMD.supported_instruction_sets.avx2?
{% elsif flag?(:aarch64) %}
neon = SIMD::NEON.new if SIMD.supported_instruction_sets.neon?
{% end %}The library includes a comprehensive benchmark suite to compare implementations.
shards build --release# Run all benchmarks
./bin/benchmark
# List available benchmarks
./bin/benchmark --list
# Run specific benchmark(s)
./bin/benchmark -t add
./bin/benchmark -t add -t mul -t dot
# Show help
./bin/benchmark --helpadd - dst = a + b
mul - dst = a * b
fma - dst = a * b + c
sum - horizontal sum
dot - dot product
max - find maximum
clamp - clamp to range
xor - bitwise XOR
bswap - byte swap
popcount - population count
copy - memory copy
fill - memory fill
convert - int16 to float conversion
xor_block16 - XOR with 16-byte key
fir - 5-tap FIR filter
======================================================================
SIMD Benchmark
======================================================================
System Information:
SIMD Support: SSE2 | SSE41 | AVX2 | AVX512
Available Implementations: SIMD::SSE2, SIMD::SSE41, SIMD::AVX2, SIMD::AVX512
----------------------------------------------------------------------
add (dst = a + b) - 4096 elements
----------------------------------------------------------------------
Scalar 957.37k ( 1.04µs) (± 0.58%) 0.0B/op 7.81× slower
SSE2 954.17k ( 1.05µs) (± 0.65%) 0.0B/op 7.84× slower
SSE41 957.07k ( 1.04µs) (± 0.47%) 0.0B/op 7.82× slower
AVX2 4.38M (228.51ns) (± 1.14%) 0.0B/op 1.71× slower
AVX512 7.48M (133.66ns) (± 0.91%) 0.0B/op fastest
Currently the library focuses on Float32 and UInt32. Future versions could add:
# Planned API
simd.add_f64(dst, a, b)
simd.mul_f64(dst, a, b)
simd.fma_f64(dst, a, b, c)
simd.sum_f64(a)
simd.dot_f64(a, b)Note: SIMD width is halved for Float64 (e.g., AVX2 processes 4 doubles vs 8 floats).
# Signed integers
simd.add_i32(dst, a, b)
simd.add_i64(dst, a, b)
# Unsigned integers
simd.add_u64(dst, a, b)
# Smaller types (useful for image/audio processing)
simd.add_i16(dst, a, b)
simd.add(dst, a, b) # With saturation optionsReplace type-suffixed methods with overloaded versions that dispatch based on slice type:
# Current API
simd.mul(dst, a, b)
simd.xor(dst, a, b)
# Proposed API with overloading
simd.mul(dst, a, b) # Dispatches to Float32 version
simd.mul(dst_f64, a_f64, b_f64) # Dispatches to Float64 version
simd.xor(dst, a, b) # Dispatches to UInt32 version
simd.xor(dst_u64, a_u64, b_u64) # Dispatches to UInt64 versionsimd.div(dst, a, b) # Element-wise division
simd.sqrt(dst, a) # Square root
simd.rsqrt(dst, a) # Reciprocal square root (fast approximation)
simd.abs(dst, a) # Absolute value
simd.neg(dst, a) # Negation
simd.min(dst, a, b) # Element-wise minimum
simd.max(dst, a, b) # Element-wise maximum (binary version)
simd.floor(dst, a) # Floor
simd.ceil(dst, a) # Ceiling
simd.round(dst, a) # Round to nearestsimd.min(a) # Find minimum value
simd.argmax(a) # Index of maximum value
simd.argmin(a) # Index of minimum value
simd.any_nonzero(a) # Check if any element is non-zero
simd.all_nonzero(a) # Check if all elements are non-zerosimd.cmp_eq(mask, a, b) # Equal
simd.cmp_ne(mask, a, b) # Not equal
simd.cmp_lt(mask, a, b) # Less than
simd.cmp_le(mask, a, b) # Less than or equal
simd.cmp_ge(mask, a, b) # Greater than or equalsimd.gather(dst, src, indices) # dst[i] = src[indices[i]]
simd.scatter(dst, src, indices) # dst[indices[i]] = src[i]simd.gemv(dst, matrix, vector, alpha, beta) # Matrix-vector multiply
simd.gemm(c, a, b, alpha, beta) # Matrix-matrix multiply
simd.transpose(dst, src, rows, cols) # Matrix transposesimd.find_byte(haystack, needle) # Find first occurrence of byte
simd.find_any_byte(haystack, needles) # Find first occurrence of any needle
simd.count_byte(haystack, needle) # Count occurrences
simd.to_lowercase(dst, src) # ASCII lowercase conversion
simd.to_uppercase(dst, src) # ASCII uppercase conversionsimd.rgb_to_grayscale(dst, src) # Color conversion
simd.premultiply_alpha(dst, src) # Alpha premultiplication
simd.bilinear_sample(dst, src, ...) # Bilinear interpolation- Prefetching: Add software prefetch hints for large sequential operations
- Non-temporal stores: Use streaming stores for write-only large buffers
- Alignment hints: Optimize for aligned memory access patterns
- Loop unrolling: Process multiple vectors per iteration for better ILP
- Fork it (https://github.com/spider-gazelle/simd/fork)
- Create your feature branch (
git checkout -b my-new-feature) - Commit your changes (
git commit -am 'Add some feature') - Push to the branch (
git push origin my-new-feature) - Create a new Pull Request
- Stephen von Takach - creator and maintainer