Performance

This document explains the performance model, benchmark methodology, and results for InlineCollections.

Performance goals

Zero allocations on the hot path for operations within capacity.
Minimal call overhead via aggressive inlining.
Cache locality through contiguous inline storage.
Predictable latency with O(1) or O(n) guarantees.

Zero-allocation philosophy

InlineCollections stores elements inline (8, 16, or 32 elements depending on type), eliminating the need to allocate on the heap.

Note: While heap allocations are removed, the struct's size grows with Capacity * sizeof(T). Passing large value types by value in hot paths can degrade performance and increase stack pressure. Whenever possible, pass collections by ref or in or use a smaller-capacity variant to minimize copying.

Collection	Heap Allocations	Capacity
InlineList8	0	8
InlineList16	0	16
InlineList32	0	32
List(capacity:8)	1	4+
InlineStack8	0	8
InlineStack16	0	16
InlineStack32	0	32
Stack()	1	10+
InlineQueue8	0	8
InlineQueue16	0	16
InlineQueue32	0	32
Queue()	1	10+

Benchmark methodology

BenchmarkDotNet configuration

[SimpleJob(RuntimeMoniker.Net80, invocationCount: 1000)]
[MemoryDiagnoser]
[MinColumn, MaxColumn, MeanColumn, MedianColumn, StdDevColumn]
public class InlineList32Benchmark { ... }

Runtime: .NET 8.0
Invocation count: 1000 per benchmark iteration
Memory diagnoser: tracks allocations and Gen2 collections
Columns: min, max, mean, median, standard deviation

Running benchmarks

cd benchmarks/InlineCollections.Benchmarks
dotnet run -c Release

This produces CSV/JSON output in BenchmarkDotNet.Artifacts/results/.

Benchmark results

InlineList32 vs List

Operation	InlineList32	List	Ratio	Allocation
Constructor (Span/Enum)	13.94 ns	102.46 ns	7.4x ⚡	0 vs 1
AddRange (Bulk Add)	21.21 ns	98.72 ns	4.7x ⚡	0 vs 1
Add (Single)	17.29 ns	60.50 ns	3.5x ⚡	0 vs 1
Contains (Search)	9.11 ns	21.69 ns	2.4x ⚡	0 vs 0
Remove (Middle)	83.58 ns	163.86 ns	2.0x ⚡	0 vs 1
Insert (Middle)	45.33 ns	84.89 ns	1.9x ⚡	0 vs 1
Indexer (Access)	9.43 ns	9.97 ns	1.1x	0 vs 0

Key Insights:

7.4x Speedup in construction: Eliminating heap allocation and using direct span copy.
Bulk Operations: AddRange is significantly faster as it avoids internal resizing and allocations.
Zero GC Pressure: All operations maintain zero bytes allocated.

2. InlineStack32 vs Stack

Operation	InlineStack32	Stack	Ratio	Allocation
Foreach Iteration	20.68 ns	155.01 ns	7.5x ⚡	0 vs 0
Creation (New)	8.36 ns	37.67 ns	4.5x ⚡	0 vs 1
Push	17.29 ns	51.03 ns	3.0x ⚡	0 vs 1
TryPush/Pop Cycle	12.14 ns	15.60 ns	1.3x ⚡	0 vs 0
Pop/Peek	15.59 ns	19.61 ns	1.3x ⚡	0 vs 0

Key Insights:

Iteration Dominance: 7.5x faster foreach loops due to the efficient ref struct enumerator.
Allocation-Free Push: 3x speedup by keeping the storage on the stack.
BCL Overhead: Even simple Peek/Pop are slower in BCL due to array indirection.

3. InlineQueue32 vs Queue

Operation	InlineQueue32	Queue	Ratio	Allocation
Foreach Iteration	2.22 μs	11.99 μs	5.4x ⚡	0 vs 1
Clear	415.7 ns	684.8 ns	1.6x ⚡	0 vs 1
WrapAround Performance	481.0 ns	594.2 ns	1.2x ⚡	0 vs 1
Enqueue/Dequeue Cycle	6.80 μs	8.30 μs	1.2x ⚡	0 vs 1

Key Insights:

Massive Iteration Gains: 5.4x faster iteration over circular buffer.
Efficient Wrapping: Using bitwise masking (& 31) provides a measurable edge over standard modulo operations.
Predictable Performance: Zero allocations even during heavy enqueue/dequeue cycles.

Cache behavior

Inline storage advantage

InlineList32 stores 128 bytes (32 ints) inline:

Cache line: 64 bytes (typical)
Fit: ~2 cache lines
Memory access: 1-2 L1 cache misses (worst case), then cache hits

List has array on heap:

Reference on stack: pointer
Array on heap: separate allocation, different cache line
Memory access: dereference (cache miss to heap), then array access

For small iterations, InlineCollections benefit from spatial locality.

Measurement

// Benchmark: iterate 100 times
var list = new InlineList32<int>(data32);  // 32 ints
int sum = 0;
for (int i = 0; i < 100; i++)
    sum += list[i % 32];

InlineList32 and List perform nearly identically (both direct memory access), but InlineList32 has advantage if the struct is kept in L1 cache.

Inlining and JIT

AggressiveInlining

Hot methods are marked with [MethodImpl(MethodImplOptions.AggressiveInlining)]:

[MethodImpl(MethodImplOptions.AggressiveInlining)]
public void Add(T item)
{
    if ((uint)_count >= Capacity) ThrowFull();
    Unsafe.Add(ref _buffer[0], _count++) = item;
}

This tells the JIT to inline the method even if the heuristics would normally prevent it.

JIT-compiled output

For Add(), the JIT generates (x64 pseudocode):

cmp     [rsp+offset], 32    ; if _count < 32
jge     throw_full
mov     [rsp+rax*4], rcx    ; Write element at index
inc     [rsp+offset]        ; _count++
ret

Nearly branch-free for the common case (no allocation, no indirection).

Struct size

Struct sizes

Collection	Element	Size
InlineList32	4B	132B
InlineList32	8B	260B
InlineStack32	4B	132B
InlineQueue32	4B	140B

Memory movement overhead depends on CPU architecture and scales with struct size.

Recommendation

Avoid passing by value in hot loops. Use ref parameter:

// ❌ Bad: copies struct
void Process(InlineList32<int> list) { ... }

// ✅ Good: no copy
void Process(ref InlineList32<int> list) { ... }

Allocation-free guarantee

The collections guarantee zero allocations for operations within capacity:

GC.TotalMemory(false, out long before);
var list = new InlineList32<int>();
for (int i = 0; i < 32; i++)
    list.Add(i);
GC.TotalMemory(false, out long after);

Debug.Assert(before == after);  // No allocations

When benchmarks mislead

Microbenchmarks measure isolated operations. In real workloads:

CPU caches may be cold
Branch prediction history differs
Interleaving with other operations changes access patterns
Real collections may have better lock-free properties

Guidance: Use benchmarks to validate assumptions, but always profile real workloads.

Comparison with standard collections

Aspect	InlineCollections	BCL (List, Stack, Queue)
Allocation	0	1 per instance
Capacity	Fixed 32	Dynamic
Indexer overhead	None	Bounds check
ref return	Yes	No (List only)
Thread-safe	No	No
Suitable for	Hot paths, small sets	General purpose

Performance summary

Best for InlineCollections:

Add/Enqueue when creating new collections
Tight loops with repeated small collections
Stack frames with limited depth

Best for BCL collections:

Large or unbounded working sets
API compatibility required
Thread-safety needed
Reference type semantics required

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Performance

Performance goals

Zero-allocation philosophy

Benchmark methodology

BenchmarkDotNet configuration

Running benchmarks

Benchmark results

InlineList32 vs List

2. InlineStack32 vs Stack

3. InlineQueue32 vs Queue

Cache behavior

Inline storage advantage

Measurement

Inlining and JIT

AggressiveInlining

JIT-compiled output

Struct size

Struct sizes

Recommendation

Allocation-free guarantee

When benchmarks mislead

Comparison with standard collections

Performance summary

FilesExpand file tree

performance.md

Latest commit

History

performance.md

File metadata and controls

Performance

Performance goals

Zero-allocation philosophy

Benchmark methodology

BenchmarkDotNet configuration

Running benchmarks

Benchmark results

InlineList32 vs List

2. InlineStack32 vs Stack

3. InlineQueue32 vs Queue

Cache behavior

Inline storage advantage

Measurement

Inlining and JIT

AggressiveInlining

JIT-compiled output

Struct size

Struct sizes

Recommendation

Allocation-free guarantee

When benchmarks mislead

Comparison with standard collections

Performance summary