Skip to content

Simd & fill optimizations #1628

Open
Open
@dhardy

Description

@dhardy

#1579 notes some unfinished business:

The Simd and m128i etc. type generation should be equivalent, but they're not in terms of code; the Simd impls currently use fill to avoid more unsafe code here.

Notice from the above that u32x4, u16x8 and u8x16 are the same size as u128 and m128i but cost about twice as much to generate here. This indicates the fill code may be sub-optimal.

Additionally, the m128i impl performed even worse when transmuting a u128 value (~4.3ns or +%130) which, as far as I can tell, is purely because the u128 value is returned via rax, rdx while the __m128i value is returned via rdx, r10 (with rax equal to the struct address). I don't understand this.

Optimizing Fill for such cases may not be possible without specialization, and even then it's unclear if we'd want to due to the implied value-breaking changes.

Optimizing SIMD impls would require either specialization or replacing the generic Simd<$ty, LANES> impls with a (large) number of specific impls.

Metadata

Metadata

Assignees

No one assigned

    Labels

    B-compilerBreakage: needs compiler upgradeB-valueBreakage: changes output valuesC-optimisationP-lowPriority: Low

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions