Description
#1579 notes some unfinished business:
The
Simd
andm128i
etc. type generation should be equivalent, but they're not in terms of code; theSimd
impls currently usefill
to avoid moreunsafe
code here.Notice from the above that
u32x4
,u16x8
andu8x16
are the same size asu128
andm128i
but cost about twice as much to generate here. This indicates thefill
code may be sub-optimal.Additionally, the
m128i
impl performed even worse when transmuting au128
value (~4.3ns or +%130) which, as far as I can tell, is purely because theu128
value is returned viarax, rdx
while the__m128i
value is returned viardx, r10
(withrax
equal to the struct address). I don't understand this.
Optimizing Fill
for such cases may not be possible without specialization, and even then it's unclear if we'd want to due to the implied value-breaking changes.
Optimizing SIMD impls would require either specialization or replacing the generic Simd<$ty, LANES>
impls with a (large) number of specific impls.