Using the _mm_shuffle_epi32 instruction instead of _mm_shuffle_ps wastes xmm7 register allocation. - by clbri

Status : 


Sign in
to vote
ID 725139 Comments
Status Active Workarounds
Type Bug Repros 0
Opened 2/15/2012 2:30:28 PM
Access Restriction Public
Moderator Decision Sent to Engineering Team for consideration


I was implementing fast batched 4x4 matrix * 4x1 vector multiplication for a math library, and was testing whether it would be faster/more beneficial to use the shufps or the pshufd SSE instruction for doing the necessary shuffles.

The result of the profiling was that shufps performed stellarly better, but the result is not due to shufps being faster than pshufd (they have the same latency, but Intel Sandy Bridge is specified to have better throughput on pshufd). Instead, what happens is that Visual Studio compiler fails to utilize all 8 SSE registers for the inner loop with the pshufd instruction, and only uses xmm0-xmm6, while wasting/disregarding the use of xmm7 register altogether. With the shufps instruction, all 8 available registers are used.
Sign in to post a comment.
Posted by MS-Moderator08 [Feedback Moderator] on 2/16/2012 at 1:54 AM
Thank you for submitting feedback on Visual Studio 2010 and .NET Framework. Your issue has been routed to the appropriate VS development team for investigation. We will contact you if we require any additional information.
Posted by MS-Moderator01 on 2/15/2012 at 2:45 PM
Thank you for your feedback, we are currently reviewing the issue you have submitted. If this issue is urgent, please contact support directly(