no NRVO for POD with floating-point members - by Krishty

Status : 

  Deferred<br /><br />
		The product team has reviewed this issue and has deferred it for consideration at a later time.<br /><br />
		A more detailed explanation for the resolution of this particular item may have been provided in the comments section.

Sign in
to vote
ID 788471 Comments
Status Closed Workarounds
Type Bug Repros 0
Opened 5/23/2013 1:41:44 AM
Access Restriction Public


The compiler does not make use of named return value optimization if
 (1) the POD's size exceeds 64 bits
 (2) the POD contains floating-point attributes.
Return values are then always spilled to the stack. This causes performance degradation (for some of our inner loops, 20-30 % of instructions are needless MOVs, causing lots of load-hit-stores).

The problem can be solved by
 (1) adding an empty default constructor
 (2) using macros instead of functions.
(1) is no option for us because our data types rely on static initialization and lots of code is built upon initializer lists (std::initializer_list is not yet available in Visual C++).
(2) would introduce serious maintainability issues.

We think this is an optimizer bug.
Sign in to post a comment.
Posted by Eric [MSFT] on 6/10/2013 at 3:04 PM
Hi, thanks for the source code. The codegen differences you're seeing are indeed because we chose a slightly different optimization path due to the identification of POD versus non-POD. If you are curious, register allocation is affected and we introduce more spills -- those are the spills you see in the assembly code.

This is an optimization bug, and we will address that in a future release. It's getting pretty late in the product cycle for the next major release, but we will try to get it in.

In the mean time, if it's possible to use an empty constructor for performance.

Thanks for taking the time to get the repro case for us. I'm closing this MSConnect item. Feel free to re-activate it if you need anything else.

Eric Brumer
Microsoft Visual C++ Team
Posted by Krishty on 5/29/2013 at 7:50 PM
P.S.: The issue you explained (conservative volatile stores) seems very reasonable to me. I don't understand, however, why the problem disappears if I add a constructor, or if the structure is smaller, or if it does not use floating-point numbers. Does the optimizer have dedicated paths for POD, where each optimization needs to be implemented explicitly? Shouldn't both examples produce identical code regardless of the constructor or the size?
Posted by Krishty on 5/29/2013 at 2:29 AM
Hi Eric,

indeed the "true" code looks quite different. But the "true" code is thousands of lines, and I wanted to reduce it for you. I must have hit "the wrong" error.

Please have a look at the following program, which does not use volatile and is more "real world":

struct Vector {
float x;
float y;
float z;

Vector const vectorFromXYZ(float const x, float const y, float const z) {
Vector rv;
rv.x = x;
rv.y = y;
rv.z = z;
return rv;

Vector const operator + (Vector const & a, Vector const & b) {
Vector rv;
rv.x = a.x + b.x;
rv.y = a.y + b.y;
rv.z = a.z + b.z;
return rv;

__declspec(dllexport) float param0 = 1.0;
__declspec(dllexport) float param1 = 2.0;
__declspec(dllexport) float param2 = 3.0;
__declspec(dllexport) float param3 = 4.0;
__declspec(dllexport) float param4 = 5.0;
__declspec(dllexport) float param5 = 6.0;
Vector globalA;

int main() {
globalA = vectorFromXYZ(param0, param1, param2) + vectorFromXYZ(param3, param4, param5);

this yields the following assembly:

int main() {
sub         rsp,18h
globalA = vectorFromXYZ(param0, param1, param2) + vectorFromXYZ(param3, param4, param5);
movss     xmm0,dword ptr [param0 (013F0E3020h)]
movss     xmm1,dword ptr [param1 (013F0E3030h)]
addss     xmm0,dword ptr [param3 (013F0E3028h)]
addss     xmm1,dword ptr [param4 (013F0E302Ch)]
movss     dword ptr [rsp],xmm0
movss     xmm0,dword ptr [param2 (013F0E301Ch)]
movss     dword ptr [rsp+4],xmm1
mov         rax,qword ptr [rsp]
mov         qword ptr [globalA (013F0E35F8h)],rax
addss     xmm0,dword ptr [param5 (013F0E3024h)]
movss     dword ptr [rsp+8],xmm0
mov         eax,dword ptr [rsp+8]
mov         dword ptr [globalA+8h (013F0E3600h)],eax
xor         eax,eax
add         rsp,18h

Adding an empty default constructor to struct Vector:
Vector() { }
yields the assembly:
globalA = vectorFromXYZ(param0, param1, param2) + vectorFromXYZ(param3, param4, param5);
movss     xmm2,dword ptr [param0 (013F383020h)]
movss     xmm1,dword ptr [param1 (013F383030h)]
movss     xmm0,dword ptr [param2 (013F38301Ch)]
xor         eax,eax
globalA = vectorFromXYZ(param0, param1, param2) + vectorFromXYZ(param3, param4, param5);
addss     xmm2,dword ptr [param3 (013F383028h)]
addss     xmm1,dword ptr [param4 (013F38302Ch)]
addss     xmm0,dword ptr [param5 (013F383024h)]
movss     dword ptr [globalA (013F3835F8h)],xmm2
movss     dword ptr [globalA+4h (013F3835FCh)],xmm1
movss     dword ptr [globalA+8h (013F383600h)],xmm0

which does not spill.

Please consider: In any real-world scenario, the resulting vector will cause some global effect (e.g. being parameter for a WinAPI, File, or Direct3D call). So, volatile store information might also be a problem in this case.
Posted by Eric [MSFT] on 5/28/2013 at 12:25 PM
Hi, do you have an example that shows the issue you describe without the use of a volatile global? I ask because your example is hitting a very specific case of conservative optimization, which may not be representative of your real code.

Consider the example you gave (renamed the function test1):

double volatile forceSideEffect;
void test1() {
auto ra = a(1, 2);
forceSideEffect = ra.att0;
forceSideEffect = ra.att1;

We do in fact perform RVO here. The codegen you are seeing is due to an issue in the global optimizer where we are treating volatile stores conservatively. It happens in several steps:

Step 1: before optimizing, we record info about the volatile memory accesses. The compiler is looking at these instructions:
ra.att0 = 1.0;
ra.att1 = 2.0;
forceSideEffect = ra.att0; // volatile store #1
forceSideEffect = ra.att1; // volatile store #2

We record information about the volatile stores:
volatile store #1 reads from ra.att0 and writes to forceSideEffect.
volatile store #2 reads from ra.att1 and writes to forceSideEffect.

Step 2) begin optimizing. We perform copy propagation which turns the instructions into this:
ra.att0 = 1.0;
ra.att1 = 2.0;
forceSideEffect = 1.0; // volatile store #1
forceSideEffect = 2.0; // volatile store #2

Unfortunately, we have not updated the 'volatile store information' to indicate that the sources have actually changed. We still think the first volatile store reads from ra.att0, and we still think the second volatile store reads from ra.att1.

The optimizer is unable to remove the first two instructions (stores to ra.att0 and ra.att1), because it thinks another instruction is actually consuming it -- even though it is not. This is a hole in the optimizer, and I have opened a task to track fixing this in a future release.

If you change your code to not use volatile, you get the right behavior:

double x;
double y;
void test2() {
auto ra = a(1, 2);
x = ra.att0;
y = ra.att1;

movsdx xmm0, QWORD PTR __real@3ff0000000000000
movsdx xmm1, QWORD PTR __real@4000000000000000
movsdx QWORD PTR ?x@@3NA, xmm0
movsdx QWORD PTR ?y@@3NA, xmm1
ret     0

So... after all that: please let me know if you have an example of the issue you are seeing without the use of volatile to simulate side-effects. Hopefully I can then address the true issue.

Eric Brumer - Microsoft Visual C++ Team
Posted by Microsoft on 5/23/2013 at 2:30 AM
Thanks for your feedback.

We are rerouting this issue to the appropriate group within the Visual Studio Product Team for triage and resolution. These specialized experts will follow-up with your issue.
Posted by Microsoft on 5/23/2013 at 1:52 AM
Thank you for your feedback, we are currently reviewing the issue you have submitted. If this issue is urgent, please contact support directly(