IL bytecode method "cpblk" badly implemented by x86 CLR - by Alexandre Mutel

Status : 

  Fixed<br /><br />
		This item has been fixed in the current or upcoming version of this product.<br /><br />
		A more detailed explanation for the resolution of this particular item may have been provided in the comments section.


60
0
Sign in
to vote
ID 766977 Comments
Status Closed Workarounds
Type Bug Repros 2
Opened 10/10/2012 5:57:29 PM
Access Restriction Public

Description

This is a long running performance caveat, but the cpblk IL opcode is generating a x86 instruction that is operating on a per-byte basis like this:
rep movsb dest, source

instead of using rep movsd or advanced SSE instructions. Please fix this, as It makes all cpblk to run damn slow on x86. 

Memcpy implemented used in the CLR must use the same optimized memcpy that we have with MSVCRT.
Sign in to post a comment.
Posted by Deon [MSFT] on 4/29/2014 at 12:31 PM
Thank you for reporting this issue. This issue has been fixed in Visual Studio 2013. You can install a trial version of Visual Studio 2013 with the fix from: http://go.microsoft.com/?linkid=9832436
Posted by Jon Hanna on 1/10/2014 at 3:13 AM
'Performance of "rep movsb" has been greatly improving on recent hardware to the point where it is the best implementation of memcpy'

I'm running on a Core i7 2670QM which is a not the most current chip (October 2011), but it is current enough that one might imagine them being in use for some years to come.

In trying to decide upon the best approach to a managed memcpy for a C# project, I compared three options; a highly unwound C# loop (close to the code below, but with further unwinding from a Duff's Device-like approach), P/Invoking into msvcrt.dll, and calling into an IL assembly that had a simple wrapper on cpblk. I then tested across a variety of block sizes, whether or not the addresses were aligned, and whether or not the same blocks were involved (to have different cache behaviour). I don't have full benchmarks (I just wanted to find which worked better for my particular case, and then get onto the next task), but I did find cpblk to be particularly poor on 32-bit .NET:

I found that on both .NET with the 4.5.1 runtime and Mono 3.3.0 on 64-bit, the cpblk was the most performant in pretty much all conditions, though msvcrt.dll begins to catch up as block-size increases . With the Mono 3.3.0 runtime on 32-bit, it was a closer call between, msvcrt doing better than cpblk after about 400bytes blocksize. (Presumably msvcrt.dll is more performant than their version of cpblk, and that's the point when the cost of p/invoke becomes insignificant).

In .NET running on 32-bit though, cpblk was the least performant in pretty much all conditions. While msvcrt.dll started becoming more performant than the C# loop when the block-size got to around 400bytes, cpblk was consistently slower than both of the other approaches no matter what the block size, cache conditions, or whether or not the source and/or destination were aligned.

I'd certainly say there's room for improvement with modern chips. In the meantime, I'd say that those who may want to use it should test their workarounds, as code like that suggested in the comment below is only superior in some conditions.
Posted by Jan [MSFT] on 1/30/2013 at 6:00 PM
Discard my example in previous reply. To make the unwound loop to perform well on current x86 JIT, it has to be coded slightly differently:

     static unsafe void BulkCopy32Bit(int * dest, int * src, int len)
     {
         while (len > 3)
         {
             int a = src[0];
             int b = src[1];
             len -= 4;
             dest[0] = a;
             dest[1] = b;
             a = src[2];
             b = src[3];
             src += 4;
             dest[2] = a;
             dest[3] = b;
             dest += 4;
         }
         while (len > 0)
         {
             *dest = *src;
             src++; dest++;
             len--;
         }
     }

Jan Kotas
CLR
Posted by Jan [MSFT] on 1/30/2013 at 5:41 PM
Managed C++ is not the only emitter of cpblk instruction. Other (maybe even more common) emitter of cpblk instruction are IL stub marshallers.

The key factor that made me leave things as they are for cpblk is the good performance of "rep movsb" on processors that are sold these days. Let's assume that I will make the fix to call CRT memcpy equivalent today. It will take years for the CLR with fix to be installed on majority of machines. By that time, the typical machine the fixed CLR runs on is likely going to have processor with performant implementation of "rep movsb", and so the fix is not going to make much positive difference. I hope it make sense.

I understand that we do not have solution for your problem with copying memory in portable way right now. The best workaround I can offer in the meantime is to use custom portable implementation of memcpy with unwound loops like in the example I have attached below. Of course, it will be slower than the fine tuned CRT implementation. But it will be significantly faster than cpblk on processors with poorly performing "rep movsb". It is also significantly faster than CustomCopy from your benchmark. For best results, it helps to align the destination before running the unwound loop, and use 64-bit longs on 64-bit platforms and 32-bit ints on 32-bit platforms.

Jan Kotas
CLR

        static unsafe void BulkCopy32Bit(int * dest, int * src, int len)
        {
         while (len > 4)
         {
             int a = src[0];
             int b = src[1];
             int c = src[2];
             int d = src[3];
             src += 4;
             len -= 4;
             dest[0] = a;
             dest[1] = b;
             dest[2] = c;
             dest[3] = d;
             dest += 4;
            }
            while (len > 0)
            {
             *dest = *src;
             src++; dest++;
             len--;
            }
        }
Posted by Alexandre Mutel on 1/28/2013 at 10:07 PM
Thanks for the feedback and the fixes, though I have some concern about 'cpblk' not being fixed:

rep movsb is indeed fine for small blocks. Though, the problem is that I'm using cpblk to perform memcpy of large block of memory from unmanaged to unmanaged memory and unfortunately, this is the only way to do this in a portable way (to run on Windows Desktop, WinRT and Windows Phone 8). Afaik, all 'memcpy' methods in .NET are always working with at least a managed array (System.Buffer.BlockCopy, Marshal.Copy). When using interop with unmanaged code, memcpy from unmanaged to unmanaged is inevitable.

Also, if I remember well, when using C++/CLI, the standard memcpy is translated to IL cpblk, although a developer would expect that performance are on par with the real memcpy used in C++, but this is not the case. This is the reason why I was using cpblk at the first place, as It is the actual bytecode used in C++/CLI.

Looking at the old SSCLI20 sources, emit_CPBLK was redirected to a straight CRT memcpy, but it seems that it is not the case.

Also in which case cpblk is explicitly used for small blocks in the CLR? I thought that It was only generated in C++/CLI scenarios.... (copy of structs? I don't remember that struct copy is using cpblk but I would think more about a cpobj)
Posted by Jan [MSFT] on 1/25/2013 at 7:01 AM
Thank you for reporting this performance issue, and including thorough micro benchmark to demonstrate it.

The next version of .NET Framework will include fixes that make all APIs exercised by your micro benchmark to have similar performance for large blocks on new hardware. The performance for small blocks will vary by design because of the different APIs have different fixed costs, like argument validation.

I have not implemented your suggestion to compile the cpblk IL opcode into the call of the CRT memcpy. This optimization would be only beneficial on older hardware and only for large block sizes that are not common for cpblk. It would hurt performance for small block sizes that cpblk is typically used for.

Performance of "rep movsb" has been greatly improving on recent hardware to the point where it is the best implementation of memcpy. You can find details in the current version of The Intel Architecture Optimization Manual, or in the current memcpy CRT implementation at "C:\Program Files (x86)\Microsoft Visual Studio 11.0\VC\crt\src\intel\memcpy.asm" in the default Visual Studio 11 installation.

Jan Kotas
CLR
Posted by Microsoft on 10/10/2012 at 9:49 PM
Thanks for your feedback.

We are rerouting this issue to the appropriate group within the Visual Studio Product Team for triage and resolution. These specialized experts will follow-up with your issue.
Posted by Microsoft on 10/10/2012 at 6:52 PM
Thank you for your feedback, we are currently reviewing the issue you have submitted. If this issue is urgent, please contact support directly(http://support.microsoft.com)
Posted by Alexandre Mutel on 10/10/2012 at 6:28 PM
Forgot to mention that I also published two years ago a benchmark about this problem: http://code4k.blogspot.com/2010/10/high-performance-memcpy-gotchas-in-c.html