Home Dashboard Directory Help
Search

REPLACE function extremely slow with (non-binary) Windows collation by Paul White NZ


Status: 

Closed
 as By Design Help for as By Design


8
0
Sign in
to vote
Type: Bug
ID: 512459
Opened: 11/19/2009 5:44:50 PM
Access Restriction: Public
0
Workaround(s)
view
3
User(s) can reproduce this bug

Description

The REPLACE function appears to be nearly 20x slower with an explicit Windows collation compared to an explicit SQL collation, and nearly 40x slower than a binary collation. See the repro script. The fact that there is a difference isn't surprising, but the magnitude of it is ;c)

The query plan produced using the Windows collation is subtly different from all other cases. The Compute Scalar includes an extra CONVERT (not a CONVERT_IMPLICIT) in every case except the Windows collation.

The scale of the difference makes me wonder whether the plan produced for the Windows collation is failing to take advantage of some optimization available when one of the other collations is specified? It seems slow enough to speculate that a complete copy of the processed string is taken for every row.

While I can certainly work around this, for example by including an explicit COLLATE LATIN1_GENERAL_BIN or COLLATE LATIN1_GENERAL_BIN2, I'd be interested to know a little more about the cause of the difference - the knowledge might come in useful in other situations.

Thanks,

Paul
Details
Sign in to post a comment.
Posted by Paul White NZ on 12/5/2009 at 12:34 AM
Awesome feedback - thanks Jim!
Posted by Microsoft on 12/1/2009 at 2:31 PM
Hi Paul,

Thankyou for this problem report. As you say, we expect BIN to be faster than a Windows collation, but a factor of 20+ is surprisingly huge.

Internally, for the Windows collation, the REPLACE operation converts CodePage to UCS-2; then performs a linguistic pattern match; then converts back. We need to check, but it's likely that the dominant factor is the linguistic match - that's to say, we don't simply perform even a RegEx match; we use collation weights. BIN collation avoids both conversions and makes a simple binary match. Whilst SQL collations, for the most part, are similarly optimized.

I have assigned this report over to the Developer to investigate further. However, after chatting thru the issue, we reckon this perf difference is due to a bug. It's more likely an effect of the hefty cost, in CPU cycles, of linguistic-aware algorithms.

Takeaway (at least, pending deeper investigation): for Apps that use heavy string manipulation, choose your collation carefully and benchmark, as you have done, to validate your choice.

Thanks,

Jim Hogg
Posted by Microsoft on 12/1/2009 at 2:31 PM
Hi Paul,

Thankyou for this problem report. As you say, we expect BIN to be faster than a Windows collation, but a factor of 20+ is surprisingly huge.

Internally, for the Windows collation, the REPLACE operation converts CodePage to UCS-2; then performs a linguistic pattern match; then converts back. We need to check, but it's likely that the dominant factor is the linguistic match - that's to say, we don't simply perform even a RegEx match; we use collation weights. BIN collation avoids both conversions and makes a simple binary match. Whilst SQL collations, for the most part, are similarly optimized.

I have assigned this report over to the Developer to investigate further. However, after chatting thru the issue, we reckon this perf difference is due to a bug. It's more likely an effect of the hefty cost, in CPU cycles, of linguistic-aware algorithms.

Takeaway (at least, pending deeper investigation): for Apps that use heavy string manipulation, choose your collation carefully and benchmark, as you have done, to validate your choice.

Thanks,

Jim Hogg
Sign in to post a workaround.