Search

Replace function does not work with valid unicode characters above 65500 by Richard.S

Closed
as By Design Help for as By Design

1
0
Sign in
to vote
Type: Bug
ID: 385082
Opened: 11/30/2008 9:37:54 PM
Access Restriction: Public
0
Workaround(s)
0
User(s) can reproduce this bug
Using unicode characters above 65500 return the incorrect results when using some string functions such as replace.

The following command:
SELECT REPLACE(N'test' + NCHAR(65500), NCHAR(65500), '')

successfully strips out the added unicode character, however if you use any unicode characters above 65500 (including valid Unicode 1.0 characters), the replace function does not work correctly.

For example if you run:
SELECT REPLACE(N'test' + NCHAR(0xFFFD), NCHAR(0xFFFD), '')

The unicode character 65533 does not get stripped out, unlike all other valid unicode characters.

The workaround is to cast the string as a binary:
SELECT REPLACE(N'test' + NCHAR(0xFFFD) COLLATE Latin1_General_BIN, NCHAR(0xFFFD) COLLATE Latin1_General_BIN, '')

however, this should not be necessary if the function had worked correctly.
Details (expand)
Product Language
English

Version

SQL Server 2008 - Enterprise Edition

Category

SQL Engine

Operating System

Windows Server 2008
Operating System Language
US English
Steps to Reproduce
SELECT REPLACE(N'test' + NCHAR(0xFFFD), NCHAR(0xFFFD), '')
Actual Results
test (followed by the unicode character)
Expected Results
test (without the unicode character)

Platform

X64
File Attachments
0 attachments
Sign in to post a comment.
Posted by Microsoft on 2/2/2009 at 4:33 PM
Hello,

I’m sorry if I wasn’t specific enough. Let me try to explain the behavior in more details.

For non-binary collations REPLACE function in SQL Server uses linguistic comparison to match the pattern with parts of the input text. Even though codes U+FFF9 – U+FFFD are valid UNICODE codes they have no linguistic sort weights assigned to them. Therefore in the context of comparison (and effectively in the context of pattern matching in REPLACE function) they are considered “undefined” and are ignored.

Please note that using such “undefined” UNICODE characters in the pattern for the REPLACE function may bring more surprises. You can see the danger of it on the following example:

SELECT REPLACE(N'test' + NCHAR(0xFFFD), N't' + NCHAR(0xFFFD), 't_')

returns:
---------------------------------
t_est_


So as you see SQL Server not only matched (‘t’+U+FFFD) with itself as one would expect, but also matched it with ‘t’ that one would never expect!
One might also ask, why didn’t SQL Server match a pattern from your example (U+FFFD) with any character from the input text (‘test’)? The reason is that a pattern consisting of “undefined” characters only is considered empty and therefore cannot be matched with any input text.

As a side note: for an example of a valid UNICODE character above 65500 please look at U+FFE9. The following script works as expected:

SELECT REPLACE(N'test' + NCHAR(0xFFE9), NCHAR(0xFFE9), '_')

returning:
------------------------
test_


As you pointed out, to remove unsupported characters from a UNICODE value one needs to use a binary collation.

Thanks,
Krzysztof Kozielczyk (Microsoft)

Posted by Richard.S on 1/21/2009 at 4:25 PM
Keep in mind also that according to the link you had sent, character 0XFFFD is UNICODE 1.0 which SQL Server should support, but does not do so correctly in this case.
Posted by Richard.S on 1/19/2009 at 9:34 PM
Krzysztof Kozielczyk please read the description before dismissing the bug. I mentioned characters ABOVE 65500 and used 65501 only as an example. There are five defined unicode characters in this range. If we wanted to strip out characters using the replace function such as FFFD, they won't be stripped out.

Also, regardless if the example I used is an undefined unicode example, if it is stored in SQL, then it is a valid character for which we would expect these functions to work. In any case, if we wanted to strip these characters out (being invalid by your definition) using the replace function, they will not be stripped out without casting to binary.
Posted by Microsoft on 1/19/2009 at 5:34 PM
Hello

I’m closing the bug as “By Design”. Please let me know if you don’t feel satisfied by the clarifications we provided.

Thank you one more time for reporting this issue. I’m still very interested in learning about your scenario.

Thanks,
Krzysztof Kozielczyk

Posted by Microsoft on 1/19/2009 at 5:34 PM
Hello

I’m closing the bug as “By Design”. Please let me know if you don’t feel satisfied by the clarifications we provided.

Thank you one more time for reporting this issue. I’m still very interested in learning about your scenario.

Thanks,
Krzysztof Kozielczyk

Posted by Microsoft on 1/5/2009 at 10:06 PM
Hello,

Thank you for reporting this issue. The behavior you describe is expected, even though it may seem counter-intuitive. The code U+FFDD (65501) is not defined by Unicode standard as a valid character and SQL Server does not evaluate comparison of undefined characters. Therefore the REPLACE function does not match this character and it’s not replaced.

Here is the reference of Unicode character codes:
http://en.wikibooks.org/wiki/Unicode/Character_reference/F000-FFFF

We would be very interested to learn about the problem you’re trying to solve that requires this functionality. Are you trying to remove unsupported characters or validate the text? We would appreciate it if you could elaborate on this to help us understand the real world scenario behind this issue. This way we could make sure we do provide a way of addressing it.

Thanks,
Krzysztof Kozielczyk

Posted by Microsoft on 1/5/2009 at 10:04 PM
Hello,

Thank you for reporting this issue. The behavior you describe is expected, even though it may seem counter-intuitive. The code U+FFDD (65501) is not defined by Unicode standard as a valid character and SQL Server does not evaluate comparison of undefined characters. Therefore the REPLACE function does not match this character and it’s not replaced.

Here is the reference of Unicode character codes:
http://en.wikibooks.org/wiki/Unicode/Character_reference/F000-FFFF

We would be very interested to learn about the problem you’re trying to solve that requires this functionality. Are you trying to remove unsupported characters or validate the text? We would appreciate it if you could elaborate on this to help us understand the real world scenario behind this issue. This way we could make sure we do provide a way of addressing it.

Thanks,
Krzysztof Kozielczyk

Sign in to post a workaround.