Home Dashboard Directory Help
Search

Extend Regex to process Unicode characters, not UTF-16 code units by Bradley Grainger


Status: 

Closed
 as Won't Fix Help for as Won't Fix


5
0
Sign in
to vote
Type: Suggestion
ID: 357780
Opened: 7/25/2008 10:27:00 AM
Access Restriction: Public
1
Workaround(s)
view

Description

Strings in .NET are, of course, represented as a sequence of UTF-16 code units. Unicode characters that fall outside the base multilingual plane (BMP) are encoded using two .NET chars. For example U+10123 (Aegean Number Two Thousand) is encoded in UTF-16 as 0xD800 0xDD23. That is, "\U00010123" and "\uD800\uDD23" are strings containing the same sequence of characters.

The System.Text.RegularExpressions.Regex class appears to operate on UTF-16 code units, not Unicode characters. For example Regex.IsMatch("\U00010123", @"^.{2}$") returns true, meaning that it matched two characters, even though the string really only contains one Unicode character. It is clear that the . is matching each of the surrogate code points used to encode the character.

The problem with this is that it's almost impossible to use regular expressions with strings that contain non-BMP characters. The "." metacharacter will only match half a Unicode character and return an unmatched surrogate. The \p{} character class doesn't actually match characters with the specified Unicode general category if the characters fall outside the BMP. For example, Regex.IsMatch("\U00010001", @"\p{L}") returns false, even though U+10001 (the character in the string) has a Unicode category of Lo. For more, see http://code.logos.com/blog/2008/07/net_regular_expressions_and_unicode.html.

I suggest that a new mode of behavior be added, which changes the regex engine to operate on Unicode characters, not UTF-16 code units. For backwards compatibility, this would need to be an opt-in behavior, probably specified by supplying a new option, for example, RegexOptions.Unicode.

This new mode of behavior would give the following sample output:

Regex.IsMatch("\U00010123", @"^.$", RegexOptions.Unicode) => returns true; there is exactly one character in the string

Regex.IsMatch("\U00010123", @"\p{N}", RegexOptions.Unicode) => returns true; there is a character with a Unicode general category of "Number" in the string

Regex.IsMatch("\U00010123", @"\p{Cs}", RegexOptions.Unicode) => returns false; a match for \p{Cs} would never succeed (in Unicode mode), because there's no such thing as a surrogate character (http://blogs.msdn.com/michkap/archive/2005/07/27/444101.aspx).

The demo page for the ICU regular expression engine (http://demo.icu-project.org/icu-bin/redemo) shows how regex processing in this new mode would work.
Details
Sign in to post a comment.
Posted by Microsoft on 5/14/2010 at 2:19 PM
Hi bgrainger,

Thanks again for taking the time to report this suggestion. Unfortunately, we have no plans to add the ability for Regex to process Unicode characters in this way, due to other priorities. One area of Regex that we are considering investing in is better performance. Therefore, I'm going to go ahead and resolve this as Won't Fix.

Please let me know if you have any questions or concerns at justinv at microsoft dot com.

Regards,

Justin Van Patten
Program Manager
CLR Base Class Libraries
Posted by Microsoft on 12/5/2008 at 9:18 PM
Hi bgrainger,

Thank you for the suggestion! Your feedback is very valuable to us. We aren't planning to add support for this in the next major version of the .NET framework, but we will continue to track this for the future.

Regards,

Justin Van Patten
Program Manager
Base Class Library
Posted by Bradley Grainger on 7/25/2008 at 5:14 PM
Essentially what I'm suggesting is that .NET regular expressions have "Basic Unicode Support: Level 1" as defined by Unicode TR18 (http://www.unicode.org/reports/tr18/). The area I've particularly focussed on in the suggestion above is requirement RL1.7 regarding supplementary code points (http://www.unicode.org/reports/tr18/#Supplementary_Characters).
Sign in to post a workaround.
Posted by Bradley Grainger on 7/25/2008 at 10:29 AM
Instead of using .NET regular expressions, one can download the latest ICU build (http://www.icu-project.org/) and P/Invoke to the exported C functions that process regular expressions (http://www.icu-project.org/apiref/icu4c/uregex_8h.html). The regex language supported by .NET and ICU is very similar.