String.GetHashCode ignores any characters in the string beyond the first null byte in x64 runtime. - by Marc C Brooks

Status : 

  Won't Fix<br /><br />
		Due to several factors the product team decided to focus its efforts on other items.<br /><br />
		A more detailed explanation for the resolution of this particular item may have been provided in the comments section.


12
0
Sign in
to vote
ID 519104 Comments
Status Closed Workarounds
Type Bug Repros 7
Opened 12/8/2009 4:07:30 PM
Access Restriction Public

Description

The GetHashCode method of the System.String class returns hash code values that do not take into account all the characters the string contains IFF the string contains any embedded NIL characters (e.g. the sequence [low-endian] of 0x00, 0x00).  This results in a lot of hash code collision when working with strings that embed NIL characters (for example as a sub-string seperator). 

This is only an issue in 64-bit compilations as the error is in the #else clause of a #if WIN32 block.

Strings in .Net are not supposed to be nul-terminated (like in native C libraries), they are supposed to be Length-controlled.

The reference-source-server's version of the code is:
        // Gets a hash code for this string.  If strings A and B are such that A.Equals(B), then
        // they will return the same hash code.
        [ReliabilityContract(Consistency.WillNotCorruptState, Cer.MayFail)]
        public override int GetHashCode() { 
            unsafe {
                fixed (char *src = this) { 
                    BCLDebug.Assert(src[this.Length] == '\0', "src[this.Length] == '\\0'"); 
                    BCLDebug.Assert( ((int)src)%4 == 0, "Managed string should start at 4 bytes boundary");
 
#if WIN32
                    int hash1 = (5381<<16) + 5381;
#else
                    int hash1 = 5381; 
#endif
                    int hash2 = hash1; 
 
#if WIN32
                    // 32bit machines. 
                    int* pint = (int *)src;
                    int len = this.Length;
                    while(len > 0) {
                        hash1 = ((hash1 << 5) + hash1 + (hash1 >> 27)) ^ pint[0]; 
                        if( len <= 2) {
                            break; 
                        } 
                        hash2 = ((hash2 << 5) + hash2 + (hash2 >> 27)) ^ pint[1];
                        pint += 2; 
                        len  -= 4;
                    }
#else
                    int     c; 
                    char *s = src;
/// ERROR ON NEXT LINE
                    while ((c = s[0]) != 0) { 
                        hash1 = ((hash1 << 5) + hash1) ^ c; 
                        c = s[1];
                        if (c == 0) 
                            break;
                        hash2 = ((hash2 << 5) + hash2) ^ c;
                        s += 2;
                    } 
#endif
#if DEBUG 
                    // We want to ensure we can change our hash function daily. 
                    // This is perfectly fine as long as you don't persist the
                    // value from GetHashCode to disk or count on String A 
                    // hashing before string B.  Those are bugs in your code.
                    hash1 ^= ThisAssembly.DailyBuildNumber;
#endif
                    return hash1 + (hash2 * 1566083941); 
                }
            } 
        } 
Sign in to post a comment.
Posted by Richard Deeming on 9/11/2014 at 4:13 AM
Still not fixed in .NET 4.5.2, despite adding equivalent "breaking changes" to the framework via security updates to address hash-code collision DoS vulnerabilities.

Time to re-open this one and get it fixed!
Posted by Microsoft on 6/14/2010 at 3:11 PM
Hello IDisposable,

As you have mentioned this would be a breaking change for some programs (even though they shouldn't really be relying on this), the risk of this was deemed too high to fix this in the current release.

I agree that the rate of collisions that this will cause in the default Dictionary<String, Object> will be inflated by this. If this is adversely effecting your applications performance, I would suggest trying to work around it by using one of the Dictionary constructors that takes an IEqualityComparer so you can provide a more appropriate GetHashCode implementation. I know this isn't ideal and would like to get this fixed in a future version of the .NET Framework.

Thanks,
Matt Greig
BCL Team
Posted by Marc C Brooks on 4/2/2010 at 2:51 PM
Woooo! "Resolved" as Won't Fix... again. Thanks for caring, Microsoft.
Posted by Microsoft on 12/11/2009 at 11:04 AM
Hello IDisposable,
Thank you for the feedback. We will look into improving this in a future release of the .NET Framework.

Sincerely,
Josh Free
Base Class Library Development
Posted by Microsoft on 12/11/2009 at 1:51 AM
We were able to reproduce the issue you are seeing. We are escalating this bug to the product unit who works on that specific feature area. The product team will review this issue and make a decision on whether they will fix it or not for the next release.
Thanks!
Posted by Microsoft on 12/9/2009 at 3:39 AM
Thank you for your feedback, we are currently reviewing the issue you have submitted. If this issue is urgent, please contact support directly(http://support.microsoft.com)
Posted by Marc C Brooks on 12/8/2009 at 4:18 PM
And yes, I know that if you change this, it could be a "breaking change" if someone was stupid enough to store the GetHashCode return value on disk somewhere... but they should be spanked anyway. Also note that the return values differ completely from x86 to x64, which is probably a unexpected surprise for most developers. I know I found it very difficult to understand why I was getting HUGE collision rates on my server's use of Dictionary<String, object> (under x64 64-bit runtime), but never saw a single one on my development machine (under x86 32-bit runtime).

In the end, I assume I was simply doing something wrong in synthesizing a composite String key and switched to a specific Tuple of strings... which worked correctly because I was no longer embedding NIL characters in my strings.