Home Dashboard Directory Help
Search

Add support for storing UTF-8 natively in SQL Server. by Erland Sommarskog


Status: 

Active


72
3
Sign in
to vote
Type: Suggestion
ID: 362867
Opened: 8/21/2008 1:52:05 PM
Access Restriction: Public
0
Workaround(s)
view

Description

You can store Unicode data in SQL Server today, but only in UCS-2 encoding. In many environments, UTF-8 is the dominating encoding for Unicode data, and it is the default
encoding for XML documents.

Today UTF-8 applications must convert between UCS-2 and UTF-8 to retrieve and store
data in SQL Server, which adds complexity and overhead to applications.

Another issue is that for languages written in the Latin script, UTF-8 is a much more effective storage format, since most for characters in text, only one byte is needed.
Details
Sign in to post a comment.
Posted by Joshua25640735 on 5/14/2012 at 6:42 PM
We've ran into this too. VARCHAR(6500) cannot be converted to NVARCHAR(6500) to enable Unicode support for the obvious reason. Our usage model says we could convert to VARCHAR(7000) UTF8 and expect it to work.
Posted by 13thdragon on 1/1/2012 at 10:14 PM
Over three years ago and the utf-8 issue appears still not to be addressed. I was hoping to turn the trial into a purchase like many others. But like them, looks like I'm still stuck with the other guy. There is no way I am rewriting 7 years worth of programming to adapt to Microsoft I am only one person that doesn't matter. ( How many others feel this way.)
Posted by Michael_W on 11/14/2011 at 2:37 PM
It is 3 years later since this request was put in. Is there a new Microsoft answer? Anything in the pipeline?
Posted by kishorjha on 7/4/2011 at 2:24 AM
I am facing a very big issue related to the fact that SQL Server does not support UTF-8 at collation level like Sybase and Oracle does.

Storing encoded data only based on a specific collation means that SQL server should be explicitly told that we are sending some Unicode data in order to store national character set. This effectively means that we cannot currently tell SQL server to treat all data as Unicode by default which is possible by setting character set to UTF-8, UTF-16, etc in Oracle and Sybase.

The issue is when we try and internationalize some large old applications which uses inline queries, it's very difficult to port the changed application to SQL Server due to the amount to change involved and just in case you want to avoid the change, you will have to make so many complex string processing logic, that application becomes unusable.

I request Microsoft to provide the feature to set collation setting to formats like UTF-8 so that we solve many such related issues .

Thanks,
Kishor
(kishorjha@msn.com)
Posted by Erland Sommarskog on 3/21/2011 at 2:57 PM
To some extent this requestion has been addressed in SQL 2008 R2 with the addition of Unicode compression. That addresses the space issue. Unfortunately, this is only an Enterprise Edition feature.

There is still need to support UTF-8 in tools and BULK INSERT, and maybe also in TDS to save network bandwidth. But I leaning more and more towards that there should be any new data type, collation or whatever to support UTF-8. It is very difficult to think of a solution that would not bring confusion and make SQL Server more difficult to use.
Posted by BetterToday on 9/17/2010 at 5:10 AM
BULK INSERT should support below mentioned automatic file encoding detection as well.
Posted by BetterToday on 9/17/2010 at 5:08 AM
OSQL should also support UTF-8 as encoding for SQL files.

Visual Studio optionally automatically converts SQL files containing UNICODE characters into UTF-8. If it does that, OSQL does not read nor execute those files correctly.

I suggest to use one of the standard MS text file reader API calls to read SQL files in OSQL. These automatically detect which file encoding is used and extract correct characters from the file stream.

Axel Dahmen
www.axeldahmen.de
Posted by jon__ on 7/16/2010 at 5:55 AM
This is a absolute must have for any RDBMS. Please add this in the coming release of SQL Server.
Posted by RyanBarrett on 7/30/2009 at 6:03 AM
In addition to being CAST-able / CONVERT-able, it could be effectively implemented with the existing datastructures by the SET statements within TSQL. For example:

SET TEXT ENCODING {format | @format}

It's analogous to the existing statement SET DATEFORMAT. As I understand it, it only concerns the output of the text, having little effect on storage format.
Posted by Jason Kresowaty on 8/31/2008 at 5:53 AM
One more thing... when the data type is NVARCHAR, the system could actually look at the string and determine whether UTF-8 or UCS-2 storage is more efficient and then use that as the underlying storage. No need for interaction from the user.
Posted by Jason Kresowaty on 8/31/2008 at 5:48 AM
UTF-8 strings have the same "value space" as UCS-2 strings. I strongly feel that this should be an option on the existing NCHAR/NVARCHAR types. Possible even a database-level default for this option. New data types would complicate interaction with applications when the use case is merely to reduce the storage requirement in half. This is merely a storage option -- should be possible to flip this flag without introducing any changes or incompatibilities with regards to existing applications.

Another use case is interpreting binary data as Unicode by way of UTF-8. For example, with the following syntaxes:

CAST(@Binary UTF8 AS NVARCHAR(100))

- or -

UTF8BytesToNVARCHAR(@Binary)

Currently these kinds of things can be done by SQLCLR UDF, but it would be better if these were in T-SQL because they are so important. This means that @Binary should be converted to Unicode using the rules of UTF-8. The result of this expression would actually be a UCS-2 NVARCHAR string, as I am assuming that strings would stay UCS-2 in memory, like they do in .NET. The UTF-8 would just be an on-disk thing.
Posted by Microsoft on 8/21/2008 at 6:49 PM
Hi Erland,
Thanks for your suggestion. We are considering adding support for UTF8 in the next version of SQL Server. It is not clear at this point if it will be a new type or integrate it with existing types. We understand the pain in terms of integrating with UTF8 data and we are looking at ways to effectively resolve it.

--
Umachandar, SQL Programmability Team
Sign in to post a workaround.