std::wcout is ten times slower than wprintf: Performance bug in C++ library - by Andrew7Webb

Status : 

  By Design<br /><br />
		The product team believes this item works according to its intended design.<br /><br />
		A more detailed explanation for the resolution of this particular item may have been provided in the comments section.

Sign in
to vote
ID 642876 Comments
Status Closed Workarounds
Type Bug Repros 1
Opened 2/11/2011 6:20:58 AM
Access Restriction Public


The code below outputs a string 4000 times using wprintf and wcout.  It shoudl take roughly the  same amount of time.  But it actually takes more than ten times as long with wcout.  Please find out why and fix wcout.

void usePrintf()
	for( unsigned i=0; i < 4000; i++ ) {
		wprintf( L"%s", L"cupcakes are made of flour, eggs, and sugar" );

void useCout()
	for( unsigned i=0; i < 4000; i++ ) {
		std::wcout << L"cupcakes are made of flour, eggs, and sugar";

int _tmain(int argc, _TCHAR* argv[])
	long	tick0= ::GetTickCount();
	long	tick1= ::GetTickCount();
	long	tick2= ::GetTickCount();

	//      DEBUG                                            RELEASE
	//   printf  took     671 milliseconds.	printf took       593 milliseconds.
	//  wcout  took 11653 milliseconds.	wcout  took 10779 milliseconds.
	std::wcout << L"printf took " << tick1 - tick0 << L" milliseconds." << std::endl
	                  << L"wcout  took " << tick2 - tick1 << L" milliseconds." << std::endl;
	return 0;
Sign in to post a comment.
Posted by Andrew7Webb on 2/18/2011 at 8:36 PM
I accept these practical soltuions, and thank you for responding to the performance bug.

* Using ostringstream/puts() is almost as fast as printf().
* Calling setvbuf() makes everything significantly faster, and erases almost all performance differences.
* Using pubsetbuf() to give cout its own buffer actually makes it faster than everything else, including printf() (!!!).

best regards, Andrew

P.S. since I also took a more detailed look, I might as well report what I found:

// TEMPLATE CLASS basic_filebuf
template<class _Elem,
    class _Traits>
    class basic_filebuf
        : public basic_streambuf<_Elem, _Traits>
    {    // stream buffer associated with a C stream

"basic_filebuf" inherits from "basic_streambuf".

That's fine except that "basic_filebuf" loses a chance to optimize the unbuffered output of multiple characters by using the generic "basic_streambuf".    
         virtual streamsize __CLR_OR_THIS_CALL xsputn(const _Elem *_Ptr,

If "basic_filebuf" implemented its own xsputn, and did something like

     if (0 == _Pnavail() ) {
        fwrite( _Ptr, _Count, sizeof (_Elem), _Myfile );
        return _Count;

It would save lots of time compared to having overflow in a loop call lots of
     _Fputc(wchar_t _Wchar, _Filet *_File)

Posted by Stephan [MSFT] on 2/18/2011 at 5:40 PM

Thanks for reporting this issue. I'm resolving it as By Design, because this is an unfortunate consequence of how our C and C++ Standard Library implementations are designed, but I have a few easy workarounds for you.

The problem is that when printing to the console (instead of, say, being redirected to a file), neither our C nor C++ I/O are buffered by default. This is sometimes concealed by the fact that C I/O functions like printf() and puts() temporarily enable buffering while doing their work.

(As documented at , our C Standard Library implementation isn't capable of line buffering, only full buffering or no buffering. Paragraph 7.19.3/7 of the 1999 C Standard requires that "As initially opened, the standard error stream is not fully buffered; the standard input and standard output streams are fully buffered if and only if the stream can be determined not to refer to an interactive device." Since we don't have line buffering, we must default to no buffering.)

There are a couple of things that you can do. You can call setvbuf() to enable buffering on stdout, which will significantly improve the performance of both printf() and cout. (If you request line buffering, VC will give you full buffering, but other implementations may give you line buffering.) Alternatively, you can give cout a buffer of your own.

Here's my example program:

C:\Temp>type meow.cpp
#include <stdio.h>
#include <stdlib.h>
#include <fstream>
#include <ios>
#include <iostream>
#include <ostream>
#include <sstream>
#include <vector>
#include <windows.h>
using namespace std;

long long counter() {
    return li.QuadPart;

long long frequency() {
    return li.QuadPart;

int main() {
    #if defined(TEST1)
        #define USE_PRINTF
    #elif defined(TEST2)
        #define USE_COUT
    #elif defined(TEST3)
        #define USE_VECTOR_AND_COUT
    #elif defined(TEST4)
        #define USE_OSS
    #elif defined(TEST5)
        #define USE_FIX
        #define USE_PRINTF
    #elif defined(TEST6)
        #define USE_FIX
        #define USE_COUT
    #elif defined(TEST7)
        #define USE_FIX
        #define USE_VECTOR_AND_COUT
    #elif defined(TEST8)
        #define USE_FIX
        #define USE_OSS

    #ifdef USE_FIX
        if (setvbuf(stdout, 0, _IOLBF, 4096) != 0) {

    const long long start = counter();

    #if defined(USE_PRINTF)
        for (int i = 0; i < 10000; ++i) {
            printf("[printf] this is string number %d\n", i);
    #elif defined(USE_COUT) || defined(USE_VECTOR_AND_COUT)
        #ifdef USE_VECTOR_AND_COUT
            vector<char> buf(65536);
            cout.rdbuf()->pubsetbuf(, buf.size());

        for (int i = 0; i < 10000; ++i) {
            cout << "[cout ] this is string number " << i << "\n";
#elif defined(USE_OSS)
        ostringstream oss;

        for (int i = 0; i < 10000; ++i) {
            oss << "[oss ] this is string number " << i;

    const long long finish = counter();

    ofstream f("timings.txt", ios_base::app);

    #ifdef USE_FIX
        f << "FIX - ";

    #if defined(USE_PRINTF)
        f << "printf";
    #elif defined(USE_COUT)
        f << "cout";
    #elif defined(USE_VECTOR_AND_COUT)
        f << "vector and cout";
    #elif defined(USE_OSS)
        f << "oss";

    f << ": " << (finish - start) * 1000.0 / frequency() << " ms" << endl;

I compiled and executed it like this:

C:\Temp>for /l %I in (1,1,8) do (cl /EHsc /nologo /W4 /MT /O2 /GL /DTEST%I meow.cpp && for /l %K in (1,1,5) do meow)

This prints a ton of stuff, and records the performance in timings.txt. Here's what I got (on Vista SP2 x64):

C:\Temp>type timings.txt
printf: 11294.8 ms
printf: 3352.87 ms
printf: 5495.42 ms
printf: 5568.7 ms
printf: 3179.67 ms
cout: 22158.9 ms
cout: 25855.5 ms
cout: 23742 ms
cout: 23080.8 ms
cout: 22072.4 ms
vector and cout: 2180.58 ms
vector and cout: 2149.43 ms
vector and cout: 2161.41 ms
vector and cout: 2154 ms
vector and cout: 2152.9 ms
oss: 5594.2 ms
oss: 5477.6 ms
oss: 5422.23 ms
oss: 5535.69 ms
oss: 3175.77 ms
FIX - printf: 2379.71 ms
FIX - printf: 2370.67 ms
FIX - printf: 2358.52 ms
FIX - printf: 2362.13 ms
FIX - printf: 2351.94 ms
FIX - cout: 2376.76 ms
FIX - cout: 2368.34 ms
FIX - cout: 2376.14 ms
FIX - cout: 2349.48 ms
FIX - cout: 2363.16 ms
FIX - vector and cout: 2176.16 ms
FIX - vector and cout: 2157.45 ms
FIX - vector and cout: 2166.77 ms
FIX - vector and cout: 2165.99 ms
FIX - vector and cout: 2163.99 ms
FIX - oss: 2385.05 ms
FIX - oss: 2353.87 ms
FIX - oss: 2398.24 ms
FIX - oss: 2377.62 ms
FIX - oss: 2362.45 ms

It appears that the first run is very slow because the Command Prompt's buffer is being filled up (I use a buffer of 9999 lines). In the interest of paranoia, I threw out each first number and averaged the remaining 4. The averages:

printf: 4399.165 ms
cout: 23687.675 ms
vector and cout: 2154.435 ms
oss: 4902.8225 ms
FIX - printf: 2360.815 ms
FIX - cout: 2364.28 ms
FIX - vector and cout: 2163.55 ms
FIX - oss: 2373.045 ms

This actually demonstrates a third workaround - performing your formatting in an ostringstream, and then blasting the result to the console with puts(). To summarize:

* Using ostringstream/puts() is almost as fast as printf().
* Calling setvbuf() makes everything significantly faster, and erases almost all performance differences.
* Using pubsetbuf() to give cout its own buffer actually makes it faster than everything else, including printf() (!!!).

I suggest calling setvbuf() alone, as that's the simplest thing to do and it's extremely effective.

If you have any further questions, feel free to E-mail me at .

Stephan T. Lavavej
Visual C++ Libraries Developer
Posted by jalf_ on 2/17/2011 at 12:56 AM
See also for a detailed analysis and discussion of the performance of Iostreams.

And the TR on C++ performance ( has a whole section dedicated to speeding up Iostreams.

The current speed really is crippling for many purposes.
Posted by Microsoft on 2/14/2011 at 12:59 AM
Thanks for your feedback. We are routing this issue to the appropriate group within the Visual Studio Product Team for triage and resolution. These specialized experts will follow-up with your issue.
Posted by Andrew7Webb on 2/11/2011 at 2:02 PM
I understand a bit more about the problem:

streambuf's xsputn function uses "overflow" to putput each character one at a time.

    virtual streamsize __CLR_OR_THIS_CALL xsputn(const _Elem *_Ptr,
        streamsize _Count)
        {    // put _Count characters to stream
        streamsize _Size, _Copied;

        for (_Copied = 0; 0 < _Count; )
            if (0 < (_Size = _Pnavail()))
                {    // copy to write buffer
                if (_Count < _Size)
                    _Size = _Count;
                _Traits::copy(pptr(), _Ptr, (size_t)_Size);
                _Ptr += _Size;
                _Copied += _Size;
                _Count -= _Size;
            else if (_Traits::eq_int_type(_Traits::eof(),
-------->                overflow(_Traits::to_int_type(*_Ptr))))
                break;    // single character put failed, quit
                {    // count character successfully put

        return (_Copied);

overflow then eventually calls fputwc.c's

wint_t __cdecl fputwc (
        wchar_t ch,
        FILE *str
    REG1 FILE *stream;
    REG2 wint_t retval;


    /* Init stream pointer */
    stream = str;

    __try {
        retval = _fputwc_nolock(ch,stream);
    __finally {


I don't know exactly why it takes quite a bit of time, but I do notice that lots of stuff is being done on a per character basis rather than on the whole string.
Posted by Microsoft on 2/11/2011 at 7:13 AM
Thank you for your feedback, we are currently reviewing the issue you have submitted. If this issue is urgent, please contact support directly(