The x64 (and probably x32) implementation of std::atomic::load in VS 2012 is so bad as to be practically useless. Given a std::atomic<std::size_t> ai, the expression ai.load(std::memory_order_relaxed) ultimately arrives at a call to the intrinsic _InterlockedOr64(&x, 0). This intrinsic, in turn, emits a cmpxchg loop that repeatedly loads the memory location, and then compares the loaded value to the memory location just loaded from.
For reference, the correct code to emit for a relaxed load is "mov register, [memory location]".
It seems that all atomic load operations are treated the same no matter what memory ordering is specified; all operations have sequential consistency. This is, strictly speaking, permitted behavior. It's just completely useless. Atomics are used in performance-critical lock-free data structures, and the abysmal implementation slows our lock-free hash map down by a factor of 10 or even 100 compared to the Intel atomics implementation or Boost.Atomic. The atomics become a major bottleneck and scalability issue in applications using them.