There is a missed optimization for vbroadcasti128. The problem here is that its source operand is a m128, the compiler cannot directly use a register or __m128i value type, so it has to save it onto the stack first. That is understandable, however there should be a way to directly use this instruction with a pointer, the compiler should either figure it out how to do that, or we need a function with this signature: _mm256_broadcastsi128_si256(const __m128* m), notice the pointer type for the input. I don't know who decided this in the first place, he was obviously unaware of the nature of the underlying machine instruction, probably some folks at intel.
And there is no _mm256_broadcastf128_f256, for vbroadcastf128, but that's basically the same thing.