There are two ways to put __m128 into __m256 directly.
1. _mm256_castps128_ps256, (mostly) a free operation just changing the register reference to ymm.
2. _mm256_insertf128_ps, which means costy RAW dependence, since it needs to merge half of the register with the existing value.
So generally casting is prefered.
The real showstopper here is, that the compiler may spill those __m128 variables and _mm256_castps128_ps256 gets compiled to "vmovaps reg, m256", where m256 is only aligned to 16 bytes, since it is the address of a __m128 variable.