If you recall where we left off on my post yesterday we compiled a test program with gcc and saw this code for the 'working' part of a loop. (Yes, I will be getting to the Intel C++ compiler next post, but I'll stick with what I've got so far just so we can take baby steps).
.LBB52: .loc 1 14 0 movss (%rbp,%rax,4), %xmm0 addss (%rdx,%rax,4), %xmm0 movss %xmm0, (%rbp,%rax,4) addq $1, %rax