I think the following can explain it:
To copy data from memory to a register and back to memory, you do
mov eax, [address]
mov [address2], eax
This moves 32 bit (4 byte) from address to address2. The same goes with 64 bit in 64 bit mode
mov rax, [address]
mov [address2], rax
This moves 64 bit, 2 byte, from address to address2. "mov" itself, regardless of whether it is 64 bit or 32 bit has a latency of 0.5 and a throughput of 0.5 according to Intel's specs. Latency is how many clock cycles the instruction takes to travel through the pipeline and throughput is how long the CPU has to wait before accepting the same instruction again. As you can see, it can do two mov's per clock cycle, however, it has to wait half a clock cycle between two mov's, thus it can effectively only do one mov per clock cycle (or am I wrong here and misinterpret the terms? See PDF here for details).
Of course a mov reg, mem
can be longer than 0.5 cycles, depending if the data is in 1st or 2nd level cache, or not in cache at all and needs to be grabbed from memory. However, the latency time of above ignores this fact (as the PDF states I linked above), it assumes all data necessary for the mov are present already (otherwise the latency will increase by how long it takes to fetch the data from wherever it is right now - this might be several clock cycles and is completely independent of the command being executed says the PDF on page 482/C-30).
What is interesting, whether the mov is 32 or 64 bit plays no role. That means unless the memory bandwidth becomes the limiting factor, 64 bit mov's are equally fast to 32 bit mov's, and since it takes only half as many mov's to move the same amount of data from A to B when using 64 bit, the throughput can (in theory) be twice as high (the fact that it's not is probably because memory is not unlimited fast).
Okay, now you think when using the larger SSE registers, you should get faster throughput, right? AFAIK the xmm registers are not 256, but 128 bit wide, BTW (reference at Wikipedia). However, have you considered latency and throughput? Either the data you want to move is 128 bit aligned or not. Depending on that, you either move it using
movdqa xmm1, [address]
movdqa [address2], xmm1
or if not aligned
movdqu xmm1, [address]
movdqu [address2], xmm1
Well, movdqa/movdqu has a latency of 1 and a throughput of 1. So the instructions take twice as long to be executed and the waiting time after the instructions is twice as long as a normal mov.
And something else we have not even taken into account is the fact that the CPU actually splits instructions into micro-ops and it can execute these in parallel. Now it starts getting really complicated... even too complicated for me.
Anyway, I know from experience loading data to/from xmm registers is much slower than loading data to/from normal registers, so your idea to speed up transfer by using xmm registers was doomed from the very first second. I'm actually surprised that in the end the SSE memmove is not much slower than the normal one.