windows - memcpy performance differences between 32 and 64 bit processes

Question

Welcome To Ask or Share your Answers For Others

windows - memcpy performance differences between 32 and 64 bit processes

asked Oct 24, 2021 in Technique[技术] by 深蓝 (71.8m points)

windows - memcpy performance differences between 32 and 64 bit processes

We have Core2 machines (Dell T5400) with XP64.

We observe that when running 32-bit processes, the performance of memcpy is on the order of 1.2GByte/s; however memcpy in a 64-bit process achieves about 2.2GByte/s (or 2.4GByte/s with the Intel compiler CRT's memcpy). While the initial reaction might be to just explain this away as due to the wider registers available in 64-bit code, we observe that our own memcpy-like SSE assembly code (which should be using 128-bit wide load-stores regardless of 32/64-bitness of the process) demonstrates similar upper limits on the copy bandwidth it achieves.

My question is, what's this difference actually due to ? Do 32-bit processes have to jump through some extra WOW64 hoops to get at the RAM ? Is it something to do with TLBs or prefetchers or... what ?

Thanks for any insight.

Also raised on Intel forums.

See Question&Answers more detail:os

与恶龙缠斗过久,自身亦成为恶龙；凝视深渊过久,深渊将回以凝视…

1 Answer

深蓝 · Answer 1 · 2021-10-23T18:58:57+0000

I think the following can explain it:

To copy data from memory to a register and back to memory, you do

mov eax, [address]
mov [address2], eax

This moves 32 bit (4 byte) from address to address2. The same goes with 64 bit in 64 bit mode

mov rax, [address]
mov [address2], rax

This moves 64 bit, 2 byte, from address to address2. "mov" itself, regardless of whether it is 64 bit or 32 bit has a latency of 0.5 and a throughput of 0.5 according to Intel's specs. Latency is how many clock cycles the instruction takes to travel through the pipeline and throughput is how long the CPU has to wait before accepting the same instruction again. As you can see, it can do two mov's per clock cycle, however, it has to wait half a clock cycle between two mov's, thus it can effectively only do one mov per clock cycle (or am I wrong here and misinterpret the terms? See PDF here for details).

Of course a mov reg, mem can be longer than 0.5 cycles, depending if the data is in 1st or 2nd level cache, or not in cache at all and needs to be grabbed from memory. However, the latency time of above ignores this fact (as the PDF states I linked above), it assumes all data necessary for the mov are present already (otherwise the latency will increase by how long it takes to fetch the data from wherever it is right now - this might be several clock cycles and is completely independent of the command being executed says the PDF on page 482/C-30).

What is interesting, whether the mov is 32 or 64 bit plays no role. That means unless the memory bandwidth becomes the limiting factor, 64 bit mov's are equally fast to 32 bit mov's, and since it takes only half as many mov's to move the same amount of data from A to B when using 64 bit, the throughput can (in theory) be twice as high (the fact that it's not is probably because memory is not unlimited fast).

Okay, now you think when using the larger SSE registers, you should get faster throughput, right? AFAIK the xmm registers are not 256, but 128 bit wide, BTW (reference at Wikipedia). However, have you considered latency and throughput? Either the data you want to move is 128 bit aligned or not. Depending on that, you either move it using

movdqa xmm1, [address]
movdqa [address2], xmm1

or if not aligned

movdqu xmm1, [address]
movdqu [address2], xmm1

Well, movdqa/movdqu has a latency of 1 and a throughput of 1. So the instructions take twice as long to be executed and the waiting time after the instructions is twice as long as a normal mov.

And something else we have not even taken into account is the fact that the CPU actually splits instructions into micro-ops and it can execute these in parallel. Now it starts getting really complicated... even too complicated for me.

Anyway, I know from experience loading data to/from xmm registers is much slower than loading data to/from normal registers, so your idea to speed up transfer by using xmm registers was doomed from the very first second. I'm actually surprised that in the end the SSE memmove is not much slower than the normal one.

Categories

windows - memcpy performance differences between 32 and 64 bit processes

windows - memcpy performance differences between 32 and 64 bit processes

Please log in or register to add a comment.

Please log in or register to answer this question.

1 Answer

Please log in or register to add a comment.

Just Browsing Browsing

Most popular tags