Culprit: False Data Dependency (and the compiler isn't even aware of it)
(罪魁祸首:错误的数据依赖关系 (并且编译器甚至没有意识到))
On Sandy/Ivy Bridge and Haswell processors, the instruction:
(在Sandy / Ivy Bridge和Haswell处理器上,指令如下:)
popcnt src, dest
appears to have a false dependency on the destination register dest
.
(似乎对目标寄存器dest
有错误的依赖性。)
Even though the instruction only writes to it, the instruction will wait until dest
is ready before executing. (即使指令仅写入指令,指令也会等到dest
准备好后再执行。)
This false dependency is (now) documented by Intel as erratum HSD146 (Haswell) and SKL029 (Skylake) (英特尔现已(现在)将这种错误依赖性记录为勘误表HSD146(Haswell)和SKL029(Skylake))
Skylake fixed this for lzcnt
and tzcnt
.
(Skylake将其固定为lzcnt
和tzcnt
。)
Cannon Lake (and Ice Lake) fixed this for popcnt
.
(坎农湖(和冰湖) popcnt
修复为popcnt
。)
bsf
/ bsr
have a true output dependency: output unmodified for input=0.
(bsf
/ bsr
具有真正的输出依赖性:输入= 0时输出未修改。)
(But no way to take advantage of that with intrinsics - only AMD documents it and compilers don't expose it.) ((但是没有办法利用内在函数来利用它 -只有AMD记录了它,编译器没有公开它。))
(Yes, these instructions all run on the same execution unit ).
((是的,这些指令都在同一执行单元上运行)。)
This dependency doesn't just hold up the 4 popcnt
s from a single loop iteration.
(这种依赖关系不只是从单个循环迭代中容纳4个popcnt
。)
It can carry across loop iterations making it impossible for the processor to parallelize different loop iterations. (它可以进行循环迭代,从而使处理器无法并行化不同的循环迭代。)
The unsigned
vs. uint64_t
and other tweaks don't directly affect the problem.
(unsigned
vs. uint64_t
和其他调整不会直接影响问题。)
But they influence the register allocator which assigns the registers to the variables. (但是它们会影响寄存器分配器,后者将寄存器分配给变量。)
In your case, the speeds are a direct result of what is stuck to the (false) dependency chain depending on what the register allocator decided to do.
(在您的情况下,速度是卡在(假)依赖链上的直接结果,具体取决于寄存器分配器决定执行的操作。)
- 13 GB/s has a chain:
popcnt
- add
- popcnt
- popcnt
→ next iteration (13 GB / s有一条链: popcnt
add
popcnt
- popcnt
→下一次迭代)
- 15 GB/s has a chain:
popcnt
- add
- popcnt
- add
→ next iteration (15 GB / s有一条链: popcnt
add
popcnt
add
→下一个迭代)
- 20 GB/s has a chain:
popcnt
- popcnt
→ next iteration (20 GB / s有一条链: popcnt
- popcnt
→下一次迭代)
- 26 GB/s has a chain:
popcnt
- popcnt
→ next iteration (26 GB / s有一条链: popcnt
- popcnt
→下一次迭代)
The difference between 20 GB/s and 26 GB/s seems to be a minor artifact of the indirect addressing.
(20 GB / s和26 GB / s之间的差异似乎只是间接寻址的一个小问题。)
Either way, the processor starts to hit other bottlenecks once you reach this speed. (无论哪种方式,一旦达到此速度,处理器就会开始遇到其他瓶颈。)
To test this, I used inline assembly to bypass the compiler and get exactly the assembly I want.
(为了测试这一点,我使用了内联程序集来绕过编译器并获得我想要的程序集。)
I also split up the count
variable to break all other dependencies that might mess with the benchmarks. (我还拆分了count
变量,以打破可能与基准混乱的所有其他依赖项。)
Here are the results:
(结果如下:)
Sandy Bridge Xeon @ 3.5 GHz: (full test code can be found at the bottom)
(Sandy Bridge Xeon @ 3.5 GHz :(完整的测试代码可在底部找到))
Different Registers: 18.6195 GB/s
(不同的寄存器: 18.6195 GB / s)
.L4:
movq (%rbx,%rax,8), %r8
movq 8(%rbx,%rax,8), %r9
movq 16(%rbx,%rax,8), %r10
movq 24(%rbx,%rax,8), %r11
addq $4, %rax
popcnt %r8, %r8
add %r8, %rdx
popcnt %r9, %r9
add %r9, %rcx
popcnt %r10, %r10
add %r10, %rdi
popcnt %r11, %r11
add %r11, %rsi
cmpq $131072, %rax
jne .L4
Same Register: 8.49272 GB/s
(相同寄存器: 8.49272 GB / s)
.L9:
movq (%rbx,%rdx,8), %r9
movq 8(%rbx,%rdx,8), %r10
movq 16(%rbx,%rdx,8), %r11
movq 24(%rbx,%rdx,8), %rbp
addq $4, %rdx
# This time reuse "rax" for all the popcnts.
popcnt %r9, %rax
add %rax, %rcx
popcnt %r10, %rax
add %rax, %rsi
popcnt %r11, %rax
add %rax, %r8
popcnt %rbp, %rax
add %rax, %rdi
cmpq $131072, %rdx
jne .L9
Same Register with broken chain: 17.8869 GB/s
(链断裂的同一个寄存器: 17.8869 GB / s)
.L14:
movq (%rbx,%rdx,8), %r9
movq 8(%rbx,%rdx,8), %r10
movq 16(%rbx,%rdx,8), %r11
movq 24(%rbx,%rdx,8), %rbp
addq $4, %rdx
# Reuse "rax" for all the popcnts.
xor %rax, %rax # Break the cross-iteration dependency by zeroing "rax".
popcnt %r9, %rax
add %rax, %rcx
popcnt %r10, %rax
add %rax, %rsi
popcnt %r11, %rax
add %rax, %r8
popcnt %rbp, %rax
add %rax, %rdi
cmpq $131072, %rdx
jne .L14
So what went wrong with the compiler?
(那么编译器出了什么问题?)
It seems that neither GCC nor Visual Studio are aware that popcnt
has such a false dependency.
(似乎GCC和Visual Studio都不知道popcnt
具有这种错误的依赖性。)
Nevertheless, these false dependencies aren't uncommon. (但是,这些错误的依赖关系并不少见。)
It's just a matter of whether the compiler is aware of it. (只是编译器是否知道它。)
popcnt
isn't exactly the most used instruction.
(popcnt
并不是最常用的指令。)
So it's not really a surprise that a major compiler could miss something like this. (因此,主要的编译器可能会错过这样的内容,这并不奇怪。)
There also appears to be no documentation anywhere that mentions this problem. (似乎也没有任何文档提到此问题。)
If Intel doesn't disclose it, then nobody outside will know until someone runs into it by chance. (如果英特尔不公开,那么除非有人偶然碰到它,否则外界不会知道。)
( Update: As of version 4.9.2 , GCC is aware of this false-dependency and generates code to compensate it when optimizations are enabled. Major compilers from other vendors, including Clang, MSVC, and even Intel's own ICC are not yet aware of this microarchitectural erratum and will not emit code that compensates for it.)
(( 更新: 从4.9.2版开始 ,GCC意识到了这种虚假依赖关系,并在启用优化后生成了代码来对其进行补偿。其他厂商的主要编译器,包括Clang,MSVC甚至是英特尔自己的ICC都尚未意识到此微体系结构勘误表,并且不会发出补偿它的代码。))
Why does the CPU have such a false dependency?
(为什么CPU具有如此错误的依赖性?)
We can speculate: it runs on the same execution unit as bsf
/ bsr
which do have an output dependency.
(我们可以推测:它运行相同的执行单元作为bsf
/ bsr
它确实有一个输出的依赖。)
( How is POPCNT implemented in hardware? ). (( 如何在硬件中实现POPCNT? )。)
For those instructions, Intel documents the integer result for input=0 as "undefined" (with ZF=1), but Intel hardware actually gives a stronger guarantee to avoid breaking old software: output unmodified. (对于这些指令,Intel将input = 0的整数结果记录为“未定义”(ZF = 1),但是Intel硬件实际上为避免破坏旧软件提供了更强的保证:输出未修改。)
AMD documents this behaviour. (AMD记录了这种行为。)
Presumably it was somehow inconvenient to make some uops for this execution unit dependent on the output but others not.
(可能不方便地为此执行单元的输出依赖于输出,而另一些则不然。)
AMD processors do not appear to have this false d