I am evaluating the usage (clearing and querying) of Floating-Point Exceptions in performance-critical/"hot" code. Looking at the binary produced I noticed that neither GCC nor Clang expand the call to an inline sequence of instructions that I would expect; instead they seem to generate a call to the runtime library. This is prohibitively expensive for my application.
Consider the following minimal example:
#include <fenv.h>
#pragma STDC FENV_ACCESS on
inline int fetestexcept_inline(int e)
{
unsigned int mxcsr;
asm volatile ("vstmxcsr" " %0" : "=m" (*&mxcsr));
return mxcsr & e & FE_ALL_EXCEPT;
}
double f1(double a)
{
double r = a * a;
if(r == 0 || fetestexcept_inline(FE_OVERFLOW)) return -1;
else return r;
}
double f2(double a)
{
double r = a * a;
if(r == 0 || fetestexcept(FE_OVERFLOW)) return -1;
else return r;
}
And the output as produced by GCC: https://godbolt.org/z/jxjzYY
The compiler seems to know that he can use the CPU-family-dependent AVX-instructions for the target (it uses "vmulsd" for the multiplication). However, no matter which optimization flags I try, it will always produce the much more expensive function call to glibc rather than the assembly that (as far as I understand) should do what the corresponding glibc function does.
This is not intended as a complaint, I am OK with adding the inline assembly. I just wonder whether there might be a subtle difference that I am overlooking that could be a bug in the inline-assembly-version.
question from:
https://stackoverflow.com/questions/65891873/why-is-fetestexcept-in-c-compiled-to-a-function-call-rather-than-inlined 与恶龙缠斗过久,自身亦成为恶龙;凝视深渊过久,深渊将回以凝视…