Welcome to OStack Knowledge Sharing Community for programmer and developer-Open, Learning and Share
Welcome To Ask or Share your Answers For Others

Categories

0 votes
585 views
in Technique[技术] by (71.8m points)

rounding - How do we reduce the precision of accumulated values in neural networks when using integer/fixed-point arithmetic?

Let's say we have a NN with multiple layers. Say a simple MLP (Multi-Level Perceptron) that has GEMM1 -> Activation1 -> GEMM2 -> Activation2. Now, let's say we are doing inference and we are using int8 as the precision of the data and the weights.

A GEMM layer involves accumulation. Generally, accumulation is done in 32bits. So, the output of GEMM1 has all elements in int32. Now before we start Activation1, we will need to convert them from 32bits to 8bits. Maybe we don't. So, then we will do Activation1 in 32-bits. But at some point, we need to come back to 8 bits, say before starting GEMM2.

My question is: how is the conversion from int32 to int8 done? There are two things that come to my mind: rounding and quantization. There are many rounding methods (simple, convergent, nearest, etc), but in this case, it doesn't seem like rounding because it's not like we are losing a little bit of precision; we are losing 24 bits. For quantization, we basically take the entire range of numbers in the int32 output matrix and then map it to an 8-bit range. But we need to know the full output matrix before we can do this. We can't do it on an element by element level.

I use int in the text above, but I think the fixed point is the same from a rounding/quantization perspective. The floating-point is different. And it makes sense that people like BFloat16 (over IEEE half-precision/FP16) because it has the same range as FP32. So, when converting the output of GEMM1 from IEEE full-precision (FP32) to BFloat16, it's easier. We change a number from say 2.46392 to 2.5. We just lost some precision, but the converted result is still close to the original number. With fixed point/int, it is confusing because it seems like we are changing a number from say 253 to 56, which is a different scale altogether.

I hope this is making sense. Please correct me if I am wrong somewhere.

question from:https://stackoverflow.com/questions/65713031/how-do-we-reduce-the-precision-of-accumulated-values-in-neural-networks-when-usi

与恶龙缠斗过久,自身亦成为恶龙;凝视深渊过久,深渊将回以凝视…
Welcome To Ask or Share your Answers For Others

1 Answer

0 votes
by (71.8m points)
Waitting for answers

与恶龙缠斗过久,自身亦成为恶龙;凝视深渊过久,深渊将回以凝视…
Welcome to OStack Knowledge Sharing Community for programmer and developer-Open, Learning and Share
Click Here to Ask a Question

...