4 % 3 == 1
, so (4^k * a + b) % 3 == (a + b) % 3
. You can use this fact to evaluate x%3 for a 32-bit x:
x = (x >> 16) + (x & 0xffff);
x = (x >> 10) + (x & 0x3ff);
x = (x >> 6) + (x & 0x3f);
x = (x >> 4) + (x & 0xf);
x = (x >> 2) + (x & 0x3);
x = (x >> 2) + (x & 0x3);
x = (x >> 2) + (x & 0x3);
if (x == 3) x = 0;
(Untested - you might need a few more reductions.) Is this faster than your hardware can do x%3? If it is, it probably isn't by much.
与恶龙缠斗过久,自身亦成为恶龙;凝视深渊过久,深渊将回以凝视…