32-bit support needed for all gates and derived classes, AVX code, and CPython wrappers

Currently only 64-bit is used, which on smaller circuits is at least 2x slower but theorized to be even 2.5x slower due to the IO bound nature of the problem, while the accuracy is not sufficient.

It furthermore raises questions about performance comparison against the Groq and FPGA implementations which use float32 or 32-bit fixed point.