I want to use sse vectorized instructions to speed up a large number of integer multiplies and additions, with the catch that this arithmetic is performed under some fixed < 32 bit prime modulus. I feel like the code gcc -02 could be improved by a factor of 10, but I'm not embly coder. Thanks,
No comments:
Post a Comment