Jean-François Michaud wrote:
> > However, when working with more digits and/or parallel operations,
> > straight inline SIMD code tends to beat lookup tables due to the load
> > stage bottleneck.
> Hmm possibly but thats assuming alot ;-). Most processors out there
> don't have the SIMD instruction set/registers.
Actually, most do. But there are just enough that don't in order to
make it dangerous to use them if you have a universal application.
The experiments I've run lately show a 512-byte lookup table (which
translates one byte/two digits at a time) to be the fastest, but I've
not done careful ****ysis to make sure the cache isn't getting polluted
by having such a large table. Bertrand posted an SSE version in the
"faster hex to buffer routine" thread. And I've provided a couple of
Note to Terje -- I tried the MUL trick with 32-bit integers (converted
to a decimal string) and wound up having to do a *huge* multiplication
to overcome the loss of precision. For two characters (one byte) you
can get away with this, but not for 32-bit integers. I'd be interested
in seeing a 32-bit conversion to decimal integer string that doesn't
involve at least three MUL instructions. I found out that a repeated
subtraction for each digit turned out to be the fastest solution I
could come up with.