Transcendental floating point functions are now unfixably brokenon Intel processors

Y

Yousuf Khan

" This error has tragically become un-fixable because of the
compatibility requirements from one generation to the next. The fix for
this problem was figured out quite a long time ago. In the excellent
paper The K5 transcendental functions by T. Lynch, A. Ahmed, M. Schulte,
T. Callaway, and R. Tisdale a technique is described for doing argument
reduction as if you had an infinitely precise value for pi. As far as I
know, the K5 is the only x86 family CPU that did sin/cos accurately. AMD
went back to being bit-for-bit compatibile with the old x87 behavior,
assumably because too many applications broke. Oddly enough, this is
fixed in Itanium.

What we do in the JVM on x86 is moderately obvious: we range check the
argument, and if it's outside the range [-pi/4, pi/4]we do the precise
range reduction by hand, and then call fsin.

So Java is accurate, but slower. I've never been a fan of "fast, but
wrong" when "wrong" is roughly random(). Benchmarks rarely test
accuracy. "double sin(double theta) { return 0; }" would be a great
benchmark-compatible implementation of sin(). For large values of theta,
0 would be arguably more accurate since the absolute error is never
greater than 1. fsin/fcos can have absolute errors as large as 2
(correct answer=1; returned result=-1). "

https://blogs.oracle.com/jag/entry/transcendental_meditation
 
P

pedro1492

" This error has tragically become un-fixable because of the
compatibility requirements from one generation to the next. The fix for
this problem was figured out quite a long time ago. In the excellent
paper The K5 transcendental functions by T. Lynch, A. Ahmed, M. Schulte,
T. Callaway, and R. Tisdale a technique is described for doing argument
reduction as if you had an infinitely precise value for pi. As far as I
know, the K5 is the only x86 family CPU that did sin/cos accurately. AMD
went back to being bit-for-bit compatibile with the old x87 behavior,
assumably because too many applications broke. Oddly enough, this is
fixed in Itanium.

Wot about Cuda and other GPU?
A company I used to work for tried them for a heavy duty number-crunching
algorithm (that took about 4 days with Opterons) and found they got
different answers. Which was a problem with "time-lapse processing" when
they compared data recorded years apart, so required you would have to redo
the old data with the new hardware.
 
Y

Yousuf Khan

Wot about Cuda and other GPU?
A company I used to work for tried them for a heavy duty number-crunching
algorithm (that took about 4 days with Opterons) and found they got
different answers. Which was a problem with "time-lapse processing" when
they compared data recorded years apart, so required you would have to redo
the old data with the new hardware.

Well, I don't think that CUDA or other GPU-based API's actually have any
native transcendental functions built-in. In fact, as far as I know most
floating point hardware don't have these functions built-in, it's only
the x87's CISC instruction set that had ever had these high-order
functions. All other FPU's emulate these instructions in software, using
series expansion methods.

Even the x86's latest FP hardware instruction sets, such as SSE and AVX
don't have native support for these high-order functions. So to use
these functions: (1) you would have to emulate them in software, or (2)
you would have to do some of your calculations in the new generation SSE
instructions, and then pass some of it to the old x87 hardware.
Interestingly when AMD created the 64-bit instruction set extensions to
x86, it actually removed all further support for x87. All future
floating point now has to be done through SSE2+. So without access to
the x87 hardware in 64-bit programs, you have no choice but to use
software-based techniques.

Yousuf Khan
 
Top