How Math.Cos & Math.Sin is implemented?

Morgan Cheng · Oct 13, 2006

Hi,

I am writing a program that will take a lot of Math.Cos & Math.Sin
operation. I am afraid this will be source of performance impact.

Anybody knows how Math.cos & Math.Sin is implemented?
I suppose it just retrieving a huge pre-computed table, it might be
quick. I tried to cache all possible angle cos/sin in my own array , it
turns to be much faster to call Math.Cos & Math.Sin all the time.

Jon Slaughter · Oct 13, 2006

Morgan Cheng said:
Hi,

I am writing a program that will take a lot of Math.Cos & Math.Sin
operation. I am afraid this will be source of performance impact.

Anybody knows how Math.cos & Math.Sin is implemented?
I suppose it just retrieving a huge pre-computed table, it might be
quick. I tried to cache all possible angle cos/sin in my own array , it
turns to be much faster to call Math.Cos & Math.Sin all the time.

It uses an algorithm. Tables only produce finite precision and it takes
something like 2*Pi*10^7 values to get the same precision of a float from a
good algorithm. You can reduce the table size by using symmetry and such but
you end up introducing overhead when you do that too and its still on the
same order.

Utlimately its your choice. You choose a table for speed and waste memory or
you use an algorithm for precision and not wste memory.

Chris Nahr · Oct 13, 2006

Anybody knows how Math.cos & Math.Sin is implemented?

Not at all. That is, all modern personal computer CPUs have a
built-in math coprocessor that directly provides trigonometric
functions. The Math.* methods simply forwards calls to these
optimized hardware facilities. So it's extremely unlikely that you'll
get better speed by writing a software algorithm in C# or even C++.

Jon Skeet [C# MVP] · Oct 13, 2006

Morgan Cheng said:
I am writing a program that will take a lot of Math.Cos & Math.Sin
operation. I am afraid this will be source of performance impact.

Whenever you have performance fears, run tests. Most of the time, in my
experience, you'll find that performance fears about individual bits of
code are unfounded.

In this case, a quick test on my laptop showed Math.Sin being called
1,000,000,000 times in less than 2 seconds. Just how often is your
program going to call the trig methods?

Michael A. Covington · Oct 14, 2006

In this case, a quick test on my laptop showed Math.Sin being called

1,000,000,000 times in less than 2 seconds. Just how often is your
program going to call the trig methods?

With different arguments each time, or were most of the calls optimized
away?

That's 500 sine computations per microsecond. A microsecond is maybe 2400
clock cycles on your Pentium. I don't recall if the Pentium clock is
divided down. Even if it's not, 4.8 clock cycles per sine computation is
not quite credible.

Jon Skeet [C# MVP] · Oct 14, 2006

Michael A. Covington said:
With different arguments each time, or were most of the calls optimized
away?

That's 500 sine computations per microsecond. A microsecond is maybe 2400
clock cycles on your Pentium. I don't recall if the Pentium clock is
divided down. Even if it's not, 4.8 clock cycles per sine computation is
not quite credible.

You're right. Here's a somewhat better test - I suspect things were
being optimised out before and I was too sleepy to notice. Oops!

using System;

class Test
{
static void Main()
{
double total = 0.23;

DateTime start = DateTime.Now;
for (int i=0; i < 100000000; i++)
{
total += Math.Sin(total);
total += Math.Cos(total);
}
DateTime end = DateTime.Now;

Console.WriteLine (end-start);
Console.WriteLine (total);
}
}

This is harder work, of course - 2 trig operations and 2 additions per
cycle. The timing on my box is 12 seconds for the 100,000,000 cycles.
Not as fast as before, but still likely to be fast enough for the OP
not to have to worry

Willy Denoyette [MVP] · Oct 14, 2006

| > > In this case, a quick test on my laptop showed Math.Sin being called
| > > 1,000,000,000 times in less than 2 seconds. Just how often is your
| > > program going to call the trig methods?
| >
| > With different arguments each time, or were most of the calls optimized
| > away?
| >
| > That's 500 sine computations per microsecond. A microsecond is maybe
2400
| > clock cycles on your Pentium. I don't recall if the Pentium clock is
| > divided down. Even if it's not, 4.8 clock cycles per sine computation
is
| > not quite credible.
|
| You're right. Here's a somewhat better test - I suspect things were
| being optimised out before and I was too sleepy to notice. Oops!
|
| using System;
|
| class Test
| {
| static void Main()
| {
| double total = 0.23;
|
| DateTime start = DateTime.Now;
| for (int i=0; i < 100000000; i++)
| {
| total += Math.Sin(total);
| total += Math.Cos(total);
| }
| DateTime end = DateTime.Now;
|
| Console.WriteLine (end-start);
| Console.WriteLine (total);
| }
| }
|
| This is harder work, of course - 2 trig operations and 2 additions per
| cycle. The timing on my box is 12 seconds for the 100,000,000 cycles.
| Not as fast as before, but still likely to be fast enough for the OP
| not to have to worry

Following is exactly what the JIT has produced from the loop in release mode
, the figures between () are the instruction latencies ( here for AMD64,
your's may vary).

dd0424 fld qword ptr [esp] (4)
d9fe fsin (93)
dc0424 fadd qword ptr [esp] (6)
dd1c24 fstp qword ptr [esp] (2)
dd0424 fld qword ptr [esp] (4)
d9ff fcos (92)
dc0424 fadd qword ptr [esp] (6)
dd1c24 fstp qword ptr [esp] (2)
83c601 add esi,1 (1)
81fe00e1f505 cmp esi,5F5E100h (4)
7cda jl 00cb00a0 (1)

that's a total 215 clock cycles per loop. On my box with a clock cycle of
~0,4329 nSec. that would account for ~93 nSec per loop, or 9.3 sec. for
100.000.000.000. Actually the test runs in 8.59 sec. this because there is
some amount of // execution done.

Willy.

Morgan Cheng · Oct 16, 2006

Willy said:
| > > In this case, a quick test on my laptop showed Math.Sin being called
| > > 1,000,000,000 times in less than 2 seconds. Just how often is your
| > > program going to call the trig methods?
| >
| > With different arguments each time, or were most of the calls optimized
| > away?
| >
| > That's 500 sine computations per microsecond. A microsecond is maybe
2400
| > clock cycles on your Pentium. I don't recall if the Pentium clock is
| > divided down. Even if it's not, 4.8 clock cycles per sine computation
is
| > not quite credible.
|
| You're right. Here's a somewhat better test - I suspect things were
| being optimised out before and I was too sleepy to notice. Oops!
|
| using System;
|
| class Test
| {
| static void Main()
| {
| double total = 0.23;
|
| DateTime start = DateTime.Now;
| for (int i=0; i < 100000000; i++)
| {
| total += Math.Sin(total);
| total += Math.Cos(total);
| }
| DateTime end = DateTime.Now;
|
| Console.WriteLine (end-start);
| Console.WriteLine (total);
| }
| }
|
| This is harder work, of course - 2 trig operations and 2 additions per
| cycle. The timing on my box is 12 seconds for the 100,000,000 cycles.
| Not as fast as before, but still likely to be fast enough for the OP
| not to have to worry

Following is exactly what the JIT has produced from the loop in release mode
, the figures between () are the instruction latencies ( here for AMD64,
your's may vary).

dd0424 fld qword ptr [esp] (4)
d9fe fsin (93)
dc0424 fadd qword ptr [esp] (6)
dd1c24 fstp qword ptr [esp] (2)
dd0424 fld qword ptr [esp] (4)
d9ff fcos (92)
dc0424 fadd qword ptr [esp] (6)
dd1c24 fstp qword ptr [esp] (2)
83c601 add esi,1 (1)
81fe00e1f505 cmp esi,5F5E100h (4)
7cda jl 00cb00a0 (1)

that's a total 215 clock cycles per loop. On my box with a clock cycle of
~0,4329 nSec. that would account for ~93 nSec per loop, or 9.3 sec. for
100.000.000.000. Actually the test runs in 8.59 sec. this because there is
some amount of // execution done.

Thanks for your clarification.
I did some expriementation too. It shows sin/cos doesn't take much cpu
cycles, but I still prefer to pre-compute needed sin/cos value in two
array, and fetch them later. Since I am implementing Hough
Transformation, which needs cos/sin in a X*Y loop(X & Y are image width
and height). Accessing an array is always supposed to be faster than
Math.Cos & Math.Sin function call, right?

Morgan Cheng · Oct 16, 2006

Jon said:
It uses an algorithm. Tables only produce finite precision and it takes
something like 2*Pi*10^7 values to get the same precision of a float from a
good algorithm. You can reduce the table size by using symmetry and such but
you end up introducing overhead when you do that too and its still on the
same order.

Utlimately its your choice. You choose a table for speed and waste memory or
you use an algorithm for precision and not wste memory.

In my case, I don't need cos/sin value of any angles. I just need 0,
1/4, 2/4, 3/4....355+3/4 degress. So, I precompute them and put them in
two array cos[360*4] and sin[360*4].

Ben Newsam · Oct 16, 2006

It uses an algorithm. Tables only produce finite precision and it takes
something like 2*Pi*10^7 values to get the same precision of a float from a
good algorithm. You can reduce the table size by using symmetry and such but
you end up introducing overhead when you do that too and its still on the
same order.

Utlimately its your choice. You choose a table for speed and waste memory or
you use an algorithm for precision and not wste memory.

Many years ago, I implemented SIN and COS tables as 16 bit values
rather than floating point. This was in assembler, mind you, not a
modern language. For the graphic resolution required at the time, 16
bits was quite enough. When calculating an X/Y position, I stored the
remainder and used that in the next calculation to cut down on
positional errors. It worked a treat, and was blindingly fast too. I
suspect that nowadays, 24 bit values might be required, but the
principle remains the same.

Willy Denoyette [MVP] · Oct 16, 2006

|
| Willy Denoyette [MVP] wrote:
| > | > | > > In this case, a quick test on my laptop showed Math.Sin being
called
| > | > > 1,000,000,000 times in less than 2 seconds. Just how often is your
| > | > > program going to call the trig methods?
| > | >
| > | > With different arguments each time, or were most of the calls
optimized
| > | > away?
| > | >
| > | > That's 500 sine computations per microsecond. A microsecond is
maybe
| > 2400
| > | > clock cycles on your Pentium. I don't recall if the Pentium clock
is
| > | > divided down. Even if it's not, 4.8 clock cycles per sine
computation
| > is
| > | > not quite credible.
| > |
| > | You're right. Here's a somewhat better test - I suspect things were
| > | being optimised out before and I was too sleepy to notice. Oops!
| > |
| > | using System;
| > |
| > | class Test
| > | {
| > | static void Main()
| > | {
| > | double total = 0.23;
| > |
| > | DateTime start = DateTime.Now;
| > | for (int i=0; i < 100000000; i++)
| > | {
| > | total += Math.Sin(total);
| > | total += Math.Cos(total);
| > | }
| > | DateTime end = DateTime.Now;
| > |
| > | Console.WriteLine (end-start);
| > | Console.WriteLine (total);
| > | }
| > | }
| > |
| > | This is harder work, of course - 2 trig operations and 2 additions per
| > | cycle. The timing on my box is 12 seconds for the 100,000,000 cycles.
| > | Not as fast as before, but still likely to be fast enough for the OP
| > | not to have to worry

| >
| > Following is exactly what the JIT has produced from the loop in release
mode
| > , the figures between () are the instruction latencies ( here for AMD64,
| > your's may vary).
| >
| > dd0424 fld qword ptr [esp] (4)
| > d9fe fsin (93)
| > dc0424 fadd qword ptr [esp] (6)
| > dd1c24 fstp qword ptr [esp] (2)
| > dd0424 fld qword ptr [esp] (4)
| > d9ff fcos (92)
| > dc0424 fadd qword ptr [esp] (6)
| > dd1c24 fstp qword ptr [esp] (2)
| > 83c601 add esi,1 (1)
| > 81fe00e1f505 cmp esi,5F5E100h (4)
| > 7cda jl 00cb00a0 (1)
| >
| > that's a total 215 clock cycles per loop. On my box with a clock cycle
of
| > ~0,4329 nSec. that would account for ~93 nSec per loop, or 9.3 sec. for
| > 100.000.000.000. Actually the test runs in 8.59 sec. this because there
is
| > some amount of // execution done.
| >
| Thanks for your clarification.
| I did some expriementation too. It shows sin/cos doesn't take much cpu
| cycles, but I still prefer to pre-compute needed sin/cos value in two
| array, and fetch them later. Since I am implementing Hough
| Transformation, which needs cos/sin in a X*Y loop(X & Y are image width
| and height). Accessing an array is always supposed to be faster than
| Math.Cos & Math.Sin function call, right?

Could be, but keep in mind that using a table look-up might introduce some
hidden costs.
I wouldn't care about these kind of micro-optimizations, more important is
to take care of a good algorithm design and implementation. If ever it turns
out to be a bottleneck, you can switch to a table look-up if you are sure
about the performance gains after proper profiling.

Willy.

=?ISO-8859-1?Q?Arne_Vajh=F8j?= · Oct 17, 2006

Morgan said:
In my case, I don't need cos/sin value of any angles. I just need 0,
1/4, 2/4, 3/4....355+3/4 degress. So, I precompute them and put them in
two array cos[360*4] and sin[360*4].

In which case the array lookup is obviously faster. But also
has nothing to do with the general case.

Arne

Morgan Cheng · Oct 17, 2006

Arne said:
Morgan said:

In my case, I don't need cos/sin value of any angles. I just need 0,
1/4, 2/4, 3/4....355+3/4 degress. So, I precompute them and put them in
two array cos[360*4] and sin[360*4].

Click to expand...

In which case the array lookup is obviously faster. But also
has nothing to do with the general case.

Arne

Implementation of Hough Transformation.
it is something like.

double angle = 0.0;
double angleStep = Math.PI / (180*4);
for (int x=0; x< image.Width; ++x)
for (int y =0; y< image.Height; ++y)
{
if (some codition)
{
double radius = x * cos (angle) + y * sine(angle);
....
}
angle += angleStep;
}

Just rush the code. Perhaps not accurate for Hough.
cos & sin are computed over and over again in two dimesion loop.

Morgan Cheng · Oct 17, 2006

Willy said:
|
| Willy Denoyette [MVP] wrote:
| > | > | > > In this case, a quick test on my laptop showed Math.Sin being
called
| > | > > 1,000,000,000 times in less than 2 seconds. Just how often is your
| > | > > program going to call the trig methods?
| > | >
| > | > With different arguments each time, or were most of the calls
optimized
| > | > away?
| > | >
| > | > That's 500 sine computations per microsecond. A microsecond is
maybe
| > 2400
| > | > clock cycles on your Pentium. I don't recall if the Pentium clock
is
| > | > divided down. Even if it's not, 4.8 clock cycles per sine
computation
| > is
| > | > not quite credible.
| > |
| > | You're right. Here's a somewhat better test - I suspect things were
| > | being optimised out before and I was too sleepy to notice. Oops!
| > |
| > | using System;
| > |
| > | class Test
| > | {
| > | static void Main()
| > | {
| > | double total = 0.23;
| > |
| > | DateTime start = DateTime.Now;
| > | for (int i=0; i < 100000000; i++)
| > | {
| > | total += Math.Sin(total);
| > | total += Math.Cos(total);
| > | }
| > | DateTime end = DateTime.Now;
| > |
| > | Console.WriteLine (end-start);
| > | Console.WriteLine (total);
| > | }
| > | }
| > |
| > | This is harder work, of course - 2 trig operations and 2 additions per
| > | cycle. The timing on my box is 12 seconds for the 100,000,000 cycles.
| > | Not as fast as before, but still likely to be fast enough for the OP
| > | not to have to worry
| >
| > Following is exactly what the JIT has produced from the loop in release
mode
| > , the figures between () are the instruction latencies ( here for AMD64,
| > your's may vary).
| >
| > dd0424 fld qword ptr [esp] (4)
| > d9fe fsin (93)
| > dc0424 fadd qword ptr [esp] (6)
| > dd1c24 fstp qword ptr [esp] (2)
| > dd0424 fld qword ptr [esp] (4)
| > d9ff fcos (92)
| > dc0424 fadd qword ptr [esp] (6)
| > dd1c24 fstp qword ptr [esp] (2)
| > 83c601 add esi,1 (1)
| > 81fe00e1f505 cmp esi,5F5E100h (4)
| > 7cda jl 00cb00a0 (1)
| >
| > that's a total 215 clock cycles per loop. On my box with a clock cycle
of
| > ~0,4329 nSec. that would account for ~93 nSec per loop, or 9.3 sec. for
| > 100.000.000.000. Actually the test runs in 8.59 sec. this because there
is
| > some amount of // execution done.
| >
| Thanks for your clarification.
| I did some expriementation too. It shows sin/cos doesn't take much cpu
| cycles, but I still prefer to pre-compute needed sin/cos value in two
| array, and fetch them later. Since I am implementing Hough
| Transformation, which needs cos/sin in a X*Y loop(X & Y are image width
| and height). Accessing an array is always supposed to be faster than
| Math.Cos & Math.Sin function call, right?

Could be, but keep in mind that using a table look-up might introduce some
hidden costs.

You mean cost introduced by array boundary check?

Trajectory with C#	5	May 30, 2004
Using Math.Cos function in C# for value 90 degree.	7	Nov 19, 2003
threadpool: how can I make sure just an X amount of threads (CPU's)are used?	6	Apr 15, 2010
Optimize trigonometric calculations?	12	Mar 12, 2008
Calculating Miles per Degree Longitude - Check my Math Please	12	Mar 13, 2007
Fast Fourier Transform (FFT) in VB .Net - Please Help	6	Oct 11, 2007
Thesaurus Implementation	8	Sep 16, 2006
Error: InterfaceDictionary ... does not implement interface member ...	0	Jul 1, 2010

How Math.Cos & Math.Sin is implemented?

Morgan Cheng

Jon Slaughter

Chris Nahr

Jon Skeet [C# MVP]

Michael A. Covington

Jon Skeet [C# MVP]

Willy Denoyette [MVP]

Morgan Cheng

Morgan Cheng

Ben Newsam

Willy Denoyette [MVP]

=?ISO-8859-1?Q?Arne_Vajh=F8j?=

Morgan Cheng

Morgan Cheng

Ask a Question

Similar Threads