Is there a faster hi resolution timer for diy profiling

C

colin

Hi,

Im trying to time some parts of my code using the Stopwatch class,
but it is so slow, I tried the QueryPerformanceFrequency()
but this seems to be just as slow,
if I just simply call this function in a loop it takes 21 ticks
with a reported frequency of 3.5mhz
its is about 6us wich is over 12k instructions on my 2ghz cpu.
this negates the idea of having a high res timer ...

Is there any way to access the timer more directly ?
I know the cpu actually has a very high resolution timer,
it would be nice to be able to just read this value directly,
maybe il have to write a c++ library and do it in asm
thats assuming I can get the OS to grant me acces to that hardware address.

basicaly im making a diy profiler, I have a profile class
wich needs Enter(enum p) funtion at the start and a Leave()
function at the end of some functions of interest.

primarily it keeps tarck of the number of calls made,
and a lists the calling functions counts too.
it also measures the total time spent in the function and the subtracts the
time spent in
subroutins to provide an actual time too.

It works pretty quick for counts but when I use the timer it is like 20
times slower,
wich although I can live with waiting 20 minutes to collect the data it does
mean
that there is a considerable error for short functions,

I have added a fudge wich subtracts a fiddle value from the timer,
wich is increased by 21 ticks every times its called,
this might be close but its far from ideal.

Im trying to find out why/where its spending 75% of the cpu time in
mscorlib and mscorwks libraries, the amd codeanalyst identifies its spending
most of its time in a function called mscorwks::gc_heap::adjust_limit_clr

thanks
Colin =^.^=
 
N

Nicholas Paldino [.NET/C# MVP]

Colin,

There is no way that you are going to get access to the hardware through
user-mode code. You will have to write a driver of some sort which has
kernel mode access in order to access that value. You definitely can't do
it in C#, as you can't write assemblies with exportable functions (which
would most definitely be required).

When you use the Stopwatch class, are you checking the static
IsHighResolution property? If so, does it return true or false? If it
returns true, then your hardware supports the high-resolution performance
counter, if not, then the hardware simply does not support it. The
Stopwatch class calls the QueryPerformanceFrequency method through the
P/Invoke layer to find this out, as well as to get values from the
performance counter, which is always going to add some overhead to the
calls.

You could create a native dll which would eliminate the overhead in
making those calls to the underlying APIs, but then you would have to use
the P/Invoke layer to make calls to YOUR native dll, which is going to
re-add the overhead.

The point is that the Stopwatch class is about as good as you are going
to get for any managed solution.
 
C

colin

thanks Nicholas

yes im sure high is resolution supported, I looked in the debuger
at the flag, and I dump the frequency wich is 3.7~mhz, wich
also means its high resolution.

I dont know much about the layers under c#, but it seems to be spending
about 10k
instructions in the call wich is a lot !

its unfurtunate this p/invoke overhead is there, does unsafe code help
can you have shared memory ?

or maybe the low resolution timer is quicker to look at ?
i might investigate ...

another option would be to do the informtation exchange the other way,
ie. a high priority routine such as a hardware timer interrupt updates
a variable in c#, however although this probably has a similar performance
hit
the timer would only need to update the variable about 100k times a second,
wheres I need to read it many times more than this.

maybe I could get away with a windows form timer wich is set to 1ms.
hardly high resolution but it might average out ok over enough samples.
would have to be in a very high priority thread, would this work or would it
likely get
delayed if other threads running continuosly ?

Colin =^.^=

Nicholas Paldino said:
Colin,

There is no way that you are going to get access to the hardware
through user-mode code. You will have to write a driver of some sort
which has kernel mode access in order to access that value. You
definitely can't do it in C#, as you can't write assemblies with
exportable functions (which would most definitely be required).

When you use the Stopwatch class, are you checking the static
IsHighResolution property? If so, does it return true or false? If it
returns true, then your hardware supports the high-resolution performance
counter, if not, then the hardware simply does not support it. The
Stopwatch class calls the QueryPerformanceFrequency method through the
P/Invoke layer to find this out, as well as to get values from the
performance counter, which is always going to add some overhead to the
calls.

You could create a native dll which would eliminate the overhead in
making those calls to the underlying APIs, but then you would have to use
the P/Invoke layer to make calls to YOUR native dll, which is going to
re-add the overhead.

The point is that the Stopwatch class is about as good as you are going
to get for any managed solution.


--
- Nicholas Paldino [.NET/C# MVP]
- (e-mail address removed)

colin said:
Hi,

Im trying to time some parts of my code using the Stopwatch class,
but it is so slow, I tried the QueryPerformanceFrequency()
but this seems to be just as slow,
if I just simply call this function in a loop it takes 21 ticks
with a reported frequency of 3.5mhz
its is about 6us wich is over 12k instructions on my 2ghz cpu.
this negates the idea of having a high res timer ...

Is there any way to access the timer more directly ?
I know the cpu actually has a very high resolution timer,
it would be nice to be able to just read this value directly,
maybe il have to write a c++ library and do it in asm
thats assuming I can get the OS to grant me acces to that hardware
address.

basicaly im making a diy profiler, I have a profile class
wich needs Enter(enum p) funtion at the start and a Leave()
function at the end of some functions of interest.

primarily it keeps tarck of the number of calls made,
and a lists the calling functions counts too.
it also measures the total time spent in the function and the subtracts
the time spent in
subroutins to provide an actual time too.

It works pretty quick for counts but when I use the timer it is like 20
times slower,
wich although I can live with waiting 20 minutes to collect the data it
does mean
that there is a considerable error for short functions,

I have added a fudge wich subtracts a fiddle value from the timer,
wich is increased by 21 ticks every times its called,
this might be close but its far from ideal.

Im trying to find out why/where its spending 75% of the cpu time in
mscorlib and mscorwks libraries, the amd codeanalyst identifies its
spending
most of its time in a function called mscorwks::gc_heap::adjust_limit_clr

thanks
Colin =^.^=
 
W

Willy Denoyette [MVP]

colin said:
Hi,

Im trying to time some parts of my code using the Stopwatch class,
but it is so slow, I tried the QueryPerformanceFrequency()
but this seems to be just as slow,
if I just simply call this function in a loop it takes 21 ticks
with a reported frequency of 3.5mhz
its is about 6us wich is over 12k instructions on my 2ghz cpu.
this negates the idea of having a high res timer ...

Is there any way to access the timer more directly ?
I know the cpu actually has a very high resolution timer,
it would be nice to be able to just read this value directly,
maybe il have to write a c++ library and do it in asm
thats assuming I can get the OS to grant me acces to that hardware
address.

basicaly im making a diy profiler, I have a profile class
wich needs Enter(enum p) funtion at the start and a Leave()
function at the end of some functions of interest.

primarily it keeps tarck of the number of calls made,
and a lists the calling functions counts too.
it also measures the total time spent in the function and the subtracts
the time spent in
subroutins to provide an actual time too.

It works pretty quick for counts but when I use the timer it is like 20
times slower,
wich although I can live with waiting 20 minutes to collect the data it
does mean
that there is a considerable error for short functions,

I have added a fudge wich subtracts a fiddle value from the timer,
wich is increased by 21 ticks every times its called,
this might be close but its far from ideal.

Im trying to find out why/where its spending 75% of the cpu time in
mscorlib and mscorwks libraries, the amd codeanalyst identifies its
spending
most of its time in a function called mscorwks::gc_heap::adjust_limit_clr

thanks
Colin =^.^=



Please post your code, there must be something wrong with it,
QueryPerformanceFrequency()
can't take 6µsec to complete.
Note that there is no need to call QueryPerformanceFrequency(), this value
exposed by the Stopwatch class.
Willy.
 
W

Willy Denoyette [MVP]

Willy Denoyette said:
Please post your code, there must be something wrong with it,
QueryPerformanceFrequency()
can't take 6µsec to complete.
Note that there is no need to call QueryPerformanceFrequency(), this value
exposed by the Stopwatch class.
Willy.



Well, QueryPerformanceFrequency must transition into the kernel, which is
quite expensive, (> 6000 instructions), so this function may well take +4
µsec on a 2Ghz machine. The same is true for QueryPerformanceCounter, so you
need to account for this overhead when profiling.

Windows Vista and WS2008 have a new API QueryThreadCycleTime which returns
the CPU cycles count.
This API calls the __rdtsc(); VC++ compilers intrinsic function, you can
implement a small C function that calls this intrinsic and call this one
from C# using PInvoke.


Willy.
 
W

Willy Denoyette [MVP]

colin said:
I just copied the code from here
http://msdn2.microsoft.com/en-us/library/aa964692(VS.80).aspx

and oops I meant the QueryPerformanceCounter() call is taking so long ...

I just have a loop wich calls it repeatedly 100000 times to measure how
long it takes.

the Stopwatch also took just as long.


Hmmm.. this code is meant for the compact framework (WinCE).
The exact PInvoke declaration for Windows is:

[DllImport("kernel32"),SuppressUnmanagedCodeSecurity]
static extern int QueryPerformanceFrequency(out long freq);

but as I said in another reply QueryPerformanceFrequency is an expensive
call.

Willy.
 
C

colin

Willy Denoyette said:
colin said:
I just copied the code from here
http://msdn2.microsoft.com/en-us/library/aa964692(VS.80).aspx

and oops I meant the QueryPerformanceCounter() call is taking so long ...

I just have a loop wich calls it repeatedly 100000 times to measure how
long it takes.

the Stopwatch also took just as long.


Hmmm.. this code is meant for the compact framework (WinCE).
The exact PInvoke declaration for Windows is:

[DllImport("kernel32"),SuppressUnmanagedCodeSecurity]
static extern int QueryPerformanceFrequency(out long freq);

but as I said in another reply QueryPerformanceFrequency is an expensive
call.

Ah yes i found that out after I copied it and had to go search for
coredll.dll
it seems its in kernel32 for non ce
I aded the supress security too
in the hope it might speed it up ...

seems these two calls are easy to mix up when writing them lol...
i just call frequency func once, and the counter func about 100million times

thanks
Colin =^.^=
 
W

Willy Denoyette [MVP]

Willy Denoyette said:
Well, QueryPerformanceFrequency must transition into the kernel, which is
quite expensive, (> 6000 instructions), so this function may well take +4
µsec on a 2Ghz machine. The same is true for QueryPerformanceCounter, so
you need to account for this overhead when profiling.

Windows Vista and WS2008 have a new API QueryThreadCycleTime which returns
the CPU cycles count.
This API calls the __rdtsc(); VC++ compilers intrinsic function, you can
implement a small C function that calls this intrinsic and call this one
from C# using PInvoke.


Willy.



// C function that uses an intrinsic to retrieve the Processor's cycle
counter register
// Beware your CPU may not support this!
// Compile from the command line: cl /EHsc /O2 <thisfile.cpp>
#include <intrin.h>
#pragma intrinsic(__rdtsc)
extern "C"
__declspec(dllexport) unsigned __int64 __stdcall rdtsc()
{
return __rdtsc();
}


// C#
[DllImport("thisfile.dll"),SuppressUnmanagedCodeSecurity]
static extern ulong rdtsc();

...
ulong cycles = rdtsc();

Willy.
 
C

colin

Willy Denoyette said:
"Willy Denoyette [MVP]" <[email protected]> wrote in message
// C function that uses an intrinsic to retrieve the Processor's cycle
counter register
// Beware your CPU may not support this!
// Compile from the command line: cl /EHsc /O2 <thisfile.cpp>
#include <intrin.h>
#pragma intrinsic(__rdtsc)
extern "C"
__declspec(dllexport) unsigned __int64 __stdcall rdtsc()
{
return __rdtsc();
}


// C#
[DllImport("thisfile.dll"),SuppressUnmanagedCodeSecurity]
static extern ulong rdtsc();

...
ulong cycles = rdtsc();

Willy.

thanks that worked, however it stil takes 9000 clock cycles
wich I think is roughly about the same time ...
is that PInvoke realy so time consuming ?
what does it have todo ?

thanks
Colin =^.^=
 
C

colin

colin said:
Willy Denoyette said:
"Willy Denoyette [MVP]" <[email protected]> wrote in message
// C function that uses an intrinsic to retrieve the Processor's cycle
counter register
// Beware your CPU may not support this!
// Compile from the command line: cl /EHsc /O2 <thisfile.cpp>
#include <intrin.h>
#pragma intrinsic(__rdtsc)
extern "C"
__declspec(dllexport) unsigned __int64 __stdcall rdtsc()
{
return __rdtsc();
}


// C#
[DllImport("thisfile.dll"),SuppressUnmanagedCodeSecurity]
static extern ulong rdtsc();

...
ulong cycles = rdtsc();

Willy.

thanks that worked, however it stil takes 9000 clock cycles
wich I think is roughly about the same time ...
is that PInvoke realy so time consuming ?
what does it have todo ?


Ive just used Environment.TickCount, its only 1ms resolution
but it seems to be instant.

as ive plenty enough samples it seems to give some results still ...

thanks

Colin =^.^=
 
W

Willy Denoyette [MVP]

colin said:
Willy Denoyette said:
"Willy Denoyette [MVP]" <[email protected]> wrote in message
// C function that uses an intrinsic to retrieve the Processor's cycle
counter register
// Beware your CPU may not support this!
// Compile from the command line: cl /EHsc /O2 <thisfile.cpp>
#include <intrin.h>
#pragma intrinsic(__rdtsc)
extern "C"
__declspec(dllexport) unsigned __int64 __stdcall rdtsc()
{
return __rdtsc();
}


// C#
[DllImport("thisfile.dll"),SuppressUnmanagedCodeSecurity]
static extern ulong rdtsc();

...
ulong cycles = rdtsc();

Willy.

thanks that worked, however it stil takes 9000 clock cycles
wich I think is roughly about the same time ...
is that PInvoke realy so time consuming ?
what does it have todo ?

thanks
Colin =^.^=



What exactly takes 9000 cycles?

Willy.
 
W

Willy Denoyette [MVP]

colin said:
colin said:
Willy Denoyette said:
"Willy Denoyette [MVP]" <[email protected]> wrote in message
// C function that uses an intrinsic to retrieve the Processor's cycle
counter register
// Beware your CPU may not support this!
// Compile from the command line: cl /EHsc /O2 <thisfile.cpp>
#include <intrin.h>
#pragma intrinsic(__rdtsc)
extern "C"
__declspec(dllexport) unsigned __int64 __stdcall rdtsc()
{
return __rdtsc();
}


// C#
[DllImport("thisfile.dll"),SuppressUnmanagedCodeSecurity]
static extern ulong rdtsc();

...
ulong cycles = rdtsc();

Willy.

thanks that worked, however it stil takes 9000 clock cycles
wich I think is roughly about the same time ...
is that PInvoke realy so time consuming ?
what does it have todo ?


Ive just used Environment.TickCount, its only 1ms resolution
but it seems to be instant.

To do what?
Using TickCount to measure something that takes less that the RTC interval
will return the same count, the reason is that TickCount is updated at the
rithm of the system clock (every 10 or 15.6 msec. or more....).

Willy.


Willy.
Willy.
 
C

colin

Willy Denoyette said:
colin said:
colin said:
// C function that uses an intrinsic to retrieve the Processor's cycle
counter register
// Beware your CPU may not support this!
// Compile from the command line: cl /EHsc /O2 <thisfile.cpp>
#include <intrin.h>
#pragma intrinsic(__rdtsc)
extern "C"
__declspec(dllexport) unsigned __int64 __stdcall rdtsc()
{
return __rdtsc();
}


// C#
[DllImport("thisfile.dll"),SuppressUnmanagedCodeSecurity]
static extern ulong rdtsc();

...
ulong cycles = rdtsc();

Willy.

thanks that worked, however it stil takes 9000 clock cycles
wich I think is roughly about the same time ...
is that PInvoke realy so time consuming ?
what does it have todo ?


Ive just used Environment.TickCount, its only 1ms resolution
but it seems to be instant.

To do what?
Using TickCount to measure something that takes less that the RTC interval
will return the same count, the reason is that TickCount is updated at the
rithm of the system clock (every 10 or 15.6 msec. or more....).

Willy.

well yes one would think so wich is why I wanted a high res timer,
but if you call it 1000 times and only 1 call it changes by 1 then
you know on average its taken 1us. ofc this means its subject to random
noise in the results.
the usuble resolution is timer_resolution/sqrt(number sampls)

Colin =^.^=
 
C

colin

Willy Denoyette said:
colin said:
Willy Denoyette said:
"Willy Denoyette [MVP]" <[email protected]> wrote in message
// C function that uses an intrinsic to retrieve the Processor's cycle
counter register
// Beware your CPU may not support this!
// Compile from the command line: cl /EHsc /O2 <thisfile.cpp>
#include <intrin.h>
#pragma intrinsic(__rdtsc)
extern "C"
__declspec(dllexport) unsigned __int64 __stdcall rdtsc()
{
return __rdtsc();
}


// C#
[DllImport("thisfile.dll"),SuppressUnmanagedCodeSecurity]
static extern ulong rdtsc();

...
ulong cycles = rdtsc();

Willy.

thanks that worked, however it stil takes 9000 clock cycles
wich I think is roughly about the same time ...
is that PInvoke realy so time consuming ?
what does it have todo ?

thanks
Colin =^.^=
What exactly takes 9000 cycles?

int cnt=100000;
start = timer.Value;
for (int x = 0; x < cnt; x++)
{
ulong cycles = timer.value
}
timerOverhead = (timer.Value - start)/cnt;

timer.value -> get{return rdtsc();}

timerOverhead comes out as about 9000;

Colin =^.^=
 
W

Willy Denoyette [MVP]

colin said:
Willy Denoyette said:
colin said:
// C function that uses an intrinsic to retrieve the Processor's cycle
counter register
// Beware your CPU may not support this!
// Compile from the command line: cl /EHsc /O2 <thisfile.cpp>
#include <intrin.h>
#pragma intrinsic(__rdtsc)
extern "C"
__declspec(dllexport) unsigned __int64 __stdcall rdtsc()
{
return __rdtsc();
}


// C#
[DllImport("thisfile.dll"),SuppressUnmanagedCodeSecurity]
static extern ulong rdtsc();

...
ulong cycles = rdtsc();

Willy.

thanks that worked, however it stil takes 9000 clock cycles
wich I think is roughly about the same time ...
is that PInvoke realy so time consuming ?
what does it have todo ?

thanks
Colin =^.^=
What exactly takes 9000 cycles?

int cnt=100000;
start = timer.Value;
for (int x = 0; x < cnt; x++)
{
ulong cycles = timer.value
}
timerOverhead = (timer.Value - start)/cnt;

timer.value -> get{return rdtsc();}

timerOverhead comes out as about 9000;

Colin =^.^=

Following code:

[DllImport("rtc"),SuppressUnmanagedCodeSecurity]
static extern ulong rdtsc();

static void Main()

{
int cnt = 100000;
ulong cycles = 0;
ulong start = rdtsc();
for (int x = 0; x < cnt; x++)
{
cycles = rdtsc();
}
ulong timerOverhead = (cycles - start) / (ulong)cnt;
Console.WriteLine(timerOverhead);
}
}

comes out as 13 on my box.
That means 13 cycles per iteration, the total number of instructions
executed is ~35, from which ~30 is Interop call overhead, the remaining in
the increment and compare of cnt.
The called function itself is only two instructions:

10001000 0f31 rdtsc
10001002 c3 ret

Seems like your processor does not support the rdtsc feature, what CPU do
you run this on?

You can call following C function before you are trying to use the Time
Stamp Counter, this function returns true if TSC is supported else false....

extern "C"
__declspec(dllexport) bool __stdcall tsc()
{
bool tscFeature = false;
int CPUInfo[4] = {-1};
__cpuid(CPUInfo, 0x80000001);
int nFeatureInfo = CPUInfo[3];
// check tsc feature bit (bit 27) in CPUInfo[3]
if (nFeatureInfo & 0x4000000)
tscFeature = true;
return tscFeature;
}


[DllImport("yourdll"),SuppressUnmanagedCodeSecurity]
static extern ulong rdtsc();


Willy.
 
C

colin

Willy Denoyette said:
colin said:
Willy Denoyette said:
// C function that uses an intrinsic to retrieve the Processor's cycle
counter register
// Beware your CPU may not support this!
// Compile from the command line: cl /EHsc /O2 <thisfile.cpp>
#include <intrin.h>
#pragma intrinsic(__rdtsc)
extern "C"
__declspec(dllexport) unsigned __int64 __stdcall rdtsc()
{
return __rdtsc();
}


// C#
[DllImport("thisfile.dll"),SuppressUnmanagedCodeSecurity]
static extern ulong rdtsc();

...
ulong cycles = rdtsc();

Willy.

thanks that worked, however it stil takes 9000 clock cycles
wich I think is roughly about the same time ...
is that PInvoke realy so time consuming ?
what does it have todo ?

thanks
Colin =^.^=

What exactly takes 9000 cycles?

int cnt=100000;
start = timer.Value;
for (int x = 0; x < cnt; x++)
{
ulong cycles = timer.value
}
timerOverhead = (timer.Value - start)/cnt;

timer.value -> get{return rdtsc();}

timerOverhead comes out as about 9000;

Colin =^.^=

Following code:

[DllImport("rtc"),SuppressUnmanagedCodeSecurity]
static extern ulong rdtsc();

static void Main()

{
int cnt = 100000;
ulong cycles = 0;
ulong start = rdtsc();
for (int x = 0; x < cnt; x++)
{
cycles = rdtsc();
}
ulong timerOverhead = (cycles - start) / (ulong)cnt;
Console.WriteLine(timerOverhead);
}
}

comes out as 13 on my box.
That means 13 cycles per iteration, the total number of instructions
executed is ~35, from which ~30 is Interop call overhead, the remaining in
the increment and compare of cnt.
The called function itself is only two instructions:

10001000 0f31 rdtsc
10001002 c3 ret

Seems like your processor does not support the rdtsc feature, what CPU do
you run this on?

You can call following C function before you are trying to use the Time
Stamp Counter, this function returns true if TSC is supported else
false....

extern "C"
__declspec(dllexport) bool __stdcall tsc()
{
bool tscFeature = false;
int CPUInfo[4] = {-1};
__cpuid(CPUInfo, 0x80000001);
int nFeatureInfo = CPUInfo[3];
// check tsc feature bit (bit 27) in CPUInfo[3]
if (nFeatureInfo & 0x4000000)
tscFeature = true;
return tscFeature;
}


[DllImport("yourdll"),SuppressUnmanagedCodeSecurity]
static extern ulong rdtsc();

thanks for trying it on your machie :D

hmm seems running from the ide debugger is making it incredibly slow
I just made completly new files and new c# console project,
managed to get 12 too by using ctrl-f5,
but when I run with just f5 its 9000 !

is there a way to avoid the overhead the debugger being attached seems to
impose ?

I only use the timer in debug mode to do profileing anyway,
but I gues I cld detect if the debuger is not atacched and dump it to a file
....

thanks
Colin =^.^=
 
C

colin

colin said:
thanks for trying it on your machie :D

hmm seems running from the ide debugger is making it incredibly slow
I just made completly new files and new c# console project,
managed to get 12 too by using ctrl-f5,
but when I run with just f5 its 9000 !

is there a way to avoid the overhead the debugger being attached seems to
impose ?

I only use the timer in debug mode to do profileing anyway,
but I gues I cld detect if the debuger is not atacched and dump it to a
file ...

I also went back and tried the

Stopwatch.GetTimestamp()

function amd found this is also much faster with no debugger attached,
the overhead is only 3 ticks instead of 13,
incidently it goes upto 39 ticks if optimization is off too.

3 ticks is still 1us and the rdtsc is much much faster,
but I could of lived with 3 ticks, it would of added 100s
to the run time instead of over 600.

thanks again
Colin =^.^=
 
W

Willy Denoyette [MVP]

colin said:
Willy Denoyette said:
colin said:
// C function that uses an intrinsic to retrieve the Processor's
cycle counter register
// Beware your CPU may not support this!
// Compile from the command line: cl /EHsc /O2 <thisfile.cpp>
#include <intrin.h>
#pragma intrinsic(__rdtsc)
extern "C"
__declspec(dllexport) unsigned __int64 __stdcall rdtsc()
{
return __rdtsc();
}


// C#
[DllImport("thisfile.dll"),SuppressUnmanagedCodeSecurity]
static extern ulong rdtsc();

...
ulong cycles = rdtsc();

Willy.

thanks that worked, however it stil takes 9000 clock cycles
wich I think is roughly about the same time ...
is that PInvoke realy so time consuming ?
what does it have todo ?

thanks
Colin =^.^=

What exactly takes 9000 cycles?

int cnt=100000;
start = timer.Value;
for (int x = 0; x < cnt; x++)
{
ulong cycles = timer.value
}
timerOverhead = (timer.Value - start)/cnt;

timer.value -> get{return rdtsc();}

timerOverhead comes out as about 9000;

Colin =^.^=

Following code:

[DllImport("rtc"),SuppressUnmanagedCodeSecurity]
static extern ulong rdtsc();

static void Main()

{
int cnt = 100000;
ulong cycles = 0;
ulong start = rdtsc();
for (int x = 0; x < cnt; x++)
{
cycles = rdtsc();
}
ulong timerOverhead = (cycles - start) / (ulong)cnt;
Console.WriteLine(timerOverhead);
}
}

comes out as 13 on my box.
That means 13 cycles per iteration, the total number of instructions
executed is ~35, from which ~30 is Interop call overhead, the remaining
in the increment and compare of cnt.
The called function itself is only two instructions:

10001000 0f31 rdtsc
10001002 c3 ret

Seems like your processor does not support the rdtsc feature, what CPU do
you run this on?

You can call following C function before you are trying to use the Time
Stamp Counter, this function returns true if TSC is supported else
false....

extern "C"
__declspec(dllexport) bool __stdcall tsc()
{
bool tscFeature = false;
int CPUInfo[4] = {-1};
__cpuid(CPUInfo, 0x80000001);
int nFeatureInfo = CPUInfo[3];
// check tsc feature bit (bit 27) in CPUInfo[3]
if (nFeatureInfo & 0x4000000)
tscFeature = true;
return tscFeature;
}


[DllImport("yourdll"),SuppressUnmanagedCodeSecurity]
static extern ulong rdtsc();

thanks for trying it on your machie :D

hmm seems running from the ide debugger is making it incredibly slow
I just made completly new files and new c# console project,
managed to get 12 too by using ctrl-f5,
but when I run with just f5 its 9000 !

Let me see, you are running a debug version in the IDE debugger and it
returns 9000, right? Well, It will be slower but not that slow, there is
clearly something wrong (this 9000 seems like a magic number, weird) .

is there a way to avoid the overhead the debugger being attached seems to
impose ?

No there is no posibility (other than running outside the debugger) to avoid
the overhead imposed by the fact you are 1) running non optimized code 2)
inside the IDE debugger.
I only use the timer in debug mode to do profileing anyway,
but I gues I cld detect if the debuger is not atacched and dump it to a
file

You should not instrument debug builds, but if you do, you should accept the
overhead imposed by the environment (debugger, non optimized JIT code).

Willy.
 
C

colin

Willy Denoyette said:
Let me see, you are running a debug version in the IDE debugger and it
returns 9000, right? Well, It will be slower but not that slow, there is
clearly something wrong (this 9000 seems like a magic number, weird) .

yeah strange, maybe some who knows the internals can shed some light on
this?
No there is no posibility (other than running outside the debugger) to
avoid the overhead imposed by the fact you are 1) running non optimized
code 2) inside the IDE debugger.

without actually knowing where the code is spending its time during
those 9000 clock cyclces its hard to know whats cuasing it.
You should not instrument debug builds, but if you do, you should accept
the overhead imposed by the environment (debugger, non optimized JIT
code).

yes good point, however I wanted to make sure the realease build was clean.
Theres little difference in the run time beteween debug and release.

I was only looking for the most time consuming parts of the code
I didnt expect a 20 times increase in time just becuase of this timer
tho lol, the increase occurs in the release or debug version,
purly determined by the debugegr being attched or not.

thanks very much for your help finaly got a workable solution anyway :)
Colin =^.^=
 

Ask a Question

Want to reply to this thread or ask your own question?

You'll need to choose a username for the site, which only take a couple of moments. After that, you can post your question and our members will help you out.

Ask a Question

Top