Skybuck's RAM Test version 0.07 (For CUDA and CPU) now available !

S

Skybuck Flying

(To run this test successfully you will probably need an NVIDA CUDA enabled
graphics card, probably compute capability 2.0 as a minimum).

Hello,

Skybuck's RAM Test version 0.07 is now available at the following link, in
either winrar form or loose files (3):

File:

http://www.skybuck.org/CUDA/RAMTest/SkybuckRAMTestV007b.rar

Folder:

http://www.skybuck.org/CUDA/RAMTest/

What the test does is the following:

It creates 20.000 blocks. Each block has 8.000 elements. Each element is a
32 bit integer (4 bytes).

Each block has one execution thread.

The execution thread "travels" through the elements in a random fashion.
(RAM read test).

It stores/writes the last element it processed in the BlockResult[
BlockIndex ] to verify if it indeed did any processing at all.

This test is performed on GPU and CPU. (On the CPU only one thread/core is
used for now, perhaps a future test will include multi-threading).

The timing and performance results are then displayed at the bottom.

The GT 520 gpu and the AMD x2 3800+ dual core cpu single thread performed as
follows:

Kernel execution time in seconds: 25.0870683593750000
CPU execution time in seconds : 11.8696194628088207

Cuda memory transactions per second: 63777878.5898704829000000
CPU memory transactions per second : 134797918.7549603890000000

Conclusion: CPU's single thread is twice as fast as GPU.

Note: this test requires 611 megabyte (640.000.000 bytes) to be
free/available on CPU and GPU.

I would be very much interested in how this test performs on your
system/cpu/gpu.

So if you do run this test on your system, please post the results below
(just the 4 lines as above is enough/fine).

(Also additional info about system would be nice too but is not required ;))

You can also e-mail results to:

(e-mail address removed)

Bye,
Skybuck.
 
S

Skybuck Flying

I did some more test with different settings after seeing the depression
results for random access memory for cuda and probably register dependency
and such.

These graphics cards are supposed to be good for linear access/vector like
access so I tested that as well somewhat.

If the number of elements is just one cuda performance extremely well, 10
times as fast as the cpu.

If the number of elements is 10 cuda still performance 5 times as fast as
the cpu roughly speaking.

So there is still some hope inside of me that cuda will be usefull for for
example video codecs.

I was hoping to use cuda for something else but I guess that will have to go
back into the freezer for now.

Or I could give opencl a try and see if somebody's ati card does better, but
opencl seems somewhat boring and very little information about the
instruction set used by opencl programs.

So perhaps I should spent some time on giving my lossless video codec
another try but this time use cuda to see if it can achieve faster
performance and perhaps even high resolution, which would be nice.

It needs to be at least twice as fast for somewhat decent frame rates at
normal resolution and then it will need to be 4 times as fast for double
resolution... so it needs to be 8 times as fast.

Seeing a speed up of 10 is nice.

However parallel algorithm might also require some rounds... but the test
settings also included that somewhat, loops were 10... I just did another
test with 100 loops, cuda still 3 times faster than cpu.

Time for a more serious test. I set elements to 1 which would mean 32 bit
colors. I set blocks to 1920x1200 and I set loops to 22 for a parallel scan
simulation * 60 for video frequency.

I won't reveal the numbers lol. But I can tell you. The GPU is 40 times as
fast as the CPU ! LOL.

That puts big smile on my face ! ;) =D

CUDA just made my day real happy ! ;) =D

Sigh... so CUDA should be perfectly suited for writing video codecs as long
as the video codecs do their work as sequentially as possible ;)

Bye,
Skybuck.
 
S

Skybuck Flying

The number of blocks doesn't really matter.

I test with 2000 and it gives same performance results, it just takes
shorter to test, it is after all divided by seconds taken.

The higher block numbers were just to test if it might help.

Anyway I have managed to find a little optimization trick via ptx.

It's quite significant too.

By adding a "cop" directive which stands for "cache operation" specifier
cuda can be made to run faster:

The following instruction was changed from:

ld.global.s32 %r34, [%r38+0];
To:
ld.global.cg.s32 %r34, [%r38+0];

This seems to give 50% more performance for random access memory with cuda !

However care/more tests should be done to be sure... maybe it's just for
this particular situation, but the difference is so big there is probably
something to it ! ;)

Surprisingly the "cop" .cs did not give more performance, which is what I
tried first.

I still have others to try, but this is already pretty spectacular ! ;)

Since everything else I tried with code adjustments didn't help ! ;)

So there is still hope yet to squeeze some more performance out of it ! ;)
=D

The CPU is still twice as fast by a large margin though ! ;)

Bye,
Skybuck.
 
S

Skybuck Flying

The following technique works more or less the same way at the source level:

The following parameter is altered

from:

int *Memory,

to:

volatile int *Memory,

This produces the instruction:

ld.volatile.global.s32 %r34, [%r38+0];

I also tried adding .cg behind the global but that is not allowed that would
be recursive...

volatile already indicates that no cache operations are allowed.

This gives the same 50% performance increase which is very nice ! ;)

Bye,
Skybuck.
 
S

Skybuck Flying

Interesting news in short: GPU cache 4 times faster then CPU cache ! ;) =D

(Version 0.10 which still uses GPU ram instead of GPU cache also available)

(Version 0.12 is the gpu cache version but still unreleased ;) =D)

Ok, the shared memory kernel is done... it also executes 4000 blocks but
this time sequentially...

This test/results made my jaw drop ! LOL... which offers possibilities/hope
for cuda:

Just a single cuda thread did this:

http://www.skybuck.org/CUDA/RAMTest/version 0.12/SharedMemoryTest.png

Text:

"
Test Cuda Random Memory Access Performance.
version 0.12 created on 21 july 2011 by Skybuck Flying.
program started.
Device[0].Name: GeForce GT 520
Device[0].MemorySize: 1008402432
Device[0].MemoryClockFrequency: 600000000
Device[0].GlobalMemoryBusWidthInBits: 64
Device[0].Level2CacheSize: 65536
Device[0].MultiProcessorCount: 1
Device[0].ClockFrequency: 1620000000
Device[0].MaxWarpSize: 32
Setup...
ElementCount: 8000
BlockCount: 4000
LoopCount: 80000
Initialize...
LoadModule...
OpenEvents...
OpenStream...
SetupKernel...
mKernel.Parameters.CalculateOptimalDimensions successfull.
mKernel.Parameters.ComputeCapability: 2.1
mKernel.Parameters.MaxResidentThreadsPerMultiProcessor: 1536
mKernel.Parameters.MaxResidentWarpsPerMultiProcessor: 48
mKernel.Parameters.MaxResidentBlocksPerMultiProcessor: 8
mKernel.Parameters.OptimalThreadsPerBlock: 256
mKernel.Parameters.OptimalWarpsPerBlock: 6
mKernel.Parameters.ThreadWidth: 256
mKernel.Parameters.ThreadHeight: 1
mKernel.Parameters.ThreadDepth: 1
mKernel.Parameters.BlockWidth: 16
mKernel.Parameters.BlockHeight: 1
mKernel.Parameters.BlockDepth: 1
ExecuteKernel...
ReadBackResults...
DisplayResults...
CloseStream...
CloseEvents...
UnloadModule...
ExecuteCPU...
Kernel execution time in seconds: 0.3385913085937500
CPU execution time in seconds : 1.4263124922301578
Cuda memory transactions per second: 945092186.0015719590000000
CPU memory transactions per second : 224354762.1879504710000000
program finished.
"

Conclusion: shared memory is HELL/SUPER FAST !

Almost 4 times faster than the CPU ?!?!

I am gonna do a little debug test with VS 2010, because this is almost
unbelievable ! LOL. But I believe but gjez ?! Cool.

Though the GPU L1 cache is probably smaller than CPU L1 cache which could
explain it's higher speed

For real purposes I might require an even larger cache and then maybe the
results will be different... but for now it's hopefull

Bye,
Skybuck.
 
S

Skybuck Flying

In reality this probably means the gpu is twice as fast as a dual core,
since the dual core will also probably be double as fast as single core.

So if a quad core processor would face a gt 520 they would both be about the
same speed would be my estimate, unless newer cpu's have even faster caches
;)

Bye,
Skybuck.
 
S

Skybuck Flying

Woops there was something wrong with the kernel and also the kernel launch
parameters.

Kernel was doing only 1 block, and launch parameters where 4000 threads.

Now the situation has been corrected.

The kernel is doing 4000 blocks and only 1 thread.

It turns out it's fricking slow !


Test Cuda Random Memory Access Performance.
version 0.12 created on 21 july 2011 by Skybuck Flying.
program started.
Device[0].Name: GeForce GT 520
Device[0].MemorySize: 1008402432
Device[0].MemoryClockFrequency: 600000000
Device[0].GlobalMemoryBusWidthInBits: 64
Device[0].Level2CacheSize: 65536
Device[0].SharedMemoryPerMultiProcessor: 49152
Device[0].RegistersPerMultiProcessor: 32768
Device[0].ConstantMemory: 65536
Device[0].MultiProcessorCount: 1
Device[0].ClockFrequency: 1620000000
Device[0].MaxWarpSize: 32
Setup...
ElementCount: 8000
BlockCount: 4000
LoopCount: 80000
Initialize...
LoadModule...
OpenEvents...
OpenStream...
SetupKernel...
mKernel.Parameters.CalculateOptimalDimensions successfull.
mKernel.Parameters.ComputeCapability: 2.1
mKernel.Parameters.MaxResidentThreadsPerMultiProcessor: 1536
mKernel.Parameters.MaxResidentWarpsPerMultiProcessor: 48
mKernel.Parameters.MaxResidentBlocksPerMultiProcessor: 8
mKernel.Parameters.OptimalThreadsPerBlock: 256
mKernel.Parameters.OptimalWarpsPerBlock: 6
mKernel.Parameters.ThreadWidth: 1
mKernel.Parameters.ThreadHeight: 1
mKernel.Parameters.ThreadDepth: 1
mKernel.Parameters.BlockWidth: 1
mKernel.Parameters.BlockHeight: 1
mKernel.Parameters.BlockDepth: 1
ExecuteKernel...
ReadBackResults...
DisplayResults...
CloseStream...
CloseEvents...
UnloadModule...
ExecuteCPU...
Kernel execution time in seconds: 24.2583750000000000
CPU execution time in seconds : 1.4263193366754714
Cuda memory transactions per second: 13191320.5233244183900000
CPU memory transactions per second : 224353685.5819891260000000
program finished.

(Picture already updated above).
 
S

Skybuck Flying

Just for the record,

I also wrote a CPU test which can run on any x86 cpu.

For the AMD X2 3800+ the results for a single core where as follows:

Test CPU Random Memory Access Performance.
version 0.01 created on 1 august 2011 by Skybuck Flying.
program started.
Setup...
ElementCount: 8000
BlockCount: 4000
LoopCount: 80000
Initialize...
ExecuteCPU...
CPU execution time in seconds : 0.7778037432131737
CPU memory transactions per second : 411414836.7016757590000000
program finished.

So that's:

411.414.836 random 32 bit integers per second (mostly from cpu cache).

This higher number is because of optimized code (no slow dynamic indexes and
no slow get element routine/no call overhead).

Bye,
Skybuck.
 

Ask a Question

Want to reply to this thread or ask your own question?

You'll need to choose a username for the site, which only take a couple of moments. After that, you can post your question and our members will help you out.

Ask a Question

Top