NV40 getting over 7000M pixels/sec (over 7G pixels!)

joe smith · Mar 21, 2004

What you have been describing so far seems to assume a pretty complete

GPU hardware implementation of a still-evolving DX9+ standard. Neither
NVidia nor Ati have complete hardware implementations.That situation

Everything else still is "today" except the 3.0 specification shaders, and
yup, it -is- required to leverage different generation graphics chips. 2.0
is the entry-level to 'real' shader programming, 1.0 was just the 'practise
round' where it did not offer THAT MUCH functionality over fixedpipe (except
for the vertex programs and even there in a bit limited manner.. marginally
useful for special effects and that sort of things.. special cases, with 2.0
and upwards we could begin writing full unified rendering pipes through
on-demand shader compilation, pretty easy with HLSL runtime compiler in
D3DX...)

There is still a lot to be desired even with 3.0 specs, but it's a nice
foundation / entry level few years from now. All the same properly leveraged
*GPU* doesn't need much spoon feeding at all from the CPU. The CPU just must
tell when to do what, the GPU will handle the rest.. so to speak. The case
you meantion that older hardware must be catered is source of much
frustration, but no pain no gain... like I ranted earlier it's easier to go
with the most common denominator and screw the scalability, in commercial
environment Good Enough is often most economical and therefore feasible. If
I would be doing stuff for my own amusement could afford all kinds of crap
cannot do in day to day work. Etc.. but to cut the story short CPU isn't
bottlebeck for the GPU by default.. it is bottleneck due to conscious (or
non-conscious

design choises.

joe smith · Mar 21, 2004

Read: if we talk about DX9 and beyond level hardware, the CPU is less and
less of a bottleneck. Further down the road we go closer to DX3 more and
more CPU bound the applications will be as far as commanding graphics
processor is concerned.

The ExecuteBuffers were actually pretty neat-o idea originally: the driver
could dynamically translate the command stream to whatever internal
representation was most convenient and resident in the graphics
processor(s). It was just plain ****ing annoyance to use to developers were
the primary reason to reject a reasonable enough practise. Programmable
Vertex and Fragment processor arrays are just another way to do it.. way
that is efficient enough now that the overhead of this kind of abstraction
is amortized by Moore's Law (how convenient) and very pragmatic to the
developer aswell. I don't know ANYONE who wouldn't have liked shaders when
really got to use them (not just glance casually and determine that they
don't like them because they are different to the way used to doing things..
=)

The direction we seem to be heading is that the data is resident in the GPU
local memory and that we write more, and more flexible programs to transform
the data from point to point inside the graphics processor.. every new
generation of GPUs is more flexible and homogeneous than the previous one
(eg. things are 'clicking' into their places..) ; it would been neat if
everyone involved in the source end of the business would got things 'right'
the first time around, but I think it was transistor budget related more
than anything initially.. glad they got the process started anyways. Now
only if we still could resolve the age-old problems of zbuffer/triangle
rendering such as how to best handle translucency globally and we would be
on our way to better/faster graphics still. But looking just 5 years back,
hell, a long way has been covered already. Fun stuff!

Etc... I'm ranting stupidly about random stuff, I'll stop if this is too
much bullshit and incorrect information. ;-)

John Lewis · Mar 22, 2004

Read: if we talk about DX9 and beyond level hardware, the CPU is less and
less of a bottleneck. Further down the road we go closer to DX3 more and
more CPU bound the applications will be as far as commanding graphics
processor is concerned.

The ExecuteBuffers were actually pretty neat-o idea originally: the driver
could dynamically translate the command stream to whatever internal
representation was most convenient and resident in the graphics
processor(s). It was just plain ****ing annoyance to use to developers were
the primary reason to reject a reasonable enough practise. Programmable
Vertex and Fragment processor arrays are just another way to do it.. way
that is efficient enough now that the overhead of this kind of abstraction
is amortized by Moore's Law (how convenient) and very pragmatic to the
developer aswell. I don't know ANYONE who wouldn't have liked shaders when
really got to use them (not just glance casually and determine that they
don't like them because they are different to the way used to doing things..
=)

The direction we seem to be heading is that the data is resident in the GPU
local memory

Things are coming a full circle. Remember 3dfx made the point that
the graphics data should be resident in large memory on the GPU board,
while nVidia in the early days said... no...no... just need a faster
bus. It was pretty obvious at the time which party was technically
correct, taking a clear message from the Commodore Amiga architecture
-- offload as many tasks from the CPU as possible by having very smart
peripherals with DMA channels to memory. Of course, Intel deliberately
pulled the PC the opposite direction successfully to maximise sales of
expensiive CPUs. Until 3dfx and Creative came along and began to
break down the barriers with their "intelligent" peripherals.

John Lewis

FLY135 · Mar 24, 2004

"joe smith"
<john.smith@iiuaudhahsyasdy232462643264276asdhfvhdsafhasdgdsagyufasgyufdashu
fdashuyfhuysafhuysafhuydh27324242742647623762667bhfbdsahbvfahds.net> wrote
in message news:[email protected]...

First, the CPU is limpdick only on non-graphics related stuff, if the GPU
code is properly done, so why mention CPU at all...? It was in context of
CPU *feeding* the GPU, feeding takes minimal CPU if it is done 'right', so
286 will do for that. If CPU is burned for AI, and other tasks, that is not
related to GPU and irrelevant, etc.

I think you need to consult the know-how-the-GPU-is-programmed specialist,
for instance I am available for consulting. Ask if there is anything
unclear.

Since you offered, I've got a question.... I'd like to map a decoded video
stream onto a 3D polygon so that I can scale, rotate, skew, etc the picture
in realtime. I'd also like to have more than one stream at a time.
Ignoring the CPU video decoding aspect, what are the ramifications of maping
the video to polygons? The video will be 720x480 and natively in YUV
format. I figure I will probably need to convert the 16bit YUV to 24bit RGB
and put it into some sort of a Direct3D texture object. 720x480 is 345600
pixels and if I need to pack that into a 32bit words (not sure how the D3D
wants it) then that will be 1,382,400 bytes per video frame (30Hz) times the
number of active video streams.

Does this sound like something that's do able in your experience. I
installed the DirectX SDK and they had a sample app mapping video onto a 3D
shape but it was only one video stream and pretty low res.

joe smith · Mar 25, 2004

Since you offered, I've got a question.... I'd like to map a decoded video

stream onto a 3D polygon so that I can scale, rotate, skew, etc the picture
in realtime. I'd also like to have more than one stream at a time.

The scale, rotate, etc. skew part can be done by changing texture
coordinates at the vertices so that is trivial.

Ignoring the CPU video decoding aspect, what are the ramifications of maping
the video to polygons? The video will be 720x480 and natively in YUV
format. I figure I will probably need to convert the 16bit YUV to 24bit

RGB

If your hardware does not support YUV the same YUV format as sampler as your
video stream is in you have to do conversion. The most pragmatic way to do
the conversion is to write a fragment program to do it for you, this
requires least CPU overhead for that part. If your hardware supports the
format, then that is all you need to do.

Actually, depending on the hardware you have installed and the video stream
you are playing back the level of acceleration varies. If you are not
comfortable using dx9 interfaces for handling the video you can use vfw, but
this is less economical. The dx9 interfaces leverage the hardware
acceleration better than most other ways such as decoding with CPU, even if
you choose to decode with the CPU you can still do YUV->RGB conversion with
the fragment program (assuming you have atleast dx8-level hardware).

and put it into some sort of a Direct3D texture object. 720x480 is 345600
pixels and if I need to pack that into a 32bit words (not sure how the D3D
wants it) then that will be 1,382,400 bytes per video frame (30Hz) times the
number of active video streams.

D3D doesn't want it in any specific format, the hardware might, tho.. you
must check your hardware caps for what texel formats are supported with your
current videomode and choose one from those.

Does this sound like something that's do able in your experience. I
installed the DirectX SDK and they had a sample app mapping video onto a 3D
shape but it was only one video stream and pretty low res.

It is doable but don't be disappointed if the framerate is not going to be
what you might want depending on the hardware and therefore rendering path
that is taken. The optimal situation for GPU with low-bandwidth high-latency
bus is where the data is in GPU local memory and you can write vertex and
fragment programs to synthesize further data and cache it on surfaces in GPU
local memory.

Streaming video is off-context to my performance advice: it requires
transfering the texels from CPU local memory to the GPU local memory, which
is precisely the kind of things which are are to be avoided. But in this
case there is no way around this, you can compute the bandwidth requirement
yourself.. 30 x sizeof(unsigned int) * 720 * 480 which is~roughly 40
MB/stream-- piece of cake for the AGP 4-8x bus, whatever CPU hit there might
be depends on the CPU and the format of the stream and decoder you are
using. I can't comment on that as it could be anything.

I'm doing 512x512 divx streaming to texture in YUV format and doing the
YUV->RGB conversion with fragment program when sampling from the
videostream. If you don't want to specialize all your fixedpipe or fragment
programs, you can do the conversion in separate pass to different surface
and sample from there; requires minumum number of changes to rest of the
code you might have there. I compile the shaders dynamically on-demand with
realtime "JIT" shader compiling system so I don't have to worry about
details like that. YMMV.

Maybe the Microsoft XNA has something in the toolchain you might use, check
the GDC'04 related announcements.

joe smith · Mar 25, 2004

The scale, rotate, etc. skew part can be done by changing texture

coordinates at the vertices so that is trivial.

Or, you can ofcourse change the transformation matrices or move the
vertices, whichever effect it is that you want. For the record there are no
"polygon" primitives in the D3D, so what you are asking is impossible, but
you can have the same effect with two triangles. The primitives are not 3D
either, mathematically they are built from vertices in [x y z w] -format,
but I get the general idea what you are after.

Paul Clarke · Mar 25, 2004

Does anybody know where the toilets are?

joe smith · Mar 25, 2004

Does anybody know where the toilets are?

No, sorry, you just have to shit your pants.

FLY135 · Mar 25, 2004

"joe smith"
<john.smith@iiuaudhahsyasdy232462643264276asdhfvhdsafhasdgdsagyufasgyufdashu
fdashuyfhuysafhuysafhuydh27324242742647623762667bhfbdsahbvfahds.net> wrote
in message

Streaming video is off-context to my performance advice: it requires
transfering the texels from CPU local memory to the GPU local memory, which
is precisely the kind of things which are are to be avoided. But in this
case there is no way around this, you can compute the bandwidth requirement
yourself.. 30 x sizeof(unsigned int) * 720 * 480 which is~roughly 40
MB/stream-- piece of cake for the AGP 4-8x bus, whatever CPU hit there might
be depends on the CPU and the format of the stream and decoder you are
using. I can't comment on that as it could be anything.

Thanks, my major concern was the speed of updating the textures every video
frame. I can specify whatever video card is necessary, but was hoping that
there would be reasonable prices (< $200) cards that would fit the bill.

I'm doing 512x512 divx streaming to texture in YUV format and doing the
YUV->RGB conversion with fragment program when sampling from the
videostream.

Do I understand correctly that "fragment" programs are code that is run by
the GPU? If so what is the (scripting?) language called? I'm an
experienced programmer, but haven't done any Direct3D.

joe smith · Mar 25, 2004

Do I understand correctly that "fragment" programs are code that is run by

the GPU? If so what is the (scripting?) language called? I'm an
experienced programmer, but haven't done any Direct3D.

Yes, they also go by the name of "pixel shader", which is the D3D
terminology for the concept. Fragment is more appropriate as pixel can be
composed from multiple fragments, for example when doing multi-sample
antialiasing.. I tend to use the term for both OpenGL and Direct3D
interchangeably, my bad.

It's not scripting language, it's bytecode the GPU (caps permitting) can
execute. Direct3D acts as layer between the hardware/driver compo and is the
entry-point to the runtime mechanism which can translate the bytecode to
format the GPU can understand, the code that is executed in the GPU does not
necessarily have to resemble the original at all.. it's like JIT compilation
in Java, but just applied to the context graphics processors. It's actually
pretty good stuff, check it out asap.

HLSL and Cg are nice languages to develop fragment and vertex programs and
take care of the register allocation and other arcane tasks just like C,
C++, Pascal and other higher level languages did for the CPUs, here's a snip
of fragment program to demonstrate how the workflow is in general:

-- begin --

struct fragment
{
float4 color : COLOR0;
};

// this is semantically same as having global variable in C / C++
// "sampler2D" object is just abstraction to 2D texture surface..

uniform sampler2D mytexture;

// this function can be called anything, you can determine the entry-point's
name
// when compiling, unlike in C where entry-point is always called "main()"

fragment psmain(const vertex v)
{
fragment f;

float4 sample = tex2D(mytexture,v.texcoord);

f.color = sample * v.diffuse + v.specular;
return f;
}

-- end --

Voila', texture modulated with argb gradient called "diffuse" and argb
gradient called "specular" is added to the modulated color. It's pretty
powerful way to express more complex arithmetic you want to do per pixel,
this was trivial example but it doesn't get much more complex than that
beyond that. I omitted the vertex struct, but it is basicly defined in the
same way the fragment is.. you can define it yourself like you want. The
runtimes will then "connect" the streams so that the fragment (and vertex)
programs get the right data from the right place (which struct member is
which is determined with "semantics" (documentation is more verbose
regarding that).

For instance, if above I would not want to modulate with alpha, I would
rewrite the arithmetic:

f.color.rgb = sample.rgb * v.diffuse.rgb + specular.rgb;
f.color.a = sample.a;

... just an example, but shows that a lot of things are possible. The ".rgb"
after variable is called "swizzling", it is in effect runtime permutation of
input vector which is virtually free for GPU fragment programs, the above
permutation does just select three first components in "default" order, so
arithmetic is only done with the three first components of the float4
vector, also assignment only writes to the three first components.

This is starting to look like HLSL tutorial.. but that's pretty much the
whole deal, now you have the boost you need to go and write whatever you
want. YUV->RGB conversion will be quite smooth, think of it as doing few dot
products and accumulating (mac).. very cheap. If you need the
clamping/saturation as you well might in that particular piece of code,
there is saturate() and clamp() in HLSL aswell, which are fast so no fear
using them (ofcourse all added instructions make program slower, but that is
a relative concept...

Good luck!

FLY135 · Mar 25, 2004

"joe smith"
<john.smith@iiuaudhahsyasdy232462643264276asdhfvhdsafhasdgdsagyufasgyufdashu
fdashuyfhuysafhuysafhuydh27324242742647623762667bhfbdsahbvfahds.net> wrote
in message news:[email protected]...

Good luck!

Thanx again for all the info.

Revealing The Power of DirectX 11	1	Jan 31, 2009
AMD R600 Architecture and GPU Analysis (long read)	5	May 15, 2007

NV40 getting over 7000M pixels/sec (over 7G pixels!)

joe smith

joe smith

John Lewis

FLY135

joe smith

joe smith

Paul Clarke

joe smith

FLY135

joe smith

FLY135

Ask a Question

Similar Threads