Nvidia plays the meltdown blame game


N

NV55

Comment Story doesn't mesh with reality

By Charlie Demerjian: Monday, 07 July 2008, 4:32 PM


NVIDIA'S STOCK TOOK a long overdue beating the other day, more because
Wall Street is collectively horrified that it has been lied to than
any fundamentals that are public. That said, the 8K keeps up the
firm's tradition of honesty and integrity.

The root of the problem is, so far, HP notebooks, but likely others.
You can see the HP page here, and at least one lawsuit about the same
thing here. No mention of this in the Nvidia statement though. Why
would they? If you look at what Nvidia says, it isn't their fault, it
is those damn suppliers.

The official line is: "While we have not been able to determine a root
cause for these failures, testing suggests a weak material set of die/
package combination, system thermal management designs, and customer
use patterns are contributing factors". Parsing that, you see that
they are blaming fabs and packaging suppliers first, OEMs second, and
those damn users third, but they have no fault here, NV can do no
wrong.

This is really dangerous for three reasons: they are annoying
suppliers, annoying OEMs and annoying users. Last we checked, they
need all three to remain in business.

The weak die/packaging excuse doesn't wash at all. Nvidia is blaming
TSMC behind the scenes, trashing them pretty hard through 'unofficial'
channels to deflect blame. They are likely to be doing the same to
packaging suppliers as well, and others. The reason this doesn't wash
is that there are only a handful of suppliers in each of these fields.

If they had a problem with Nvidia, there would be problems with other
companies. ATI, Altera and dozens of others, would have chips crapping
out left and right, especially designs where they are meant to run
24/7 like embedded parts. You would see an industry rife with failures
and warning like the bad caps problem of a few years ago.

You simply aren't seeing that. Period. No warning from others, no
recalls, no TSMC warnings, no nothing. This is a sham to deflect blame
from Nvidia, they don't want to dent their shiny image, much less slow
down the 'can of whoop-ass' opening. I am calling bullshit on the
supplier-blaming problem.

Suppliers are a problem for Nvidia though, at least they are now.
Trashing your suppliers like this is a dangerous thing to do, Nvidia
needs them more than they need Nvidia. Can you imagine the scene at
the next TSMC planning meeting where they are discussing who gets what
allocation on the next tight process, and how much they pay?

TSMC Planner 1: How many wafers do we allocate for Nvidia a month?
TSMC Planner 2: The 40nm process is looking tight at first, do you
agree?
TSMC Planner 1: Yeah, really tight.
TSMC Planner 2: Remember that time when NV was calling us [male
rooster euphemism][oral suction euphemism]s to anyone who would
listen? Wasn't that a fun time.
TSMC Planner 1: So 4 then?
TSMC Planner 2: 4K? That seems high.
TSMC Planner 1: No, 4.

Blaming your suppliers publicly is bad. When it isn't their fault, it
is worse. Doing so in the sleazy backhanded ways that Nvidia knows so
well is tantamount to corporate suicide. Suppliers will find a way to
make you pay, and they will get the knife in somehow. Nvidia being
bossy and arrogant only makes the situation more enjoyable for them.
Look for this PR blunder to have massive long-term effects that
manifest themselves in dropped margins, critical parts shortages, and
missed deadlines. Bad move #1.

Bad move #2 is blaming the OEMs, this is done with the subtle phrase
"system thermal management designs" in the 8K. This is engineering
code for, "we didn't do anything wrong, those nitwits at HP did". It
works like this, Nvidia makes a part and it has a variety of
constraints it is meant to be used within. Things like power draw,
minimum and maximum temperature, and other things.

NV specs these things, and HP makes a notebook to the specs that NV
gives them, a process that happens long before the chips come out of
the fabs in any decent volume. If the chips are within the promised
specs, thing go well. If they are not, there are some tweaks you can
pull, but if they are too far out of spec, you are basically screwed.

Now this assumes both sides are honest, and people are trying to solve
problems, not deflect blame. Nvidia is really good at the latter, bad
at the former. They also can't make a chip that isn't a blast furnace.
Most of their recent woes, including the massively delayed current
round of MCPs, is down to out of control thermals, just like the last
round.

How do you fix a systemic design problem in silicon on a time scale
that doesn't sink an entire season's notebook sales? Easy, you fudge
the spec sheet. If you have a TDP of 20W for a part, and it is coming
in at 25W from the fab, you can lower the speed or change what TDP
means. If you promised HP a chipset that has an 800FSB and it can only
hit 667, well, that is problematic. If you give them a chipset with a
20W TDP, and the definition of TDP changed between the last generation
and this one, well, "that is how we do it now".

If it is HP incompetence as Nvidia is stating, then it would simply be
a case of a line or two of notebooks that went bad. HP system
engineering is one of the very best in the industry, period, subject
to management whims. This is not to say they can't screw up, they most
definitely can, but it is pretty rare on anything major. HP does seem
to have QC process engineering down well.

Does this mean they are perfect? No, not even close. Have they screwed
up on a notebook? Sure, probably several here and there over the past
few years. If you look at the HP page, once again here, you will see
there are 24 models affected. I can believe there are one, two, maybe
four screwups, but 24 model lines all with the same problem? All with
cooling related failures? All with cooling related video failures? All
with cooling related video failures on Nvidia parts?

What NV is doing is smearing the good name of HP and it's engineers
here. There is no way in hell that HP totally botched every Nvidia
based notebook for a generation in the same way. Not a chance. This is
once again a smear job, and it will once again come back to bite
Nvidia in the bottom line, give it time. Companies like this have long
memories. The only thing you can say from this is that it is not HP's
fault.

Well, actually, you can say more. If HP specced cooling for a
theoretical 20W, and the Nvidia chip puts out more than 20W, what
happens is you get more heat in the system than you can get rid of,
and temperatures slowly climb. It will either keep climbing, or level
off, but likely it is out of the thermal bounds set by Nvidia. The
system will get really hot or simply crash.

The problem? This puts them out of the thermal tolerances for the
packaging. That is OK for short periods, but repeatedly staying above
the limits causes the packaging material to degrade prematurely. Worse
yet, repeated heating and cooling caused by the laptops heating up and
then crashing, then being left off for a bit to cool and 'work again',
is horrible for the packaging. This is how solder joints and bumps
crack, and substrate warps. Coupled with weakened materials from
overheating, and you have dead GPUs.

This is hugely unlikely to be a HP problem, or a substrate problem. It
is most likely a bad engineering design decision that Nvidia tried to
sweep under the rug. Sometimes it works, other times it doesn't. This
time is an 'other', and companies like TSMC and HP don't like being
publicly crucified for Nvidia's screwups. They really don't like it.

The third bad move is 'customer use patterns': so, it isn't our fault,
it is those crazy kids! A Scooby Doo villain couldn't have said it
better after a failed whoop-ass attempt. From the look of things, the
customers are doing things like turning on and off laptops, something
likely unanticipated by Nvidia product planners. I mean who does that?

Blaming customers would be bad move number three, but I doubt most of
them will realise it is Nvidia's fault, they will blame HP or the host
of other OEMs that haven't been named yet. Either way, if you take bad
move #2 into account, if I were an OEM, I would tell everyone calling
in for warranty support unequivocally that it is Nvidia's fault for
supplying bum chips. In this case, it wouldn't be deflecting blame.

In any case, the 'crazy kids' blame game is pointless and will only
hurt Nvidia if people hear it. They likely won't, but there is no
upside unless they think analysts are several steps dumber than a slow
sheep.

In the end, the whole thing can be summed up by bad engineering,
covering your ass, and hoping it blows over. Nvidia corporate
messaging is pretty much incompetent, more driven by the fact that
they are pawns of people higher up the food chain than anything else,
and they only have one tool, a hammer.

When something goes wrong, they don't know how to solve problems, only
hit things. This situation was dealt with by surprising Wall Street
with a collective kick in the hedge funds. There was no explanation,
no softening of the blow, and no word to the press, just a 'Surprise,
we are tanking' governmental form, followed by stonewalling and finger
pointing at blameless people.

Botched doesn't begin to describe this response, but it is a good
start. They utterly flunked Crisis Management 101. Given the last
sentence of the 8K, " There can be no assurance that we will not
discover defects in other MCP or GPU products," this is far from over.
In fact, we know it is; there are many more lines and products
affected.

Now that you know about how the Nvidia parts failed leading to the
massive loss, plummeting stock, and management fast-talking, what
everyone wants to figure out is where the buck stops. That is not a
simple question, but several industry insiders have told us the same
story, it all depends on who got burned, and how big they are.

The one we know about is HP, here and here, but it is far from over.
Nvidia is chiming in now because it is very likely they are footing
the bill for the class action settlement, or at least a very large
chunk of it. When they gave the prescient advice that, "There can be
no assurance that we will not discover defects in other MCP or GPU
products", they aren't joking, this problem hasn't cropped up in
desktop parts yet, but it most assuredly will. We are getting reports
of other afflicted items, but it is premature to name them.

So, basically, Nvidia totally screwed up, and is blaming everyone but
the one company they should, itself. The OEMs know it, consumers know
it, suppliers know it, and since the "OMFG, our hair is on fire"
performance of last week, just about the entire world knows about it.
Everyone who has one of these parts will be seeking restitution, just
watch the bills mount now that word has spread.

But that brings up the costs and payments. Nvidia took a $150-200
million hit initially over this, but what does that cover? Looking at
Dell's web site, going from an integrated GPU to an external Nvidia
GPU is either a $50 or $130 upgrade, maybe more on a low volume gaming
part. That is what Dell sells the module for, plus profit and
overhead. The chips that Nvidia sells, minus GDDR memory, construction
etc, are probably in the $10-40 range.

If you look at that, there are three million or so parts affected, and
can likely be fixed by swapping out an PCIe card. With chipsets, well,
things get interesting , they are soldered to the mobo, as are many
CPUs, especially in thinner notebooks. In this case, the replacement
means a new mobo minimum, possibly a CPU thrown in for good measure.

Then there is the cost of fielding the support call, not a trivial
matter for a dead notebook. Shipping the part back to the depot,
labour to replace the mobo, and shipping it back as well. Added
staffing to handle the returns of large portions of 24 notebook lines
adds to the bottom line as well.

That leads to intangibles like customer ill will, lost productivity,
and the odd executive who gets a bum laptop for their kids. You can't
put a dollar value on these, but they do have an effect, much of
Dell's current woes are due to treating customers like dirt three-five
years ago.

So, once again, who pays for all of these costs? That is an
unequivocal "it depends". Depends on how contracts are written, how
much leverage the OEMs have, and how much good will Nvidia has built
up.

On one side, you have Dell, one-time masters of the supply chain, and
squeezers of every penny they can get. Industry insiders tell us that
Dell will be billing Nvidia for everything, from bad GPUs, mobos,
replacement costs, help desk, lawyers, and every truck roll needed to
fix something in the field. If Nvidia wriggles out of paying for
something, they will pay for it in other ways.

HP is a little more flexible, but since Nvidia has been effectively
blaming their engineering for it, I can see how they would lean a bit
more toward the " right royal bastard" side of things. They are close
to Dell in what they will charge, but may let some minor things slide.

As you move down the food chain to smaller people mobo makers, Tier 2
computer makers, and even little shops, NV will disclaim more and
more. Asus and Gigabyte will likely not get everything covered, not
even close. Smaller board makers might get credit for the cost of MCPs
and GPUs.

Unhappiness will abound. They will all get their pound of flesh, it
may just take a bit of time. Lawsuits seem to have forced disclosure,
and NV is still trying to spin, minimize the downside, and point
fingers. This, however, is far from over. Look for desktops to be
affected as well as discrete GPUs before this is over, most of them
use the same ICs as the mobile parts.

There seem to be two currently-affected products, the low-end and the
mid-range parts of the last generation. Depending on the failure rate,
Nvidia could be looking to eat the majority of a generation's products
plus the cost of things they were soldered to, and the tech school
dropout used to screw new parts in.

This will be very ugly before it is done, very very ugly. Finger
pointing early on and the blame game will only harden resolve on the
other side, and add to costs. There go their cash reserves, we guess.
It couldn't come at a worse time. Then again, doing everything wrong
does have a cost. µ


http://www.theinquirer.net/gb/inquirer/news/2008/07/07/nvidia-meltdown-blame-game
 
Ad

Advertisements

A

Augustus

Why is it that hammerheads like you will post 100+ lines of unoriginal copy
and paste (mostly bullshit to boot) instead of providing a link for a
pointless post . A link with identical crap. From a 100% reliable source
like theinquirer.net. You're killfile material.
 
R

rjn

Augustus said:
Do you seriously believe that every single 8300GS,
8400GS. 8500GT, 8600GS, 8600GT and 8600GTS of the how
many millions ever made are faulty?
Unlikely....
Even the Inq, which is not letting go of this story,
is not claiming "every single", but
"... graphics chips fail at alarming rates ..."

from recent story:
"HP pays half for Nvidia's graphic problems"
<http://www.theinquirer.net/gb/inquirer/news/2008/07/31/hp-pays-half-
nvidia-problems>
or <http://snipurl.com/39gfy> [www_theinquirer_net]

Independently, The Inq is also reporting that the NV 790i
has a problem:
"Nvidia 790i board pulled by makers"
<http://www.theinquirer.net/gb/inquirer/news/2008/07/31/nvidia-790i-
board-pulled-makers>
or <http://snipurl.com/39gh7> [www_theinquirer_net]

Inq reporting can be iffy. Here's a more respected source:
"Nvidia reports problem with laptop chips"
<http://snipurl.com/39gn9> [www_computerworld_com]
"Nvidia Corp. has uncovered a problem with some older
graphics chips that shipped in "significant quantities"
of laptop PCs ...
...
Nvidia will take a charge against second-quarter earnings
of $150 million to $200 million to cover the expected cost
of repairing and replacing the products ...
...
The products have been failing in the field at "higher
than normal rates," Nvidia said.
..."

I'm only following this because I'm getting ready
to build a new PC, and obviously want to avoid any
known problems. My current PC has an older NV chipset,
and frankly it was never very reliable, although the
blue screens may not necessarily be from the chipset.
 
D

DRS

Augustus said:
A better headline would be "All G84 and G86's are lousy 3D
performers. Do you seriously believe that every single 8300GS,
8400GS. 8500GT, 8600GS, 8600GT and 8600GTS of the how many millions
ever made are faulty? Unlikely....
If the numbers were minimal you'd expect Nvidia to defuse the situation by
providing the data. Instead it has lied at every stage. First it said there
was no problem. Then it said it was only one bad batch that went to HP.
Then HP and Dell said there is a problem (HP Nth America has extended the
warranty on its affected models by 24 months and Dell is under pressure to
follow suit), yet Nvidia won't even publicly identify the problematic GPUs
let alone talk numbers. The collective corporate refusals to reduce
customer angst are indirect evidence the problem is much bigger than anyone
wants to acknowledge.
 
Ad

Advertisements

R

rjn

Mr.E Solved! said:
Mobile units are under frequent hot-cold cycles and generally have
poorer cooling solutions so they run hotter so the change in temps is
greater during their cycle.
Fudzilla has now piled on, with the same perspective:
"Nvidia having issues with desktop GPUs, as well"
<http://www.fudzilla.com/index.php?
option=com_content&task=view&id=8730>
"What it boils down to is the solder material between the chip
and the packaging being sub-standard. Issues only occur if the
GPU is heated up and cooled down repeatedly, much as it
would be in a laptop ..."

Let me guess that this is not lead (Pb)-based solder.

Assuming the purported failure reports, and presumed
root cause, are true, is this one of the unintended
(but entirely predicted) side effect of RoHS ?
 
R

rjn

Mr.E Solved! said:
I'm sorry if your nvidia stock is weaker, ...
Could get weaker yet. Fuddy is speculating:
"Nvidia M84 & M86 problems to cost more"
<http://www.fudzilla.com/index.php?
option=com_content&task=view&id=8782>
"Monies set aside will not be enough to cover cost
...."

And it's not just the parts, labor, S&H.
If that story of the laptop graphics being excluded from some
laptop warranty is true, class actions are a certainty.
Any number of lawyers will be found to own affected LTs.

Fud closes with:
"All of these issues could be a very big advantage for AMD ..."
Only if the back-alley Chinese fab at the root cause didn't
build stuff the same way for other clients.
 
D

DRS

[...]
Fud closes with:
"All of these issues could be a very big advantage for AMD ..."
Only if the back-alley Chinese fab at the root cause didn't
build stuff the same way for other clients.
TSMC ain't no back-alley operation! Which makes their side of the story all
the more interesting.
 
Ad

Advertisements

K

kainendless

All interesting points but I don't think that they apply. I think your
position "here are our chips, we wash our hands of them" is not going to
hold up either in the boardroom or courtroom and based on recent events
it has not. Supplier-Integrator responsibilities go far beyond the
"consumer protections" end-user requirements you think I'm confusing
with the 'fitness of purpose' requirements that are usually negotiated,
I haven't seen the contract so I can't comment on specifics, but I've
seen enough of them to know what is typical and manufacturing defects
such as this are typically covered.

Especially when the problem encompasses many vendors in many different
designs and best practice cooling solutions do not result in other chips
failing. But for the bad chip packaging, there would be no fault. No
problems.

Especially when this was discovered in early 2007, a year and a half ago
and reported in April of 2007. Plenty of time to fix things, but nothing
was fixed. This didn't happen over a week or a day, it took months and
months.

http://www.theinquirer.net/gb/inquirer/news/2007/04/12/there-are-no-m...

Oh, you don't think the Inq can report the truth? why not take it from
nVidia themselves:

NVIDIA president and CEO Jen-Hsun Huang stated:

"Although the failure appears related to the combination of the
interaction between the chip material set and system design, we have a
responsibility to our customers and will take our part in resolving this
problem. The GPU has become an increasingly important part of the
computing experience and we are seeing more interest by PC OEMs to adopt
GPUs in more platforms. Recognizing that the GPU is one of the most
complex processors in the system, it is critical that we now work more
closely with notebook system designers and our chip foundries to ensure
that the GPU and the system are designed collaboratively for the best
performance and robustness."

Again, if nvidia knew about the chip material sets different performance
characteristics and did not tell anyone, that is their fault. If they
did and the makers ignored their warnings, then it's the makers
fault...for building a broken machine and using a broken component.

HP has split the difference in cost with nvidia, but no one else
has...the rest are demanding recall and restitution and are jumping
ship. Hard to say why, but HP can afford the $150 out of $300 liability
each repair is estimated to cost. Are you saying HP knew the chip was
faulty as well and crossed their fingers?

So far it looks like nVidia left everyone swinging in the breeze and
nothing you have said indicates otherwise, are you SURE you don't have
any nvidia stock?!
My 780i still has vc after 4 bios updates. Failure is nvidia's main
product
 

Ask a Question

Want to reply to this thread or ask your own question?

You'll need to choose a username for the site, which only take a couple of moments. After that, you can post your question and our members will help you out.

Ask a Question

Top