Bizarre Overheating Issue

N

nometa

Hello All,

I have a serious - and very wierd - overheating issue with my PC. It's
randomly (and I literally mean randomly: anywhere from 30 seconds after
boot to 3 hours in) reporting that the CPU is overheating and shutting
down. After the machine does a hard shut down, on reboot it reports at
startup that the machine was shut down due to a thermal event
(overheating), and I have to press F4 to continue.

The problem is that the machine ISN'T overheating: my fan app logs show
the machine never goes over 40 degrees C, the case never goes over 35
degrees, and every time it's happened, I've been able to touch the
heatsink for the CPU and hold my hand there with no issues - it's
barely warm.

I've removed all external hardware, checked all of my internal
connections, broken the machine down completely and rebuilt it,
removed, and reseated the memory, cards, and processor, cleaned the
processor and heatsink/fan according to Arctic Silver's website
instructions (including using new Arctic Silver thermal paste). I've
updated the BIOS to the most recent release, done all of the Windoze
updates, done spyware scans etc. The only thing I haven't done is a
full virus scan because the machine keeps shutting down before it can
be completed - though it has gotten aprox. 3/4 of the way through
(including scanning my entire system drive) and hasn't found anything.

I'm at my wits' end here - I really can't afford to / don't want to buy
a new board and/or processor: does anyone have any idea what this might
be? Is there any way to disable the thermal limit on the processor (I'm
comfortable keeping an eye on it manually).

Any input would be hugely appreciated. Specs on my box are below:

P4 1.8 Processor
Intel 865PERLL Board
Antec Sonata Case (350 W Supply)
Zalman CPU Fan
3 HDDs (80 Gig system / 120 Gig and 250 Gig D: and E: drives)
Basic CD Rom drive
Matrox G400 DualHead AGP Video card
Windows XP Pro SP2

If you need any more info, please let me know. Any help would be
appreciated. Thanks!
 
N

nometa

Sadly, there's no option in the bios to change the thresholds - at
least not that I can find. Intel provides a couple of utilities that in
theory allow you to change the temperature thresholds for the fans -
but most of them are locked, including for the CPU. But this doesn't
change the threshold at which the CPU considers itself to be
overheating and shuts down the system.
 
D

David Maynard

nometa said:
Hello All,

I have a serious - and very wierd - overheating issue with my PC. It's
randomly (and I literally mean randomly: anywhere from 30 seconds after
boot to 3 hours in) reporting that the CPU is overheating and shutting
down. After the machine does a hard shut down, on reboot it reports at
startup that the machine was shut down due to a thermal event
(overheating), and I have to press F4 to continue.

That means it's the BIOS that's doing the shutdown. Check your BIOS temp
settings.
The problem is that the machine ISN'T overheating: my fan app logs show
the machine never goes over 40 degrees C, the case never goes over 35
degrees, and every time it's happened, I've been able to touch the
heatsink for the CPU and hold my hand there with no issues - it's
barely warm.

Well, the first thing, after checking the BIOS settings, is to determine
what the "fan log" is monitoring. Core temp, heatsink temp, or a bogus value?

And the finger feel heatsink test doesn't necessarily tell you anything
because it's the processor you care about and if heat is not getting *to*
the heatsink it won't get hot. The outside of your kitchen oven is probably
not as hot as the insides either, because it's insulated, if you see what I
mean.

I've removed all external hardware, checked all of my internal
connections, broken the machine down completely and rebuilt it,
removed, and reseated the memory, cards, and processor, cleaned the
processor and heatsink/fan according to Arctic Silver's website
instructions (including using new Arctic Silver thermal paste). I've
updated the BIOS to the most recent release, done all of the Windoze
updates, done spyware scans etc. The only thing I haven't done is a
full virus scan because the machine keeps shutting down before it can
be completed - though it has gotten aprox. 3/4 of the way through
(including scanning my entire system drive) and hasn't found anything.

Checked fan rpm?
 
E

Ed Medlin

David Maynard said:
That means it's the BIOS that's doing the shutdown. Check your BIOS temp
settings.


Well, the first thing, after checking the BIOS settings, is to determine
what the "fan log" is monitoring. Core temp, heatsink temp, or a bogus
value?

And the finger feel heatsink test doesn't necessarily tell you anything
because it's the processor you care about and if heat is not getting *to*
the heatsink it won't get hot. The outside of your kitchen oven is
probably not as hot as the insides either, because it's insulated, if you
see what I mean.



Checked fan rpm?

That was my first thought. All the symtoms point to a failing CPU HS fan
(other than a bogus temp reading). I have had it happen before, but I was
lucky enough that it made enough noise to make me look for it. A bad or
gunked up fan bearing can be very intermitten and after cool-down can run
normally for awhile.

Ed<snip>
 
D

David Maynard

Ed said:
That was my first thought. All the symtoms point to a failing CPU HS fan
(other than a bogus temp reading). I have had it happen before, but I was
lucky enough that it made enough noise to make me look for it. A bad or
gunked up fan bearing can be very intermitten and after cool-down can run
normally for awhile.

Yeah. But then there's the matter that apparently his logs aren't showing
the same temperature problem so there is still a mystery there even if our
fan RPM guess has merit, and that causes me to wonder if it's looking at
the right sensor.

And then there's the conundrum that, if everything else was proper, with a
failing, or low RPM, fan the heatsink *would* be 'hot' but he says it's not.

He also doesn't say what processor it is but both the P4 and Athlon64, for
example, have a flat "thermal trip" control line off the processor that's
independent of any 'temp sensor' and the subsequent calibration issues one
might imagine with one. It doesn't matter what any 'sensor' says, whether
it's monitored, or what the trip point is set to, if the processor
internally detects an over temp it'll shut down.

It all tends to points at either poor thermal compound/application or the
heatsink not on right, e.g. cocked or backwards, but his comment about
cleaning everything suggests it's a machine that used to work.

Or a bogus sensor, but that's way down on the probability tree, IMO.

Lastly, he doesn't say anything at all about what 'PC' it is and if it's a
pre-built then there's the possibility that "thermal event" isn't
necessarily, or exclusively, the processor. It's the most likely but it
could, may, possibly, be monitoring other things as well.
 
N

nometa

Thanks for all the replies - seriously appreciated, since this has been
driving me mad.

To try and address everything in one reply:

1) It's a PC that I built (I've done this tons of times, for the record
- I used to work in a datacentre). It was working perfectly for about 2
months, the temp issues are new. I hadn't changed anything in the
hardware or BIOS since the original build.

2) The BIOS isn't showing overheating either - it says that the temp is
less than 40 celsius - usually around 35 - for the CPU, and all other
zones are far less than this.

3) I'm confident the issue isn't poor contact with the CPU / the
thermal compound / a failing CPU fan. I have a Zalman CPU fan with an
external controller that I've set to max. The fan is spinning (and
registering as spinning in the BIOS and in the fan monitoring app).

I know it's seated correctly - there is heat transfer (the heatsink
below the fan is warm, just not enough to warrant an overheating
issue).

And I cleaned and reseated the heatsink/fan using Arctic Silver and
following these instructions:
http://www.arcticsilver.com/ceramique_instructions.htm.

I'm very anal-retentive about these kinds of things <grin>

For the record, there are also two case fans with the Antec Sonata case
(specs here: http://www.antec.com/us/productDetails.php?ProdID=15138)
that are both hooked up, spinning, and registering correctly in the
BIOS and fan app.

So I can say with confidence that I honestly don't think it's an
installation or fan issue.

Am I understanding correctly when I say that the CPU itself has a
thermal sensor that can shut it down? Is it possible that sensor itself
is failing? Is it possible it's a virus of some sort (I've read a
couple of random posts about power viruses, but nothing really seemed
credible)?

Again, thank you for the help. If I can at least narrow down the issue,
I'll feel less crazy.

Brad
 
D

David Maynard

nometa said:
Thanks for all the replies - seriously appreciated, since this has been
driving me mad.

To try and address everything in one reply:

1) It's a PC that I built (I've done this tons of times, for the record
- I used to work in a datacentre). It was working perfectly for about 2
months, the temp issues are new. I hadn't changed anything in the
hardware or BIOS since the original build.

2 months isn't long enough to rule out a construction issue because it
could be the environment has changed. For example, it could have been right
on the edge of doing this same thing all along but now does because room
temp has increased. Overclockers (not that it 'depends' on that) often run
into that problem when summer comes along but it also happens with winter
because, strange as it may seem, people sometimes keep the house *warmer*
in winter. Or the PC might be located where it's warmer because of room
airflow, or lack of it, with an example being near a heating/cooling vent.

2) The BIOS isn't showing overheating either - it says that the temp is
less than 40 celsius - usually around 35 - for the CPU, and all other
zones are far less than this.

3) I'm confident the issue isn't poor contact with the CPU / the
thermal compound / a failing CPU fan. I have a Zalman CPU fan with an
external controller that I've set to max. The fan is spinning (and
registering as spinning in the BIOS and in the fan monitoring app).

I know it's seated correctly - there is heat transfer (the heatsink
below the fan is warm, just not enough to warrant an overheating
issue).

And I cleaned and reseated the heatsink/fan using Arctic Silver and
following these instructions:
http://www.arcticsilver.com/ceramique_instructions.htm.

I hear ya. But then you're getting a "thermal event."

I'm very anal-retentive about these kinds of things <grin>

For the record, there are also two case fans with the Antec Sonata case
(specs here: http://www.antec.com/us/productDetails.php?ProdID=15138)
that are both hooked up, spinning, and registering correctly in the
BIOS and fan app.

So I can say with confidence that I honestly don't think it's an
installation or fan issue.

Am I understanding correctly when I say that the CPU itself has a
thermal sensor that can shut it down?

As I mentioned, the P4 and Athlon64 internally monitor temperature and
generate a thermal trip if exceeded but you've not mentioned anything about
what processor or motherboard you're using, or what else is in it.

In theory the on-die thermal diode provided for external temp monitoring
should give a relatively accurate indication of temperature consistent with
the independent thermal trip, if that's what's being monitored. But, then,
what the 'monitor' is looking at is another question.

Is it possible that sensor itself
is failing?

Highly unlikely that the internal monitor would fail independent of the
processor as it's not a 'sensor' in the sense of being a separate device
but is simply another of the millions upon millions of semiconductor
devices on the same piece of silicon.
Is it possible it's a virus of some sort (I've read a
couple of random posts about power viruses, but nothing really seemed
credible)?

There's one that forces a shutdown but the BIOS would not report a "thermal
event" from such a thing and the one I am aware of pops up a shutdown
warning with a 30 second, or some such, timeout. If I remember correctly it
does so by mangling a critical process XP monitors and it's actually XP
doing an "I'm broken" shutdown.
 
N

nometa

Replying to David:

Specs are: P4 1.8 Processor, Intel 865PERLL Board (rest of components
listed in original post). Not denying that I'm getting a "thermal
event" - just trying to figure out how it's possible, and what I can do
about it. To be clear:

- I'm not overclocking
- I've got 3 fans in the case, all running full speed
- my room isn't hot (freezing, actually - water heaters + old house =
coooolllddd)
- NONE of the sensors - OS, application, or BIOS - are reporting
overheating

Bottom line is that this machine is shutting down because it's
overheating (or thinks it is), but none of the tools available show
that it's overheating. If anyone can suggest a way to figure out why
nothing in the BOIS or other applications shows overheating and yet
this machine is (or thinks it is) overheating, and more importantly,
how I can fix the issue, I would seriously appreciate.

Replying to JAD:

Never thought it might be a power supply issue - is there any way to
tell if that's the problem (short of swapping it out?). And out of
curiousity, how might this be causing the issue - not clear on how a
power supply might be causing the machine to report a CPU overheat.

Thanks!
 
N

nometa

In case anyone has the same issue and is looking for a resolution: not
sure how or why, but it seems it was the power supply after all. I had
a 2nd Antec Sonata case, so swapped out the power supplies, and have
had zero issues since - the box has been up and running for 6 days
straight.

If anyone has any idea how this might happen, would love to know for
the sake of curiousity...

Thanks all for your help!
 

Ask a Question

Want to reply to this thread or ask your own question?

You'll need to choose a username for the site, which only take a couple of moments. After that, you can post your question and our members will help you out.

Ask a Question

Top