Bad Power Mgmt

D

don_b_1

New and fully updated install of XP Home SP2 installed to freshly formatted
partition at (hd0,0) for new components. No previous power mgmt difficulties.

New parts are as follows:
MSI MS-7325 v1.0 (K9N4 SLI) NVIDIA nForce 500 SLI chipset
AMI BIOS v. 1.3 (but the problem exists with v.1.4 and v.1.5)
AMD64 X2 5000+ Black Edition
Corsair TWIN 2X2048-6400C4 G (two overnight runs of memtest86 mode 5
produced no errors)
MSI NX7300LE 256 MB - NVIDIA GeForce 7300 LE
New 500w / 34a x 12v power supply
Nothing overclocked

I activated Home/Office Desk profile in default configuration and find the
computer rebooted when I expect it to be hibernated. I activated the
keyboard sleep button and verified standby mode initiates a reboot.

I set up a "Hibertest" profile to prevent Standby mode and it entered
hibernation one time only. Computer enters and exits hibernation perfectly
when manually initiaated.

All power profiles turn off the display and hard disks properly and on
schedule. These resume properly with any input.

So, two problems. No automatic initiation of hibernation and standby mode
causes reboot whether initiated by system, sleep button or shutdown menu
item.

Power management has been reinstalled. Changing ACPI BIOS functions have no
effect on either situation.

Windows Error Reporting says the standby mode crash is caused by the video
driver and issues advisory "STOP 0x000000EA THREAD_STUCK_IN_DEVICE_DRIVER".
It recommends updating to the most current video driver. I did this.
Alternatively, it suggests disabling hardware acceleration and disabling
write combining . I tried this too. The standby reboot problem still exists
very reliably.

Not sure I believe the error reporting service since the computer passes all
stability and torture tests. Other than the rebooting initiated by standby
mode, the computer has no problems that I've identified so far.

Any ideas on cause of this? Are there any optional hotfixes issued for
nvidia accelerated graphics driver issues?

Thanks,
Don
 
D

don_b_1

Update: I did remove hiberfil.sys, run chkdsk /r, defrag and reactivated the
hiberfil. Now I can push my sleep button and put the computer into dead
powered down standby where the lights and fans turn completely off, then
bring it right back up with a key press. I can also manually put it into full
hibernation and bring it back up with the power switch but I still have no
automatic hibernation and have lost automatic standby.
 
P

PaulMaudib

Update: I did remove hiberfil.sys, run chkdsk /r, defrag and reactivated the
hiberfil. Now I can push my sleep button and put the computer into dead
powered down standby where the lights and fans turn completely off, then
bring it right back up with a key press. I can also manually put it into full
hibernation and bring it back up with the power switch but I still have no
automatic hibernation and have lost automatic standby.

To bad nobody has any idea what you are responding too.
 
W

w_tom

Now I can push my sleep button and put the computer into dead
powered down standby where the lights and fans turn completely off, then
bring it right back up with a key press. I can also manually put it into full
hibernation and bring it back up with the power switch but I still have no
automatic hibernation and have lost automatic standby.

The error message is probably correct. What others rumor as stress
testing is not and tells you little. Question is whether the video
driver or video hardware is defective. Defective hardware (like so
many other devices including power supply) can be defective while
still booting and running a computer.

First, does every connector on video card have a corresponding
connection? If not, why not?

Second, what does video card manufacturer diagnostics report? If it
does not provide diagnostics for free, well, how responsible was that
manufacturer? Your problem is why diagnostics are provided and
executed.

Third, is the power supply really working properly? In your case,
computer must multitask to everything. Sound card making noise, while
CD-ROM is reading, while another program is searching through hard
drive, while video processor is doing complex graphics (ie a movie),
while IE is downloading a program from internet. Now you are ready
to measure four critical voltages on any one of orange, red, yellow,
and purple wire from power supply. Those DC voltage numbers must
exceed 3.23, 4.87, and 11.7. Motherboard monitor is not sufficient.

In each case, establish each suspect as either definitively good or
definitively bad. 'Maybe' answers such as your stress test tell us
nothing useful.

BTW, when executing video processor diagnostics, then heat that card
with a hairdryer on high. If video card has an intermittent, then
sometimes a 100% defect can only be identified at the perfectly ideal
temperature output by a hairdryer on highest heat setting. This
determines something 'definitively'. And then never look back while
moving on to other possible suspects.
 
D

don_b_1

Thanks for the info and advice w_tom.

At this point, all the new hardware is suspect even though it tests out OK.
The power supply seems good as all voltages come in within 2% above spec
whether under load or not. I verified with two different meters. I suppose I
should have another look at that. The diagnostics on the video card doesn't
show a problem.

Further checking revealed Windows is definitely the culprit here. My new
installation got thoroughly trashed but the problem is I don't know why. I'm
hoping I didn't get a good install when I changed the hardware. I didn't
bother with the power management in the beginning but when I did, it had
problems.

The other night after I ran the chkdsk/defrag, etc. I cured one thing but
more things goofed up. I lost all my restore points and Windows started
handling my processor cores in a very crazy manner. This was definitely a
Windows issue because it wasn't happening in Safe Mode or Linux. I also
couldn't start the Windows recovery manager from the CD.

Big question is: what happened? I'm hoping I got a bad in the beginning.
IThinking back I'm not absolutely certain I did a proper format on the drive
when I reinstalled three weeks ago. I know I didn't have to break out my
"proof" disk to allow the XP upgrade to continue and blew through everything
quick and dirty. I'm doing a reinstall right now and definitely did do a
proper format this time. Twice. Once with Gparted and a followup with Windows
just to make sure.

We'll see but it already appears as though I have a problem as the display
won't return when recovering from manually induced standby in ACPI S3 mode
and it still reboots instead of recovering when in S1 mode. I think maybe I
have to get the proper hotfixes for the AMD X2 CPU. My AMD X2 laptop with
Vista doesn't have any power mgmt issues.

I also have a weird power options window with no UPS tab but a double entry
screen under the main Power Schemes tab with setup for battery power and AC
power being side by side. Never seen that before.

I'm not sure what you mean when you ask if every connector on video card has
a corresponding connection. There are different video outputs but I'm only
using the D-shell output to run my monitor.
 
W

w_tom

At this point, all the new hardware is suspect even though it tests out OK.
The power supply seems good as all voltages come in within 2% above spec
whether under load or not. I verified with two different meters. I suppose I
should have another look at that. The diagnostics on the video card doesn't
show a problem.

Further checking revealed Windows is definitely the culprit here. My new
installation got thoroughly trashed but the problem is I don't know why. I'm
hoping I didn't get a good install when I changed the hardware. I didn't
bother with the power management in the beginning but when I did, it had
problems. ....

Defined were all voltages OK when under full load. To be clear,
full load means downloading from the Internet, while playing complex
graphics (ie movie) on video processor, while playing sound, while
reading a CD-Rom, while doing a defrag on the disk, while reading a
floppy, etc. Then record those numbers. Being within 2% does not
tell as much as what those numbers actually are on any one purple,
orange, red, and yellow wires - when under that above load.

Moving on, it is normal for defective hardware to pass the
diagnostic at normal temperature. Some defects only appear in a
diagnostic when hardware (ie video processor and adjacent memory) is
heated by a hairdryer in highest heat setting. (Same heat is required
to make Memtst86 an effective diagnostic.)

Anything involving disk drive hardware (partitioning or formatting)
is irrelevant to your problem. Hardware associated with your problem
is limited to video processor, sound card, some memory (used by OS),
CPU (and its associated power supply), power supply 'system' (not just
the power supply), and some motherboard functions.

I don't remember if this was discussed. But a visual inspection of
electrolytic capacitors may reveal swelling - which would create a
voltage problem across the motherboard resulting in what appears to be
strange (unrelated), and unique problems.

Whatever the problem is has caused (apparently) the video subsystem
to not work properly as indicated by the BSOD error code and by how
that video processor interferes with power down routines in BIOS.
Hopefully something above will identify a hardware failure. If not,
we can only assume an incompatibility between that video card (both
firmware and driver) and the BIOS (power management) firmware /
hardware.

I am also assuming the video processor manufacturer web site has
been consulted for a latest driver version. Not at all a likely
solution, but ...

Use heat as a diagnostic tool. For example, are problems worse when
only one part of hardware is heated with that hairdryer? A
technique used in aerospace to find defective hardware before that
hardware started failing. Cold (ie 40 degree F) is also a useful (but
not as effective) diagnostic tool.
 
D

don_b_1

After downloading Windows Updates all night, the computer is doing better. It
will enter and recover from S1 standby with no problem. I haven't yet tested
ACPI S3 mode yet. It will also manually and automatically enter hibernation
and my power options control panel applet is now correct.

The previous problems sure seem to be Windows induced rather than hardware
or driver problems. Heat was not an issue. I first became aware of it when
the computer was idle. I would return to it and find it either locked up with
no display (requiring a power down for recovery) or find it rebooted when it
should have been in hibernation. Heat had nothing to do with this. Working
the bejeezus out of the computer has never incited a shutdown or an error
that I can see.

I can drive the GPU very hard for hours running a simulation that operates
at +75 fps and never induce a shutdown of any sort. It will heat up a bit
according to the temperature monitor but that doesn't appear to cause
problems.

Right now I'm still running the original drivers that came with the
components and things seems to by fine so far. Next I'll update the drivers
to nvidia latest and see what happens.

The thing that bothers me is what caused Windows to become so corrupt. It
was driving my second core to 80% or better at all times producing an average
CPU Usage of 40-50% even as System Idle Process was clocking in at 97-99%.
This was obviously a windows problem since it didn't occur in Safe Mode or
when running Linux.

When it was goofed like this, it required 70 seconds to complete a specific
opertion that normally required 14.7 seconds. With affinity set to CPU1, the
second core, it required 90 seconds.

At that time, I could complete the same test operation in Linux in 13
seconds flat. Now that Windows is all reinstalled and tamed, it's back to
completing the operation in 14.7 seconds and will do it with affinity set to
CPU1 in 13.5 seconds.

BTW: this test I mention is doing a find and replace operation on a 2.2 mb,
407 page text document that requires almost 69000 replacements of the letter
"a" with "1234567890" . I run the Windows test in Word and the Linux test,
using the same document, in Open Office.
 
W

w_tom

At that time, I could complete the same test operation in Linux in 13
seconds flat. Now that Windows is all reinstalled and tamed, it's back to
completing the operation in 14.7 seconds and will do it with affinity set to
CPU1 in 13.5 seconds.

Both your diagnostics and symptoms created by reloading Windows
suggests hardware has always been good and that Windows was
corrupted. Problem may have been in the HAL; more likely the
interface between HAL and other related functions.

Sometimes, when reloading system files, some files do not get
updated. Different example, critical files processing IP functions
(networking) would not properly reload until every peripheral that
uses IP (modem, wireless, etc) were first removed. You may have had
same.

If, for some reason, a function (or file) was an older version and
the other function (file) was a newer update, then incompatibilities
can exist. Microsoft programmers would have assumed both functions
were either original or updated; never intended two different versions
to work together. This creates strange problems - could explain why
an interface between HAL and BIOS power management was corrupted. How
could two 'talking' functions end up at different rev levels
(incompatible)? I can only speculate. However if a computer with MS
updates was then reloaded with original OS files, then incompatible
revision might exist.

Symptoms suggest the problem has been eliminated by reloading and
updating all OS files to the same (latest) rev level.

What can cause a processor to work hard while no processes are
executing? An interrupt that the CPU processes but never gets
cleared. An example that might explain 80% CPU time. Unfortunately
we have too little information to answer any better. But once all
files were reloaded to same rev level, then things apparently work.
 
D

don_b_1

Here's where it gets weird Tom. I also use Ubuntu linux. After I built the
new computer, I did the XP reinstall to accomodate the hardware upgrade but I
didn't reinstall the linux since it still worked. I merely installed and
enabled the nvidia restricted drivers. It was so easy it was almost
automatic. After a week or so, about the time I first started noticing
oddities with the XP, my linux started having problems. I assumed it was
something to do with putting the new hardware under it so I reinstalled and
did all my updates on it. (Sixteen hours of downloading those updates) I
never could get the nvidia drivers to install on this copy of Ubuntu and was
stuck with generic vesa or the open source nv driver. I tried everything I
could think of, and hammered on it for a week. I used all the linux methods,
used a third party installer and did the miserable nvidia command line
routine to get it to go. No way. I even put an installation on a different
hard drive but nothing would work.

Subsequent to formatting, reinstalling, updating , testing and verifying my
XP installation as good, I booted up Ubuntu to see if it acted any
differently. Sure enough, I used the Restricted Drivers Manager to install
the proprietary nvidia and it took it immediately.

I never would have dreamed that a corrupted copy of Windows could goof up a
computer so badly that it would prevent a completely different operating
system on a completely different physical drive from functioning properly.

Seems like I learn something new every time I blink.

Thanks for your help amigo.
 
W

w_tom

I never would have dreamed that a corrupted copy of Windows could goof up a
computer so badly that it would prevent a completely different operating
system on a completely different physical drive from functioning properly.

Software cannot harm hardware. In the rare time when that exception
existed, since then, video monitors were redesigned to eliminate that
'software harming hardware' problem.

Software might change CMOS settings. But CMOS (except for data time
clock) is only changed by BIOS; not by any other software. This new
symptom suggests Windows HAL is does not explain the problem. Again,
these analog type of problems are why we use heat (and cold) to
aggravate, then locate, the problem. Problem could be a weak
transistor that periodically does and does not conduct enough current
(does not create a logic one or logic zero; only create undefined - a
voltage in between). Problem could be a hardware problem that causes
timing shifts / delays. Again we use temperature to make that failure
obvious to diagnostics. These problems could exist between CPU, in
buses between essential functions, or even on the video processor
interface.

Do your diagnostics also test multiprocessor arbitration functions
on the motherboard? Probably not if diagnostics are not provided by
the computer manufacturer.

Heat could also cause CPU's power supply to become less stable. Any
power supply voltage can be completely defective and CPU will still
work OK. Again, temperature may aggravate the problem sufficiently
that bad voltages are apparent.

Of course, you have inspected electrolytic capacitors for bulging -
another indication that voltages are no longer stable.

Appreciate the objective. Aggravate a marginal and therefore
intermittent hardware problem. Also useful are what those voltages
numbers actually are (not that voltages are in spec). Change between
light load and the above described full load testing to record number
changes. Again, seeking something marginal. Voltages can be in spec
but numbers indicate the problem. A best tool to find these is a
hairdryer on highest heat. Heat is not a problem as others so
mistakenly assume. Heat is a most powerful diagnostic tool to
aggravate, make hard, and therefore find a defect.

Of course, motherboards even have tiny capacitors scattered about to
'decouple' adjacent ICs. If one is missing, then intermittent
failures can occur. Inspection will never find such problems. Just
another reason why we aggravate problems by working hardware at
extremes - full load, highest and lowest temperatures, etc. - to make
a problem hard before trying to find or fix it. Eliminating
intermittent failures are the most difficult. Suggested are some
tools to use in that art.
 
D

don_b_1

Thanks again Tom.

Of course hardware is always a suspect but in this case there really is no
indication of such. Het and/or working the computer very hard has never
caused a problem.

What I do know for certain is:

1) Since formatting and reinstalling and updating XP, there have still been
no errors or problems with the computer or either operating system.

2) Prior to this reinstallation, there was noting but rapidly progressing
problems with the computer and both operating systems that finally reached
the point the computer was out of control.

Logic and common sense has to place the blame on the original install of XP
but I'm watching very carefully
 

Ask a Question

Want to reply to this thread or ask your own question?

You'll need to choose a username for the site, which only take a couple of moments. After that, you can post your question and our members will help you out.

Ask a Question

Top