Hangs randomly and doesn't boot

S

screwed-up-man

Hi there,
we are having very annoying sporadic problems with a server we bought
recently.
The server hangs up sporadically every few days and if power-cycled
won't complete boot or complete POST for a few minutes (hangs, in some
cases screen has no signal).

I think this might be caused by a previous overheat (in december), but
symptoms are strange so please tell in case you have an alternative
diagnosis or a suggestion to fix it.

Unfortunately we bought this server without on-site assistance, and that
costs nonzero, and also since the problems happen very sporadically and
last only a few minutes we have troubles calling assistance.

The server was assembled by the shop, it is a rackmount 24-disks with
two big turbines on the side and two 60mm fans in correspondence of the
RAM modules 16x2GB, it is a Tyan i5000PW. The two small fans were turned
to push air outside, as ther turbines are (air flow comes from through
the disks in the front). However the depression caused by the turbines
inside the chassis almost prevented air flow from the small fans, and I
believe the RAM modules overheated, or something else in that area. I
think I received some "CRITICAL OVERHEAT FB-DIMM" error messages from
Linux kernel once in december. We were using it in a normal room at
normal temperature, not yet moved to the conditioned servers' room.

When this happened I realized that the RAM modules were overheating and
I turned the small 60mm fans to push air towards the inside. Now RAM
modules are cold enough: max 67 degrees Celsius (lm-sensors i5k_amb) in
normal environment temperature. Then we shut down the server for the
remainder of december.

In January I launched memtest86+ v2.01 and that went for 1 day without
errors! (that's why I am not fully convinced of RAM problems) Is
memtest86+ able to properly disable the ECC in 5000P Blackford? Would
you trust memtest results 100%?

Still, yesterday the computer hanged again and when I power cycled it,
it wouldn't boot or even post (no screen signal) for 5 minutes, until I
opened it (it probably cooled down a bit, or was it some spurious data
in silicon that cleared?) and then it booted again. Then I closed it and
it is still running today!!

What the heck?

Frankly, we do have an additional identical mainboard. I might replace
it (supposing it's broken, which I am not sure) but if I return the
"broken" one to the shop they probably won't replace it, because it runs
correctly 99% of the time. This is a problem for us because we would
like to keep a spare mainboard for the moment where nothing compatible
is in production anymore. We are a research entity and the funding for
that project ended already, so we cannot buy another one, so we prefer
not to waste 1 mainboard. Ok ok if I cannot get rid of these hangups I
will eventually change the mainboard and see.

Another thing I could make is to turn the fans again and make it
overheat again so make it (hopefully) predictably fail, but I fear I
would damage something useful (we e.g. do not have spare RAM modules).

I am also thinking about BIOS settings. Is there anything that make it
behave like this? That would be the best news.
Yesterday I changed these settings
Installed OS: Win2K/XP ---> Other (we use Linunx Ubuntu 8.04)
Large Disk Access Mode: DOS ---> Other (maybe I did this wrong
according to the manual)
SERR signal condition: Single bit ---> Both
System Event Logging: ---> Reset Log (wouldn't show anything... but it
still doesn't)
Parallel Port: Enabled ---> Disabled
Then in Linux I disabled these modules that were continuously giving
erroneous errors (was like this since always, also in an identical
machine we have in another building)
i5000_edac
edac_core


Thank you for any help. As you imagine this was supposed to be our main
server for the foreseeable future and we are not full of money as we are
a research entity, and that funding has ended. We would really need to
have this server working... :-(

I might follow up with more information slowly in this thread (even
across a few weeks) when new things happen.
 
A

Aragorn

/
Hi there,
we are having very annoying sporadic problems with a server we bought
recently.
The server hangs up sporadically every few days and if power-cycled
won't complete boot or complete POST for a few minutes (hangs, in some
cases screen has no signal).

Okay... That means it's definitely a hardware problem and not a software
problem.
I think this might be caused by a previous overheat (in december), but
symptoms are strange so please tell in case you have an alternative
diagnosis or a suggestion to fix it.

Unfortunately we bought this server without on-site assistance, and that
costs nonzero, and also since the problems happen very sporadically and
last only a few minutes we have troubles calling assistance.

I wouldn't let that hold me back, though. If there is a problem serious
enough to prevent the machine from being used 24/7 - it *is* after all a
server, right? - then whoever assembled that machine for you and sold it to
you has failed to live up to the quality standards you've paid for, and
thus you are entitled to complain.

Normal warranty is one year on the entire machine with most manufacturers
and shops, plus the extended warranties on the individual components - e.g.
3 or 5 years on hard disks, depending on the type - so you should not be
afraid of contacting the manufacturer.
The server was assembled by the shop, it is a rackmount 24-disks with
two big turbines on the side and two 60mm fans in correspondence of the
RAM modules 16x2GB, it is a Tyan i5000PW.

Tyan is pretty good stuff - I have one myself - but I also know from
experience with a previous Tyan board that they can be tricky to set up if
the person assembling the machine isn't qualified to work on anything more
specialized than a consumergrade PC.

I've had a machine once - for about three months - built on a Tyan S2642
Thunder K7 board, which *never* ran stable and would sometimes crash up to
10 times in a single hour. The people who built that machine for me
insisted that it was my fault, because I was running GNU/Linux on it and
<quote> "they had tested it with Windows XP and Windows 2000 Server and
they never had any problems with it" </quote>, eventhough they *knew* in
advance that I would solely be using that machine with GNU/Linux.

The whole thing turned into a flamewar via e-mail, and eventually they
agreed that I would ship the machine back to them. After a whole year -
unbelievable but true - they told me that I had been right and that there
was "something" wrong with the machine. They had tried - or so they say,
but I do believe that they weren't lying about this specific fact - with
different sets of processors, different motherboards and eventually
different sets of everything, and they just couldn't that that machine to
work. Eventually they built me an Intel machine, with which I've been
having loads of other problems since and which by now cannot be relied upon
anymore either - I have already spoken about this elsewhere.

My point is that although those people couldn't get it right, the Tyan
Thunder K7 board more or less has its own fanclub of people who think it's
the greatest thing since sliced bread, and I'm also pretty convinced that
Intel doesn't ship buggy motherboards and buggy processors.

So that makes for two machines built by the same guys - they have in the
meantime gone bankrupt, but that had another reason - that both fail in
living up to what they are intended for, and thus my point is that
professional-grade hardware is best not handled by ordinary consumergrade
PC assemblers.
The two small fans were turned to push air outside, as ther turbines are
(air flow comes from through the disks in the front). However the
depression caused by the turbines inside the chassis almost prevented air
flow from the small fans, and I believe the RAM modules overheated, or
something else in that area. I think I received some "CRITICAL OVERHEAT
FB-DIMM" error messages from Linux kernel once in december. We were using
it in a normal room at normal temperature, not yet moved to the
conditioned servers' room.

A conditioned server room is great but shouldn't be a requirement for proper
functioning. The error message you got however should never have been
discarded.
When this happened I realized that the RAM modules were overheating and
I turned the small 60mm fans to push air towards the inside. Now RAM
modules are cold enough: max 67 degrees Celsius (lm-sensors i5k_amb) in
normal environment temperature. Then we shut down the server for the
remainder of december.

In January I launched memtest86+ v2.01 and that went for 1 day without
errors! (that's why I am not fully convinced of RAM problems) Is
memtest86+ able to properly disable the ECC in 5000P Blackford? Would
you trust memtest results 100%?

/memtest86(+)/ cannot prove to you that there is nothing wrong with the
machine. It can only corroborate the suspicion that there *is* something
wrong with it by listing RAM errors if it encounters them. As far as I can
tell, it also does a check on the ECC.
Still, yesterday the computer hanged again and when I power cycled it,
it wouldn't boot or even post (no screen signal) for 5 minutes, until I
opened it (it probably cooled down a bit, or was it some spurious data
in silicon that cleared?) and then it booted again. Then I closed it and
it is still running today!!

What the heck?

Did you check */var/log/messages* and the output of *dmesg* for any
warnings? Not that a complete and sudden kernel lock-up would announce
itself, but the kernel ring buffer and the syslog can hold valuable
information pertaining to hardware.
Frankly, we do have an additional identical mainboard. I might replace
it (supposing it's broken, which I am not sure) but if I return the
"broken" one to the shop they probably won't replace it, because it runs
correctly 99% of the time.

That may be their policy, but in that case there still are lawyers. <grin>
If you pay for a professional-grade server - particularly ones that are
officially supported at running GNU/Linux, as opposed to "it runs Windows
but it'll probably also run GNU/Linux"[1] - then that means that you are
entitled to a flawlessly functioning machine capable of 24/7 uptime.

*[1]* The Linux kernel runs on just about every processor in existence, but
of course the hardware vendor can always claim that their goods have not
been sufficiently tested with GNU/Linux and that therefore your claim is
void. But that is not the case here.
This is a problem for us because we would like to keep a spare mainboard
for the moment where nothing compatible is in production anymore. We are a
research entity and the funding for that project ended already, so we
cannot buy another one, so we prefer not to waste 1 mainboard. Ok ok if I
cannot get rid of these hangups I will eventually change the mainboard and
see.

Hmm... So what you're telling me is that this machine is used solely for a
single project, for which the funding has ended already, and from which I
then gather that upon the next research project, you will receive funding
for buying additional hardware? In that case, I would try with the
replacement motherboard first and run the limited risk of not having a
spare motherboard until the new funding arrives.

But still, you *should* be entitled for a replacement or a refund from the
manufacturer of that machine.
Another thing I could make is to turn the fans again and make it
overheat again so make it (hopefully) predictably fail, [...

Damage it on purpose? That's perverse, in my opinion...
...] but I fear I would damage something useful (we e.g. do not have spare
RAM modules).

I am also thinking about BIOS settings. Is there anything that make it
behave like this?

Hmm... I suddenly remember that with my first Tyan machine - the one that
failed - having ECC enabled would cause a 5 to 7 minute delay between the
power-on and system boot because of the ECC scrubbing of the entire memory
in that machine - it had 3 GB of RAM - and during that time, there was no
screen output whatsoever and the monitor stayed off.

Such a delay is long, of course, but on the other hand, such machines are
not intended to be rebooted every couple of hours or even to be shut down
on a daily basis. The only way to get around it was to disable ECC
altogether.

On the other hand, this was an older board - this was in 2001 - and modern
boards should not have such long boot-up delays anymore. And then of
course, that has nothing to do with random hangs, which suggests that your
machine does indeed still have an unresolved hardware problem.
That would be the best news.
Yesterday I changed these settings
Installed OS: Win2K/XP ---> Other (we use Linunx Ubuntu 8.04)

Should indeed be set that way. Some boards also have "Linux" as one of the
choices, or "UNIX".
Large Disk Access Mode: DOS ---> Other (maybe I did this wrong
according to the manual)

This pertains only to IDE hard disks, but if you have those, then setting it
to "Other" prior to OS installation is best if you use GNU/Linux.
SERR signal condition: Single bit ---> Both

Don't really know whether that has any pertinence. When in doubt, stick to
the defaults, I say. ;-)
System Event Logging: ---> Reset Log (wouldn't show anything... but it
still doesn't)

This is a one-time only reset of the event log and does nothing with regard
to stability. It's just there to clean the log if it contains errors, e.g.
from a BIOS upgrade. (Not that I'm sure on whether a BIOS upgrade would
legitimately leave errors in the log, but I was told that it did.)
Parallel Port: Enabled ---> Disabled

Frees up an interrupt, but does nothing else.
Then in Linux I disabled these modules that were continuously giving
erroneous errors (was like this since always, also in an identical
machine we have in another building)
i5000_edac
edac_core

I have no experience with those, but a quick Google search tells me that
they are Linux's way to report and correct errors, and for stability on a
server, I would definitely keep them running.

I don't know whether the errors you got from these modules were erroneous,
but it is possible that they were only being verbose about very normal ECC
events.

Few people realize the importance (and activity) of ECC. Thunderstorms,
radiowaves, cosmic radiation, it all affects your circuits and can
potentially crash your system. That is *why* ECC exists, and so it's only
normal that it reports what it does.
Thank you for any help. As you imagine this was supposed to be our main
server for the foreseeable future and we are not full of money as we are
a research entity, and that funding has ended. We would really need to
have this server working... :-(

I might follow up with more information slowly in this thread (even
across a few weeks) when new things happen.

We'll be monitoring... :p
 
A

Andy

screwed-up-man said:
Hi there,
we are having very annoying sporadic problems with a server we bought
recently.
The server hangs up sporadically every few days and if power-cycled
won't complete boot or complete POST for a few minutes (hangs, in some
cases screen has no signal).

But after a few minutes it's OK?
In January I launched memtest86+ v2.01 and that went for 1 day without
errors! (that's why I am not fully convinced of RAM problems) Is
memtest86+ able to properly disable the ECC in 5000P Blackford? Would
you trust memtest results 100%?

Wait...did memtest actually have enough time to get through testing the
*entire* 32GB?

If not I'd be letting it run several passes (yes this will take a long
time), especially given the "CRITICAL OVERHEAT FB-DIMM" message you said
you'd received.
Still, yesterday the computer hanged again and when I power cycled it,
it wouldn't boot or even post (no screen signal) for 5 minutes, until I
opened it (it probably cooled down a bit, or was it some spurious data
in silicon that cleared?) and then it booted again. Then I closed it and
it is still running today!!

So the machine _definitely_ booted faster than this previously?

Cheers,
Andy.
 

Ask a Question

Want to reply to this thread or ask your own question?

You'll need to choose a username for the site, which only take a couple of moments. After that, you can post your question and our members will help you out.

Ask a Question

Top