F@H Errors and what they mean

muckshifter

I'm not weird, I'm a limited edition.
Moderator
Joined
Mar 5, 2002
Messages
25,735
Reaction score
1,204
EARLY_UNIT END:
Quite possibly the most common error found today. EARLY_UNIT_END is usually caused by one of two things: a bad WU or an unstable system.

If you get one isolated EARLY_UNIT_END, it's most likely just a WU that is bad. It's not a problem, and you shouldn't worry about it. It's usually caused when atoms in the WU reach impossible positions and Gromacs can't continue.

Multiple EARLY_UNIT_END errors are a sign of a severe problem with your machine. Machines that are clocked too high, have heat problems, or possibly have SSE forced on (AMD only) will generate this error. You should stop F@H if you get more than one EARLY_UNIT_END per week per machine, and certainly if you get two in a row. Make sure your machine is up to spec, with reasonable temperatures, reasonable clocking (CPU, FSB, and memory must all be stable), and a good, powerful PSU. EARLY_UNIT_END is most often caused by problems with a user's machine, and an abnormal number of them certainly merits examining your system.

This error may be accompanied by a LINCS WARNING message that gives more specific technical details on exactly what happened.

NOTE: See the description about "-forceasm" (3.x) or "-forceSSE" (4.x) causing SPECIAL_EXIT on certain AMD based systems. If you are running an AMD Athlon XP with the Thoroughbred or Barton cores, you should remove the "-forceasm" or "-forceSSE" switch, most likely fixing your problems.

FILE_IO_ERROR:
An error that occurs when disk operations go bad. This is a fairly general error, having many sub-types. It has plummeted in frequency since the release of Gromacs Core 1.46. Now, this error usually happens when a hardware error occurs: something like "Write 0010, read back 0011". If you experience this error, make sure your hard drives are OK: run ScanDisk, CHKDSK, or fsck, make sure the IDE bus is in spec, make sure you're using good IDE cables, and make sure the drive isn't dying.

FILE_IO_ERROR has also been reported to occur if two Console clients working on the same unit are started. This can occur if you accidentally start one client twice on a dually, instead of two clients once.
Thanks to sortofageek for contributing the part about two clients causing this error.

CLIENT_DIED:
This happens when, simply enough, the client dies. The core is still running, and can't find the client, so it shuts down. This is usually related to overclocking and/or overly aggressive memory timings. Back down on these and this error should vanish.

UNKNOWN ERROR:
A now rare Gromacs error that usually occurs if there's a corrupt WU being processed. It is no longer common and any instances should probably be reported (post a log, etc.). You may also want to check your hardware if you've had past errors.

Client-Core Communications Error:
There are several different kinds of this error.

ERROR 0xX is basically another form of an unknown error. It can be found on Linux if you're having Glibc version problems. See the Linux forum for more info. Overclocking is another possible cause. ERROR 0x1 has occured with Gromacs units. Its cause is still unknown. This error has not been replicated by the Pande group. There are known solutions to 0xX if it's caused by overclocking (stop!) or Glibc (see Linux forum). Otherwise, there's no known fix. Post relevant sections of FAHlog.txt (including version and type of client) and which version your OS is and continue folding.

codeDivStart()ERROR 0x1 has been reported to occur if the core is killed while the client is processing, though this is a fairly rare occurrence if you are not using scripts that kill the core. [15:07:06] CoreStatus = 1 (1)
[15:07:06] Client-core communications error: ERROR 0x1
[15:07:06] Deleting current work unit & continuing...
[15:07:26] Trying to send all finished work units
[15:07:26] + No unsent completed units remaining.
[15:07:26] - Preparing to get new work unit...
Thanks to gnewbury for information on this form of ERROR 0x1.


ERROR 0xC0000005 means there was a memory access violation. This is a standard Windows error code for any program trying to access memory it does not control. This can be a rare hardware error and is not cause for concern. Old versions of clients/cores can also cause this problem.

ERROR 0x________, where the blank is an eight-digit hexadecimal code, is usually a general Windows error. Look up the specific Windows error code (if you need help, just post a thread) and you will most likely find the cause.
Thanks to Bruce and Guha for clarifying 0xX errors.

BAD_FRAME_CHECKSUM:
codeDivStart() You'll see a block in your log that looks something like this:
[hh:mm:ss] Header on frame 220 differs from expected header
[hh:mm:ss] Got: A028B-5C-3E84B02E-EA1B7D4: 0220
[hh:mm:ss] Expected: A028B-5C-3E84B02E-EA1B7D4: 0219
Note that the two lines of Hexadecimal numerals are the same. This strange error only occurs with Tinker units. The only known cause is when two or more clients are started at once and are working in the same directory, but there may be other causes. This error often, bizzarely, occurs on an early frame but is not detected until the unit's end.

BAD_FRAME_CHECKSUM, similar to one type of Gromacs FILE_IO_ERROR, can also mean that a hardware error occurred where there was a slight discrepancy between what was read and what was expected: something like writing 101010 and reading back 110110. Again, this is commonly not detected until the unit finishes.

Server reports digital signature does not match
Some of the newer servers don't seem to like the older versions of the client. Upgrade to the latest client.

SPECIAL_EXIT:
This severe error means that something unknown happened inside the Gromacs core. The only known cause is when "-forceasm" (3.x) or "-forceSSE" (4.x) is applied to an AMD system that is not 100% stable with SSE. CPUs that have had problems include the Thoroughbred B, Barton, and Opteron cored processors. In this case it should be dealt with as an EARLY_UNIT_END error (see above). Removing "-forceasm" or "-forceSSE" will almost certainly fix the problem. SSE related errors are now fairly rare, compared to a few months ago.

If you are not forcing use of SSE and this error occurs, a log should be posted as this is a serious problem.

Posting Log Files: A Guide
When you post any log straight onto the board, please edit out any insignificant details. Examples of this would be completion of frames, core download ("1024 bytes downloaded... 512000"), and Getwork errors (leave the first and last ones please).
This makes logs far easier to read and problems are easier to spot. If you're unsure if something should be cut, please just leave it in.

Folding@Home Unit Types and Cores
To tell which type of unit you are running, you can simply look at your log file. When the client first starts the core, you'll find one of the following strings somewhere. "Gromacs Core" means it's a Gromacs unit. "Double Gromacs Core" is a SSE2-enhanced double-precision version of regular Gromacs. "Protein Design Core" is a Genome unit; this core is no longer in use as of 15 March 2004. Tinker simply says "Folding@Home Client Core" and then, farther down, "TINKER: Software Tools for Molecular Design".

If you look at the currently running processes under WinNT/2K/XP (in Task Manager) or Linux, you'll find one of these cores running (the most current version number is also listed):
  • FAHCore_65.exe - Tinker - Version 2.50 3.80
  • FAHCore_78.exe - Gromacs - Version 1.59 1.70
  • FAHCore_79.exe - DGromacs - Version 1.60
  • FAHCore_ca.exe - Genome - Version 2.06 - No longer in use
  • FAHCore_82.exe - AMBER - new
    I don't think it is possible (or necessary) to keep the version numbers up to date, but I changed a couple anyway. -b
Special thanks to the Pande group for their excellent support of this project (in no particular order): Vijay, Guha, Youngmin, Vishal, Eric, Chris, Adam, Stefan, and the rest of you who never post. You've got to like people who get up in the morning and go over to kick the servers when they're not working. Thanks to all of you.

This document may be freely linked to or reproduced, so long as a link to this page is provided.
 

Ask a Question

Want to reply to this thread or ask your own question?

You'll need to choose a username for the site, which only take a couple of moments. After that, you can post your question and our members will help you out.

Ask a Question

Top