CompactFlash corruption and EWF

L

lino

Hello,

I'm using 1GB PQI Industrial CF marked as fixed disk.
The CF is divided into three partitions:

1) C:, active primary, compressed NTFS, 220MB
2) D:, primary, NTFS, 9MB
3) R:, primary, NTFS, 769MB

The first partition holds the OS (XPe SP2 FP2007) and is EWF protected
(RAM-REG mode), the second one holds some persistent data and log files and
the last one holds data-recording data.
Data-rate on disks D: and R: is low, let's say 10-20 kB/min.
The system works continuously, but it reboot at least once a week.

Randomly, after some time the system works, let's say after one or two
month, errors are reported writing data, for example "Windows was enable to
save all the data for the file D:\$Mft". At this point, rebooting the
system, it become unusable as the CF get corrupted .... even the first
partition, protected by EWF and not involved in data-writing at all!!! The
CF is unrecoverable, cannot be reformatted even with low level format
utilities.

Have someone experienced (and maybe solved) this?

Thank you.


Other info:
- PQI says this is due to the ECC (Error Correct Code) function that marks
some physical blocks as bad
- PQI have found that these CF can be restored with some kind of utility
they have (but they cannot deliver) and they are not really damaged
- this happens on 12 systems over 40 till now
- system is not stressed when errors are reported: CPU usage is 25% ,
physical memory available (40MB) and all disks have free space.
 
S

Sean Liming \(MVP\)

Flash can wear out over time, but this sounds a little too soon.

1. How frequently are is the system writing data?
2. Have you tried other flash manufacturers?
3. Have you looking into FBWF over EWF and have everything in the same
parition with write-through for the data to be saved?

--
Regards,

Sean Liming
www.sjjmicro.com / www.seanliming.com
Book Author - XP Embedded Advanced, XP Embedded Supplemental Toolkit
 
L

lino

Hi,
data-recording on the third partition is about 50 kB every one minute while
log-rate on the second partition is almost irrelevant, let's say less than
one log line every hour.
Due to the very low failure rate it's not easy for me to test other
solutions.
I think that main questions are:
using more than one partition on the same device is somehow critical?
why a data write failure on the second partition can damage the first one?

Regards.
 
O

Oriane

Hi Lino,

lino said:
Hello,

I'm using 1GB PQI Industrial CF marked as fixed disk.
The CF is divided into three partitions:

1) C:, active primary, compressed NTFS, 220MB
2) D:, primary, NTFS, 9MB
3) R:, primary, NTFS, 769MB
I can't answer your questions, sorry :-( But I would like to know the
process you've followed to install XPe on C...

Thanks in advance

Oriane
 
L

lino

Hi,
simply copying the XPe image files to C:, making sure to uncompress ntldr,
the first time.
After FBA and refining, I use Norton Ghost to grab a disk image and than
make production. If all CF are the same type (size, chipset, P/N, ...) you
can also use SDI scripts to do the same.

Paolo
 
O

Oriane

Hi Paolo (or Lino ?)

Thanks for your answer...

lino said:
Hi,
simply copying the XPe image files to C:, making sure to uncompress ntldr,
the first time.
After FBA and refining, I use Norton Ghost to grab a disk image and than
make production. If all CF are the same type (size, chipset, P/N, ...) you
can also use SDI scripts to do the same.

Paolo
My question was about the building of the image for the compact flash ! We
have test an image on a hard disk without problem, but we don't succeed in
booting from the CF...

At first, we intend to load the OS into RAM to avoid writing on the CF. So
my colleague tried to load an ISO of XPe, written on the CF, into RAM. But
it fails...

Afterwards, we heard about the EWF concept, which should minimize the number
of output operations on the CF. So we are looking for manuals for this
solution...or for advices...

Oriane
 
L

lino

Hi Oriane,

Using the CF is exactly the same as using a hard drive. The only thinks to
take in account is that you have to use an industrial grade CF, marked as
fixed disk if you need more than one partition on it.
After that, I used a lot of CF in the past without any kind of write
protection: the only real problem is if you want turn off the system just
powering it off! (of course you have to limit write activity, but this
doesn't seam to be a problem). EWF primary protect the OS against damage due
to power loss. There's a lot of documentation about it in this newsgroup, on
the web and simply on the XPe documentation. You can install EWF from XPe
image directly or even at runtime and you can enable or disable it as you
need. I'm using RAM-REG mode that doesn't need additional disk space.
An important thing is that you have to prepare the CF on the target system
to make it bootable, especially don't use a USB card reader. CF's disk
geometry may vary from one system to the other.

Bye

Paolo
 
G

Guest

Hi,
We are having similar problems with PQI turbo DOMs. Using FAT32 and FBWF.
Only a config folder and "Documents and settings" are not filtered. After a
few months files in the protected area are corrupted. chkdsk reports them as
either wrong size or cross linked clusters. Opening up the fragments in a hex
editor shows the files to be mangled ( some bits of the file repeated , long
strings of 2 bytes repeated).

Background disk defrag, prefetch and autolayout are off. Process monitor
logging thru boot and runtime show no unexpected writes and FBWF overlay
detail contains the expected files.

The files being corrupted are sometimes ones only read on boot and sometimes
never read at all. Distribution of corrupted files seems to be random.

The system is writing to the config area very infrequently. There is the
usual amount of activity in documents and settings for a box running IIS.

Any ideas or questions would be appreciated.

regards

Will M
 
G

Guest

Corruption occurs every 512 bytes. The sector length is 512 bytes. Could this
be a problem with wear levelling on the disk?
 
L

lino

Hi Will,

Can you give me some more info about the hardware you are using and any
other information you think are relevant, please? I'm trying to avoid long
time behaviors rebooting the system every night ... it seams that in all the
situation I experienced the OS corruption began after 2-4 day the system is
on.

Thank you.

Paolo
 
L

lino

we also thougth that was the problem, but PQI Customer Service Center says:

The activity of Wear-Leveling is not related to logical blocks, even as
logical partition.It only works on physical sector and will not affect the
logical data. Even if data location at physical block was changed or moved,
the register of controller will auto reassign the location of logical
blocks.
 
G

Guest

Sure. probably not going to help much though.

We are using a pqi turbo DOM 432 44 pin IDE as a boot disk. The mainboard is
a digital logic pc-104 msm800. The system runs headless and can only be
connected to via ethernet.

The corruption seems to be incremental. Systems that have been on longer
have more corruption. Systems corrupted on the bench only have one area of
corruption.
Freezing or heating the DOM makes no difference to corruption rate. FBWF
makes no difference.

usefull tools for examining the problem are chkdsk from command prompt.
Disk view and process monitor from sysinternals.
Fixmbr ( from a recovery console or from a dos boot).
A hex editor to examine any recovered files.

We are using a cloning machine to mass produce the DOMs. I`m hoping it works
at the IDE level to copy things onto the flash...otherwise i might be
overwriting the "bad block" sector on the DOM.

hope this helps.
 
L

lino

Thank you Will,

Just one other question to compare our situations:
is your CF restorable after it get corrupted? can you reformat it?
My CFs can only be restored with a PQI production utility that they can't
deploy for license agreement reasons. I can't reformat it, even with some
low-level utility I found over the web (HDD-GURU site). I tried al lot of
tools without success. Also, I have a quiet poor production (40 item) to
test different solutions over months of mtbf.

Regards,

Paolo
 
G

Guest

We`ve lost a few that cannot be recovered even with a low level format. These
happened when we were having power supply problems so were possibly over
volted.

Most of the time a format or mbrfix will make the disk usable again. Looks
like our problems are similar but different. If i make any progress I`ll post
it here and hope it`s usefull to someone.
 
R

Robert

Sorry, I don't have solutions to offer.

We are seeing similar issues - cards get randomly corrupted. We've not yet
determined the cause (we've tried operating at low voltage levels with power
cycling and so on and can't duplicate on the bench).

Today, I received a CF card from the field with corruption that prevents XPE
booting (get the logo with the crawl and then a black screen).

The weird thing is that even though the EWF is enabled, it appears that
Windows is STILL WRITING log files. Am I seeing things?

There are several logs present with timestamps that correspond with the
suspected malfucntion time and these files have entries going back almost two
months. Since the unit is powered down each day, I don't think we're seeing
entries cached from past activities.

We are still using SP1.

This problem is eating our lunch (and a lot of other peoples' lunches as
well)... it would be nice if MS could provide some insight (given the
per-unit license fees that we're all paying).

Robert
 
R

Rob

Sorry, I don't have solutions to offer.

We are seeing similar issues - cards get randomly corrupted. We've not yet
determined the cause (we've tried operating at low voltage levels with power
cycling and so on and can't duplicate on the bench).

Today, I received a CF card from the field with corruption that prevents XPE
booting (get the logo with the crawl and then a black screen).

The weird thing is that even though the EWF is enabled, it appears that
Windows is STILL WRITING log files. Am I seeing things?

There are several logs present with timestamps that correspond with the
suspected malfucntion time and these files have entries going back almost two
months. Since the unit is powered down each day, I don't think we're seeing
entries cached from past activities.

We are still using SP1.

This problem is eating our lunch (and a lot of other peoples' lunches as
well)... it would be nice if MS could provide some insight (given the
per-unit license fees that we're all paying).

Robert












- Show quoted text -

Hi all,

We're seeing similar problems with PQI disks - apparently random
corruption of
sectors on the disks. After having contacted PQI support, they replied
that certain
types of disk with firmware version "e" or "f" can develop this
problem. It can happen
when writing, but *ALSO* when reading (due to ECC). So EWF is not a
solution. The only
acceptable solution seems to be to replace the disk by a newer "g"
version which
doesn't have the issue. Or go back to NTe, which for some unknown
reason
works differently at IDE level. Firmware upgrades to "g" are not
possible.

Greetings,
Rob
 
L

lino

Hi Rob,
it seams to be correct as we experienced this behavior on a particular P/N
range while newer CFs work fine till now.
I'm happy to know someone can take more information than us from PQI!
Do you know how can I check the firmware? ... I have a tool but it reports
only FW v. 1.01 and nothing else.
Thank you in advance
Paolo.
 
R

Robert

Thanks for the comment, Rob.

We're using Transcend CF cards (marked as "Industrial" and "Non-DMA Fixed").
There are three partitions: C (EWF), D (No EWF), and an EWF partition.

I saw a post where Sean Liming recommended WinSystems CF cards but our
problem is rare enough (never seen in the lab) that I don't know how we'd see
whether or not a change in cards made a difference.

It turns out that there is an explanation for the growing log files
(specifically SETUPAPI.LOG). Seems that the application programmer IS saving
some user settings on C (typically not changed very often - but not a good
idea IMHO). Whenever one of these settings changes (and a commit if performed
followed by a reboot), any new text added to the logs since startup will also
be committed. Reboots where there have been no prior commits do not grow the
logs.

The question remains (in my mind): what happens if the EWF volume runs out
of space. While I don't think we've seen this happen, is there some amount of
free space that must remain? Will corruption occur if the size of the commit
exceeds free space?
 
L

lino

Hi Rob,

according to PQI devices, CF in our case, we finally have an official
report, directly from PQI that I report here as is:

I guess the customer used XPe on AC47

This solution "ra03" is not suit for XPe.
Its transfer mode PIO2 and EEC function 2 bit.
XP embedded 's booting will be loading a lot of files.
"ra03" can not handle such heavy loading well.
When ra03 works over its loading, it will let those good sectors which the
lots files on it become bad sectors.
There are appearing some mistakes let the XPe not booting abnormal and
running CHKDSK.
If your clean all partition then do format again, then perform scandisk a
gain, you will find all the bad sectors are gone
 

Ask a Question

Want to reply to this thread or ask your own question?

You'll need to choose a username for the site, which only take a couple of moments. After that, you can post your question and our members will help you out.

Ask a Question

Top