.docx files have XML components, but what's their use?

G

Ghitorni

I read that if any corruption occurs, slim chances of recovering for 2003
version files. In 2007 you can recover almost fully because the actual file
is in zip format and inside it contains many xml files. But the "file" as
such, .docx is a single file (until unzipped & extracted). Then how can some
corruption save the file, because even in a zip format file, if a small
chunk is gone, you can never open it. Could anyone shed some light on this?
Thanks
 
D

Doug Robbins - Word MVP

There could well be (and certainly are) cases where the corruption does not
preclude the Zip file from being opened.

--
Hope this helps.

Please reply to the newsgroup unless you wish to avail yourself of my
services on a paid consulting basis.

Doug Robbins - Word MVP, originally posted via msnews.microsoft.com
 
P

Peter Jamieson

..docx and .doc files (at least since about Word 6) have a more similar
structure than many people probably realise - even in .doc, which uses
OLE Compound Files, the content is divided into different "streams"
which can be opened separately.

That said, .docx does have considerable advantages, including
a. the ZIP file structure itself is a de facto standard - I don't
personally have any ZIP utilities for recovering "unopenable" ZIP files,
but I expect there are many. I don't think you will find so many
utilities that know how to recover the content of a corrupted OLE
Compound File
b. each file within the ZIP is almost certainly going to be an XML
text file such as "document.xml", or a single binary object such as a
..jpg. If the ZIP is damaged, but you can still open it and get the
document.xml, you have already achieved quite a lot. Even if the ZIP is
damaged to the extent that you cannot open it, a recovery utility has a
much better chance of identifying the component files when it knows that
they are either XML or - in some cases at least - well-known types of
binary object such as .jpg. In contrast, in a .doc, the equivalent of
document.xml is a complex binary structure. It isn't even a simple
stream of text with markup. You have to have a utility that knows
precisely how to look through that binary representation in order to
extract anything at all. Although MS has now published the .doc standard
(it appears to be a work in progress), I suspect not many people will
want to spend resource developing new recovery software for obsolescent
formats.

Peter Jamieson

http://tips.pjmsn.me.uk
 

Ask a Question

Want to reply to this thread or ask your own question?

You'll need to choose a username for the site, which only take a couple of moments. After that, you can post your question and our members will help you out.

Ask a Question

Top