.docx files have XML components, but what's their use?

Ghitorni · May 5, 2010

I read that if any corruption occurs, slim chances of recovering for 2003
version files. In 2007 you can recover almost fully because the actual file
is in zip format and inside it contains many xml files. But the "file" as
such, .docx is a single file (until unzipped & extracted). Then how can some
corruption save the file, because even in a zip format file, if a small
chunk is gone, you can never open it. Could anyone shed some light on this?
Thanks

Doug Robbins - Word MVP · May 5, 2010

There could well be (and certainly are) cases where the corruption does not
preclude the Zip file from being opened.

--
Hope this helps.

Please reply to the newsgroup unless you wish to avail yourself of my
services on a paid consulting basis.

Doug Robbins - Word MVP, originally posted via msnews.microsoft.com

Peter Jamieson · May 5, 2010

..docx and .doc files (at least since about Word 6) have a more similar
structure than many people probably realise - even in .doc, which uses
OLE Compound Files, the content is divided into different "streams"
which can be opened separately.

That said, .docx does have considerable advantages, including
a. the ZIP file structure itself is a de facto standard - I don't
personally have any ZIP utilities for recovering "unopenable" ZIP files,
but I expect there are many. I don't think you will find so many
utilities that know how to recover the content of a corrupted OLE
Compound File
b. each file within the ZIP is almost certainly going to be an XML
text file such as "document.xml", or a single binary object such as a
..jpg. If the ZIP is damaged, but you can still open it and get the
document.xml, you have already achieved quite a lot. Even if the ZIP is
damaged to the extent that you cannot open it, a recovery utility has a
much better chance of identifying the component files when it knows that
they are either XML or - in some cases at least - well-known types of
binary object such as .jpg. In contrast, in a .doc, the equivalent of
document.xml is a complex binary structure. It isn't even a simple
stream of text with markup. You have to have a utility that knows
precisely how to look through that binary representation in order to
extract anything at all. Although MS has now published the .doc standard
(it appears to be a work in progress), I suspect not many people will
want to spend resource developing new recovery software for obsolescent
formats.

Peter Jamieson

http://tips.pjmsn.me.uk

XML vs. docx file formats	8	Feb 8, 2009
Open Packaging Format creator/manager for OS X	5	Oct 11, 2009
Zip packages and XML files	2	Aug 9, 2008
where are xml files?	1	Mar 2, 2010
Word 2007: Slow typing when picture in header (DOCX but NOT in DOC	3	Oct 26, 2009
Word 2007 : corrupt files - help needed	5	Jul 29, 2008
Corrupt files when using Vista Extract All with large encrypted ZI	2	Jun 10, 2007
What software do you use to convert between audio files?	6	Apr 15, 2023

.docx files have XML components, but what's their use?

Ghitorni

Doug Robbins - Word MVP

Peter Jamieson

Ask a Question

Similar Threads