Word html editing - attribute loss

R

ruggero.vecchio

Hello all,
I'm dealing with a very serious problem and I have not found a solution
so far, hence any help is very appreciated.

I have an application that produces active documents. I mean documents
that have placeholders that are set by the application.
The main prerequisites were that the documents must be html and can be
edited either with Word or with our own editor (which is based on the
AxWebBrower Activex).

Now the problem: in order to recognise the placeholders and their
values and in order to correctly set or replace them, I tried many
solutions but the problem is always the same. Word is messing up the
contents of the html document, reorganizing it (and this is acceptable)
but losing some of the most standard html attributes!

More precisely: one of the ways I tried is to use span tags with an
attribute name or id, set with the code of the placeholder, but
everytime a document is opened with Word these attributes are lost.

Take the very simple example below: if you save it as html and then
open it with Word and just save it again, you will see that the ID
attribute is gone (ID="VAR-001").

Can anybody explain the reason of this behavior and if there is some
possible solution to this problem?
Is there any documentation about this kind of topics?
Does it exists some html attribute that Word would respect?

Many thanks in advance
Roger

--- html code ---
<!DOCTYPE HTML PUBLIC "-//W3C//DTD HTML 4.0 Transitional//EN">
<HTML><HEAD><TITLE></TITLE>
<META http-equiv=Content-Type content="text/html;
charset=windows-1252">
<BODY>
<P style="MARGIN: 0cm 0cm 0pt; TEXT-ALIGN: left" align=left>
<SPAN ID="VAR-001"
style='font-size:10.0pt'>@VARIABLE</SPAN></P></BODY></HTML>
 
C

Cindy M.

Now the problem: in order to recognise the placeholders and their
values and in order to correctly set or replace them, I tried many
solutions but the problem is always the same. Word is messing up the
contents of the html document, reorganizing it (and this is acceptable)
but losing some of the most standard html attributes!
Well, yes...

Word was never designed to be an HTML editor. The HTML file format for
Word is there to allow people to save Word documents that can be
displayed in a browser, and at the same time can still be edited as Word
documents.

When Word opens a document that's not in its proprietary format, a
converter is run. The converter puts the file into the proprietary
format. When the document is saved again as HTML, it's converted using
Word's internal rules for HTML. So, yes, it will strip things out and put
other things in. And you can't change how the converter does this.

Cindy Meister
INTER-Solutions, Switzerland
http://homepage.swissonline.ch/cindymeister (last update Jun 17 2005)
http://www.word.mvps.org

This reply is posted in the Newsgroup; please post any follow question or
reply in the newsgroup and not by e-mail :)
 
R

ruggero.vecchio

Hello Cindy,
thanks a lot for answering :)

Yes the converter stuff ... that's OK, but there are parts of an html
document that make it consistent like for instance a NAME or an ID
attribute. These could be referenced by some javascript in the document
or outside of it; not very fair that Word cleans everything!

Word 2003 has the possibility to include XML nodes and attributes if a
namespace is declared and this fits to my needs (instead of using
"pure" html tags I could use xml nodes and attributes), because these
are never removed by the converter. The problem is that this feature is
not present in previous versions and I'm looking for similar solutions
that are supported by Office XP (at least).

Best regards,
Roger
 

Ask a Question

Want to reply to this thread or ask your own question?

You'll need to choose a username for the site, which only take a couple of moments. After that, you can post your question and our members will help you out.

Ask a Question

Top